WO2022023218A1

WO2022023218A1 - Methods and apparatus for managing a system that controls an environment

Info

Publication number: WO2022023218A1
Application number: PCT/EP2021/070720
Authority: WO
Inventors: Filippo VANNELLA; Ezeddin AL HAKIM; Saman FEGHHI; Erik AUMAYR; Grigorios IAKOVIDIS
Original assignee: Telefonaktiebolaget Lm Ericsson (Publ)
Priority date: 2020-07-30
Filing date: 2021-07-23
Publication date: 2022-02-03

Abstract

A computer implemented method (100) is disclosed for managing a system controlling an environment is that is operable to perform a task. The method comprises providing, to a plurality of Agents, a representation of a current state of the environment (110), wherein the plurality of Agents comprises a Learning Agent operable to implement a Reinforcement Learning model for selecting actions to be executed on the environment, and a plurality of Baseline Agents, each Baseline Agent operable to implement a policy for selecting actions to be executed on the environment, wherein each policy implemented by a Baseline Agent satisfies a criterion with respect to performance of the task (110a). The method further comprises receiving, from the Learning Agent, a candidate Learning Agent action for execution on the environment, and, from the plurality of Baseline Agents, a plurality of candidate Baseline Agent actions for execution on the environment (120), generating an environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions (130), and providing the environment action to the system for execution on the environment.

Description

Methods and Apparatus for Managing a System that Controls an Environment

Technical Field

The present disclosure relates to a method for managing a system controlling an environment that is operable to perform a task. The present disclosure also relates to a management node, and to a computer program and a computer program product environment that is operable to perform a task.

Background

Reinforcement Learning (RL) is a decision-making framework in which an agent interacts with an environment by exploring its states and selecting actions to be executed on the environment. Actions are selected with the aim of maximising the long-term return of the actions according to a reward signal. More formally, an RL problem is defined by:

• The state space S which is the set of all possible states in which the environment may exist,

• The action space A which is the set of all possible actions which may be executed on the environment,

• The transition probability distribution P-. which is the probability of transitioning from one state to another based on an action selected by the agent,

• The reward distribution R-. which incentivises or penalises specific state-action pairs.

The agent's policy p, defines the control strategy implemented by the agent, and is a mapping from states to a policy distribution over possible actions, the distribution indicating the probability that each possible action is the most favourable given the current state. An RL interaction proceeds as follows: at each time instant t, the agent finds the environment in a state s_t e S. The agent selects an action a_t ~ 7t(· |s_t) e A, receives a stochastic reward r_t ~ R(- \ s_t, a_t), and the environment transitions to a new state s_t+1 ~ P(· |s_t, a_t). The agent's goal is to find the optimal policy, i.e. a policy that maximizes the expected cumulative reward over a predefined period of time, also known as the policy value function Vⁿ(s ) =

While executing the above discussed dynamic optimisation process in an unknown environment (with respect to transition and reward probabilities), the RL agent needs to try out, or explore, different state- action combinations with sufficient frequency to be able to make accurate predictions about the rewards and the transition probabilities of each state-action pair. It is therefore necessary for the agent to repeatedly choose suboptimal actions, which conflict with its goal of maximizing the accumulated reward, in order to sufficiently explore the state-action space. At each time step, the agent must decide whether to prioritize further gathering of information ( exploration ) or to make the best move given current knowledge ( exploitation ). Exploration may create opportunities by discovering higher rewards on the basis of previously untried actions. However, exploration also carries the risk that previously unexplored decisions will not provide increased reward and may instead have a negative impact on the environment. This negative impact may only be short term or may persist, for example if the explored actions place the environment in an undesirable state from which it does not recover.

In the context of RL, an optimal policy is usually derived in a trial-and-error fashion by direct interaction with the environment. In the course of such interaction, the agent will explore suboptimal regions of the state- action space. In many technical domains and real world use cases, this suboptimal exploration may result in unacceptable performance degradation, risk taking, or breaching of safety regulations. Consequently, the standard approach for RL solutions is to employ a simulator as a proxy for the real environment during the training phase, thus allowing for unconstrained exploration without concern for performance degradation. However, simulators are often subject to modelling errors related to inherent environment stochasticity, and this calls into question their reliability for training an RL agent policy that will be deployed into the real world.

Significant research has been directed to the challenge of addressing the risk of unacceptable performance degradation in RL agent training, and to circumventing the issue of inaccurate simulations, resulting in the development of Safe Reinforcement Learning (SRL) techniques.

Many definitions of "safety” in the context of RL have been proposed, as well as a wide range of "safe RL methods” that seek to respect these definitions. Examples of strategies used in SRL include the use of accumulated past knowledge, making conservative action choices that prioritise avoiding the worst-case scenario, or requesting guidance from an external agent or a human operator, when the current state is considered too risky. Notable recent developments in the SRL field include L. Torrey and M. E. Taylor,

"Help an agent out: Student/teacher learning in sequential decision tasks”, Proceedings of the Adaptive and Learning Agents Workshop 2012, ALA 2012 - 2012, in which uncertainties in the problem environment are modelled as a set of possible environments and safety is achieved by providing a solution that is safe for all of the possible environments. Another approach, proposed in a non-published reference document, provides a mechanism to execute constrained exploration based on the distance between the actions executed by a safe baseline policy and the action executed by the learning policy. The mechanism is based on a hyperparameter e , which implements a trade-off between exploration and safety. A limitation of both the above approaches is their specificity. Many problem domains contain highly diverse state spaces, which are extremely difficult to accurately model, even with multiple possible environments. In addition, a safe baseline policy may be overly limiting for significant regions of the state space, and the e based mechanism discussed above for choosing an action based on the safe baseline policy may also be overly conservative in many cases.

Alshiekh, M., Bloem, R., Ehlers, R., Konighofer, B., Niekum, S. and Topcu, U., 2018, April, Safe reinforcement learning via shielding, proposes filtering out unsafe actions that are proposed by the RL agent before they can be executed on the environment. However the proposed approach uses temporal logic, a model-based method, to formulate safety into logical rules and calculate safety guarantees. This makes a safety model difficult to scale, as each model formulating the logical rules has to be updated with evolution of an environment, or be completely reconstructed for each new environment or use case.

Summary

It is an aim of the present disclosure to provide a method, management node, and computer readable medium which at least partially address one or more of the challenges discussed above. It is a further aim of the present disclosure to provide a method, management node and computer readable medium which facilitate selection of optimal or close to optimal actions for an environment that is operable to perform a task, without compromising task performance.

According to a first aspect of the present disclosure, there is provided a computer implemented method for managing a system controlling an environment that is operable to perform a task. The method comprises providing, to a plurality of Agents, a representation of a current state of the environment, wherein the plurality of Agents comprises a Learning Agent operable to implement a Reinforcement Learning model for selecting actions to be executed on the environment, and a plurality of Baseline Agents, each Baseline Agent operable to implement a policy for selecting actions to be executed on the environment, wherein each policy implemented by a Baseline Agent satisfies a criterion with respect to performance of the task. The method further comprises receiving, from the Learning Agent, a candidate Learning Agent action for execution on the environment, and, from the plurality of Baseline Agents, a plurality of candidate Baseline Agent actions for execution on the environment. The method further comprises generating an environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions, and providing the environment action to the system for execution on the environment.

According to another aspect of the present disclosure, there is provided a computer program and a computer program product configured, when run on a computer to carry out a method according to any one of the aspects or examples of the present disclosure. According to another aspect of the present disclosure, there is provided a management node for managing a system controlling an environment that is operable to perform a task. The management node comprises processing circuitry configured to provide, to a plurality of Agents, a representation of a current state of the environment, wherein the plurality of Agents comprises a Learning Agent operable to implement a Reinforcement Learning model for selecting actions to be executed on the environment, and a plurality of Baseline Agents, each Baseline Agent operable to implement a policy for selecting actions to be executed on the environment, wherein each policy implemented by a Baseline Agent satisfies a criterion with respect to performance of the task. The processing circuitry is further configured to receive, from the Learning Agent, a candidate Learning Agent action for execution on the environment, and, from the plurality of Baseline Agents, a plurality of candidate Baseline Agent actions for execution on the environment. The processing circuitry is further configured to generate an environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions, and to provide the environment action to the system for execution on the environment.

According to another aspect of the present disclosure, there is provided a system for controlling an environment that is operable to perform a task. The system comprises a management node according to the preceding aspect of the present disclosure, a Learning Agent operable to implement a Reinforcement Learning model for selecting actions to be executed on the environment, and a plurality of Baseline Agents, each Baseline Agent operable to implement a policy for selecting actions to be executed on the environment, wherein each policy implemented by a Baseline Agent satisfies a criterion with respect to performance of the task.

Examples of the present disclosure thus propose a method according to which a plurality of Baseline Agents complement at least one Learning Agent in proposing actions for execution on an environment. Policies implemented by the Baseline Agents for the selection of actions satisfy a criterion with respect to performance of a task by the environment, thus providing a benchmark for safety of the Learning Agent proposed action or actions with respect to the environment and the task it performs. Both the Baseline Agent proposed actions and the Learning Agent proposed action or actions are used to generate an environment action, which is the action that is actually provided to a system controlling the environment for execution on the environment. The method proposed in the present disclosure thus in effect shields the environment from the Learning Agent, allowing the Learning Agent to propose actions but generating an action for forwarding to the environment on the basis not just of the Learning Agent proposal, but also on the basis of proposals from a plurality of Baseline Agents. The plurality of Baseline Agents may offer different perspectives, implement differing logic, be based on different datasets, and/or be optimised for different regions of the state-action space of the environment, thus offering a more complete, flexible and scalable representation of "safety” with respect to the environment than is offered by methods proposed in existing art. In examples in which a plurality of Learning Agents propose actions, each learning agent may implement different learning models and/or use different training techniques, so enabling parallel experimentation with different learning techniques and parameters, supporting optimal control of the environment.

According to another aspect of the present disclosure, there is provided a computer implemented method for managing a system controlling a communication network that is operable to provide a communication network service. The method comprises providing, to a plurality of Agents, a representation of a current state of the communication network, wherein the plurality of Agents comprises a Learning Agent operable to implement a Reinforcement Learning model for selecting actions to be executed on the communication network, and a plurality of Baseline Agents, each Baseline Agent operable to implement a policy for selecting actions to be executed on the communication network, wherein each policy implemented by a Baseline Agent satisfies a criterion with respect to provision of a communication network service by the communication network. The method further comprises receiving, from the Learning Agent, a candidate Learning Agent action for execution on the communication network, and, from the plurality of Baseline Agents, a plurality of candidate Baseline Agent actions for execution on the communication network. The method further comprises generating a communication network action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions, and providing the communication network action to the system for execution on the communication network.

Examples of the present disclosure thus provide a method that facilitates the use of Reinforcement Learning in control of a communication network, without risking the compromised network performance that could arise through direct use of a Reinforcement Learning model or algorithm, for example during a learning or exploration phase.

Brief Description of the Drawings

For a better understanding of the present disclosure, and to show more clearly how it may be carried into effect, reference will now be made, by way of example, to the following drawings in which:

Figure 1 is a flow chart illustrating process steps in a computer implemented method for managing a system controlling an environment that is operable to perform a task;

Figures 2A to 2D show a flow chart illustrating process steps in another example of a computer implemented method for managing a system controlling an environment that is operable to perform a task; Figures 3A to 3C show a flow chart illustrating process steps in another example of a computer implemented method for managing a system controlling an environment that is operable to perform a task;

Figure 4 illustrates an overview of a modular Safe Reinforcement Learning architecture;

Figure 5 illustrates a process flow during which examples of the methods of Figures 1 to 3C may be implemented;

Figure 6 illustrates coverage and capacity in Remote Electronic Tilt optimisation;

Figure 7 is a block diagram illustrating functional modules in a management node; and

Figure 8 is a block diagram illustrating functional modules in another example of a management node.

Detailed Description

Examples of the present disclosure propose a method for Safe Reinforcement Learning (SRL), and an architecture on which it may be implemented, that change the standard RL interaction cycle so as to ensure the safety of the environment with respect to performance of a task. Conceptually, the method may be envisaged as implementing a safety shield which protects an environment by preventing a Learning agent from interacting directly with the environment, and safety logic which determines what action should be provided by the shield to the environment for execution. The action for execution is determined by the safety logic on the basis of propositions from at least one Learning Agent, which may be implementing an RL model, and a heterogeneous plurality of Baseline Agents, each of which implements a policy that is "safe” in the sense that it fulfils a criterion with respect to task performance. Safety of the actions proposed by the one or more Learning Agents is evaluated with respect to a minimum performance level that is assured by the criterion respected by the Baseline Agent policies. For example, average improvement over the Baseline policies, or improvement over the best/worst Baseline policy, may be considered. The safety shield implemented by the method proposed herein protects the environment by acting as the interface through which the Learning Agent may interact with the environment, ensuring that any action forwarded to the environment for execution satisfies safety requirements encapsulated by the Baseline Agent policies and their respect for the task performance criterion. The Learning Agent or Agents do not therefore have direct access to the environment, ensuring that unsafe actions proposed by the Learning Agent or Agents will not be executed on the environment. At each time step, the one or more Learning Agents and the Baseline Agents propose actions for execution on the environment. The method proposed herein involves evaluating those actions on the basis of safety logic that enables the generation of an action for execution on the basis of all proposed actions, and forwarding that action for execution on the environment. The evaluation of proposed actions may for example involve the building of a safe set of actions and an unsafe set of actions, with the action for execution being selected as the safe action that has the highest predicted performance.

Figure 1 is a flow chart illustrating process steps in a computer implemented method for managing a system controlling an environment that is operable to perform a task. Referring to Figure 1, the method comprises, in a first step 110, providing, to a plurality of Agents, a representation of a current state of the environment. As illustrated at 310a, the plurality of Agents comprises a Learning Agent operable to implement a Reinforcement Learning (RL) model for selecting actions to be executed on the environment.

The plurality of Agents may in some examples comprise a plurality of Learning Agents, as discussed in further detail with reference to Figure 2A. The plurality of Agents further comprises a plurality of Baseline Agents, each Baseline Agent operable to implement a policy for selecting actions to be executed on the environment, wherein each policy implemented by a Baseline Agent satisfies a criterion with respect to performance of the task.

For the purpose of the present disclosure, an Agent comprises a physical or virtual entity that is operable to implement a policy for the selection of actions on the basis of an environment state. Examples of a physical Agent may include a computer system, computing device, server etc. Examples of a virtual entity may include a piece of software or computer program, a code fragment operable to implement a computer program, a virtualised function, or any other logical entity. A virtual entity may for example be instantiated in a cloud, edge cloud or fog deployment. A learning Agent comprises an Agent that is operable to implement an RL model for selecting actions to be executed on an environment. Examples of RL policy models may include Q-learning, State-Action-Reward-State-Action (SARSA), Deep Q Network, Policy Gradient, Actor-Critic, Asynchronous Advantage Actor-Critic (A3C), etc. A learning Agent is operable to use feedback for training in an online environment, in order to continually update the RL model and improve the quality of actions selected. A Baseline Agent comprises an Agent operable to implement a policy for selecting actions to be executed on an environment, which policy satisfies a criterion with respect to performance of a task by the Environment. Example policies that may be implemented by a Baseline Agent include rule based policies developed by domain experts to conform to the relevant task performance criterion, and data-driven policies, such as Machine Learning models that have been trained on actions that fulfil the task performance criterion, and are therefore considered to be "safe” with respect to task performance. The criterion that is satisfied by a policy implemented by a Baseline Agent may take any number of different forms, for example according to the nature of different policies implemented by the Baseline Agents. The criterion may also or alternatively encompass a plurality of requirements with respect to task performance. For example, in the case of a rule based policy implemented by a Baseline Agent, the criterion may require that the policy be configured to avoid specific undesirable outcomes with respect to task performance, or specific undesirable environment states, in which states the environment cannot perform its task to an acceptable level. In another example, in the case of a data-driven policy that is implemented by a Baseline Agent and has been used previously in an online setting, the criterion may require that the policy has already been trained to an acceptable level of performance, or has been used in the online setting for a minimum length of time without adverse incident.

Referring still to Figure 1, the method 100 further comprises, in step 120, receiving, from the Learning Agent, a candidate Learning Agent action for execution on the environment, and, from the plurality of Baseline Agents, a plurality of candidate Baseline Agent actions for execution on the environment.

For the purposes of the present disclosure, an action comprises any intervention which may be made on the environment. The intervention may for example comprise a change in a parameter of the environment, a change of state of a component element of the environment, a decision with respect to the environment, allocation of resources within the environment, etc. A candidate Learning Agent action comprises an action that has been proposed by a Learning Agent for execution on the environment. Similarly, a candidate Baseline Agent action comprises an action that has been proposed by a Baseline Agent for execution on the environment. A candidate Learning Agent action and/or candidate Baseline Agent action may in some examples comprise a vector of action probabilities, each element of the vector corresponding to a possible action and comprising a probability, evaluated by the Agent that proposed the candidate action, that the corresponding possible action is the most favourable of the possible actions according to a performance measure of the task.

Referring again to Figure 1, in step 130, the method 100 comprises generating an environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions. An environment action comprises an action that is to be executed on the environment. In step 140, the method 100 comprises providing the environment action to the system for execution on the environment.

Examples of the method 100 thus facilitate the use of RL in control of an environment that is operable to perform a task, without compromising task performance. In addition to protecting the environment from unfiltered access by a Learning Agent which could, owing to a need to explore the relevant state-action space, propose actions leading to undesirable outcomes, the method 100 offers a modularity and adaptability that lends itself to application on a range of different dynamically evolving use cases. For example, the number of Baseline Agents, the number of Learning Agents, and the policies or RL models that they implement, may be adjusted at any time, allowing the control afforded by the method to evolve and adapt over time. The plurality of Baseline Agents offer a range of options for evaluating the Learning Agent proposed actions, and assessing what is a "safe” action with respect to task performance. For example, the Baseline Agents may implement different policies, including a mix of rule based and data driven policies. In one example, the Baseline Agent policies may be optimised for different regions of the state-action space, allowing for an improved overall performance, as discussed in greater detail below with respect to Figures 2A to 2D. In some examples, it may be envisaged that the method 100 could be implemented using only a single Baseline Agent implementing a policy that satisfies a criterion with respect to performance of the task by the Environment. The single Baseline Agent may provide, at each performance of the method 100, a candidate Baseline Agent action.

Figures 2A to 2D show a flow chart illustrating process steps in another example of computer implemented method 200 for managing a system controlling an environment that is operable to perform a task. The method 200 provides one example of how the steps of the method 100 may be implemented and supplemented to achieve the above discussed and additional functionality. Referring first to Figure 2A, the method comprises, in a first step 202, receiving a representation of a current state of the environment from the system controlling the environment. In step 210, the method comprises providing, to a plurality of Agents, a representation of a current state of the environment. As illustrated at 210i, the plurality of Agents comprises at least one, and may comprise a plurality of, Learning Agents operable to implement an RL model for selecting actions to be executed on the environment, and a plurality of Baseline Agents, each Baseline Agent operable to implement a policy for selecting actions to be executed on the environment, wherein each policy implemented by a Baseline Agent satisfies a criterion with respect to performance of the task. Also as illustrated at 21 Oi, at least two of the Baseline Agents may implement policies that are optimized for different regions of the state-action space of the environment. As discussed above, the state- action space of the environment comprises the range of possible states in which the environment may exist and the available actions for execution on the environment in those states. Different Baseline Agent policies may be optimised for different regions of the state-action space, for example having been trained on a particular set of training data relating to a specific region of the state action space, or having been specifically configured for a set of circumstances corresponding to a region of the state action space. In the case of multiple Learning Agents, each Learning Agent may be implementing a different RL policy model for the selection of actions. The models may be based on the same RL algorithm but trained using different hyperparameters or different training data, or may be based on different RL algorithms. It will be appreciated that Baseline Agents may implement a policy that has been trained in an offline environment. In an online environment, a Baseline Agent policy is only used to recommend actions. In contrast, a Learning Agent is continuously training during online operation, while also recommending actions at each time step. In step 220, the method comprises receiving, from the Learning Agent, a candidate Learning Agent action for execution on the environment, or, from the plurality of Learning Agents, a plurality of Learning Agent actions for execution on the environment. Step 220 further comprises receiving, from the plurality of Baseline Agents, a plurality of candidate Baseline Agent actions for execution on the environment. The method then comprises, in step 230, generating an environment action on the basis of the one or plurality of candidate Learning Agent actions and the plurality of candidate Baseline Agent actions. Steps which may be involved in the generation of an environment action are discussed in further detail with reference to Figures 2C and 2D.

Referring now to Figure 2B, the method 200 further comprises, in step 240, providing the environment action to the system for execution on the environment. This may for example comprise sending the action, or a representation of the action, to the system using any suitable communication channel.

In step 250, the method 200 comprises receiving, from the system, a representation of a state of the environment following execution of the environment action, and a value of a reward function representing an impact of execution of the environment action on performance of the task. The reward function may be specific to the environment in which the action is executed, and may be known to the system controlling the environment. The reward function may also be known to the Learning Agent or Learning Agents and to the plurality of Baseline Agents.

In step 260, the method 200 comprises providing, to the plurality of Agents the representation of a state of the environment following execution of the environment action, the value of the reward function, and a representation of the environment action. It will be appreciated that this action may aid in rendering the conducting of the method effectively transparent to the one or more Learning Agents and to the Baseline Agents. The one or more Learning Agents and the plurality of Baseline Agents receive a state representation, provide a proposed action, and then receive an updated state representation, a reward value, and a representation of the action that was executed on the environment (which may not be the action that was proposed by any given Agent). The Agents then have all the information they require to propose a new action on the basis of the updated state of the environment. Both Baseline Agents and Learning Agents may use the received feedback elements of updated state representation, reward value, and representation of the action that was executed on the environment, to propose a new action for execution on the environment. The number of feedback elements that are used by a particular Agent in order to propose a next action may depend upon the particular policy implemented by the Agent. In addition to proposing a new action, Learning Agents also use the feedback information (including any one or more of updated state representation, reward value, and representation of the action that was executed on the environment) to continuously train and update their prediction model according to the particular RL algorithm they are implementing. In this manner, the Learning Agents continually seek to improve the actions that they will recommend in the future, according to a goal for example of maximising future reward.

In step 270, the method 200 comprises receiving, from at least one of the plurality of Agents, performance feedback for the Agent. This performance feedback may be taken into account in the generating of the environment action at step 230, as discussed in further detail below with reference to Figures 2C and 2D.

For example, some implementations of logic for generating the environment action may depend upon previous performance of different agents, or on learning parameters such as the Value function, or Action- Value function, which may be supplied as feedback by the Agents in step 270. The feedback may include any metric or parameter relating to function or performance of the Agent, for example allowing tracking of Agent performance and the determining of trends in performance over time.

It will be appreciated that the steps of Figures 2A and 2B may be repeated at each time step, with the plurality of Agents proposing actions on the basis of each environment state, and the method generating, on the basis of the proposed actions, an environment action to be forwarded to the system for execution on the environment.

Figures 2C and 2D illustrate examples of how the step 230 of generating an environment action on the basis of the one or more candidate Learning Agent actions and the plurality of candidate Baseline Agent actions may be carried out.

Referring initially to Figure 2C, in a first example, the step 230 of generating an environment action initially comprises, in step 232c, evaluating the candidate Learning Agent action or actions, and the plurality of candidate Baseline Agent actions against a criterion relating to at least one of task performance or environment state. The evaluation of step 232c may be against a single criterion or against multiple criteria, which may relate to task performance, environment state or a combination of task performance and environment state. The evaluation step may in some examples comprise, in step 232ci, for each candidate Learning Agent action and candidate Baseline Agent action, predicting a state of the environment following execution of the candidate Learning Agent action or candidate Baseline Agent action. The prediction may for example be generated using a trained supervised learning model that takes as input the current environment state and the proposed action, and predicts the environment state after execution of the proposed action. The evaluation step may further comprise, at 232cii, predicting a value of a reward function representing an impact of execution of the candidate Learning Agent action or candidate Baseline Agent action on performance of the task on the basis of the predicted environment state. In still further examples, step 232ci may be omitted, and the evaluation may comprise simply predicting, on the basis of a proposed action, the value of a reward function that would be generated as a result of execution of the action. The reward function may for example comprise a Key Performance Indicator (KPI), or a function of several KPIs, for the environment and/or its performance of its task.

In step 242c, the step 230 of generating an environment action may comprise assembling a candidate set of environment actions from the candidate Learning Agent action and the plurality of candidate Baseline Agent actions the on the basis of the evaluation. As illustrated at 234ci, the assembling of the candidate set may for example be based on a threshold value for the predicted reward function value, or on a threshold improvement or increase in reward value. The candidate set may in some examples be considered as a "safe” candidate set, comprising those actions proposed by the plurality of Agents that are considered to be "safe” with respect to the environment and its performance of its task. As discussed above, the definition of "safe” may be dependent upon a particular technical domain, use case or deployment, but may encompass acceptable limits for task performance as determined for the specific situation, and/or mandated safety requirements for the technical domain etc.

The step 230 of generating an environment action may then comprise, in step 236, generating the environment action as a function of the one or more candidate Learning Agent actions and the candidate Baseline Agent actions on the basis of the evaluation. If a candidate set has been assembled at step 234c, then the generation may be made using the candidate set of environment actions. In some examples of the present disclosure, the function used to generate the environment action may comprise a weighted sum. In other examples, the function may comprise a selection from among the one or more candidate Learning Agent actions and the candidate Baseline Agent actions. As illustrated at 236ci, selecting may comprise selecting from among the candidate Learning Agent action and the candidate Baseline Agent actions the action which is predicted to generate at least one of the highest value of a reward function representing an impact of execution of the action on performance of the task and/or the greatest increase in value of the reward function from a value based on the current state of the environment. In other examples, selecting may comprise selecting an action on the basis of how often it has been proposed in the past, or selecting an action based upon past performance metrics or learning parameters of the individual agents. Such learning parameters may for example include the Value function V, and/or Action-Value function Q. The Value Function V gives the expected sum of reward for a system starting in state s and controlled by an Agent acting optimally according to its selection policy. The Action-Value function Q gives the expected sum of reward for a system starting in state s and controlled by an Agent which first takes action a and then acts optimally according to its selection policy. The feedback received at step 270 from the plurality of Agents may include performance and/or learning parameters for the Agents, and may be taken into account in order to select the environment actions from among the candidate Learning Agent actions and Baseline Agent actions. In some examples, performance trends and/or other tracking of individual Agent proposals or comparative performance data between Agents may be used to select the environment action from among the candidate actions.

Figure 2D illustrates another example of how the step 230 of generating an environment action on the basis of the one or more candidate Learning Agent actions and the plurality of candidate Baseline Agent actions may be carried out.

Referring to Figure 2D, in a first step 232d, generating an environment action may comprise generating a weighted combination of candidate Baseline Agent actions, wherein weights are assigned to individual candidate Baseline Agent actions according to at least one of the representation of the current state of the environment, a candidate Baseline Agent action, a performance measure of the Baseline Agent. The selection of criterion for the weighted combination may therefore take account of the current environment state and/or proposed actions, for example prioritising the actions proposed by Baseline Agents that are optimised for the current part of the state-action space. The selection criterion may also or alternatively take account of individual Agent performance, for example prioritising actions proposed by those Baseline Agents that have recently been performing well. As illustrated at 232di, the weighted combination may be a weighted sum.

In some examples, the candidate Learning Agent action and candidate Baseline Agent actions may comprise candidate action vectors, each element of a candidate action vector corresponding to a possible action and comprising a probability that the corresponding action is the most favourable of the possible actions according to a performance measure of the task. In such examples, generating a weighted combination of candidate Baseline actions may comprise using a control policy to compute a weighted combination, such as a weighted sum, of candidate Baseline Agent action vectors, as illustrated at 232dii.

In step 234d, generating an environment action may comprise generating a weighted combination of: (1) the combination of the candidate Baseline Agent actions and (2) the one or more candidate Learning Agent actions, according to at least one of a predetermined risk schedule and/or performance feedback of the Baseline Agents and one or more Learning Agents. As illustrated at step 235di, the weighted combination may be a weighted sum. As discussed above with reference to step 234dii, the candidate Learning Agent action and candidate Baseline Agent actions may comprise candidate action vectors, and generating a weighted combination at step 234d may comprise, at step 234dii, using a control policy to compute a weighted sum of the weighted combination of candidate Baseline Agent action vectors and the one or more candidate Learning Agent action vectors. The predetermined risk schedule that may be used to combine the candidate Baseline Agent actions with the candidate Learning Agent action or actions may balance the "safety” of the Baseline Agent actions against the possibility of the Learning Agent actions to offer greater reward, for example as the Learning Agent or Agents train their prediction models on the environment. The risk schedule may therefore evolve with time and/or with performance of the one or more Learning Agents. For example, the risk schedule may have an inversely proportional weight over time: risk schedule k = f(l /t) for t > 1 where f is an arbitrary function (for example f(x) = x^a for a > 0)

In step 236d, generating an environment action may comprise selecting the environment action from the weighted combination generated at step 234d. For example, if the weighted combination is a weighted sum of action probability vectors, selecting the environment action may comprise selecting as the environment action the possible action corresponding to the highest probability value in the weighted sum at step 234di.

In step 238d, the combination logic for generating the weighted combinations at steps 232d and 234d may be updated. For example at step 238di, the weights for the weighted sums in steps 232di, 232dii, 234di and 234dii may be updated. The combination logic (for example the weights of the weighted sums) may be updated based on at least one of the representation of the current state of the environment, a candidate Baseline Agent action or candidate Learning Agent action and/or a performance measure of the Baseline Agents or Learning Agent.

Figures 3A to 3C illustrate different examples of how the methods 100 and 200 may be applied to different technical domains. A more detailed discussion of example use cases is provided below, for example with reference to Figure 6, however Figures 3A to 3C provide an indication of example environments, actions, state representations etc. for different technical domains. It will be appreciated that the technical domains illustrated in Figures 3A to 3C are merely for the purpose of illustration, and application of the methods 100 and 200 to other technical domains may be envisaged.

Figures 3A and 3B illustrate steps of a method 300 for managing a system controlling an environment that is operable to perform a task. The steps of the method 300 largely correspond to the steps of the method 200, and reference to made to the above discussion of the method 200 for the detail of the corresponding method steps.

Referring initially to Figure 3A the method 300 comprises providing, to a plurality of Agents, a representation of a current state of the environment, wherein the plurality of Agents comprises at least one Learning Agent operable to implement an RL model for selecting actions to be executed on the environment, and a plurality of Baseline Agents, each Baseline Agent operable to implement a policy for selecting actions to be executed on the environment, wherein each policy implemented by a Baseline Agent satisfies a criterion with respect to performance of the task.

As illustrated at 31 Oi, in one example, the environment may comprise a communication network, and the task that the environment is operable to perform may comprise provision of communication network services. In such examples, the system controlling the environment may comprise a network management system, a network operations centre, a core network function, etc.

As illustrated at 31 Oii, in another example, the environment may comprise a cell of a communication network, and the task that the environment is operable to perform may comprise provision of communication network services. In such examples, the system controlling the environment may comprise a cloud RAN system, a virtualised or other function within a cloud RAN system, an eNodeB, a gNodeB, any other implementation of node that is operable to transmit, receive, process and/or orchestrate wireless signals, etc.

As illustrated at 310iii, in another example, the environment may comprise a vehicle, and the task that the environment is operable to perform may comprise advancing over a terrain. In such examples, the system controlling the environment may comprise a vehicle navigation system, collision avoidance system, engine control system, steering system etc.

Other examples of technical domain may be envisaged, including for example industrial and manufacturing domains, in which an environment may comprise a factory, production line, reaction chamber, item of automated equipment, etc., commercial, residential or office spaces, in which an environment may comprise a room, a floor, a building, etc. energy generation, in which an environment may comprise a power plant, a turbine, a solar array etc. as well as many others.

Referring still to Figure 3A, in step 320, the method 300 comprises receiving, from the one or more Learning Agents, one or more candidate Learning Agent actions for execution on the environment, and, from the plurality of Baseline Agents, a plurality of candidate Baseline Agent actions for execution on the environment. Step 320i illustrates example actions that may be envisaged in the case of an environment comprising a communication network, cell of a communication network, or cell sector of a communication network. Such example actions include: an allocation decision for a communication network resource; a configuration for a communication network node, which may be a physical or virtual node, for example implementing a Virtualised Network Function; a configuration for communication network equipment; a configuration for a communication network operation; a decision relating to provision of communication network services for a wireless device; a configuration for an operation performed by a wireless device in relation to the communication network.

In step 330, the method 300 comprises generating an environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions. This step may be carried out according to any of the examples illustrated in Figures 2C and 2D and/or discussed above. Referring now to Figure 3B, In the case of an environment comprising a communication network cell or cell sector, steps 335 and 337 may then be carried out. At step 335, the method 300 comprises verifying an impact of the environment action on a neighbour cell or sector. If the impact of the environment action on a neighbour cell or sector violates a neighbour cell or sector performance condition, the method comprises, at step 337, generating a new environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions. In some examples, consideration of neighbour cell or sector impact could also be included in the actions of the Baseline Agents and/or Learning Agents. If it can be ensured that the newly generated environment action of step 337 is "safe” in that its impact on a neighbour cell or sector is acceptable, then the action may be forwarded directly to the system for execution on the environment in step 340. This may be the case for example if one or more Baseline Agent policies take such impact into account and if an action proposed by such a Baseline Agent is selected at step 337. Alternatively, if it cannot be ensured that the impact of the newly generated environment action on a neighbour cell or sector is acceptable, then the check of step 335 may be repeated.

It will be appreciated that similar checks on the impact of the generated environment action upon neighbouring environments may be envisaged for the methods 100, 200 and/or 300 when applied in other technical domains.

In step 340, the method 300 comprises providing the environment action to the system for execution on the environment. The method then comprises, in step 350, receiving, from the system, a representation of a state of the environment following execution of the environment action, and a value of a reward function representing an impact of execution of the environment action on performance of the task. As illustrated at 350i, the reward function comprises a function of at least one performance parameter for the environment, such as a performance parameter for the communication network, vehicle, factory, power plant etc. The precise reward function may be determined at the domain level, and may be considered to be comprised within the domain knowledge for the particular environment. In step 360, the method 300 comprises providing, to the plurality of Agents, the representation of a state of the environment following execution of the environment action, the value of the reward function, and a representation of the environment action. As discussed above with reference to the method 200, both Baseline Agents and Learning Agents may use the received feedback elements of updated state representation, reward value, and representation of the action that was executed on the environment, to propose a new action for execution on the environment. The number of feedback elements that are used by a particular Agent in order to propose a next action may depend upon the particular policy implemented by the Agent. In addition to proposing a new action, Learning Agents also use the feedback information (including any one or more of updated state representation, reward value, and representation of the action that was executed on the environment) to continuously train and update their prediction model according to the particular RL algorithm they are implementing. In this manner, the Learning Agents continually seek to improve the actions that they will recommend in the future, according to a goal for example of maximising future reward.

The representation of the environment state that is provided to the Learning Agent or Agents and the Baseline Agents may comprise parameter values for suitable parameters according to the particular environment. Figure 3C illustrates example elements that may be included in the representation 301 of the environment state for an environment comprising a communication network, cell of a communication network, or cell sector of a communication network. The example elements include: a value of a network coverage parameter 301a a value of a network capacity parameter 301b a value of a network congestion parameter 301c a current network resource allocation 301 d a current network resource configuration 301 e a current network usage parameter 301f a current network parameter of a neighbour communication network cell 301 g a value of a network signal quality parameter 301 h a value of a network signal interference parameter 301 i a value of a network power parameter 301j a current network frequency band 301k a current network antenna down-tilt angle 3011 a current network antenna vertical beamwidth 301m a current network antenna horizontal beamwidth 301 n a current network antenna height 301 o a current network geolocation 301 p a current network inter-site distance 301 q Elements for the representation of environment state for other environments may be envisaged according to the parameters that may be used to describe a particular environment. For example, for a vehicle, the environment state may be represented by any one or more of vehicle geographic or communication network position, vehicle orientation, vehicle position or orientation with respect to a route, roadway or path, vehicle speed, velocity, acceleration, engine speed, engine temperature, fuel reserve, detection of and parameters representing peripheral objects, etc. Suitable parameters including temperature, pressure, humidity, chemical composition, velocity of components, flow rates etc., may be envisaged for other industrial, manufacturing, commercial and other environments.

Figures 1 to 3C discussed above provide an overview of methods which may be performed according to different examples of the present disclosure. The methods involve the receipt of proposed actions from Learning Agents and Baseline Agents, and the generation of an environment action for execution on an environment on the basis of the received proposed actions. There now follows a detailed discussion of how different process steps illustrated in Figures 1 to 3C and discussed above may be implemented according to an example architecture and process flow according to the present disclosure.

Figure 4 illustrates an overview of a modular Safe Reinforcement Learning (SRL) architecture, elements of which may implement the methods disclosed herein. The architecture comprises n Learning Agents (illustrated as Reinforcement Learning (RL) Agents) 3 that are continuously benchmarked against m Baseline Agents (illustrated as Safe Baselines) 4 by a Safety Shield 2 that, through the imposition of Safety Constraints 9 generated by Safety Logic 5, selects an action 6 to be performed on the environment 1 . The Safety Shield 2 and Safety Logic 5, may thus jointly implement the methods 100, 200, 300.

The architectural elements illustrated in Figure 4, and their interactions, are discussed in further detail below:

Environment 1 : This component represents the real environment that the agent is acting upon, modelled as a standard RL problem following the definition provided above. The environment is assumed to be controlled by a system with which the Safety Shield 2 interacts.

Safety Shield 2: This component is the mediator between the Learning Agents 3 and the Environment 1 . As opposed to traditional RL models in which agents interact with the environment directly, providing actions and receiving feedback, the Safety Shield 2 of the architecture of Figure 4 acts as a proxy between the Learning Agents 3 and the Environment 1, protecting the Environment 1 from "unsafe” actions 8 that may be proposed by the Learning Agents. From the perspective of the Learning Agents, and of the Baseline Agents 4 discussed below, the Safety Shield 2 is the representation of the Environment 1 : the Agents receive feedback from the Environment via the Safety Shield 2 and propose the next action directly to the Safety Shield 2.

The safety shield collects suggested actions 8 from Learning Agents 3 and Baseline Agents 4 (the candidate Learning Agent actions and candidate Baseline Agent actions). The Safety Shield 2 then chooses the final safe action 6 (the environment action discussed above) to be performed on the real Environment 1, with input of Safety Constraints 9 from the Safety Logic 5 (providing the criteria, combination logic, weighting, etc. discussed above with reference to Figures 2C and 2D). Feedback 7 from the Environment 1 (for example in the form of a Reward function value) is also collected by the Safety Shield 2, and together with the performed action 6 (the environment action), whether that action was a candidate Baseline Agent action or a candidate Learning Agent action, is fed back to the Learning Agents 3 and Baseline Agents 4 though the safety feedback 10, allowing the Agents to prepare proposed action for the next time step. The Learning Agents 3 and Baseline Agents 4 may also provide feedback 11 to the Safety Logic 5 regarding their performance and/or learning parameters, which feedback may be used for example to make dynamics rules or collect trajectories information from the Agents.

Learning Agents 3: The architecture comprises a set of n online RL Agents 3 that indirectly interact with the Environment 1 by suggesting actions 8 to the Safety Shield 2. The Learning Agents 3 are also continuously trained from previous interactions and collected feedback that improves their future recommended actions.

It is envisaged that different Learning Agents 3 may use different learning models and hyperparameters, and may therefore suggest different actions at a given time. Using multiple Learning Agents 3 enables parallel experimentation with different learning techniques and parameters, allowing the Safety Shield 2 to determine a suitable safe action from amongst those proposed.

Baseline Agents 4: The architecture comprises a set m Safe Baselines 4. Baselines receive feedback about the current state of the environment to suggest actions that are considered to be safe. Some embodiments of safe baselines include (i) models that have been trained on previous actions known to be safe, (ii) rule- based modules that recommend actions satisfying safety criteria, or any other implementation of a safe action proposer. The Safe Baselines thus fulfil at least one criterion that defines them as "Safe” with respect to the environment and/or its performance of a task. It will be appreciated that the precise definition of "safe”, as discussed above, will vary according to technical domain, use case etc., and may even vary with time or circumstances for a single use case. It may be envisaged that the Safe Baselines provide performance that is safe but not necessarily optimal.

Safety Logic 5: The Safety Logic 5 provides the constraints 9 to the Safety Shield 2, which constraints are used to select a safe action from among those proposed by the Learning Agents and Baseline Agents. The constraints thus encapsulate the criterion or criteria which are used to generate the environment action 6 from the candidate Baseline Agent actions and candidate Learning Agent actions 8. Some examples of safety logic include:

(i) rule-based, where the safety logic uses predefined rules based on safety criteria to choose a safe action amongst agents and baselines. In this example, the safety logic is independent of the performance and feedback of the Learning Agents 3 and Baseline Agents 4.

(ii) performance dependent, where the logic relies on the feedback and output of the Learning Agents 3 and Baseline Agents 4. This may include for example expected rewards, convergence, etc., and can be used to determine the weights with which Learning Agent and Baseline Agent suggested actions are prioritised by the Safety Shield 2.

(iii) hybrid, where rule-based and performance-dependent logic, or any other type of logic, is used to select a safe action from among the actions suggested by the Learning Agents 3 and Baseline Agents 4.

A detailed discussion of performance dependent logic using a weighted sum is provided in the context of an example use case below, with reference to Figure 6. Another example of performance based safety logic is a prediction model. According to such an example, the Safety Logic 5 uses a pre-trained supervised learning model to assess the value of the actions that are proposed by the different Learning Agents 3 and Baseline Agents 4. A supervised learning model is trained on historical data from the Environment 1, with the input for the model defined as a given Environment state plus an action, and the output defined as the Environment state after applying the action. With a sufficiently large sample size, a model can be trained that is reasonably accurate in predicting the likely change in Environment state for each of the proposed actions. The Safety Logic may use such a pre-trained model to compare the actions proposed by Learning Agents and Baseline Agents by querying the model with each action in addition to the current environment state. From the predicted next state, the Safety Logic can calculate various metrics to score the proposed actions, such as reward (e.g. KPI_state.+i — KPI_state.) or safety constraint (e.g. KPI_state.+i > t for any known safety threshold t). In one example, the safety logic may compare proposed actions by using the state KPIs directly. The Safety Logic may choose the action that achieves the best predicted state KPIs and forward it to the Safety Shield 2 for provision to the environment.

Figure 5 illustrates a process flow during which examples of the methods 100, 200, 300 may be implemented by elements of the example architecture of figure 4.

Referring to Figure 5, the following steps may be performed:

STEP A: Collect the policies/models implemented by the Baseline Agents, or train the models for example using offline data. STEP B: Create the Learning Agent models.

STEP C: Design safety logic (may incorporate information from Learning Agent models and Baselines)

STEP D: Observe current environment state (see for example step 202 of method 200)

STEP E: The Safety shield queries probability of actions from all Learning Agent policies and Baseline Agents (see for example discussion above of action probability vectors and steps 120, 220, 320).

STEP F: Safety shield applies (calls) safety constraint from safety logic on proposed actions (see for example steps 130, 230, 330).

STEP G: Safe action is propagated to the environment by the safety shield (see for example steps 140, 240,

340).

STEP H: Feedback (next state, reward/loss, possibly additional feedback information) from environment to safety shield (see for example steps 250, 350).

STEP I: The safety feedback is returned to the agents and baselines (see for example steps 260, 360).

STEP J: Learning Agents are trained on the safety feedback (according to RL learning algorithms).

STEP K: Safety logic receives feedback from agents and baselines (see for example step 270).

There now follows a detailed discussion of two example use cases, illustrating how the methods of the present disclosure may be implemented to address example technical scenarios. The following detailed use case examples are drawn from the telecoms domain, although it will be appreciated that other use cases in other technical domains may be envisaged, as discussed above.

Use Case 1: Remote Electronic Tilt

Modern cellular networks increasingly face the need to satisfy consumer demand that is highly variable in both the spatial and the temporal domains. In order to be able efficiently to provide high level of Quality of Service (QoS) to User Equipments (UEs), networks must adjust their configuration in an automatic and timely manner. Antenna vertical tilt angle, referred to as downtilt angle, is one of the most important variables to control for QoS management. The downtilt angle can be modified both in a mechanical and an electronic manner, but owing to the cost associated with manually adjusting the downtilt angle, Remote Electrical Tilt (RET) optimisation is used in the vast majority of modern networks.

There exist several Key Performance Indicators (KPIs) that may be taken into consideration when evaluating the performance of a RET optimization strategy, and significant existing material is dedicated to Coverage Capacity Optimization (CCO), in which the coverage KPI relates to area covered in terms of a minimum received signal strength, and the capacity KPI relates the average total throughput in the a given area of interest. Figure 6 illustrates the trade-off between coverage and capacity for the present RET use case: an increase in antenna downtilt (9_S ® f₅) correlates with a stronger signal in a more concentrated area, as well as higher capacity and reduced interference radiation towards other cells in the network. However, excessive downtilting can result in insufficient coverage in a given area, with some UEs unable to receive a minimum signal quality.

During the past decade, there has been a lot of research into the field of RET optimization using RL methods. However the majority of existing methods for RET optimisation using RL techniques do not consider any notion of safety, resulting in arbitrary performance disruption during the learning procedure. As reliability of services is one of the most important features for a network provider, the possibility of performance degradation has in effect prohibited the real-world deployment of RL methods for RET optimisation, despite these methods having been shown to outperform the currently used traditional methods in simulated network environments. By addressing the issue of performance degradation through the above described shielding action, and the use of Baselines to ensure forwarding of an action that meets defined "safety” criteria, methods according to the present disclosure can enable the safe deployment of RL Agents for RET optimisation, as discussed below.

The following scenario may be considered for the RET optimisation use case:

• There exist m sub-optimal safe baselines which have been used by the network operator for RET optimization up to the current time instant.

• There exists a dataset D, created by observing the cellular network while the baseline policies were in effect. D consists of N trajectories describing the interaction of the baseline policies with the network environment. Each trajectory component contains the current state of the system, the action chosen by the baseline method, and the corresponding reward.

• The goal is to train an RL agent to solve the RET optimization problem using the methods and architecture discussed above.

It is noted that in the present example, the number of Learning Agents agents n = 1, although it will be understood that multiple Learning Agents may be included, submitting their proposed actions as discussed above.

For this scenario, the elements of the above discussed architecture may be:

Environment: The physical 4G or 5G mobile cellular network area considered for RET optimisation. The network area is divided into C sectors, each served by an antenna. In one example, the state of the environment can be described by the vector s_t = [cov(t), cap(t),d(t)] E [0,1] x [0,1] x [0,90] where cov(t ) is the coverage network metric, cap(t ) is the capacity metric and (t) is the downtilt of the antenna's sector at time t. Actions which may be proposed by Agents comprise possible discrete changes to the antenna downtilt. In the present example, only three actions are available: a_t E [— Dϋ, 0, Di9], The reward signal at time t may be defined as: r_t = log (1 + cov(t)² + cap

. As discussed above, the reward signal or function may be defined at the level of domain knowledge.

Learning Agents: A set of n = 1, RL agents, whose policy takes as input the state of the network and returns a downtilt angle variation for the antenna sector in a reactive manner. In the present example, a single Learning Agent per cell is considered, and optimisation of C cell sectors is executed independently. The Learning Agent's policy is a Machine Learning model. In one example, the Learning Agent's policy TT_W may consist of an Artificial Neural Network (ANN) parametrised by a weight vector w.

Baseline Agents: A set of m rule-based Baselines n_Si , .... 7r_Sm. The Baselines can be known or estimated from the dataset D (e.g. by modelling each of the policies according to an ANN and estimating the probability of action through logistic regression).

Safety Logic: It is assumed that the RL agent is following a learning policy n_L that is trained next to safe Baselines TT_s.. The Safety Logic implements a control policy n_c that is initially dominated by the safe baselines. As more trials are conducted and learning policy n_L is trained to recommend better actions than 7T_S., the control policy begins to rely more on newly trained models and less on baselines.

The control policy n_c is a linear combination of n_s. and n_L\

The weight k, described above as the risk schedule, is initially close to 1 so as to ensure the actions of the Baselines are prioritised during an initial training period for n_L. As n_L reaches the desired objective, k is reduced to take more of the Learning Agent's recommendations.

The Baseline weights t;(s, a) are hyper-parameters controlling the importance of each Baseline. As discussed above, in some examples of the safety logic such hyper-parameters can be state-action dependent, allowing to increase the weights for policies whose performance is greater on (s, a) and decrease the weights for policies whose performance is poorer on (s, a). Such weights can be based for example on concentration bounds on data that are previously available, data from the policy that will allow an evaluation of the performance of the policy on a given (s, a).

As discussed above, the Learning Agent may use an artificial neural network with the following architecture to approximate the value function:

Example hyperparameters that may be used in training are as follows: Discounting reward factor y

Use Case 2: Dynamic Resource Allocation

In many communication networks, a plurality of services may compete over resources in a shared environment such as a Cloud. The services can have different requirements and their performance may be indicated by their specific QoS KPIs. Additional KPIs that can be similar across services can also include time consumption, cost, carbon footprint, etc. The shared environment may also have a list of resources that can be partially or fully allocated to services. These resources can include CPU, memory, storage, network bandwidth, Virtual Machines (V Ms), Virtual Network Functions (VNFs), etc. For this scenario, the elements of the above discussed architecture may be:

Environment: The combination of services that are sharing available resources. The performance of the various services with their current allocated resources is monitored. The state of this environment is the combination of service KPIs across services, and reward can also be calculated from KPIs.

Learning Agents: Each resource on the shared platform may have an associated Learning Agent, which may propose actions in the form of resource allocations to services. At each step a Learning Agent can suggest a resource allocation to be applied to the competing services. The Learning Agents also train in parallel with correlated data comprising state (service KPIs), actions (allocated resources), next state and rewards.

Baseline Agents: Safe baselines may comprise resource allocation templates that are predefined by domain experts for different types of services. There can be a plurality of predefined Baselines suggesting different resource allocations for the same scenario.

Safety Logic: The Safety logic can compare suggested resource allocations from baselines to suggestions from the RL agent(s) and select the one with highest predicted reward. This resource allocation is then selected and performed on the environment by the shield (which means it is allocated to services and their new KPIs are monitored and fed back to baselines and agents through the safety shield).

Cloud implementation

Example methods according to the present disclosure can be implemented in a distributed manner, including for example a number W of distributed workers and one central worker (master) that acts as coordinator.

The safe baselines 7r_s , .... n_Sm may be known at the level of the distributed workers or at the central worker node. In the latter case communication cost for communication between the distributed workers and the central node in terms of communication of Baseline Agent actions and safety constraints may be taken into account. The safety shield and safety logic that implement the methods 100, 200, 300 can be implemented at the central worker node that outputs safe actions at a global level. The safety shield may also consider possible conflicting safety requirements between adjacent workers and handle them accordingly (as illustrated for example in steps 335 and 337 of method 300). In one example of a cloud implementation, there could be different hierarchical levels between workers that consider intermediate decision entities, for example at cluster level. The training for the distributed model could be conducted according to a Federated Learning process. Referring to the first Use Case discussed above, for RET optimisation, hierarchical levels of cell, cluster of cells, and network may exist. The methods 100, 200, 300, for example as implemented by a Safety Shield and Safety Logic as discussed above, may be performed at any of these levels. Carrying out the methods at a higher level allows for resolution of conflicting safety requirements but implies higher communication cost.

As discussed above, the methods 100, 200 and 300 may be implemented via a Safety Shield and Safety Logic, which are logical entities in a functional architecture proposed according to the present disclosure.

The Safety Shield and Safety Logic represent one logical framework for implementing the methods. The present disclosure provides a management node that is adapted to perform any or all of the steps of the above discussed methods. The management node may be a physical or virtual node, and may for example comprise a virtualised function that is running in a cloud, edge cloud or fog deployment. The management node may carry out the methods by implementing a logical Safety Shield and Safety Logic as described above, or in any other appropriate manner. The Learning Agents and Baseline agents that supply candidate actions for the methods 100, 200 and 300 may also be running on the management node, or may be running on a separate physical or logical node, as discussed above for example with reference to the could implementation. In the context of an environment comprising a communication network, the management node may for example comprise or be instantiated in any part of a logical core network node, network management center, network operations center, Radio Access node etc. Any such communication network node may itself be divided between several logical and/or physical functions, and any one or more parts of the management node may be instantiated in one or more logical or physical functions of a communication network node.

Figure 7 is a block diagram illustrating an example management node 700 which may implement the method 100, 200 and/or 300, as elaborated in Figures 1 to 6, according to examples of the present disclosure, for example on receipt of suitable instructions from a computer program 750. Referring to Figure 7, the management node 700 comprises a processor or processing circuitry 702, and may comprise a memory 704 and interfaces 706. The processing circuitry 702 is operable to perform some or all of the steps of the method 100, 200 and/or 300 as discussed above with reference to Figures 1 to 6. The memory 704 may contain instructions executable by the processing circuitry 702 such that the management node 700 is operable to perform some or all of the steps of the method 100, 200 and/or 300, as elaborated in Figures 1 to 6. The instructions may also include instructions for executing one or more telecommunications and/or data communications protocols. The instructions may be stored in the form of the computer program 750.

In some examples, the processor or processing circuitry 702 may include one or more microprocessors or microcontrollers, as well as other digital hardware, which may include digital signal processors (DSPs), special-purpose digital logic, etc. The processor or processing circuitry 702 may be implemented by any type of integrated circuit, such as an Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA) etc. The memory 704 may include one or several types of memory suitable for the processor, such as read-only memory (ROM), random-access memory, cache memory, flash memory devices, optical storage devices, solid state disk, hard disk drive etc.

Figure 8 illustrates functional modules in another example of management node 800 which may execute examples of the methods 100, 200 and/or 300 of the present disclosure, for example according to computer readable instructions received from a computer program. It will be understood that the modules illustrated in Figure 8 are functional modules, and may be realised in any appropriate combination of hardware and/or software. The modules may comprise one or more processors and may be integrated to any degree.

Referring to Figure 8, the management node 800 is for managing a system controlling an environment that is operable to perform a task. The management node comprises an Agent module 802 for providing, to a plurality of Agents, a representation of a current state of the environment, wherein the plurality of Agents comprises a Learning Agent operable to implement an RL model for selecting actions to be executed on the environment and a plurality of Baseline Agents, each Baseline Agent operable to implement a policy for selecting actions to be executed on the environment, wherein each policy implemented by a Baseline Agent satisfies a criterion with respect to performance of the task. The Agent module 802 is also for receiving, from the Learning Agent, a candidate Learning Agent action for execution on the environment, and, from the plurality of Baseline Agents, a plurality of candidate Baseline Agent actions for execution on the environment. The management node also comprises a processing module 804 for generating an environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions, and an environment module 806 for providing the environment action to the system for execution on the environment. The management node may further comprise interfaces 808, which may be operable to facilitate communication with the Agents and system over a suitable communication channel.

Examples of the present disclosure thus propose a method according to which a plurality of Baseline Agents complement at least one Learning Agent in the proposing of actions for execution on an environment. Policies implemented by the Baseline Agents for the selection of actions satisfy a criterion with respect to performance of a task by the environment, thus providing a benchmark for safety of the Learning Agent proposed action with respect to the environment and the task it performs. Both the Baseline Agent proposed actions and the Learning Agent proposed action are used to generate an environment action, which is the action that is actually provided to a system controlling the environment for execution on the environment. Advantages offered by examples of the present disclosure include safety, the possibility to tune an acceptable level of risk, heterogeneity, modularity and scalability.

Safety: As opposed to the standard RL interaction scheme whose trained policies do not consider any safety requirement, the methods of the present disclosure offer use case appropriate levels of safety in terms of improvement over Baseline performance.

State-action based Risk tunability: A risk hyperparameter allows a trade-off between improvement over the Baselines and safety. For high value of the risk hyperparameter a greater amount of improvement of the RL policies will be achieved at the expense of safety guarantees. The risk hyperparameter may evolve with time, and/or may be dependent upon state, action or state-action pairs, allowing for flexibility in how different Baselines are prioritised according to their performance or optimisation for a give region of the state-action space.

Heterogeneity: Methods according to the present disclosure allow for integration of different learning procedures and algorithms to train the Learning Agents. Using multiple Agents enables experimentation with different RL models and hyperparameters in parallel and during live deployment, without having to run separate experiments or compromise environment task performance. The Baseline Agents may be heterogeneous, for example mixing rule-based and data-driven policies. Using different Baseline policies can be beneficial as different baselines may have better performance over different regions of the state- action space. Combining inputs from different Baseline policies may therefore lead to improved overall performance.

Modularity: In contrast to existing SRL solutions, methods according to the present disclosure may operate according to any safety logic (performance criteria) and may incorporate candidate actions from any number of Learning Agents and Baseline Agents, choosing for example the best performing action among those proposed by the Learning Agents and Baseline Agents. Multiple Learning Agents and Baseline Agents can be added, removed and replaced as new scenarios are considered. In addition, the complexity of the decision-making process can be customised based on specific application need without requiring any modifications to the environment, which remains completely insulated.

Scalability: The methods of the present disclosure are easily scalable through distributed computing. For example in the RET use case, the network can be divided into multiple cell clusters and therefore multiple environments, each one with their own safety shield. Furthermore, the Learning Agents and Baseline Agents can run in parallel, which also lends itself to distributed computing. The methods of the present disclosure may be implemented in hardware, or as software modules running on one or more processors. The methods may also be carried out according to the instructions of a computer program, and the present disclosure also provides a computer readable medium having stored thereon a program for carrying out any of the methods described herein. A computer program embodying the disclosure may be stored on a computer readable medium, or it could, for example, be in the form of a signal such as a downloadable data signal provided from an Internet website, or it could be in any other form.

It should be noted that the above-mentioned examples illustrate rather than limit the disclosure, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended embodiments. The word "comprising” does not exclude the presence of elements or steps other than those listed in an embodiment or claim, "a” or "an” does not exclude a plurality, and a single processor or other unit may fulfil the functions of several units recited in the embodiments. Any reference signs in the claims shall not be construed so as to limit their scope.

The following are certain enumerated embodiments further illustrating various aspects the disclosed subject matter.

1 . A computer implemented method (100) for managing a system controlling an environment that is operable to perform a task, the method comprising: providing, to a plurality of Agents, a representation of a current state of the environment (110), wherein the plurality of Agents comprises (110a): a Learning Agent operable to implement a Reinforcement Learning model for selecting actions to be executed on the environment; and a plurality of Baseline Agents, each Baseline Agent operable to implement a policy for selecting actions to be executed on the environment, wherein each policy implemented by a Baseline Agent satisfies a criterion with respect to performance of the task; receiving, from the Learning Agent, a candidate Learning Agent action for execution on the environment, and, from the plurality of Baseline Agents, a plurality of candidate Baseline Agent actions for execution on the environment (120); generating an environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions (130); and providing the environment action to the system for execution on the environment (140).

2. The method of embodiment 1 , further comprising: receiving, from the system, a representation of a state of the environment following execution of the environment action, and a value of a reward function representing an impact of execution of the environment action on performance of the task (250); and providing, to the plurality of Agents (260): the representation of a state of the environment following execution of the environment action; the value of the reward function; and a representation of the environment action.

3. The method of embodiment 1 or 2, wherein generating an environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions comprises: evaluating the candidate Learning Agent action and the plurality of candidate Baseline Agent actions against a criterion relating to at least one of task performance or environment state (232c); and generating the environment action as a function of the candidate Learning Agent action and the candidate Baseline Agent actions on the basis of the evaluation (236c).

4. The method of embodiment 3, wherein evaluating the candidate Learning Agent action and the plurality of candidate Baseline Agent actions against a criterion relating to at least one of task performance or environment state comprises, for each candidate Learning Agent action and candidate Baseline Agent action: predicting a value of a reward function representing an impact of execution of the candidate Learning Agent action or candidate Baseline Agent action on performance of the task (232cii).

5. The method of embodiment 3 or 4, wherein evaluating the candidate Learning Agent action and the plurality of candidate Baseline Agent actions against a criterion relating to at least one of task performance or environment state comprises, for each candidate Learning Agent action and candidate Baseline Agent action: predicting a state of the environment following execution of the candidate Learning Agent action or candidate Baseline Agent action (232ci); and predicting a value of a reward function representing an impact of execution of the candidate Learning Agent action or candidate Baseline Agent action on performance of the task on the basis of the predicted environment state (232cii).

6. The method of any one of embodiments 3 to 5, wherein generating the environment action as a function of the candidate Learning Agent action and the candidate Baseline Agent actions on the basis of the evaluation comprises: selecting the from among the candidate Learning Agent action and the candidate Baseline Agent actions the action which is predicted to generate at least one of (236ci): the highest value of a reward function representing an impact of execution of the action on performance of the task; the greatest increase in value of the reward function from a value based on the current state of the environment.

7. The method of any one of embodiments 3 to 6, wherein generating an environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions further comprises assembling a candidate set of environment actions from the candidate Learning Agent action and the plurality of candidate Baseline Agent actions on the basis of the evaluation (234c); and wherein generating the environment action as a function of the candidate Learning Agent action and the candidate Baseline Agent actions on the basis of the evaluation comprises selecting the environment action from the candidate set of environment actions (236c).

8. The method of any one of embodiments 1 to 7, wherein generating an environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions comprises: generating a weighted sum of a combination of the candidate Baseline Agent actions and the candidate Learning Agent action according to at least one of (234d): a predetermined risk schedule; performance feedback of the Baseline Agents and Learning Agent.

9. The method of any one of embodiments 1 to 8, wherein generating an environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions comprises: generating a weighted combination of candidate Baseline Agent actions, wherein weights are assigned to individual candidate Baseline Agent actions according to at least one of (232d): the representation of the current state of the environment; a candidate Baseline Agent action; a performance measure of the Baseline Agent.

10. The method of any one of embodiments 1 to 9, wherein the candidate Learning Agent action and candidate Baseline Agent actions comprise candidate action vectors, each element of a candidate action vector corresponding to a possible action and comprising a probability that the corresponding action is the most favourable of the possible actions according to a performance measure of the task; and wherein generating an environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions comprises: using a control policy to generate the environment action, wherein the control policy is configured to: compute a weighted combination of candidate Baseline Agent action vectors (232dii); compute a weighted sum of the weighted combination of candidate Baseline Agent action vectors and the candidate Learning Agent action vector (234dii); and select as the environment action the action corresponding to the highest probability value in the weighted sum (236di).

11. The method of any one of embodiments 8 to 10, wherein generating an environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions further comprises: updating weights of at least one of the weighted combination or weighted sum on the basis of at least one of (238d, 238di): the representation of the current state of the environment; a candidate Baseline Agent action or candidate Learning Agent action; a performance measure of the Baseline Agents or Learning Agent.

12. The method of any one of embodiments 1 to 11, wherein the plurality of Agents comprises a plurality of Learning Agents, each Learning Agent operable to implement a Reinforcement Learning model for selecting actions to be executed on the environment; wherein receiving, from the Learning Agent, a candidate Learning Agent action for execution on the environment, comprises receiving a plurality of candidate Learning Agent actions from the plurality of Learning Agents (220); and wherein generating an environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions comprises generating an environment action on the basis of the plurality of candidate Learning Agent actions and the plurality of candidate Baseline Agent actions (230).

13. The method of any one of embodiments 1 to 12, wherein at least two of the Baseline Agents implement policies that are optimized for different regions of the state-action space of the environment, wherein the state-action space of the environment comprises the range of possible states in which the environment may exist and the available actions for execution on the environment in those states (21 Oi).

14. The method of any one of embodiments 1 to 13, further comprising: receiving, from at least one of the plurality of Agents, performance feedback for the Agent (270).

15. The method of any one of embodiments 1 to 14, wherein the environment comprises a communication network (31 Oi) and wherein the task that the environment is operable to perform comprises provision of communication network services.

16. The method of any one of embodiments 1 to 15, wherein the environment comprises a cell of a communication network (310ii) and wherein the task that the environment is operable to perform comprises provision of communication network services.

17. The method of embodiment 16, further comprising: verifying an impact of the environment action on a neighbour cell (335); and if the impact of the environment action on a neighbour cell violates a neighbour cell performance condition: generating a new environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions (337); and providing the new environment action to the environment for execution (340).

18. The method of any one of embodiments 15 to 17, wherein the representation of a current state of the environment comprises at least one of: a value of a network coverage parameter (301a) a value of a network capacity parameter (301b); a value of a network congestion parameter (301c); a current network resource allocation (301 d); a current network resource configuration (301 e); a current network usage parameter (301f); a current network parameter of a neighbour communication network cell (301 g); a value of a network signal quality parameter (301 h); a value of a network signal interference parameter (301 i); a value of a network power parameter (301j); a current network frequency band (301k); a current network antenna down-tilt angle (3011); a current network antenna vertical beamwidth (301m); a current network antenna horizontal beamwidth (301 n); a current network antenna height (301 o); a current network geolocation (301 p); a current network inter-site distance (301 q).

19. The method of any one of embodiments 15 to 18, when dependent on embodiment 2, wherein the reward function comprises a function of at least one performance parameter for the communication network (350i).

20. The method of any one of embodiments 15 to 19, wherein an action for execution on the environment comprises at least one of (320i): an allocation decision for a communication network resource; a configuration for a communication network node; a configuration for communication network equipment; a configuration for a communication network operation; a decision relating to provision of communication network services for a wireless device; a configuration for an operation performed by a wireless device in relation to the communication network.

21. The method of any one of embodiments 1 to 20, wherein the environment comprises a sector of a cell of a communication network and wherein the task that the environment is operable to perform comprises provision of radio access network services; wherein the representation of a current state of the environment comprises at least one of: a coverage parameter for the sector a capacity parameter for the sector; a signal quality parameter for the sector a down tilt angle of the antenna serving the sector: and wherein an action for execution on the environment comprises a down tilt adjustment value for the antenna serving the sector.

22. The method of any one of embodiments 1 to 14 wherein the environment comprises a vehicle (310iii)and wherein the task that the environment is operable to perform comprises advancing over a terrain.

23. A computer program product comprising a computer readable medium, the computer readable medium having computer readable code embodied therein, the computer readable code being configured such that, on execution by a suitable computer or processor, the computer or processor is caused to perform a method of any one of embodiments 1 to 22. 24. A management node (700) for managing a system controlling an environment that is operable to perform a task, the management node comprising processing circuitry (702) configured to: provide, to a plurality of Agents, a representation of a current state of the environment, wherein the plurality of Agents comprises: a Learning Agent operable to implement a Reinforcement Learning model for selecting actions to be executed on the environment; and a plurality of Baseline Agents, each Baseline Agent operable to implement a policy for selecting actions to be executed on the environment, wherein each policy implemented by a Baseline Agent satisfies a criterion with respect to performance of the task; receive, from the Learning Agent, a candidate Learning Agent action for execution on the environment, and, from the plurality of Baseline Agents, a plurality of candidate Baseline Agent actions for execution on the environment; generate an environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions; provide the environment action to the system for execution on the environment.

25. The management node of embodiment 24, wherein the processing circuitry is further configured to perform the steps of any one of embodiments 2 to 22.

Claims

1. A computer implemented method (100) for managing a system controlling an environment that is operable to perform a task, the method comprising: providing, to a plurality of Agents, a representation of a current state of the environment (110), wherein the plurality of Agents comprises (110a): a Learning Agent operable to implement a Reinforcement Learning model for selecting actions to be executed on the environment; and a plurality of Baseline Agents, each Baseline Agent operable to implement a policy for selecting actions to be executed on the environment, wherein each policy implemented by a Baseline Agent satisfies a criterion with respect to performance of the task; receiving, from the Learning Agent, a candidate Learning Agent action for execution on the environment, and, from the plurality of Baseline Agents, a plurality of candidate Baseline Agent actions for execution on the environment (120); generating an environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions (130); and providing the environment action to the system for execution on the environment (140).

2. The method of claim 1 , further comprising: receiving, from the system, a representation of a state of the environment following execution of the environment action, and a value of a reward function representing an impact of execution of the environment action on performance of the task (250); and providing, to the plurality of Agents (260): the representation of a state of the environment following execution of the environment action; the value of the reward function; and a representation of the environment action.

3. The method of claim 1 or 2, wherein generating an environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions comprises: evaluating the candidate Learning Agent action and the plurality of candidate Baseline Agent actions against a criterion relating to at least one of task performance or environment state (232c); and generating the environment action as a function of the candidate Learning Agent action and the candidate Baseline Agent actions on the basis of the evaluation (236c).

4. The method of claim 3, wherein evaluating the candidate Learning Agent action and the plurality of candidate Baseline Agent actions against a criterion relating to at least one of task performance or environment state comprises, for each candidate Learning Agent action and candidate Baseline Agent action: predicting a value of a reward function representing an impact of execution of the candidate Learning Agent action or candidate Baseline Agent action on performance of the task (232cii).

5. The method of claim 3 or 4, wherein evaluating the candidate Learning Agent action and the plurality of candidate Baseline Agent actions against a criterion relating to at least one of task performance or environment state comprises, for each candidate Learning Agent action and candidate Baseline Agent action: predicting a state of the environment following execution of the candidate Learning Agent action or candidate Baseline Agent action (232ci); and predicting a value of a reward function representing an impact of execution of the candidate Learning Agent action or candidate Baseline Agent action on performance of the task on the basis of the predicted environment state (232cii).

6. The method of any one of claims 3 to 5, wherein generating the environment action as a function of the candidate Learning Agent action and the candidate Baseline Agent actions on the basis of the evaluation comprises: selecting the from among the candidate Learning Agent action and the candidate Baseline Agent actions the action which is predicted to generate at least one of (236ci): the highest value of a reward function representing an impact of execution of the action on performance of the task; the greatest increase in value of the reward function from a value based on the current state of the environment.

7. The method of any one of claims 3 to 6, wherein generating an environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions further comprises assembling a candidate set of environment actions from the candidate Learning Agent action and the plurality of candidate Baseline Agent actions on the basis of the evaluation (234c); and wherein generating the environment action as a function of the candidate Learning Agent action and the candidate Baseline Agent actions on the basis of the evaluation comprises selecting the environment action from the candidate set of environment actions (236c).

8. The method of any one of claims 1 to 7, wherein generating an environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions comprises: generating a weighted sum of a combination of the candidate Baseline Agent actions and the candidate Learning Agent action according to at least one of (234d): a predetermined risk schedule; performance feedback of the Baseline Agents and Learning Agent.

9. The method of any one of claims 1 to 8, wherein generating an environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions comprises: generating a weighted combination of candidate Baseline Agent actions, wherein weights are assigned to individual candidate Baseline Agent actions according to at least one of (232d): the representation of the current state of the environment; a candidate Baseline Agent action; a performance measure of the Baseline Agent.

10. The method of any one of claims 1 to 9, wherein the candidate Learning Agent action and candidate Baseline Agent actions comprise candidate action vectors, each element of a candidate action vector corresponding to a possible action and comprising a probability that the corresponding action is the most favourable of the possible actions according to a performance measure of the task; and wherein generating an environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions comprises: using a control policy to generate the environment action, wherein the control policy is configured to: compute a weighted combination of candidate Baseline Agent action vectors (232dii); compute a weighted sum of the weighted combination of candidate Baseline Agent action vectors and the candidate Learning Agent action vector (234dii); and select as the environment action the action corresponding to the highest probability value in the weighted sum (236di).

11. The method of any one of claims 8 to 10, wherein generating an environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions further comprises: updating weights of at least one of the weighted combination or weighted sum on the basis of at least one of (238d, 238di): the representation of the current state of the environment; a candidate Baseline Agent action or candidate Learning Agent action; a performance measure of the Baseline Agents or Learning Agent.

12. The method of any one of claims 1 to 11, wherein the environment comprises at least one of a communication network (31 Oi) or a cell of a communication network (310ii), and wherein the task that the environment is operable to perform comprises provision of communication network services.

13. The method of claim 12, wherein the environment comprises a cell of a communication network, the method further comprising: verifying an impact of the environment action on a neighbour cell (335); and if the impact of the environment action on a neighbour cell violates a neighbour cell performance condition: generating a new environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions (337); and providing the new environment action to the environment for execution (340).

14. The method of claim 12 or 13, wherein the representation of a current state of the environment comprises at least one of: a value of a network coverage parameter (301a) a value of a network capacity parameter (301b); a value of a network congestion parameter (301c); a current network resource allocation (301 d); a current network resource configuration (301 e); a current network usage parameter (301f); a current network parameter of a neighbour communication network cell (301 g); a value of a network signal quality parameter (301 h); a value of a network signal interference parameter (301 i); a value of a network power parameter (301j); a current network frequency band (301k); a current network antenna down-tilt angle (3011); a current network antenna vertical beamwidth (301m); a current network antenna horizontal beamwidth (301 n); a current network antenna height (301 o); a current network geolocation (301 p); a current network inter-site distance (301 q); and wherein an action for execution on the environment comprises at least one of (320i): an allocation decision for a communication network resource; a configuration for a communication network node; a configuration for communication network equipment; a configuration for a communication network operation; a decision relating to provision of communication network services for a wireless device; a configuration for an operation performed by a wireless device in relation to the communication network.

15. The method of any one of claims 12 to 14, when dependent on claim 2, wherein the reward function comprises a function of at least one performance parameter for the communication network (350i).

16. The method of any one of claims 1 to 15, wherein the environment comprises a sector of a cell of a communication network and wherein the task that the environment is operable to perform comprises provision of radio access network services; wherein the representation of a current state of the environment comprises at least one of: a coverage parameter for the sector a capacity parameter for the sector; a signal quality parameter for the sector a down tilt angle of the antenna serving the sector: and wherein an action for execution on the environment comprises a down tilt adjustment value for the antenna serving the sector.

17. A computer program product comprising a computer readable medium, the computer readable medium having computer readable code embodied therein, the computer readable code being configured such that, on execution by a suitable computer or processor, the computer or processor is caused to perform a method of any one of claims 1 to 16.

18. A management node (700) for managing a system controlling an environment that is operable to perform a task, the management node comprising processing circuitry (702) configured to: provide, to a plurality of Agents, a representation of a current state of the environment, wherein the plurality of Agents comprises: a Learning Agent operable to implement a Reinforcement Learning model for selecting actions to be executed on the environment; and a plurality of Baseline Agents, each Baseline Agent operable to implement a policy for selecting actions to be executed on the environment, wherein each policy implemented by a Baseline Agent satisfies a criterion with respect to performance of the task; receive, from the Learning Agent, a candidate Learning Agent action for execution on the environment, and, from the plurality of Baseline Agents, a plurality of candidate Baseline Agent actions for execution on the environment; generate an environment action on the basis of the candidate Learning Agent action and the plurality of candidate Baseline Agent actions; provide the environment action to the system for execution on the environment.

19. The management node of claim 18, wherein the processing circuitry is further configured to perform the steps of any one of claims 2 to 16.

20. The management node of claims 18 or 19, wherein the management node comprises or is instantiated in part of a logical core network node, network management center, network operations center, or Radio Access node.