IL310753B2

IL310753B2 - A system and method for determininig a distributed dynamic channel allocation (ddca) policy

Info

Publication number: IL310753B2
Application number: IL310753A
Authority: IL
Inventors: Cohen Yaniv; GREENBERG Ronen
Original assignee: Elbit Systems C4I & Cyber Ltd; Cohen Yaniv; GREENBERG Ronen
Priority date: 2024-02-08
Filing date: 2024-02-08
Publication date: 2025-05-01
Also published as: IL310753A; WO2025169194A1; IL310753B1

Description

A SYSTEM AND METHOD FOR DETERMINING A DISTRIBUTED DYNAMIC CHANNEL ALLOCATION (DDCA) POLICY TECHNICAL FIELD The invention relates to a system and method for determining a Distributed Dynamic Channel Allocation (DDCA) policy.

BACKGROUND Dynamic Channel Allocation (DCA), also referred to as dynamic channel selection, is one of the most challenging tasks in the domain of resource allocation. DCA is widely recognized as a crucial mechanism for facilitating efficient communication within both centralized and distributed systems operating in a constantly changing spectrum environment. DCA algorithms can be categorized into two distinct branches, namely the centralized and distributed methods. Within the realm of the centralized approach, notable studies have employed techniques such as graph coloring. Conversely, in the context of the distributed approach to DCA, widely regarded as more suitable for a range of wireless communication applications, the problem assumes a more intricate nature. In the realm of wireless networks, characterized by the coexistence of diverse networks lacking intercommunication, the effective implementation of a Distributed Dynamic Channel Allocation (DDCA) mechanism holds the potential to yield multifaceted advantages, thereby notably augmenting the efficient utilization of the electromagnetic spectrum. Current solutions apply knowledge and know-how of skilled engineers to determine a DDCA policy used to control the dynamic selection of the channels by the diverse communication networks that can have non or partial communication between them. This engineered solution is limited in its scalability and in its flexibility to be fitted to communication networks with different topologies. There is thus a need in the art for a new method and system for determining a DDCA policy. 25 GENERAL DESCRIPTION In accordance with a first aspect of the presently disclosed subject matter, there is provided a system for determining a Distributed Dynamic Channel Allocation (DDCA) policy allowing members of a wireless communication network that communicate through communication channels, to decide on a given communication channel of the communication channels to allocate for communication, the system comprising a processing circuitry configured to: obtain, a multi-stepped simulation within a simulated environment, the simulated environment comprising: (i) models of the communication channels, and (ii) two or more agents, at least one given agent of the agents representing a wireless communication network with two or more members utilizing the given communication channel, the given agent having: (i) a spatial location, and (ii) a corresponding DDCA policy; execute, the multi-stepped simulation, wherein at each step of the multi-stepped simulation, at least one given agent of the agents: (a) senses a state of the communication channels, utilizing the models of the simulated environment and the spatial locations of the agents, (b) allocates, utilizing the state of the communication channels and the corresponding DDCA policy, a given communication channel of the communication channels to be used in a next step of the multi-stepped simulation by the wireless communication network associated with the given agent, (c) receives a reward, the reward is a weighted linear combination of a personal reward, associated with a quality of the given communication channel allocated by the given agent, and a social reward, associated with the personal rewards of other agents of the agents, other than the given agent, wherein a sum of the weights of the personal reward and the social reward is one, and (d) makes changes to the corresponding DDCA policy in accordance with the reward; and upon the execution of the multi-stepped simulation meeting a convergence condition at a given step, determine, the DDCA policy to be the corresponding DDCA policy at the given step. In some cases, the convergence condition is that a deviation of an expectation of an accumulated reward for a threshold number of previous steps, previous to the given step, is below a deviation threshold. In some cases, the other agents are neighboring agents of the given agent, being agents associated with a spatial location having an Euclidian distance below a distance threshold from the spatial location associated with the given agent.

In some cases, no information is passed between the agents during the execution of the multi-stepped simulation. In some cases, the communication channels are non-orthogonal. In some cases, the models are radio waves propagation models. In some cases, the state of the communication channels sensed by the given agent is a Signal-to-Interference and-Noise Ratio (SINR) of the corresponding sensed communication channel. In accordance with a second aspect of the presently disclosed subject matter, there is provided a system for allocating communication channels in a wireless communication network, wherein at least one member of the wireless communication network utilizes the DDCA policy of claim 1 to decide on a given communication channel of the communication channels to allocate for communication with at least one member of the members of the wireless communication network. In accordance with a third aspect of the presently disclosed subject matter, there is provided a method for determining a Distributed Dynamic Channel Allocation (DDCA) policy allowing members of a wireless communication network that communicate through communication channels, to decide on a given communication channel of the communication channels to allocate for communication, the method comprising: obtaining, by a processing circuitry, a multi-stepped simulation within a simulated environment, the simulated environment comprising: (i) models of the communication channels, and (ii) two or more agents, at least one given agent of the agents representing a wireless communication network with two or more members utilizing the given communication channel, the given agent having: (i) a spatial location, and (ii) a corresponding DDCA policy; executing, by the processing circuitry, the multi-stepped simulation, wherein at each step of the multi-stepped simulation, at least one given agent of the agents: (a) senses a state of the communication channels, utilizing the models of the simulated environment and the spatial locations of the agents, (b) allocates, utilizing the state of the communication channels and the corresponding DDCA policy, a given communication channel of the communication channels to be used in a next step of the multi-stepped simulation by the wireless communication network associated with the given agent, (c) receives a reward, the reward is a weighted linear combination of a personal reward, associated with a quality of the given communication channel allocated by the given agent, and a social reward, associated with the personal rewards of other agents of the agents, other than the given agent, wherein a sum of the weights of the personal reward and the social reward is one, and (d) makes changes to the corresponding DDCA policy in accordance with the reward; and upon the execution of the multi-stepped simulation meeting a convergence condition at a given step, determining, by the processing circuitry, the DDCA policy to be the corresponding DDCA policy at the given step. In some cases, the convergence condition is that a deviation of an expectation of an accumulated reward for a threshold number of previous steps, previous to the given step, is below a deviation threshold. In some cases, the other agents are neighboring agents of the given agent, being agents associated with a spatial location having an Euclidian distance below a distance threshold from the spatial location associated with the given agent. In some cases, no information is passed between the agents during the execution of the multi-stepped simulation. In some cases, the communication channels are non-orthogonal. In some cases, the models are radio waves propagation models. In some cases, the state of the communication channels sensed by the given agent is a Signal-to-Interference and-Noise Ratio (SINR) of the corresponding sensed communication channel. In accordance with a fourth aspect of the presently disclosed subject matter, there is provided a method for allocating communication channels in a wireless communication network, wherein at least one member of the wireless communication network utilizes the DDCA policy of claim 9 to decide on a given communication channel of the communication channels to allocate for communication with at least one member of the members of the wireless communication network. In accordance with a fifth aspect of the presently disclosed subject matter, there is provided a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processor to perform a method for determining a Distributed Dynamic Channel Allocation (DDCA) policy allowing members of a wireless communication network that communicate through communication channels, to decide on a given communication channel of the communication channels to allocate for communication, the method comprising: obtaining, by a processing circuitry, a multi-stepped simulation within a simulated environment, the simulated environment comprising: (i) models of the communication channels, and (ii) two or more agents, at least one given agent of the agents representing a wireless communication network with two or more members utilizing the given communication channel, the given agent having: (i) a spatial location, and (ii) a corresponding DDCA policy; executing, by the processing circuitry, the multi- stepped simulation, wherein at each step of the multi-stepped simulation, at least one given agent of the agents: (a) senses a state of the communication channels, utilizing the models of the simulated environment and the spatial locations of the agents, (b) allocates, utilizing the state of the communication channels and the corresponding DDCA policy, a given communication channel of the communication channels to be used in a next step of the multi-stepped simulation by the wireless communication network associated with the given agent, (c) receives a reward, the reward is a weighted linear combination of a personal reward, associated with a quality of the given communication channel allocated by the given agent, and a social reward, associated with the personal rewards of other agents of the agents, other than the given agent, wherein a sum of the weights of the personal reward and the social reward is one, and (d) makes changes to the corresponding DDCA policy in accordance with the reward; and upon the execution of the multi-stepped simulation meeting a convergence condition at a given step, determining, by the processing circuitry, the DDCA policy to be the corresponding DDCA policy at the given step. BRIEF DESCRIPTION OF THE DRAWINGSIn order to understand the presently disclosed subject matter and to see how it may be carried out in practice, the subject matter will now be described, by way of non-limiting examples only, with reference to the accompanying drawings, in which: Fig. 1 is an exemplary illustration of the channel partition, wherein ? ? ??? ? ? stand for the channel and carrier frequency, respectively; Fig. 2 is an exemplary illustration of an exemplification of a hypothetical scenario involving six networks, wherein each shade corresponds to a distinct network, distinguished by a unique serial number and wherein the symbol ’Ma’ denotes the network manager present at each respective network; Fig. 3 is an exemplary illustration of accumulated rewards as function of episode number during training for implementation of with and without action masking approach; Fig. 4 is an exemplary illustration wherein the mean (?? ) , median (?? )̃, minimum (minCQ) values of channel quality during training; Fig. 5 is an exemplary illustration of the score values of Average Number of Channel Changes (ANCC), Convergence Time (CT), Spectrum Efficiency (SE), and the Weighted Score (WS) during training; Fig. 6 is an exemplary illustration of the weighted score value, WS, as a function of the number of networks participating in the scenario for different values of ? ; Fig. 7 is an exemplary illustration of the expectation of the mean value of CQ and minQC as a function of the number of networks participating in the scenario for different values of ? ; Fig. 8 is an exemplary illustration of the weighted score values, WS, as a function of the number of networks participating in the scenario for different values of ? [m]; Insert: the average value of WS across all scenarios as function of ? [m]; Fig. 9 is an exemplary illustration of the expectation of the mean value of CQ and minQC as a function of the number of networks participating in the scenario for different values of ? [m]; Insert: the average value of E[(CQ+minCQ)/2] across all scenarios as function of ? [m]; Fig. 10 is an exemplary illustration of the expectation of the weighted score WS in the post process mechanism, as a function of participating network at each scenario. Insert: the average value of WS across all scenarios as a function of the threshold value ? , wherein ? represents the percentage of change between the current quality of the channel we are using and the anticipated channel quality of the next channel we will use, as determined by the policy, and wherein 0 stands for None (no post processing); Fig. 11 is an exemplary illustration of the expectation of the average between CQ and minQC as a function of participating network at each scenario. Insert: the average value of E[(CQ+minCQ)/2] across all scenarios as a function of the threshold value ? , wherein 0 stands for None; Fig. 12 is an exemplary illustration of the weighted score value, WS, as a function of the number of networks participating in the scenario for different alternative algorithms; Fig. 13 is an exemplary illustration of the expectation of the mean value of CQ and minQC as a function of the number of networks participating in the scenario for different alternative algorithms, wherein the alternative algorithms include Centrelized Graph-Coloring algorithm, which is a centralized approach that serves as a near optimal solution, and other algorithms that utilize distributed techniques; Fig. 14 is an exemplary illustration of the Convergence Time Score (CTS) as a function of the number of the networks participating in the scenario for different alterative algorithms; Fig. 15 is an exemplary illustration of the average value of E[(CQ+minCQ)/2] across all scenarios for each algorithm. For the proposed ? = 0 algorithm; Fig. 16 is an exemplary illustration of the accumulated rewards during training based on different values of temperature parameter ꞷ; Fig. 17 is an exemplary block diagram schematically illustrating one example of a system for determining a Distributed Dynamic Channel Allocation (DDCA) policy, in accordance with the presently disclosed subject matter; and Fig. 18 is an exemplary flowchart illustrating one example of a sequence of operations carried out for determining a Distributed Dynamic Channel Allocation (DDCA) policy, in accordance with the presently disclosed subject matter.

DETAILED DESCRIPTION In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the presently disclosed subject matter. However, it will be understood by those skilled in the art that the presently disclosed subject matter may be practiced without these specific details. In other instances, well- known methods, procedures, and components have not been described in detail so as not to obscure the presently disclosed subject matter. In the drawings and descriptions set forth, identical reference numerals indicate those components that are common to different embodiments or configurations. Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the specification discussions utilizing terms such as "generating", "formatting", "determining", "capturing", "performing", "updating", "transmitting", "obtaining", "executing", "receiving" or the like, include action and/or processes of a computer that manipulate and/or transform data into other data, said data represented as physical quantities, e.g., such as electronic quantities, and/or said data representing the physical objects. The terms "computer", "processor", "processing resource", "processing circuitry" and "controller" should be expansively construed to cover any kind of electronic device with data processing capabilities, including, by way of non-limiting example, a personal desktop/laptop computer, a server, a computing system, a communication device, a smartphone, a tablet computer, a smart television, a processor (e.g. digital signal processor (DSP), a microcontroller, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), etc.), a group of multiple physical machines sharing performance of various tasks, virtual servers co-residing on a single physical machine, any other electronic computing device, and/or any combination thereof. The operations in accordance with the teachings herein may be performed by a computer specially constructed for the desired purposes or by a general-purpose computer specially configured for the desired purpose by a computer program stored in a non-transitory computer readable storage medium. The term "non-transitory" is used herein to exclude transitory, propagating signals, but to otherwise include any volatile or non-volatile computer memory technology suitable to the application. As used herein, the phrase "for example," "such as", "for instance" and variants thereof describe non-limiting embodiments of the presently disclosed subject matter. Reference in the specification to "one case", "some cases", "other cases" or variants thereof means that a particular feature, structure or characteristic described in connection with the embodiment(s) is included in at least one embodiment of the presently disclosed subject matter. Thus, the appearance of the phrase "one case", "some cases", "other cases" or variants thereof does not necessarily refer to the same embodiment(s). It is appreciated that, unless specifically stated otherwise, certain features of the presently disclosed subject matter, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the presently disclosed subject matter, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable sub-combination. In embodiments of the presently disclosed subject matter, fewer, more and/or different stages than those shown in Fig . 18 may be executed. In embodiments of the presently disclosed subject matter one or more stages illustrated in Fig. 18 may be executed in a different order and/or one or more groups of stages may be executed simultaneously. Fig. 17 illustrate a general schematic of the system architecture in accordance with an embodiment of the presently disclosed subject matter. Each module in Fig . 17 can be made up of any combination of software, hardware and/or firmware that performs the functions as defined and explained herein. The modules in Fig . 17 may be centralized in one location or dispersed over more than one location. In other embodiments of the presently disclosed subject matter, the system may comprise fewer, more, and/or different modules than those shown in Fig. 17 . Any reference in the specification to a method should be applied mutatis mutandis to a system capable of executing the method and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that once executed by a computer result in the execution of the method. Any reference in the specification to a system should be applied mutatis mutandis to a method that may be executed by the system and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that may be executed by the system. Any reference in the specification to a non-transitory computer readable medium should be applied mutatis mutandis to a system capable of executing the instructions stored in the non-transitory computer readable medium and should be applied mutatis mutandis to method that may be executed by a computer that reads the instructions stored in the non-transitory computer readable medium. A system and method for determining a Distributed Dynamic Channel Allocation (DDCA) policy based on a Neural-Network (NN) model, or on any other Machine Learning (ML) or Artificial Intelligence (AI) models, that is designed for large-scale networks in a domain of non-orthogonal channels which utilizes multi-agent deep reinforcement learning is described herein. The main objective of the system is to determine a DDCA policy that improves channel quality while minimizing convergence time. The problem is approached as a multi-agent system, and the NN model can be optimized using a Centralized Training with Decentralized Execution (CTDE) paradigm, employing a reinforcement learning algorithm, wherein a non-limiting example of such a reinforcement learning algorithm can be the DeepMellow value-based reinforcement learning algorithm. The results manifest exceptional performance and robust generalization, showcasing superior efficacy in comparison to alternative methodologies while evincing a marginally diminished performance relative to a fully centralized approach.

The described system for determining the DDCA policy for communication networks, hereby also referred to as "the system", is based on reinforcement learning. Reinforcement learning is a subset of machine learning that allows an AI driven system (sometimes referred to as an agent) to learn through trial and error using feedback from its actions. This feedback is either negative or positive, signaled as punishment or reward with, of course, the aim of maximizing the expectation (average) of the accumulated reward along the scenario, which is the accumulation of the personal rewards and the social rewards given through the execution of the multi-stepped simulation. Reinforcement learning learns from its mistakes and offers artificial intelligence that mimics natural intelligence as closely as it is currently possible. This means that the DDCA policy is not pre-determined based on specific experience and/or know-how, but is instead iteratively learned by the system in a simulated environment as described herein. The learning process ends in determining the DDCA policy. The determined DDCA policy can be utilized by at least one member of a wireless communication network to decide on allocation of a given communication channel of a plurality of available communication channels of the wireless communication network for communication with at least one other member of the members of the wireless communication network. The decision of each member can be made with no information from other members of the communication network. The decision to allocate a channel can be made in more than one communication network without sharing information between the communication networks. The system determines and/or generates the DDCA policy by utilizing a multi-stepped simulation within a target simulated environment of multiple communication networks. In this simulated environment each agent represents an entire communication network with multiple members that communicate through a selected given communication channel. The agent can be seen as a network manager that makes the channel allocation decision for all members of the network represented by that agent. The selection of the given communication channel to use by each of the agents is the task of the DDCA policy. The multi-stepped simulation starts with the agents having a default DDCA policy. At each step, also referred herein as: "episode", the DDCA policy of the agents changes, for example by changing weights on elements of the DDCA policy. Each agent receives a reward for the decision made by the current DDCA policy and changes the DDCA policy accordingly. Each agent receives an individual reward for decisions made following the current DDCA policy. At the end of each episode, the trajectories, comprised of pairs of states and actions taken by the agents, are stored in a global storage. The network is then optimized based on this stored information to improve the accumulated reward achieved by the agents. The reward is a weighted linear combination of a personal reward – associated with the quality of the selected given communication channel allocated by the agent, and a social reward, which is associated with the personal rewards of other agents. These other agents can be agents that are neighbors of the agent – for example, those agents that are located in a Euclidean distance of less than a threshold meters from the agent (a non-limiting example can be 500 meters). This means that in some cases not all agents are taken into account when calculating the social reward – only those agents that represents communication networks that are in a distance that can affect the given agent and the communication network it represents. The sum of the weights of the personal reward and the social reward is one, which means that there is a tradeoff between giving more weight to the personal reward (greediness of the algorithm) and giving more weight to the social reward (the algorithm taking the neighbors needs into account). A non-limiting example is a weight of 0.7 to the personal reward and 0.3 to the social reward. Upon the execution of the multi-stepped simulation meeting a convergence condition at a given step, the system can determine, that the DDCA policy is the corresponding DDCA policy at that given step. An example of the convergence condition can be that the deviation of the expectation of the accumulated reward for a threshold number of previous steps, previous to the given step, is below a deviation threshold. The determined DDCA policy can be used by a at least one member of the wireless communication network to decide on a given communication channel of the communication channels to allocate for communication with at least one member of the members of the wireless communication network. When a need for the determination and/or the generation of a new DDCA policy arises, for example when the DDCA policy is going to be used in another target environment which includes communication channels having different frequencies and/or communication networks with different topologies, the system can run the multi-stepped simulation and/or a changed multi-stepped simulation within a new and/or updated simulated environment – that has models and/or topology that is in accordance with the another target environment. The members of the communication networks in the another environment will use the new DDCA policy. Bearing this in mind, attention is drawn to Fig. 17 , which is an exemplary block diagram schematically illustrating one example of a system for determining a Distributed Dynamic Channel Allocation (DDCA) policy, in accordance with the presently disclosed subject matter. The system for determining a DDCA policy 200, referred herein also as "system 200", can comprise or be otherwise associated with a data repository 210 (e.g., a database, a storage system, a memory including Read Only Memory – ROM, Random Access Memory – RAM, or any other type of memory, etc.) configured to store data, including, inter alia, models of communication channels, spatial locations of one or more agents, one or more DDCA policies, a multi-stepped simulation and its properties, a simulated environment and its properties, simulated communication channels and their properties, etc. In some cases, data repository 210 can be further configured to enable retrieval and/or update and/or deletion of the data stored thereon. It is to be noted that in some cases, data repository 210 can be local and in other cases it can be distributed. It is to be noted that in some cases, data repository 210 can be stored on a cloud-based storage. System 200 can further comprise a network interface 220 enabling connecting the system 200 to the communication network and enabling it to send and receive data, such as models of communication channels, spatial locations of one or more agents, one or more DDCA policies, simulated communication channels and their properties. In some cases, the network interface 220 can be connected to a Local Area Network (LAN), to a Wide Area Network (WAN), or to the Internet. In some cases, the network interface 2can connect to a wireless network. System 200 further comprises processing circuitry 230. Processing circuitry 230 can be one or more processing circuitry units (e.g., central processing units), microprocessors, microcontrollers (e.g., microcontroller units (MCUs)) or any other computing devices or modules, including multiple and/or parallel and/or distributed processing circuitry units, which are adapted to independently or cooperatively process data for controlling relevant system 200 resources and for enabling operations related to system 200 resources. The processing circuitry 230 comprises the following module: DDCA policy determination module 240.

DDCA policy determination module 240 can be configured to perform a DDCA determination process, as further detailed herein, inter alia with reference to Fig. 18. Fig. 18 is an exemplary flowchart illustrating one example of a sequence of operations carried out for determining a Distributed Dynamic Channel Allocation (DDCA) policy, in accordance with the presently disclosed subject matter. According to certain examples of the presently disclosed subject matter, system 200 can be configured to perform a DDCA determination process 300, e.g., utilizing the DDCA policy determination module 240. For this purpose, system 200 can be configured to obtain, a multi-stepped simulation within a simulated environment, the simulated environment comprising: (i) models of the communication channels, and (ii) two or more agents, at least one given agent of the agents representing a wireless communication network with two or more members utilizing the given communication channel, the given agent having: (i) a spatial location, and (ii) a corresponding DDCA policy (block 310). In some cases, the communication channels are non-orthogonal. In some cases, the models are radio waves propagation models. In a non-limiting example, the models of communication channels can be VHF and/or UHF communication channels models. The number of agents can be, for example, six – each representing a communication network that communicates through the VHF and/or UHF communication channels. In some cases, there is one or more members in at least one wireless communication network associated with an agent of the agents. An example topology of these exemplary networks is depicted in Fig. 2, wherein the agents are the network managers of each network and are denoted as "Ma". The spatial location of the network can be for example the spatial location of the network manager of that network. Each agent starts the multi-stepped simulation with a default DDCA policy. This default policy will change during the episodes of the multi-stepped simulation in accordance with a reward. The DDCA policy can be manifested in some cases as a DDCA ML model. The DDCA policy can be seen as a function that maps the state of the simulated environment to one or more actions. The actions can be for example, a channel to be allocated for communication by the wireless communication network associated with an agent utilizing the DDCA policy. For example, the actions can be associated with the one or more possible channels of communication (for example: k possible channels) to be chosen by the wireless communication network. In this case, an action can be to choose allocation of channel k for communication of wireless communication network. The DDCA model can be for example a DDCA neural network. This DDCA neural network can have an input representing the state of the environment. For example, the input can be the state of one or more of the communication channels, sensed utilizing the models of the simulated environment and the spatial locations of the agents. This DDCA neural network can have one or more hidden layer and an output layer. The output layer can be one or more nodes. At least one node of these output nodes can represent an action. For example, the action can be: allocate a given channel of the available k channels for communication for the wireless communication network associated with the agent that utilize this DDCA neural network. In a non-limiting example, the DDCA neural network can have k output nodes, associated with k actions – each action is to allocate one unique channel of the possible k channels for communication. The value that is calculated by the DDCA neural network is the expected accumulated future reward that is expected to be received by the agent by choosing the action – by choosing to allocate a given channel out of the possible k channels. The agent can choose, for example, to perform the action that receives the highest expected accumulated future reward. Continuing the non-limiting example above, the six agents that represent six wireless communication channels can have 10 possible communication channels to choose from. The input layer in this example is the state of the communication channels sensed by the agent at each step of the multi-stepped simulation. The ten actions in the output layer of the DDCA neural network in this example can be: choose to allocate channel 1, choose to allocate channel 2, choose to allocate channel 3, choose to allocate channel 4, choose to allocate channel 5, choose to allocate channel 6, choose to allocate channel 7, choose to allocate channel 8, choose to allocate channel 9 and choose to allocate channel 10. The DDCA neural network calculates the expected accumulated future reward for each of these actions. At runtime, a communication node can utilize the DDCA neural network to choose a channel to allocate by taking the action that receives the highest expected accumulated future reward. The training data for this DDCA neural network is generated during the execution of the multi-stepped simulation. The processing circuitry can be further configured for at least one agent of the agents to input the state of the simulated environment into the DDCA policy and to allocate a channel in accordance with the action receiving the highest expected accumulated future reward by the DDCA policy.

In some cases, the processing circuitry can be further configured to train a DDCA neural network from training data learned during the execution of the multi-stepped simulation. The DDCA neural network starts at time 0 with random weights. Those weights are updated using a gradient descent algorithm during the training procedure utilizing the training data that is generated during the execution of the multi-stepped simulation. The changes made to the DDCA policy in these cases are implanted in the training data that is generated during the execution of the multi-stepped simulation. One optional approach for generating training data for the DDCA neural network during the execution of the multi-stepped simulation involves employing a MellowMax Softmax operator on the predictions made by the DDCA neural network. Throughout the simulation, a trajectory comprising states, actions, and rewards is generated for each agent. These trajectories serve as the basis for creating training data with the objective of minimizing Eq. (14). In Eq. (14), it is evident that a MellowMax operator is applied to the DDCA neural network's prediction for the subsequent state. The MellowMax operation functions as a damping mechanism, mitigating the risk of overestimating the predictions made by the DDCA neural network. After obtaining the multi-stepped simulation, system 200 can be further configured to execute, the multi-stepped simulation, wherein at least in one step of the multi-stepped simulation, at least one given agent of the agents performs the following: (a) senses the state of one or more of the communication channels, utilizing the models of the simulated environment and the spatial locations of the agents, (b) allocates, utilizing the state of the communication channels and the corresponding DDCA policy, a given communication channel of the communication channels to be used in a next step of the multi-stepped simulation by the wireless communication network associated with the given agent, (c) receives a reward, the reward is a weighted linear combination of a personal reward, associated with the quality of the given communication channel allocated by the given agent, and a social reward, associated with the personal rewards of other agents of the agents, other than the given agent, wherein a sum of the weights of the personal reward and the social reward is one, and (d) makes changes to the corresponding DDCA policy in accordance with the reward (block 320). Generally, we store the subsequent transition (s, r, a, s’), where s,r,a,s’ stands for the state, reward, action, and next state. Then, we use a bootstrap technique with the same neural network, to generate a target value for estimation, using the reward and s’. Subsequently, we estimate the predicted value using the current state and action. By minimizing the difference between the estimated and predicted values, we train our neural network. In some cases, the state of the communication channels sensed by the given agent is a Signal-to-Interference and-Noise Ratio (SINR) of the corresponding sensed communication channel. In some case, no information is passed between the agents during the execution of the multi-stepped simulation. In other cases, partial information can be passed between the agents during the execution of the multi-stepped simulation. In some cases, the other agents are neighboring agents of the given agent, being agents associated with a spatial location having a Euclidian distance below a distance threshold from the spatial location associated with the given agent. Continuing the non-limiting example above, the multi- stepped simulation can be run through a plurality of episodes. At each episode at least one of the agents senses the state of the communication channels, for example by obtaining the SINR of at least one of the communication channels. At least one of the agents makes a channel allocation decision for his network based on the sensed SINR of the communication channel and the current DDCA policy he has. The agent receives a reward for his choice and makes changes to the DDCA policy based on the reward. The changed DDCA policy will be utilized by the agent at the next episode of the multi-stepped simulation. This is how the DDCA policy changes and evolves through the multi-stepped simulation. The reward given to the agents at each episode can be a weighted linear combination of a personal reward (a reward associated with the quality of the communication channel allocated by the agent at the current episode), and a social reward (a reward, associated with the personal rewards of other agents of the agents, other than the given agent, for example an average of the rewards given to agents that are neighbors, between two or more executive steps to agents that are neighbors of the agent, wherein neighbors can be determined by the Euclidian Distance between the agent and his neighboring agents). The sum of the weights of the personal reward and the social reward can be one. Upon the execution of the multi-stepped simulation meeting a convergence condition at a given step, system 200 can be further configured to determine, the DDCA policy to be the corresponding DDCA policy at the given step (block 330). The convergence condition can be for example that the deviation of the expectation of the accumulated reward for a threshold number of previous steps, previous to the given step, is below a deviation threshold. When the convergence condition is met, the multi-stepped simulation has ended and the current DDCA policy becomes the determined DDCA policy that can be used by real communication devices participating in actual communication networks. It is to be noted that, with reference to Fig. 18, some of the blocks can be integrated into a consolidated block or can be broken down to a few blocks and/or other blocks may be added. Furthermore, in some cases, the blocks can be performed in a different order than described herein. It is to be further noted that some of the blocks are optional. It should be also noted that whilst the flow diagram is described also with reference to the system elements that realizes them, this is by no means binding, and the blocks can be performed by elements other than those described herein. Bearing this in mind, attention is drawn to a non-limiting exemplary possible system and method for determining a DDCA policy that utilizes one possible exemplary DDCA determination algorithm that can demonstrate exceptional performance and generalization capabilities, surpassing alternative methodologies, including Jamming Avoidance Response (JAR) and random agent approaches, for example, by an approximate margin of 14.5% and 44.5%, respectively. Furthermore, this non-limiting exemplary algorithm exhibits only a small discrepancy of around 2.5% in performance when compared with a centralized approach founded in known graph coloring techniques. In this non-limiting example a scenario with N networks denoted in the set ? = {N,...,NN} is considered. The number of users in each network, Nn, can be defined as Mn ∈ {1,2,...,Mmax} ∀1 ≤ n ≤ N. A total bandwidth (TB) can be equally divided into K equal-sized overlapping channels with bandwidth B denoted by the set ? = {1,2,...,K} such that each network can operate on a single channel at each time slot. Here, we index the channels correspond to its serial number in ? , as illustrated in Figure 1. An Egli model, which was developed for UHF and VHF outdoor signals, for radio frequency propagation can be used for example as a model for the communication channel. Based on the model the Signal-to-Interferenceplus-Noise Ratio (SINR) of each user at each available channel is calculated. The SINR accounts for the interference from the other networks in the environment, and the thermal noise at the receiver. For instance, the SINR of users at network ? ? ∈ ? in some channel k ∈ ? can be calculated as follows. Consider a specific user i transmit a message to user j, where both users belong to network ? ? and operate on channel k ∈ ? . The received SINR at the receiver of user j is given by: SINR? ,? ,? ? ,? =PR? ,? ,? ? ,? ? ? +? ? ,? ? (1) where PR? ,? ,? ? ,? is the power at the receiver of user j at channel k, ? ? ,? ? is the interference at the receiver of user j, and ? ? is the thermal noise. In general, the indices in the power symbols signify the network of users, with each user represented in the subscript, respectively. The calculation of the received power is given by: PR? ,? ,? ? ,? = ??? ,? ? − ??? ,? ,? ? ,? (2) where ??? ,? ? is the transmitter power of user (possibly broadcasted signal) in dBW, and ??? ,? ,? ? ,? is the path loss, which obey the Egli model and given by: (3) ??? ,? ,? ? ,? = 40log(? ? ,? ? ,? )− 20log(? ? ) − 20log(? ? ? ? ,? ? ? ? ) − 10log(? ? ? ? ,? ? ? ? ) [?? ] where ? ? ,? ? ,? is the Euclidean distance between user i and user j in [m], fk is the carrier frequency of channel k in [MHz], ? ? ? ? and ? ? ? ? are the heights of the transmit and receiving antennas in [m], respectively, and ? ? ? ? and ? ? ? ? are the absolute gain of the transmitter and receiver antennas, respectively. The interference power from other networks, ? ? ∈ ? ? ? , at the receiver of user j at channel k, is given by: ? ? ,? ? = ∑ Φ? ,? ? (? )? ∈{1,…,? −1,? +1,…,? } where (4) Φ? ,? ? (? )= ∑ ? ? ,? ,? ? ,? ? ∈{1,…,? ? } where ? ∈ ? ? , and ? ? ,? ,? ? ,? is the interference from user m of network ? ? , computed by: ? ? ,? ,? ? ,? = PR? ,? ,? ? ,? − ? (? ,?̃) (5) where ? ̃is the channel used by network ? ? , and ? (? ,? ̃) is the attenuation at the channel of interest k with respect of channel ? ̃˜. An exemplary attenuation used here is given in Table I.

The thermal noise (IT) at the receiver of any network ? ? can be calculated as follow: IT = kB · Temp · B · NF [W] = 10log10(kB · Temp · B) + 10log10(NF) + 30 [dBm], (6) where kB, Temp, B, and NF are stands for Boltzmann constant, temperature, bandwidth, and noise figure. At room temperature, bandwidth of 2MHz, and 10log10(NF) = 6, IT = −104.9 [dBm]. Let ? ? ∈ {1,…,? } represents the operating channel of network ? ? , and SINR̅̅̅̅̅̅̅? ? ? be the average SINR of network ? ? operating at channel ? ? , given by: SINR? ,? ? ? =? ? −1∑ SINR? ,? ,? ? ? ,? ? ? ? =1,? ≠? SINR̅̅̅̅̅̅̅? ? ? =? ? ∑ SINR? ,? ? ? ? ? ? =1 (7) The objective is to maximize the average SINR̅̅̅̅̅̅̅? ? ? across all participating networks, under the constraint that the SINR̅̅̅̅̅̅̅? ? ? at each network is above target SINR required for high quality communication, defined by SINR∗: max{? ? }? 1? ∑ SINR̅̅̅̅̅̅̅? ? ? ? ? =1 (8) subject to SINR̅̅̅̅̅̅̅? ? ? ≥ SINR∗ ∀n ∈ N, We require from the algorithm to have small convergence time, and distribute implementation.

It can be ascertained, for example, that in instances of the simulations, a certain number of networks, denoted by N, are selected randomly, subject to the condition N ≤ ⌊0.7 · K⌋. Within each network, the number of users is randomly assigned, following an exemplary uniform distribution of 2 − 22 users. The spatial distribution of users within each network is determined, for example, through a multi-variate Gaussian distribution. The expectation vector can be defined by a 2D center point, and the covariance can be represented by a diagonal matrix with entries [50, 50][m]. It is important to note that our current investigation primarily addresses a 2D spatial problem, but the methodology can be readily extended to encompass 3D dimensions. The center point for each network is determined using the following procedure depicted in Algorithm 1 below.

In the context of the simulation, the hypothesis that each network comprises a specific user designated as a "network manager" is adopted. The network manager is chosen to be the user within the network whose total Euclidean distance from all other users in the same network is minimized. In precise terms, this user is selected based on their ability to maintain the shortest average distance to all other users within the confines of the given network. The role of the network manager is crucial as they assume responsibility for gathering information from the other users within their network and disseminating decisions as feedback. This feedback mechanism facilitates effective communication and coordination among users, enabling the network manager to act as a central point of control and information exchange within their respective network by using a dedicated control channel. By fulfilling this pivotal function, the network manager contributes to optimizing the overall efficacy of the network’s operations. Figure 2 illustrates an exemplary representative scenario, showcasing a possible configuration of networks, each accompanied by its designated network manager. Following the spatial localization of the networks and the identification of their respective managers, we proceed to generate a random sequence that remains constant throughout the scenario. This random sequence determines the order in which the networks will execute their actions, adhering to a cyclical pattern. Specifically, the sequence dictates which network will act first and which will follow, as the scenario progresses. In our non-limiting example, the channel allocation problem can be formulated as a MARL framework, where each network manager acts as an autonomous agent within the global framework. The objective of each agent is to optimize the channel quality for its respective network, subject to meeting a specified quality criterion and minimizing spectrum mobility until convergence. In line with standard RL paradigms, a well-defined environment comprising observation, action, and reward components is crucial. The environment has been previously characterized and is described in detail herein. By adhering to these requirements, the MARL framework enables the network managers (agents) to iteratively learn and adapt their channel allocation strategies, striving for improved channel quality and reduced spectrum mobility until the system converges to a desirable solution. To achieve these objectives, it is imperative to address and define the mentioned requirements in detail. In our non-limiting example, the observation space can represent a fusion of the SINR measurements of all users within the network. Prior to the action decided by the network manager, each user transmits their individual SINR measurements for all available channels via a dedicated control channel to the network manager. This vector of SINR measurements possesses a dimension of ℝ? , where K denotes the number of channels. The network manager, having also conducted SINR sensing for all channels, receives unique SINR measurement vectors from all users in the network. These individual vectors can then be concatenated to form a SINR matrix with dimensions ℝ? ? ×? , where ? ? corresponds to the total number of users in the network ? ? , and SINR? ,? ? is the entry at row j, and column k of the matrix. Similarly, we define a binary matrix BSINR? , where each entry equals BSINR? ,? ? = ? (SINR? ,? ? > SINR∗) (9) where ? is the indicator function, equals 1 if SINR? ,? ? > SINR∗, or equals zero otherwise. Next, the manager performs an averaging operation on the BSINR? matrix along the user axis to mitigate the influence of the number of users in the network, given by: QV? (? )=? ? ∑ BSINR? ,? ? ? ? ? =1 (10) This averaging process results in a vector representing the average number of users experiencing sufficient communication quality. We will refer to this vector as the Quality Vector (QV? ∈ ℝ? ) , which serves as the primary observation during the simulation. In the above non-limiting example, the action space refers to the ensemble of permissible actions available to the agent at each decision point. In the present context, the action space is comprised of all potential frequency channels and is denoted in ? , such that at each time step the action of network ? ? is defined as ? ? (? )∈ ? . The primary objective of all agents in this context is to solve a combinatorial problem, specifically, establishing a one-to-one mapping between each network and a channel. The primary objective of all the agents in this context is to address a combinatorial problem, namely, to establish a one-to-one mapping between each network and a channel. The aim is to achieve a state where, after convergence (where each network remains connected to the same chosen channel until the completion of the scenario) or at the scenario's end, the total mean quality of the selected channels across all networks is maximized, as described in above in Eq.(8) To address this objective, we adopt a bifurcated reward system, comprising two distinct components. The first element pertains to the personal reward, denoted as ? ? . Its magnitude is determined by the QV? ? associated with the channel selected by the agent pre-selection cycle of all other networks at time step ? . In essence, ? ? reflects a greedy mechanism aimed at selecting high quality channels. Additionally, the personal reward experiences a positive growth by the value ? 1 if the action remains unchanged. The reward computation is detailed as described in exemplary Algorithm 2 below. The second component of the reward mechanism is designated as the ’social welfare reward’, denoted as ? ??. This reward takes into consideration the implications of a network’s actions on the individual rewards of its neighboring networks during the interval between two consecutive decision-making instances. Specifically, a network is categorized as a neighboring network if the Euclidean Distance (ED) between their network center points is less than a predetermined threshold, represented as ? meters. In essence, ? ?? captures the collective welfare and interdependence among the networks, accounting for the influence exerted by the behavior of each network on the rewards of others over successive iterations. The social welfare reward is calculated as the arithmetic mean of the personal rewards of all neighboring networks. The pseudocode for the social welfare reward is outlined in Algorithm A.1 below.

Finally, the final reward is a linear combination of the two as follows: ? = ? ? ? + (1 − ? )? ?? (11) where ? ∈ [0,1] is a design parameter. In this non-limiting example, we utilize a multi-agent paradigm known as Centralized Training with Decentralized Execution (CTDE). At the beginning of each episode, each individual agent, denoted as the network manager, is assigned an exclusive serial number. This serial number determines the timing of its actions, effectively simulating a delay time characteristic of real-world scenarios. This technique is strategically employed to circumvent the potential collision of decisions among different networks and to mitigate the overall dynamic complexity of the scenario. To ensure a consistent and balanced decision-making process, the scenario horizon is explicitly defined as ? times the total number of participating networks within the given scenario. Consequently, this arrangement allows each network to possess precisely ? decision points per scenario, guaranteeing equitable opportunities for each participant to influence the outcome of the scenario. The resource allocation related to channel frequency selection during the training phase executed by each agent obeys the following prescribed methodology. Specifically, with a probability denoted by ? ? , the agents draw samples from the distribution in Eq. (12), which was first introduce in algorithm Exp3 for non-stochastic multi-armed bandit problem. Otherwise, the agents follow greedy policy Pr(? ? (? )= ? |? ? (? ))=(1 − ? )? ?? (? ? (? ),? ;? ) ∑ ? ?? (? ? (? ),? ′;? )? ′∈{1,…,? } +? ? (12) Here ? ? (? ) stands for the state at time step ? of network ? ? , ? (? ? (? ),? ;? ) stands for the estimated action value by the neural network with weights ? , which its architecture described below, and $alpha$ and $beta$ are constants. Additionally, during the training period comprising B episodes, we annealed ? ? from 0.5 to 0.01 after each of the first B/2 episodes, and then maintained it at a steady value for the remaining half, i.e., the latter B/2 episodes. Furthermore, an additional step involves the application of a masking technique over the ? (? ? (? ),? ? ) values, which was based on the QV? values at each channel ? . For any channel in the QV? with a value of 0, the probability of selecting this channel was masked to 0, while in the event of QV? == 0⃗ , no action selection was performed, and the previous action was preserved. This technique serves the purpose of preventing collisions among different networks and can be mathematically executed as follows: ? (? ? (? ),? ;? )= {−∞, ?? QV? (? )= 0? (? ? (? ),? ;? ), ?? ℎ?????? (13) Next, as is inherent in all Deep-RL value-based algorithms, our principal objective is to find an estimating function, denoted as ? ? , which approximates the action value ? (? ? (? ),? ? ;? ) . This function adeptly maps any given set of state and action (? ,? ) to a real number within the domain of ℝ. The optimization of ? ? is commonly achieved through gradient descent algorithms over a differentiable loss function. In this context, we employ the Huber loss to compute our loss value. The Huber loss function is given by: ? ? (? )= { 1? 2, ?? |? | ≤ ? ? ⋅ (|? | −2? ) , ?? ℎ?????? (14) where ? ? = (? ? (? )+ ? mm? ′∈? ? ? (? ? (? + 1),? ′;? )− ? (? ? (? ),? ? (? );? )) , and ? ? ? is a smooth maximum operator named Mellowmax operator with a temperature parameter ω. Generally, e is the error between the target and predicted values, and ? is a discrimination hyper-parameter threshold. Subsequently, with the aid of these estimated action values, the learn policy can be derived. To enable this process, we formally define the input state as follows: ? ? (? )= concatenate(CBR,QV? ? ) (15) Here, the term CBR, representing Channel Binary Representation, refers to the channel that the network utilizes before executing the action. In other words, it is a binary representation of the preceding action if the action was fulfilled. The term QV? ? stands for the ? ? ? at time step ? . The Neural Network architecture can consist, for example, of three dense layers with skip connections. The hidden layer can use, for example, the Leaky-ReLU activation function with a slope of 0.2, while the output layer can employ the identity activation function. Furthermore, to initialize the weights of the neural network, the Glorot-Uniform method, was applied, and the biases were initialized to zero. The training procedure unfolded across B scenarios as follows: First, within each main training loop, we uniformly sampled the number of participating networks, denoted as ? ∼ ? (2,⌊ 0.7? ⌋). Subsequently, we generated a scenario following the rules outlined in Algorithm 1. For each network, we then randomly assigned a network number ? ∈ {1,...,? } to define ? ? and allocated a unique replay memory (? ? ). Next, we iterate over T time steps for each scenario, where at each step we execute actions for each network sequentially, corresponding to their assigned network number.

Then we stored the relevant information in the respective unique replay buffer. Upon scenario completion, we transferred data from each ? ? into a global replay memory (GRM). Subsequently, we uniformly sampled from the GRM and trained the neural network by minimizing Eq.(14) ? ? times. The training procedure is illustrated in exemplary Algorithm 3 below, providing a comprehensive outline of the steps involved. For a clear overview of the hyperparameters utilized during the training process, please refer to Table II.

The above-described non-limiting exemplary algorithm can be subjected to two distinct training scenarios, one with the action masking approach, referred to in Eq. (13), and the other without it. As anticipated, the utilization of masking during action selection leads to higher accumulated rewards during the initial phases of training (Episodes ≤ 400). This improvement arises from the reduction of unnecessary exploration, as depicted in Figure 3. However, as the training progresses and convergence is achieved (Episodes ≥ 15 500), both techniques yield comparable results. It is important to note that both approaches remain valid when the messages exchanged between the network users and the network manager are transmitted through a control channel, which is assigned and freed for communication at any given time. On the other hand, in situations where no dedicated control channel is available, the utilization of masking becomes essential to ensure the existence of a functional communication link between the users and the network manager. Thus, it guarantees that there is at least some form of connection facilitating communication and data exchange. In addition, using the masking approach not only depends on the presence or absence of a control channel for communication but also helps to fast convergence in case of higher action space. Similarly, consistent convergence results were observed for diverse Mellowmax temperature parameters (ω), as illustrated in Figure 16.

When directing our attention to the physical results obtained during the training process, it is evident that the proposed algorithm exhibits noteworthy enhancements in the mean, median, and minimum values of the channel quality. The improvement of this values, as defined in Eq. (16), are shown in Figure 4. Furthermore, upon convergence, it is conspicuous that the agents have acquired the ability to cooperate effectively, resulting in a scenario where the lowest channel quality (min?? ) of some agents in the game remains above 0.95 percent. CQ = [QV? 1(? ? 1),…,QV? ? (? ? ? )] (16) ?? = ? [CQ] ??̃= ?????? (CQ) ? ????= min(CQ) Similarly, we have undertaken an exploration of the advancements observed during training concerning supplementary physical parameters. The first additional parameter is the average number of channel changes (ANCC), which is define to be the average number of changing a channel until convergence per network. The ANCC score (ANCCS) given by: ANCCS = 1 −ANCC? (17) The second parameter is the convergence time (CT), which refers to the time of the last channel change made by one of the networks during the scenario. The convergence time score (CTS) can be calculated as follows: CTS = 1 −CT?? (18) The third parameter, denoted as spectrum efficiency (SE), is intended to assess the spectral distortion resulting from the utilization of networks after convergence or at the end of the episode. A higher value of SE parameter can indicate that the integration of a new network into the group becomes easier task by providing more channels with high quality to operate on, along with causing minimal distortion to the convergences solution achieved by the earlier participants. The SE score (SES) can be computed as follows: SES = ? [Ψ] =? ∑Ψn? ? =1 (19) Ψn=(∑ (??? ? (? ))? ? =1)0.5 ? 0.5 The last parameter is a linear weighted score (WS) of the parameters of interest, given by: WS = ? 1⋅ ??̅̅̅̅+ ? 2⋅ ANCCS + ? 3⋅ CTS + ? 4⋅ SES (20) here ? 1, ? 2, ? 3, ? 4 are 0.4, 0.1, 0.4, 0.1, respectively. As depicted in Figure 5, all parameter scores, except SE, demonstrate improvement. These results align with our reward function, as specified in Eq.(11), wherein the impact of the CT and ANCC on the reward value is given in line 10 in Algorithm 2. Moreover, in view of the absence of a direct motivation to enhance the SE score, the algorithm possesses the freedom to establish an SE score that harmonizes with the converged policy, and results in an average value of approximately 0.8 for the SE score. Furthermore, we conducted an analysis of the influence of the personal reward weight, denoted as ρ in Eq.(11) with respect to the algorithm’s physical behavior. All options were tested over 30 games for each scenario, where the scenarios differ by the number of networks participating in the scenario, ranging from 2 to 15 networks, which in total sum into 420 games. Despite the training process being limited to scenarios involving a maximum of 7 networks, our examination extends to include the evaluation of the algorithm’s generalization capabilities across scenarios with a larger number of participating networks. It was observed that, with respect to the metric WS as depicted in Figure 6, higher values demonstrated superior performance and enhanced generality. Conversely, in the context of CQ and ? ????, lower values exhibited better performance, as evidenced by the expectation of the mean value displayed in Figure 7. This disparity in results can be attributed to the fact that higher values lead to more aggressive policies, resulting in quicker convergence. On the other hand, lower values facilitated increased cooperation and fairness among the participating networks, leading to a slower but more stable mechanism. An exemplary value of 0.7 was found to effectively strike a balance between these two contrasting aspects, rendering it an optimal selection. In the context of cooperative missions, this equilibrium point is particularly noteworthy, as it arises from the operational constraints of decentralized agents without information exchange capabilities, effectively achieving equilibrium between self-interest and social responsibility. An analysis pertaining to the influence of a designated threshold parameter denoted as Γ, which serves to establish the criteria for considering networks as neighbors was conducted. To briefly recapitulate, if the ED between the central points of two networks falls below the threshold Γ, both networks are deemed neighbors. These neighbor networks influence each other’s rewards through the parameter denoted as ? ??, as delineated in Eq.(11) and elaborated in Algorithm A.1. In the scenario where ? approaches zero (? → 0 [? ]), each network progressively exhibits more self-centered behavior, prioritizing its individual performance exclusively. Conversely, when ? tends towards infinity (? → ∞), each network gradually adopts a more "socialistic" stance, with its reward encompassing a broader spectrum of network performances, including those with which it shares no direct impact on their QVs. For example, when the Euclidean Distance (ED) between the central points equal to 700[m], there exists a 3% probability that the two networks will exert an influence on each other’s QVs. Concerning WS performance, designating a network as a neighbor when the distance falls below an exemplary distance of: 500 meters (corresponding to a probability of 50% that both networks influence each other’s QVs) within the environmental framework (expounded below), yielded the highest average performance across all tested scenarios, as depicted in Figure 8. On the contrary, when we shift our attention to the expectation value of [(CQ + ? ????)/2], it becomes evident that ? = 400 meters (representing a probability of 86% that the networks impact each other’s QVs) yields the most favorable average value across the entire domain under examination, as illustrated in Figure 9. These findings underscore the notion that, in order to achieve a high level of social performance encompassing all participants within the given scenario, a network should only consider interactions with those networks that have a high probability of interaction. This concept not only benefits individual performance but also augments the performance of the entire system. Conversely, it is evident that adopting and learning a selfish policy (? = 0) exhibits the least favorable performance across all scenarios examined. It is noteworthy that when taking into account the average of both parameters, ? = 500 meters stand out as an exemplary possible choice. Subsequently, we sought to further enhance the performance of our proposed algorithm by implementing a post-processing step in the decision-making mechanism. Given the algorithm’s generalization across diverse and extensive domains involving varying numbers of participating networks, certain instances of unnecessary channel switches were observed. For example, in a 2-network scenario, the algorithm would switch from a channel with a CQ of 1.0 to another channel with the same CQ value, even when this movement had no impact on the other participant. Such occurrences are a result of the algorithm’s generalization mechanism, which has learned to create sparsity and adopt policies that aim to provide favorable performance in numerous situations. To address this, we introduced a post-processing step as follows: If the CQ of the desired channel is not greater than the CQ of the current channel by at least ? %, the algorithm maintains its current channel, and effectively abstained from unnecessary changes. However, if the difference in CQ values exceed the specified threshold, the algorithm follows the proposed action and executes the channel switch accordingly. The baseline algorithm, without post-processing, is denoted by ? = None. Furthermore, in multi-agent systems with a high number of members, it can be challenging to account for every possible scenario. Consequently, the agents tend to learn policies that exhibit bias toward the scenarios within the training set. Applying a threshold based on channel quality can reduce unnecessary spectrum mobility, which serves as a post-processing block. To illustrate, consider a scenario in which a single network, initially operating on a randomly selected channel, might transition to an alternative channel without any rational reason, a behavior likely stemming from the biased policy learned by the agents. The findings illustrate a correlation between the increase in the weighted score as the threshold value, φ, increases, as evident from the observations in Figure 10. Conversely, there is a decrease in the channel quality, as represented in Figure 11. This indicates that the inclusion of the threshold significantly expedites the convergence time, albeit at the cost of achieving a better solution concerning channel quality. The selection of the optimal threshold value is contingent upon two crucial factors, the size of the network, where the switch to another channel can be expensive in terms of communication operations, and the quality threshold set by the user. These considerations emphasize the significance of setting a thoughtful threshold to achieve a trade-off between convergence speed and channel quality. The introduction of a post-decision mechanism, as an optional modification, holds promise for overall performance enhancement, thereby warranting further investigation and experimentation to assess its potential benefits fully. A comparative analysis to assess the performance of the proposed algorithm in relation to other well-known algorithms. To establish an upper bound, we considered an algorithm based on graph coloring, which operates in a centralized manner and is not suitable for distributed implementations. Furthermore, we evaluated our algorithm against alternative approaches, including Jammer Avoidance Responds (JAR), which makes channel switches to nearby channels (±2MHz) only if it can improve the channel quality by a margin of at least 0.05; and Random static initialization, wherein we randomly set the initial channel and refrain from making any further channel switches. The comparison with these alternative algorithms allowed us to gain insights into the effectiveness and advantages of our proposed approach. In a manner analogous to the comparison involving ? , all algorithms underwent testing across the identical set of 4games. The obtained results demonstrated the superior performance of our proposed algorithm in terms of both the WS metric and the expectation of [CQ + ? ????]/2 values, as illustrated in Figure 12 and Figure 13, respectively. Nevertheless, as shown in Figure 14, the convergence time score demonstrates lower values in the case of the proposed algorithm without post-processing (? = None), indicating that achieving superior performance in channel quality required a more comprehensive interaction among the network participants than alternative algorithms, translating into a longer time. Furthermore, the proposed decentralized algorithm displayed a marginal difference of 2% compared to the centralized graph-coloring approach in the in-sample domain (i.e., #???????? < 7), underscoring its efficacy in distributed systems, as illustrated in Figure 15. Finally, the obtained results unequivocally demonstrate the exceptional performance of the proposed algorithm, even in scenarios where it was not explicitly trained on (i.e., #???????? > 7, named out-of-sample), thereby highlighting its remarkable generality. This remarkable adaptability can be attributed to the underlying concept that scenarios with a high number of networks in certain constellations can be effectively represented by combinations of multiple scenarios involving a lower number of networks. Consequently, the agent exhibits cooperation not with all networks but rather selectively with a subset, likely comprising fewer than 7 networks, which are considered its neighboring networks. This premise elucidates the compelling rationale behind the observed excellent generality. Figure 16 exhibits the impact of the Mellowmax temperature parameter on the training procedure and convergence. The pseudo code for the social welfare reward is given in exemplary Algorithm A.1 below.

It is to be understood that the presently disclosed subject matter is not limited in its application to the details set forth in the description contained herein or illustrated in the drawings. The presently disclosed subject matter is capable of other embodiments and of being practiced and carried out in various ways. Hence, it is to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting. As such, those skilled in the art will appreciate that the conception upon which this disclosure is based may readily be utilized as a basis for designing other structures, methods, and systems for carrying out the several purposes of the present presently disclosed subject matter. It will also be understood that the system according to the presently disclosed subject matter can be implemented, at least partly, as a suitably programmed computer. Likewise, the presently disclosed subject matter contemplates a computer program being readable by a computer for executing the disclosed method. The presently disclosed subject matter further contemplates a machine-readable memory tangibly embodying a program of instructions executable by the machine for executing the disclosed method.

Claims

- 35 - CLAIMS:

1. A system for determining a Distributed Dynamic Channel Allocation (DDCA) policy allowing members of a wireless communication network that communicate through communication channels, to decide on a given communication channel of the communication channels to allocate for communication, the system comprising a processing circuitry configured to: obtain, a multi-stepped simulation within a simulated environment, the simulated environment comprising: (i) models of the communication channels, and (ii) two or more agents, at least one given agent of the agents representing a wireless communication network with two or more members utilizing the given communication channel, the given agent having: (i) a spatial location, and (ii) a corresponding DDCA policy; execute, the multi-stepped simulation, wherein at each step of the multi-stepped simulation, at least one given agent of the agents: (a) senses a state of the communication channels, utilizing the models of the simulated environment and the spatial locations of the agents, (b) allocates, utilizing the state of the communication channels and the corresponding DDCA policy, a given communication channel of the communication channels to be used in a next step of the multi-stepped simulation by the wireless communication network associated with the given agent, (c) receives a reward, the reward is a weighted linear combination of a personal reward, associated with a quality of the given communication channel allocated by the given agent, and a social reward, associated with the personal rewards of other agents of the agents, other than the given agent, wherein a sum of the weights of the personal reward and the social reward is one, and (d) makes changes to the corresponding DDCA policy in accordance with the reward; and upon the execution of the multi-stepped simulation meeting a convergence condition at a given step, determine, the DDCA policy to be the corresponding DDCA policy at the given step. - 36 -

2. The system of claim 1, wherein the convergence condition is that a deviation of an expectation of an accumulated reward for a threshold number of previous steps, previous to the given step, is below a deviation threshold.

3. The system of claim 1, wherein the other agents are neighboring agents of the given agent, being agents associated with a spatial location having an Euclidian distance below a distance threshold from the spatial location associated with the given agent.

4. The system of claim 1, wherein no information is passed between the agents during the execution of the multi-stepped simulation.

5. The system of claim 1, wherein the communication channels are non-orthogonal.

6. The system of claim 1, wherein the models are radio waves propagation models.

7. The system of claim 1, wherein the state of the communication channels sensed by the given agent is a Signal-to-Interference and-Noise Ratio (SINR) of the corresponding sensed communication channel.

8. A system for allocating communication channels in a wireless communication network, wherein at least one member of the wireless communication network utilizes the DDCA policy of claim 1 to decide on a given communication channel of the communication channels to allocate for communication with at least one member of the members of the wireless communication network.

9. A method for determining a Distributed Dynamic Channel Allocation (DDCA) policy allowing members of a wireless communication network that communicate through communication channels, to decide on a given communication channel of the communication channels to allocate for communication, the method comprising: obtaining, by a processing circuitry, a multi-stepped simulation within a simulated environment, the simulated environment comprising: (i) models of the - 37 - communication channels, and (ii) two or more agents, at least one given agent of the agents representing a wireless communication network with two or more members utilizing the given communication channel, the given agent having: (i) a spatial location, and (ii) a corresponding DDCA policy; executing, by the processing circuitry, the multi-stepped simulation, wherein at each step of the multi-stepped simulation, at least one given agent of the agents: (a) senses a state of the communication channels, utilizing the models of the simulated environment and the spatial locations of the agents, (b) allocates, utilizing the state of the communication channels and the corresponding DDCA policy, a given communication channel of the communication channels to be used in a next step of the multi-stepped simulation by the wireless communication network associated with the given agent, (c) receives a reward, the reward is a weighted linear combination of a personal reward, associated with a quality of the given communication channel allocated by the given agent, and a social reward, associated with the personal rewards of other agents of the agents, other than the given agent, wherein a sum of the weights of the personal reward and the social reward is one, and (d) makes changes to the corresponding DDCA policy in accordance with the reward; and upon the execution of the multi-stepped simulation meeting a convergence condition at a given step, determining, by the processing circuitry, the DDCA policy to be the corresponding DDCA policy at the given step.

10. The method of claim 9, wherein the convergence condition is that a deviation of an expectation of an accumulated reward for a threshold number of previous steps, previous to the given step, is below a deviation threshold.

11. The method of claim 9, wherein the other agents are neighboring agents of the given agent, being agents associated with a spatial location having an Euclidian distance below a distance threshold from the spatial location associated with the given agent. - 38 -

12. The method of claim 9, wherein no information is passed between the agents during the execution of the multi-stepped simulation.

13. The method of claim 9, wherein the communication channels are non-orthogonal.

14. The method of claim 9, wherein the models are radio waves propagation models.

15. The method of claim 9, wherein the state of the communication channels sensed by the given agent is a Signal-to-Interference and-Noise Ratio (SINR) of the corresponding sensed communication channel.

16. A method for allocating communication channels in a wireless communication network, wherein at least one member of the wireless communication network utilizes the DDCA policy of claim 9 to decide on a given communication channel of the communication channels to allocate for communication with at least one member of the members of the wireless communication network.

17. A non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processor to perform a method for determining a Distributed Dynamic Channel Allocation (DDCA) policy allowing members of a wireless communication network that communicate through communication channels, to decide on a given communication channel of the communication channels to allocate for communication, the method comprising: obtaining, by a processing circuitry, a multi-stepped simulation within a simulated environment, the simulated environment comprising: (i) models of the communication channels, and (ii) two or more agents, at least one given agent of the agents representing a wireless communication network with two or more members utilizing the given communication channel, the given agent having: (i) a spatial location, and (ii) a corresponding DDCA policy; executing, by the processing circuitry, the multi-stepped simulation, wherein at each step of the multi-stepped simulation, at least one given agent of the agents: - 39 - (a) senses a state of the communication channels, utilizing the models of the simulated environment and the spatial locations of the agents, (b) allocates, utilizing the state of the communication channels and the corresponding DDCA policy, a given communication channel of the communication channels to be used in a next step of the multi-stepped simulation by the wireless communication network associated with the given agent, (c) receives a reward, the reward is a weighted linear combination of a personal reward, associated with a quality of the given communication channel allocated by the given agent, and a social reward, associated with the personal rewards of other agents of the agents, other than the given agent, wherein a sum of the weights of the personal reward and the social reward is one, and (d) makes changes to the corresponding DDCA policy in accordance with the reward; and upon the execution of the multi-stepped simulation meeting a convergence condition at a given step, determining, by the processing circuitry, the DDCA policy to be the corresponding DDCA policy at the given step. For the Applicant: S.J. Intellectual Property Ltd. By: Avi Jencmen Advocate, Patent Attorney