WO2023160827A1 - Assembling a multi-purpose dataset - Google Patents

Assembling a multi-purpose dataset Download PDF

Info

Publication number
WO2023160827A1
WO2023160827A1 PCT/EP2022/055012 EP2022055012W WO2023160827A1 WO 2023160827 A1 WO2023160827 A1 WO 2023160827A1 EP 2022055012 W EP2022055012 W EP 2022055012W WO 2023160827 A1 WO2023160827 A1 WO 2023160827A1
Authority
WO
WIPO (PCT)
Prior art keywords
dataset
data
model
data sample
diversity
Prior art date
Application number
PCT/EP2022/055012
Other languages
French (fr)
Inventor
Hannes LARSSON
Jalil TAGHIA
Andreas Johnsson
Farnaz MORADI
Xiaoyu LAN
Masoumeh EBRAHIMI
Original Assignee
Telefonaktiebolaget Lm Ericsson (Publ)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Telefonaktiebolaget Lm Ericsson (Publ) filed Critical Telefonaktiebolaget Lm Ericsson (Publ)
Priority to PCT/EP2022/055012 priority Critical patent/WO2023160827A1/en
Publication of WO2023160827A1 publication Critical patent/WO2023160827A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0895Configuration of virtualised networks or elements, e.g. virtualised network function or OpenFlow elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/50Network service management, e.g. ensuring proper service fulfilment according to agreements
    • H04L41/5003Managing SLA; Interaction between SLA and QoS
    • H04L41/5009Determining service level performance parameters or violations of service level contracts, e.g. violations of agreed response time or mean time between failures [MTBF]

Definitions

  • This disclosure relates to methods and nodes in distributed computing systems such as communications networks. More particularly but non-exclusively, the disclosure relates to assembling multi-purpose datasets suitable for use in training different machine learning models to perform different tasks.
  • transfer learning has received considerable attention, specifically in areas such as image, video, and sound recognition.
  • each task is learnt from scratch using training data obtained from a domain and the respective model is trained to make predictions for new data from the same domain.
  • transfer learning can be used to transfer knowledge from a domain where sufficient training data is available to the domain of interest in order to improve the accuracy of the machine learning task.
  • Transfer learning is defined as follows. Given a source domain D S and learning task T s , a target domain D T and learning task T T , transfer learning aims to help improve the learning of the target predictive function r ( ) in D T using the knowledge in D s and T s , where D s * D T , or T s * r T .
  • Transfer learning methods can be divided into two main categories; homogeneous and heterogeneous. In homogeneous transfer learning the feature space in the source and target domains are the same, while in heterogeneous transfer learning the source and target domains can have different feature spaces.
  • a source domain may refer to an ML model trained for a specific type of execution environment (e.g. a VM executing with a specific configuration), whereas the target domain corresponds to a scaled or migrated version of the same environment.
  • a specific type of execution environment e.g. a VM executing with a specific configuration
  • the target domain corresponds to a scaled or migrated version of the same environment.
  • the disclosure herein aims to address some of these issues amongst others.
  • transfer learning is an approach that aims to address certain issues associated with training machine learning models in target domains for which there is limited data with which to train the model.
  • Transfer learning addresses the problem by incorporating knowledge gained from other source domains into the target domain.
  • the training task is then reduced to one of fine-tuning in the target domain (as opposed to complete training of a completely new model).
  • the transferred knowledge from other sources should be relevant to the target domain, and in the setting of distributed cloud of 5G, there are typically multiple deployments of similar infrastructure as the target.
  • the server represents the source dataset using a mixture of experts (MoE) which partitions the source dataset into mutually exclusive subsets and trains a classifier for each subset.
  • the client downloads the experts to evaluate the performance of each expert on the target data.
  • the performance information is sent to the server so that the most relevant samples in the source can be selected.
  • the selected data samples are then sent to the client, so that it can use it for training its model.
  • MoE mixture of experts
  • the disclosure herein aims to improve on some of the problems associated with data collection in transfer learning.
  • a computer implemented method of assembling a multi-purpose dataset suitable for use in training different machine learning models to perform different tasks comprises: obtaining a first data sample; and adding the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to a diversity measure.
  • a computing node for assembling a multi-purpose dataset suitable for use in training different machine learning models to perform different tasks, the node being configured to: obtain a first data sample; and add the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to a diversity measure.
  • a computing node for assembling a multi-purpose dataset suitable for use in training different machine learning models to perform different tasks
  • the node comprises: a memory comprising instruction data representing a set of instructions; and a processor configured to communicate with the memory and to execute the set of instructions.
  • the set of instructions when executed by the processor, cause the processor to: obtain a first data sample; and add the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to a diversity measure.
  • a computer program comprising instructions which, when executed on at least one processor, cause the at least one processor to carry out a method according to the first aspect.
  • a carrier containing a computer program according to the fourth aspect wherein the carrier comprises one of an electronic signal, optical signal, radio signal or computer readable storage medium.
  • a sixth aspect there is a computer program product comprising non transitory computer readable media having stored thereon a computer program according to the fourth aspect.
  • Fig. 1 shows a node according to some embodiments herein;
  • Fig. 2 shows a method according to some embodiments herein
  • Fig. 3 shows example distributions
  • Fig. 4 shows an example apparatus architecture
  • Fig. 5 shows an example method
  • Fig. 6 shows an example signalling diagram
  • Fig. 7 shows an example signalling diagram
  • Fig. 8 shows example data centres interacting with a source manager.
  • a communications network may comprise any one, or any combination of: a wired link (e.g. ASDL) or a wireless link such as Global System for Mobile Communications (GSM), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), New Radio (NR), WiFi, Bluetooth or future wireless technologies.
  • GSM Global System for Mobile Communications
  • WCDMA Wideband Code Division Multiple Access
  • LTE Long Term Evolution
  • NR New Radio
  • WiFi Bluetooth
  • GSM Global System for Mobile Communications
  • GSM Global System for Mobile Communications
  • WCDMA Wideband Code Division Multiple Access
  • LTE Long Term Evolution
  • NR New Radio
  • WiFi Bluetooth
  • wireless network may implement communication standards, such as Global System for Mobile Communications (GSM), Universal Mobile Telecommunications System (UMTS), Long Term Evolution (LTE), and/or other suitable 2G, 3G, 4G, or 5G standards; wireless local area network (WLAN) standards, such as the IEEE 802.11 standards; and/or any other appropriate wireless communication standard, such as the Worldwide Interoperability for Microwave Access (WiMax), Bluetooth, Z-Wave and/or ZigBee standards.
  • GSM Global System for Mobile Communications
  • UMTS Universal Mobile Telecommunications System
  • LTE Long Term Evolution
  • WLAN wireless local area network
  • WiMax Worldwide Interoperability for Microwave Access
  • Bluetooth Z-Wave and/or ZigBee standards.
  • Fig. 1 shows a node (e.g. a computing node, or computer node) according to some embodiments herein.
  • the node 100 is configured (e.g. adapted, operative, or programmed) to perform any of the embodiments of the method 200 as described below.
  • the node 100 may comprise one or more virtual machines running different software and/or processes.
  • the node 100 may therefore comprise one or more servers, switches and/or storage devices and/or may comprise cloud computing infrastructure or infrastructure configured to perform in a distributed manner, that runs the software and/or processes.
  • the node 100 may comprise a processor (e.g. processing circuitry or logic) 102.
  • the processor 102 may control the operation of the node 100 in the manner described herein.
  • the processor 102 can comprise one or more processors, processing units, multi-core processors or modules that are configured or programmed to control the node 100 in the manner described herein.
  • the processor 102 can comprise a plurality of software and/or hardware modules that are each configured to perform, or are for performing, individual or multiple steps of the functionality of the node 100 as described herein.
  • the node 100 may comprise a memory 104.
  • the memory 104 of the node 100 can be configured to store program code or instructions 106 that can be executed by the processor 102 of the node 100 to perform the functionality described herein.
  • the memory 104 of the node 100 can be configured to store any requests, resources, information, data, signals, or similar that are described herein.
  • the processor 102 of the node 100 may be configured to control the memory 104 of the node 100 to store any requests, resources, information, data, signals, or similar that are described herein.
  • the node 100 may generally be any computing node or computer device suitable for performing the functionality herein.
  • the node 100 is a network node in a communications network, as described above.
  • node 100 may comprise any component or network function (e.g. any hardware or software module) in the communications network suitable for performing the functions described herein.
  • a node may comprise equipment capable, configured, arranged and/or operable to communicate directly or indirectly with a UE (such as a wireless device) and/or with other network nodes or equipment in the communications network to enable and/or provide wireless or wired access to the UE and/or to perform other functions (e.g., administration) in the communications network.
  • nodes include, but are not limited to, access points (APs) (e.g., radio access points), base stations (BSs) (e.g., radio base stations, Node Bs, evolved Node Bs (eNBs) and NR NodeBs (gNBs)).
  • APs access points
  • BSs base stations
  • eNBs evolved Node Bs
  • gNBs NR NodeBs
  • core network functions such as, for example, core network functions in a Fifth Generation Core network (5GC).
  • 5GC Fifth Generation Core network
  • the node 100 may comprise other components in addition or alternatively to those indicated in Fig. 1.
  • the node 100 may comprise a communications interface.
  • the communications interface may be for use in communicating with other nodes in a communications network, (e.g. such as other physical or virtual nodes).
  • the communications interface may be configured to transmit to and/or receive from other nodes or network functions requests, resources, information, data, signals, or similar.
  • the processor 102 of node 100 may be configured to control such a communications interface to transmit to and/or receive from other nodes or network functions requests, resources, information, data, signals, or similar.
  • the node 100 is for assembling a multi-purpose dataset suitable for use in training different machine learning models to perform different tasks.
  • the node 100 may be configured to obtain a first data sample, and add the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to a diversity measure.
  • a dataset compiled in this manner may have many uses for example, it could be used to train a multi-purpose machine learning model that can be used as a source model in a transfer learning process for a wide variety of target domains. Thus, optimising the transfer learning process.
  • a database compiled according to the principles herein may be used to re-train a source model to perform a second task in a target domain, again optimising the transfer learning process as one such dataset may contain enough samples with enough diversity to re-train many different models for many different predictive tasks.
  • transfer learning learning from a source domain (e.g. a first domain) is used as a starting point for training or refining of a model suitable for use in a target domain (e.g a second domain).
  • a source domain e.g. a first domain
  • a target domain e.g a second domain
  • the source model trained in the source domain may for example, be a model trained to perform a different but related task.
  • the learnt weights of the source model may be used as the starting point fortraining the target model. This is particularly useful if there is less data in the target domain.
  • the data that is available can be used to fine-tune the source domain model.
  • the methods described herein may also be used in Domain Adaption, which is similar to transfer learning, but where there are many unlabelled samples in the target, but no or little labelled data.
  • Fig. 2 shows a method 200 that may be performed by the node 100 described above.
  • the method 200 is a computer implemented method.
  • the method 200 is for assembling a multi-purpose dataset suitable for use in training different machine learning models to perform different tasks.
  • the method comprises in a first step 202, obtaining a first data sample; and in a second step 204 the method comprises: adding the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to a diversity measure.
  • the dataset may comprise a plurality of data samples obtained from a plurality of nodes (e.g. a plurality of data sources) in a distributed computing system.
  • Each data sample may comprise a plurality of measurements and/or features measured by a respective one of the plurality of nodes in the distributed computing system.
  • each of the plurality of nodes may generally have similar or overlapping feature spaces.
  • the data samples will have common features or parameters.
  • the method 200 is performed by a node in a communications network.
  • the method may be for assembling a multi-purpose dataset of measurements and/or features in a communications network, suitable for training different machine learning models to perform different network operation tasks in the communications network.
  • the plurality of nodes may comprise edge cloud nodes and/or base stations in a communications network.
  • the method 200 may be for use in selecting a diverse (or maximally diverse) multi-purpose dataset, of fixed size, from a great many data samples available from a plurality of edgedevices and/or base stations.
  • the method 200 may enable re-use of a machine learning model based system for different network operation tasks in the communications network by providing a dataset that can be used in a transfer learning process to further train a source model for use in a target domain.
  • the plurality of nodes comprise user devices and the plurality of data samples comprise measurements of wireless connectivity performance.
  • the multi-purpose dataset is for use in training different machine learning models to perform different optimisation or orchestration tasks in the communications network.
  • network operation tasks include but are not limited to: anomaly detection, Key Performance Indicator (KPI) prediction, orchestration and network automation tasks.
  • KPI Key Performance Indicator
  • a first data sample is obtained.
  • the first data sample may be a new or previously unseen data sample.
  • the purpose of the method 200 is then to determine whether to add the first data sample to the dataset or whether to discard it.
  • the first data sample may be obtained from one of the plurality of nodes in a distributed computing system, as described above.
  • step 202 may comprise the first node receiving a first message from a second node in the communications system, the first message comprising the data sample.
  • the second node may be referred to herein as the source node of the first data sample.
  • the first message may be sent by the second node, e.g. when the data sample is received or compiled by the second node.
  • the first node may send a request to the second node for data samples and the second node may send the first data sample to the first node in response to such a request.
  • data pre-processing may be performed, such as filtering for outliers/erroneous datapoints.
  • filtering may be performed using a sequence of manually configured thresholds or conditions, e.g. to reject data samples containing NANs for example.
  • step 204 the method comprises adding the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to (e.g. based on) a diversity measure.
  • the first data sample is added to the dataset if it increases the diversity compared to if the first data sample were not included in the first dataset.
  • the diversity measure is a metric or parameter that can be used to provide an estimate of the diversity of the samples in the dataset.
  • the diversity measure is used to determine if the addition of the first data sample to the dataset increases the diversity of the dataset. For example, in some embodiments, the diversity measure may be used to estimate the diversity of the dataset in the absence of the first data sample (e.g. before the first data sample is added to it). The diversity measure may then be updated or re-estimated for the dataset including the first data sample (e.g. after the first data sample is added to it). If the diversity is increased as a result of adding the first data sample to the dataset, then the first data sample is added to the dataset. An example is shown in Fig. 3, which shows an initial distribution 302 of a dataset. In step 204, if a (new) first data sample changes the distribution to that shown in graph 304, this this may be admitted to the dataset in preference to another data sample that would change the distribution to that shown in graph 306.
  • the diversity measure may generally be any measure of spread of the distribution of data samples in the dataset.
  • the diversity measure may be a statistical measure of diversity.
  • the diversity measure may be based on entropy.
  • the diversity is measured using a differential entropy or Shannon entropy.
  • the diversity measure is policy-based or rule-based. For example, there may be predefined rules that describe whether a sample increases diversity.
  • An example of a rule is that if the sample originates from a new (e.g. previously unseen/or under-represented) environment e.g., new hardware, or software version, then this should be added to the dataset as it increases the diversity in terms of number of environments we have seen samples from.
  • the diversity may be based on a diversity metric specified in a Service Level Agreement (SLA).
  • SLAs may be taken into account when a target requests the source model/dataset.
  • the target may request a diversity above a threshold, as specified in a SLA.
  • the source model may be shared; if it does not meet the requirements, the source model may not be shared.
  • the method comprises: modelling the dataset using a generative model; updating the generative model with the first data sample; and determining whether the diversity of a distribution of the dataset, as modelled by the generative model, has increased as a result of updating the generative model with the first data sample.
  • generative models are a type of machine learning model that can be trained to model a real life data distribution and generate new data samples from said distribution.
  • the object of a generative model is to generate samples that are indistinguishable from real samples taken from the real distribution.
  • a generative model may thus be used to model the distribution of the dataset, and provide an estimation of how likely any given data sample is to have been selected from the modelled distribution.
  • Examples of generative models that may be used herein include but are not limited to: Gaussian mixture models (GMMs), Variational auto-encoders, Long Short-Term Memory Networks, State Space Models and/or Hidden Markov Models.
  • GMMs Gaussian mixture models
  • Variational auto-encoders Variational auto-encoders
  • Long Short-Term Memory Networks Long Short-Term Memory Networks
  • State Space Models State Space Models
  • Hidden Markov Models Hidden Markov Models.
  • Gaussian mixture models are advantageous as they can generally be easily applied straight to the features. Long Short-Term Memory Networks, State Space Models or Hidden Markov Models may be useful for sequential data.
  • the method may comprise estimating the differential entropy or Shannon entropy using Monte Carlo sampling of the distribution of the dataset as modelled by the generative model.
  • one way of estimating the diversity is to use the differential Shannon entropy h(X s ), as the diversity measure for the source domain D_S, which is calculated as: log p s (x) dx, where p( ) is a probability density function for the source domain trace X s .
  • p( ) can be estimated for the dataset by fitting a distribution.
  • the distribution may be obtained from the generative mode (such as a Gaussian Mixture Model) fitted to the dataset.
  • the integral (which is also an expected value) can be estimated by simple Monte Carlo sampling.
  • h(x) may be estimated using monte-carlo sampling of the distribution output from the generative model.
  • a size threshold (or maximum size) may be imposed on the dataset.
  • the size threshold may be based on computational or storage constraints. In such embodiments, if the dataset reaches the size threshold, then admittance of the first data sample may be dependent on removal of another data sample from the dataset that contributes less (or least) to the diversity of the dataset.
  • Synthetic data may also be used to increase the size of the dataset.
  • the method 200 can further comprise using the generative model to generate new synthetic data samples that increase the diversity of the dataset.
  • a first size threshold which may be e.g. a target size threshold
  • the method 200 can further comprise using the generative model to generate new synthetic data samples that increase the diversity of the dataset.
  • the skilled person will be familiar with methods of using generative models to generate new samples. For example, using techniques such as CycleGAN or histogram equalization augmentation techniques as described in the paper entitled: “Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks” by Sandfort, V., Yan, K., Pickhardt, P.J. et al. Sci Rep 9, 16884 (2019).
  • the method 200 described above may be used to determine whether a synthetic data sample adds to the diversity of the dataset.
  • the dataset may be supplemented with synthetic data generated using another data augmentation process.
  • the skilled person will be familiar with other data augmentation processes that can be used to supplement training datasets, but as an example, real numerical data samples may be smoothed, over-sampled, offset with random offsets, or noise to produce synthetic examples; image data (such as photographic data) may be transformed (e.g. rotated, cropped, enlarged or flipped), smoothed or have contrast changes applied, so as to provide additional synthetic data samples.
  • the synthetic samples may be removed, e.g. before the ‘real’ data samples.
  • the method 200 may further comprise removing a second data sample from the dataset, wherein the second data sample was previously generated by the generative model and contributes least to the diversity of the dataset compared to other data samples previously generated by the generative model in the dataset.
  • the diversity measure may be used, in the manner described above, to determine which sample or samples to remove.
  • the first and second size thresholds may be the same threshold (e.g. set to the same value).
  • the method may further comprise: removing a third data sample from the dataset that contributes least to the diversity of the dataset compared to other data samples in the dataset.
  • the third size threshold may be the same as the first and/or second size thresholds (e.g. set to the same value).
  • the diversity measure may be used, in the manner described above, to determine which sample or samples to remove.
  • the third data sample e.g. the sample contributing less or least to the diversity and which should be selected for removal
  • the third data sample has a likelihood in the distribution of the dataset, above a first likelihood threshold.
  • having a high likelihood, or being part of a mode with a high weighting indicates a sample that is derived from the main part of the distribution and thus is less likely to add diversity to the dataset.
  • the method 200 may be repeated, e.g. in a continuous manner, on different data samples in order to build up a dataset of sufficient size and diversity. The method 200 may be used to assemble a maximally diverse dataset from the available data samples.
  • a dataset produced using the method 200 may have various uses. For example, it may be used to train a general purpose machine learning model (e.g. to perform a generic predictive task). Such a general purpose model may be used as a source model in a transfer learning process.
  • the method 200 may comprise training a multi-purpose machine learning model, using the dataset, and using the multipurpose machine learning model as a source model in a transfer learning process. In this way, a multi-purpose, generic source model may be obtained, that is optimised for subsequent transfer learning.
  • the advantage of having a multi-purpose dataset and training a multipurpose model for transfer learning is that it is not necessary to perform “source model selection” and compare different source models/datasets, for example, by comparing the similarity to the target with different sources which is time consuming and computationally expensive. We also do not need to store multiple source datasets/models for each ML task. Moreover, there is no need for target samples to be available for choosing the source model. Thus, embodiments herein may result in reduced computational resource requirements, time saving, and/or storage savings (e.g. compared to saving every data sample).
  • the method 200 may be used in ML-based processes and services in a communications network.
  • the method 200 may be for enabling re-use of a machine learning-model based system for different network operation tasks in the communications network.
  • the method 200 may further comprise steps of obtaining a first model trained to perform a first network operation task, and performing further training on the first model to obtain a second model.
  • the second model is trained to perform a second network operation task, and the further training is performed using the dataset as training data.
  • the diversity of the dataset produced using the method 200 may be optimally diverse and as such, the dataset may be used in a transfer learning process in this way, to perform extra training to train a source model to perform a different task.
  • the first network operation task and/or the second network operation task may be related to anomaly detection, Key Performance Indicator, KPI, prediction, or network automation.
  • the dataset may comprise infrastructure data and/or data from an orchestrator obtained in a communications network.
  • data samples in the dataset may comprise KubernetesTM data.
  • KubernetesTM is an is an open-source orchestrator for managing applications running in containers.
  • each data sample in the dataset may comprise parameters such as: Collective CPU usage
  • a dataset comprising the above-mentioned parameters may be used to train different models, such as Service placement model, scaling, and/or routing models. These are models that decide where to place a service (service placement), when and how to scale (scaling), or how to route requests between services (routing). Models like this would typically be trained using reinforcement learning but an initial model can be transferred from a related task such as KPI prediction with a collected dataset compiled according to the method 200. One could also construct a base model to be transferred using unsupervised methods (for instance autoencoders). A dataset comprising data samples with the parameters above may also be used to train models to perform network tasks such as anomaly detection, KPI prediction and/or network automation in a communications network.
  • models such as Service placement model, scaling, and/or routing models. These are models that decide where to place a service (service placement), when and how to scale (scaling), or how to route requests between services (routing). Models like this would typically be trained using reinforcement learning but an initial model can be transferred from a related task such
  • the methods herein may be used to compile a dataset of features related to Internet of things (loT) devices used in for manufacturing.
  • LoT Internet of things
  • the data samples in the dataset may comprise parameters such as:
  • Such a dataset may be used to train different ML models such as:
  • Anomaly detection models models for predicting product quality, troubleshooting models, and/or models for predicting customer complaints.
  • the methods herein may be used to compile a dataset of features related to Automated guided vehicles in factories.
  • the data samples in the dataset may comprise, amongst other parameters: Image data
  • Such a dataset may be used to train different ML models such as:different Reinforcement Learning (RL) agents for deciding actions that should be performed in order for the vehicle to drive.
  • RL Reinforcement Learning
  • the skilled person will appreciate that these are merely examples and that other data parameters may be collected and the compiled dataset may be used to train other types of predicted models to those described above.
  • the first data sample is obtained from a second node in a communications network and the method further comprises sending a message to the second node, indicating whether the first data sample was added to the dataset.
  • the first node can give feedback to the second node, and the second node may use this feedback when determining whether to forward a data sample to the first node for consideration to be added to the database.
  • FIG. 4 illustrates an apparatus 400 according to some embodiments herein.
  • the apparatus may be part of the node 100 described above.
  • the apparatus 400 comprises the following 3 components:
  • the main component of the apparatus is the global data set component.
  • This component consists of two sub-components: The actual dataset and a model of the data set.
  • This model could for instance be, but is not limited to, a Gaussian mixture model.
  • new samples are obtained (according to step 202 described above) and investigated using various statistical approaches: if a new sample adds to the diversity of the global data set (e.g. according to step 204 described above), it is admitted into the global dataset, otherwise it is rejected. Filtering out of anomalies, erroneous values or NaNs could also be done by this module, as true outliers always increase diversity of a data set, but they may not produce a good dataset. Thresholds for such outliers must be manually configured before deployment of the system.
  • the purpose of the dataset update module is twofold: i. Remove un-necessary samples from the Global dataset. This is done when the size of the dataset exceeds a maximum size, which may or may not be pre-defined, and it is done by removing samples that contribute the least to the diversity. This could be done for instance by removing the samples with the highest likelihood, or samples from modes with the highest weights. ii. Add synthetic samples. If the dataset is smaller than a pre-defined threshold, the dataset model can be used to generate samples that are realistic but under-represented so that the diversity of the dataset would increase.
  • the apparatus 400 may perform the method 500 shown in Fig. 5.
  • the method 500 has the following steps: 1 :
  • the apparatus receives one or more samples from a worker. (E.g. in the manner of step 202, described above)
  • the sample/samples is/are accepted and added to the global data set (step 204 of the method 200 above). The fact that the sample/samples are accepted is reported back to the source nodes of the data.
  • Fig. 6 shows a signal diagram showing the signals sent between an apparatus 602 performing the method 500 on data received from a data source 604 and a data user 606.
  • the node data source 1 604 sends samples that increase the diversity of the global data set.
  • the following signals are sent: S1 : data source 1 sends data sample(s) to apparatus 602.
  • S2:apparatus 602 indicates that one or more of the samples were accepted into the dataset.
  • apparatus 602 sends global dataset to user for use in e.g. a transfer learning problem.
  • Fig. 7 shows a signal diagram showing the signals sent between apparatus 602 performing the method 500 on data received from a second data source 606.
  • the node data source 2 608 sends samples that do not increase the diversity of the global data set.
  • the following signals are sent: S5: data source 2 sends data sample(s) to apparatus 602.
  • apparatus 602 indicates that one or more of the samples were rejected
  • FIG. 8 shows an example with an environment with n Edge Datacenters (DC1 , DC2, DC3) continuously measuring a set of features that are sent to a source manager 802.
  • DC1 , DC2, DC3 Edge Datacenters
  • These features are used as inputs for a set of ML models in order to help with network and edge DC management tasks. Examples could be for instance KPI prediction or anomaly detection.
  • the achieved technical effect is to have a single augmented data source that manages collected data from multiple sources. This achieves lower storage overhead (less memory required) and removes the need of selecting which source model to transfer from - this collects one universally good source data set, which can be used for transfer learning with either an existing task or a new task.
  • the source manager 802 performs the method 500 as illustrated in Fig. 5 on the data samples sent to the source manager by the datacentres DC1 ; DC2; DC3.
  • the steps outlined below are made with reference to the steps shown in Fig. 5:
  • Step 1 The samples from the different sources are sent to the source manager (and received by the source manager according to step 202 of the method 200 above). In this embodiment, this could be done once per day, where all collected samples, X, where X is a (a,b) matrix where a is the set of measurements being sent and b the number of measured features.
  • Step 2 Remove samples where feature values are missing or obviously wrong, this “obviously” wrong can be defined by a domain expert. For instance: a sample containing a feature corresponding to temperature showing a value below 0 K or above 100 degrees Celsius will be filtered out.
  • Step 3 In the source manager there is a Gaussian mixture model (GMM), which is an example of a generative model. This model is trained using the samples that exist in the sample data set (the source data set) and the new samples combined. If the database is empty then the GMM is trained using only the newly income samples.
  • GMM Gaussian mixture model
  • Step 5a If the diversity has increased with the new samples, the samples are accepted and added to the global data set.
  • the model manager sends back information to the edge data center from which the new samples came, that the information was useful. This could then potentially be used in the edge node for decisions on what samples to send in the future.
  • Step 5b If not it/they are rejected, and the generative model is reverted to the previous state and they are not added to the global data set.
  • the model manager sends back information to the edge data center from which the new samples came, that the information was not useful. This can potentially be used in the edge node for future filtering.
  • Step 6 Check if the global dataset is larger than the size limit
  • Step 7a If the dataset is larger than the size limit: remove samples that are least useful for the dataset diversity. This can be done in the following way: compute the likelihood of all samples in the data set, remove the ones with the highest likelihood. This way samples are removed where there are already “enough” samples in the dataset.
  • a source manager such as that illustrated in Fig, 8, could be included in a function for measuring wireless connectivity performance of mobile devices (e.g. user equipments, UEs).
  • the connected device sends performance measurement readings to a central database in the cloud, to which an analytics engine is connected.
  • the collected data can be used as an input for optimizing the network and for added functionality such as predictive mobility, etc.
  • the source manager could be applied at the interface between a data streamer and a big data- database.
  • the mobile devices send data samples to the data streamer and admission control to (and continuous maintenance of) the Big Data database is performed by the source manager. If used like this, the technical effect is reducing size of data stored and possibly reducing the sending of un-necessary samples, if the feedback mechanism described in step 5 of Fig. 5 is performed.
  • the Open Radio Access Network is an open standard for next generation radio access networks driven mostly by telecom operators.
  • An Intelligent Management and Orchestration system within the framework of O- RAN has been proposed which specifically highlights the importance of Al model management, data analytics, and training capabilities.
  • the Al model management component is a functionality for life cycle management (LCM) of models and source domains.
  • the proposed source manager described in Fig. 4 above could reside as a function within an Al model management module, supporting the general model management and LCM activities.
  • the method accepts samples that increase the diversity of the source dataset, has the option to use a generative model used for sample admission to enhance the dataset further, while keeping the size of the data set below a pre-defined maximum size.
  • This allows for creation of a source dataset in a target domain-, model- and task-agnostic way. This is enabled by maximizing the diversity of the global dataset, which lets the data set be constructed before the need arises, saving time and overhead, as a single source data set can be used for multiple uses.
  • the method enables creation of source datasets of a certain data quality.
  • a computer program product comprising a computer readable medium, the computer readable medium having computer readable code embodied therein, the computer readable code being configured such that, on execution by a suitable computer or processor, the computer or processor is caused to perform the method or methods described herein.
  • the disclosure also applies to computer programs, particularly computer programs on or in a carrier, adapted to put embodiments into practice.
  • the program may be in the form of a source code, an object code, a code intermediate source and an object code such as in a partially compiled form, or in any other form suitable for use in the implementation of the method according to the embodiments described herein.
  • a program code implementing the functionality of the method or system may be sub-divided into one or more sub-routines.
  • the sub-routines may be stored together in one executable file to form a self-contained program.
  • Such an executable file may comprise computer-executable instructions, for example, processor instructions and/or interpreter instructions (e.g. Java interpreter instructions).
  • one or more or all of the sub-routines may be stored in at least one external library file and linked with a main program either statically or dynamically, e.g. at runtime.
  • the main program contains at least one call to at least one of the sub-routines.
  • the subroutines may also comprise function calls to each other.
  • the carrier of a computer program may be any entity or device capable of carrying the program.
  • the carrier may include a data storage, such as a ROM, for example, a CD ROM or a semiconductor ROM, or a magnetic recording medium, for example, a hard disk.
  • the carrier may be a transmissible carrier such as an electric or optical signal, which may be conveyed via electric or optical cable or by radio or other means.
  • the carrier may be constituted by such a cable or other device or means.
  • the carrier may be an integrated circuit in which the program is embedded, the integrated circuit being adapted to perform, or used in the performance of, the relevant method.
  • a computer program may be stored/distributed on a suitable medium, such as an optical storage medium or a solid-state medium supplied together with or as part of other hardware, but may also be distributed in other forms, such as via the Internet or other wired or wireless telecommunication systems. Any reference signs in the claims should not be construed as limiting the scope.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

A computer implemented method of assembling a multi-purpose dataset suitable for use in training different machine learning models to perform different tasks, the method comprises obtaining a first data sample, and adding the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to a diversity measure.

Description

ASSEMBLING A MULTI-PURPOSE DATASET
Technical Field
This disclosure relates to methods and nodes in distributed computing systems such as communications networks. More particularly but non-exclusively, the disclosure relates to assembling multi-purpose datasets suitable for use in training different machine learning models to perform different tasks.
Background
Management of telecoms systems is challenging due to component, infrastructure, and service complexity, heterogeneity, scale, and dynamicity. Promising management approaches based on machine learning (ML) have been developed in academia and industry. However, a key challenge in data-driven model creation is the difficulty in maintaining the accuracy of a model over time, as well as how best to reuse knowledge learnt for one type of execution environment.
In recent years, transfer learning has received considerable attention, specifically in areas such as image, video, and sound recognition. In traditional machine learning, each task is learnt from scratch using training data obtained from a domain and the respective model is trained to make predictions for new data from the same domain. However, sometimes there is not sufficient amounts of data for training in the domain of interest. In such cases, transfer learning can be used to transfer knowledge from a domain where sufficient training data is available to the domain of interest in order to improve the accuracy of the machine learning task.
Transfer learning is defined as follows. Given a source domain DS and learning task Ts, a target domain DT and learning task TT, transfer learning aims to help improve the learning of the target predictive function r( ) in DT using the knowledge in Ds and Ts, where Ds * DT, or Ts * rT.
Transfer learning methods can be divided into two main categories; homogeneous and heterogeneous. In homogeneous transfer learning the feature space in the source and target domains are the same, while in heterogeneous transfer learning the source and target domains can have different feature spaces.
In a telecoms/edge cloud environments, a source domain may refer to an ML model trained for a specific type of execution environment (e.g. a VM executing with a specific configuration), whereas the target domain corresponds to a scaled or migrated version of the same environment. In distributed systems as in telecom/edge cloud there are typically many available source domains at the same time, from different execution environments.
In certain applications, there may be limited understanding of the target domain due to the lack of availability of data samples that are representative of the domain, for example, because of difficulties in collecting data, limitations in storing data, and dynamic nature of the execution environment in the target domain.
The disclosure herein aims to address some of these issues amongst others.
Summary
As described above, transfer learning is an approach that aims to address certain issues associated with training machine learning models in target domains for which there is limited data with which to train the model. Transfer learning addresses the problem by incorporating knowledge gained from other source domains into the target domain. The training task is then reduced to one of fine-tuning in the target domain (as opposed to complete training of a completely new model). The transferred knowledge from other sources should be relevant to the target domain, and in the setting of distributed cloud of 5G, there are typically multiple deployments of similar infrastructure as the target.
Methods for selection of data samples in source domain have been studied before. In the paper by Yan, Acuna & Fidler (2020) entitled: “Neural Data Server: A Large- Scale Search Engine for Transfer Learning Data"’, https://arxiv.org/abs/2001.02799, a large- scale search engine for selecting the most useful source dataset for transfer learning is presented. The motivation for this work is that pre-training on selected relevant data samples in the source domain is important for achieving good performance in the target domain. In the proposed solution the Neural Data Server has access to a large dataset and the clients have limited target data, and the data and ML architecture for the client is not shared with the server due to privacy concerns. The server represents the source dataset using a mixture of experts (MoE) which partitions the source dataset into mutually exclusive subsets and trains a classifier for each subset. The client downloads the experts to evaluate the performance of each expert on the target data. The performance information is sent to the server so that the most relevant samples in the source can be selected. The selected data samples are then sent to the client, so that it can use it for training its model.
The method in Yan, Acuna & Fidler (2020) requires all data to be stored, and sample selection is done by looking at the different source models performance on the target task. This implies the need of storing all source data separately for all sources which causes overhead and removes the option of having a source model already available when the need for the transfer learning arises, that is you have to train both source and target models at the same time. So, the options are to store everything and wait until there is a target task, or use this for just a single task. In other words, all the selection here is done after the fact that a target task has been identified, which is slow.
In the paper by Jamshidi, Christian & Siegmund (2018) entitled: “Learning to Sample: Exploiting Similarities Across Environments to Learn Performance Models for Configurable Systems”, a guided sampling strategy is proposed. The sampling strategy exploits knowledge from various source domains similar to the target domain.
There have also been studies that looked into selection of samples in the target domain. The paper by Khan, Hon & Abraham (2019) entitled: “Transfer Learning with intelligent training data selection for prediction of Alzheimer’s Disease”, describes a method for selecting better samples in the target environment based on information from the source environment. The proposed method uses entropy to select the most informative images (slices from MRI data) to select training samples in the target domain (not in the source domain).
The citations above explicitly use the target data for determining the good samples in the source. This puts a limitation on the fact that the process of building a source data set and training the source model can not happen until the need for it arises (e.g. until the target is defined). This limits the applicability of these methods for dynamically changing environments such as Cloud where target domain can change frequently. Furthermore, the processes have to be repeated for every new target model, which results in high overhead.
The disclosure herein aims to improve on some of the problems associated with data collection in transfer learning.
According to a first aspect herein there is a computer implemented method of assembling a multi-purpose dataset suitable for use in training different machine learning models to perform different tasks. The method comprises: obtaining a first data sample; and adding the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to a diversity measure.
According to a second aspect herein there is a computing node for assembling a multi-purpose dataset suitable for use in training different machine learning models to perform different tasks, the node being configured to: obtain a first data sample; and add the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to a diversity measure.
According to a third aspect herein there is a computing node for assembling a multi-purpose dataset suitable for use in training different machine learning models to perform different tasks, the node comprises: a memory comprising instruction data representing a set of instructions; and a processor configured to communicate with the memory and to execute the set of instructions. The set of instructions, when executed by the processor, cause the processor to: obtain a first data sample; and add the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to a diversity measure.
According to a fourth aspect there is a computer program comprising instructions which, when executed on at least one processor, cause the at least one processor to carry out a method according to the first aspect.
According to a fifth aspect there is a carrier containing a computer program according to the fourth aspect, wherein the carrier comprises one of an electronic signal, optical signal, radio signal or computer readable storage medium.
According to a sixth aspect there is a computer program product comprising non transitory computer readable media having stored thereon a computer program according to the fourth aspect.
Brief Description of the Drawings
For a better understanding and to show more clearly how embodiments herein may be carried into effect, reference will now be made, by way of example only, to the accompanying drawings, in which:
Fig. 1 shows a node according to some embodiments herein;
Fig. 2 shows a method according to some embodiments herein;
Fig. 3 shows example distributions;
Fig. 4 shows an example apparatus architecture;
Fig. 5 shows an example method;
Fig. 6 shows an example signalling diagram;
Fig. 7 shows an example signalling diagram; and
Fig. 8 shows example data centres interacting with a source manager.
Detailed Description
The disclosure herein relates to a communications network (or telecommunications network). A communications network may comprise any one, or any combination of: a wired link (e.g. ASDL) or a wireless link such as Global System for Mobile Communications (GSM), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), New Radio (NR), WiFi, Bluetooth or future wireless technologies. The skilled person will appreciate that these are merely examples and that the communications network may comprise other types of links. A wireless network may be configured to operate according to specific standards or other types of predefined rules or procedures. Thus, particular embodiments of the wireless network may implement communication standards, such as Global System for Mobile Communications (GSM), Universal Mobile Telecommunications System (UMTS), Long Term Evolution (LTE), and/or other suitable 2G, 3G, 4G, or 5G standards; wireless local area network (WLAN) standards, such as the IEEE 802.11 standards; and/or any other appropriate wireless communication standard, such as the Worldwide Interoperability for Microwave Access (WiMax), Bluetooth, Z-Wave and/or ZigBee standards.
Fig. 1 shows a node (e.g. a computing node, or computer node) according to some embodiments herein. The node 100 is configured (e.g. adapted, operative, or programmed) to perform any of the embodiments of the method 200 as described below. It will be appreciated that the node 100 may comprise one or more virtual machines running different software and/or processes. The node 100 may therefore comprise one or more servers, switches and/or storage devices and/or may comprise cloud computing infrastructure or infrastructure configured to perform in a distributed manner, that runs the software and/or processes.
The node 100 may comprise a processor (e.g. processing circuitry or logic) 102. The processor 102 may control the operation of the node 100 in the manner described herein. The processor 102 can comprise one or more processors, processing units, multi-core processors or modules that are configured or programmed to control the node 100 in the manner described herein. In particular implementations, the processor 102 can comprise a plurality of software and/or hardware modules that are each configured to perform, or are for performing, individual or multiple steps of the functionality of the node 100 as described herein.
The node 100 may comprise a memory 104. In some embodiments, the memory 104 of the node 100 can be configured to store program code or instructions 106 that can be executed by the processor 102 of the node 100 to perform the functionality described herein. Alternatively or in addition, the memory 104 of the node 100, can be configured to store any requests, resources, information, data, signals, or similar that are described herein. The processor 102 of the node 100 may be configured to control the memory 104 of the node 100 to store any requests, resources, information, data, signals, or similar that are described herein.
The node 100 may generally be any computing node or computer device suitable for performing the functionality herein. In some embodiments, the node 100 is a network node in a communications network, as described above. Generally, node 100 may comprise any component or network function (e.g. any hardware or software module) in the communications network suitable for performing the functions described herein. For example, a node may comprise equipment capable, configured, arranged and/or operable to communicate directly or indirectly with a UE (such as a wireless device) and/or with other network nodes or equipment in the communications network to enable and/or provide wireless or wired access to the UE and/or to perform other functions (e.g., administration) in the communications network. Examples of nodes include, but are not limited to, access points (APs) (e.g., radio access points), base stations (BSs) (e.g., radio base stations, Node Bs, evolved Node Bs (eNBs) and NR NodeBs (gNBs)). Further examples of nodes include but are not limited to core network functions such as, for example, core network functions in a Fifth Generation Core network (5GC).
It will be appreciated that the node 100 may comprise other components in addition or alternatively to those indicated in Fig. 1. For example, in some embodiments, the node 100 may comprise a communications interface. The communications interface may be for use in communicating with other nodes in a communications network, (e.g. such as other physical or virtual nodes). For example, the communications interface may be configured to transmit to and/or receive from other nodes or network functions requests, resources, information, data, signals, or similar. The processor 102 of node 100 may be configured to control such a communications interface to transmit to and/or receive from other nodes or network functions requests, resources, information, data, signals, or similar.
Briefly, the node 100 is for assembling a multi-purpose dataset suitable for use in training different machine learning models to perform different tasks. The node 100 may be configured to obtain a first data sample, and add the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to a diversity measure.
In this manner, over time, a (maximally) diverse dataset can be compiled. A dataset compiled in this manner may have many uses for example, it could be used to train a multi-purpose machine learning model that can be used as a source model in a transfer learning process for a wide variety of target domains. Thus, optimising the transfer learning process. Alternatively, or additionally, a database compiled according to the principles herein may be used to re-train a source model to perform a second task in a target domain, again optimising the transfer learning process as one such dataset may contain enough samples with enough diversity to re-train many different models for many different predictive tasks.
Thus, presented herein are methods, nodes, computer programs, computer program products and computer carriers that can combine measurements from different source domains to create a single, maximally diverse, general purpose source data set, which can be used for transfer learning to different target domains. The samples in the source domain are selected independently from the target domain, in an efficient manner, allowing a single multi-purpose dataset to be created in a computationally efficient and storage efficient manner. Thus, providing improved source management for transfer learning.
The skilled person will be familiar with machine learning and more particularly, transfer learning which is described in the paper by Pan & Yang (2010) entitled, “A Survey on Transfer Learning” IEEE Transactions on Knowledge and Data Engineering (Volume: 22, Issue: 10, Oct. 2010). In transfer learning, learning from a source domain (e.g. a first domain) is used as a starting point for training or refining of a model suitable for use in a target domain (e.g a second domain). In practice, the source model trained in the source domain, may for example, be a model trained to perform a different but related task. The learnt weights of the source model may be used as the starting point fortraining the target model. This is particularly useful if there is less data in the target domain. In which case, the data that is available can be used to fine-tune the source domain model. The methods described herein may also be used in Domain Adaption, which is similar to transfer learning, but where there are many unlabelled samples in the target, but no or little labelled data.
Fig. 2 shows a method 200 that may be performed by the node 100 described above. The method 200 is a computer implemented method. The method 200 is for assembling a multi-purpose dataset suitable for use in training different machine learning models to perform different tasks. Briefly, the method comprises in a first step 202, obtaining a first data sample; and in a second step 204 the method comprises: adding the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to a diversity measure.
The dataset may comprise a plurality of data samples obtained from a plurality of nodes (e.g. a plurality of data sources) in a distributed computing system. Each data sample may comprise a plurality of measurements and/or features measured by a respective one of the plurality of nodes in the distributed computing system.
The data produced by (or obtained from) each of the plurality of nodes (or data sources) may generally have similar or overlapping feature spaces. In other words, the data samples will have common features or parameters.
In some embodiments, the method 200 is performed by a node in a communications network. As such, the method may be for assembling a multi-purpose dataset of measurements and/or features in a communications network, suitable for training different machine learning models to perform different network operation tasks in the communications network.
As an example, the plurality of nodes may comprise edge cloud nodes and/or base stations in a communications network. In scenarios where data is in over-abundance and where it is computationally inefficient to store all the possible data samples that are available, the method 200 may be for use in selecting a diverse (or maximally diverse) multi-purpose dataset, of fixed size, from a great many data samples available from a plurality of edgedevices and/or base stations.
The method 200 may enable re-use of a machine learning model based system for different network operation tasks in the communications network by providing a dataset that can be used in a transfer learning process to further train a source model for use in a target domain. This is described in more detail below. In one example, the plurality of nodes comprise user devices and the plurality of data samples comprise measurements of wireless connectivity performance. In such an example, the multi-purpose dataset is for use in training different machine learning models to perform different optimisation or orchestration tasks in the communications network.
More generally, network operation tasks include but are not limited to: anomaly detection, Key Performance Indicator (KPI) prediction, orchestration and network automation tasks.
In step 202, a first data sample is obtained. The first data sample may be a new or previously unseen data sample. The purpose of the method 200 is then to determine whether to add the first data sample to the dataset or whether to discard it.
The first data sample may be obtained from one of the plurality of nodes in a distributed computing system, as described above. For example, if the method 200 is performed by a first node in a communications system, step 202 may comprise the first node receiving a first message from a second node in the communications system, the first message comprising the data sample. The second node may be referred to herein as the source node of the first data sample.
The first message may be sent by the second node, e.g. when the data sample is received or compiled by the second node. In other examples, the first node may send a request to the second node for data samples and the second node may send the first data sample to the first node in response to such a request.
Following step 202, data pre-processing may be performed, such as filtering for outliers/erroneous datapoints. Such filtering may be performed using a sequence of manually configured thresholds or conditions, e.g. to reject data samples containing NANs for example.
In step 204 the method comprises adding the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to (e.g. based on) a diversity measure. The first data sample is added to the dataset if it increases the diversity compared to if the first data sample were not included in the first dataset.
The diversity measure is a metric or parameter that can be used to provide an estimate of the diversity of the samples in the dataset. The diversity measure is used to determine if the addition of the first data sample to the dataset increases the diversity of the dataset. For example, in some embodiments, the diversity measure may be used to estimate the diversity of the dataset in the absence of the first data sample (e.g. before the first data sample is added to it). The diversity measure may then be updated or re-estimated for the dataset including the first data sample (e.g. after the first data sample is added to it). If the diversity is increased as a result of adding the first data sample to the dataset, then the first data sample is added to the dataset. An example is shown in Fig. 3, which shows an initial distribution 302 of a dataset. In step 204, if a (new) first data sample changes the distribution to that shown in graph 304, this this may be admitted to the dataset in preference to another data sample that would change the distribution to that shown in graph 306.
The diversity measure may generally be any measure of spread of the distribution of data samples in the dataset. The diversity measure may be a statistical measure of diversity. For example, the diversity measure may be based on entropy. For example, the diversity is measured using a differential entropy or Shannon entropy.
In another example, the diversity measure is policy-based or rule-based. For example, there may be predefined rules that describe whether a sample increases diversity. An example of a rule is that if the sample originates from a new (e.g. previously unseen/or under-represented) environment e.g., new hardware, or software version, then this should be added to the dataset as it increases the diversity in terms of number of environments we have seen samples from.
In another example, the diversity may be based on a diversity metric specified in a Service Level Agreement (SLA). For example, SLAs may be taken into account when a target requests the source model/dataset. For example, the target may request a diversity above a threshold, as specified in a SLA. In such an example, if the diversity of the dataset meets the requirements in the SLA, then the source model may be shared; if it does not meet the requirements, the source model may not be shared.
In some embodiments, the method comprises: modelling the dataset using a generative model; updating the generative model with the first data sample; and determining whether the diversity of a distribution of the dataset, as modelled by the generative model, has increased as a result of updating the generative model with the first data sample.
The skilled person will be familiar with generative models which are a type of machine learning model that can be trained to model a real life data distribution and generate new data samples from said distribution. The object of a generative model is to generate samples that are indistinguishable from real samples taken from the real distribution. A generative model may thus be used to model the distribution of the dataset, and provide an estimation of how likely any given data sample is to have been selected from the modelled distribution.
Examples of generative models that may be used herein include but are not limited to: Gaussian mixture models (GMMs), Variational auto-encoders, Long Short-Term Memory Networks, State Space Models and/or Hidden Markov Models. The skilled person will be familiar with these types of generative models. GMMs, for example, are described in the paper by Nasios, N., & Bors, A. (2006). Variational learning for Gaussian mixture models. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 36, 849-862. Gaussian mixture models are advantageous as they can generally be easily applied straight to the features. Long Short-Term Memory Networks, State Space Models or Hidden Markov Models may be useful for sequential data.
Generally, in step 204 the method may comprise estimating the differential entropy or Shannon entropy using Monte Carlo sampling of the distribution of the dataset as modelled by the generative model.
For example, one way of estimating the diversity is to use the differential Shannon entropy h(Xs), as the diversity measure for the source domain D_S, which is calculated as:
Figure imgf000012_0001
log ps(x) dx, where p( ) is a probability density function for the source domain trace Xs. In practice p( ) can be estimated for the dataset by fitting a distribution. For example, the distribution may be obtained from the generative mode (such as a Gaussian Mixture Model) fitted to the dataset. The integral (which is also an expected value) can be estimated by simple Monte Carlo sampling.
In another example, the Differential entropy may be used as the diversity measure, which can be calculated according to: h(x) = -Ex~p(x)logp(x) where p(x) is the distribution of the dataset, e.g. as output from the generative model (which may, e.g. be a GMM). Again, h(x) may be estimated using monte-carlo sampling of the distribution output from the generative model.
The skilled person will appreciate that these are merely examples and that diversity may be estimated in other ways to those described above.
In some examples, a size threshold (or maximum size) may be imposed on the dataset. The size threshold may be based on computational or storage constraints. In such embodiments, if the dataset reaches the size threshold, then admittance of the first data sample may be dependent on removal of another data sample from the dataset that contributes less (or least) to the diversity of the dataset.
Synthetic data may also be used to increase the size of the dataset. Thus, if the size of the dataset is below a first size threshold (which may be e.g. a target size threshold), the method 200 can further comprise using the generative model to generate new synthetic data samples that increase the diversity of the dataset. The skilled person will be familiar with methods of using generative models to generate new samples. For example, using techniques such as CycleGAN or histogram equalization augmentation techniques as described in the paper entitled: “Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks” by Sandfort, V., Yan, K., Pickhardt, P.J. et al. Sci Rep 9, 16884 (2019). The method 200 described above may be used to determine whether a synthetic data sample adds to the diversity of the dataset.
As another example, the dataset may be supplemented with synthetic data generated using another data augmentation process. The skilled person will be familiar with other data augmentation processes that can be used to supplement training datasets, but as an example, real numerical data samples may be smoothed, over-sampled, offset with random offsets, or noise to produce synthetic examples; image data (such as photographic data) may be transformed (e.g. rotated, cropped, enlarged or flipped), smoothed or have contrast changes applied, so as to provide additional synthetic data samples. These are merely examples however and it will be appreciated that many other methods may equally be used to produce synthetic data examples in order to increase the size of a training dataset.
In the event that the dataset becomes too large, then the synthetic samples may be removed, e.g. before the ‘real’ data samples. For example, if the size of the dataset is above a second size threshold, the method 200 may further comprise removing a second data sample from the dataset, wherein the second data sample was previously generated by the generative model and contributes least to the diversity of the dataset compared to other data samples previously generated by the generative model in the dataset. In such an example, the diversity measure may be used, in the manner described above, to determine which sample or samples to remove. It will be appreciated that the first and second size thresholds may be the same threshold (e.g. set to the same value).
If there are no synthetic samples then a (real) data sample that contributes least to the diversity may be removed instead. For example, if the size of the dataset is above a third size threshold, the method may further comprise: removing a third data sample from the dataset that contributes least to the diversity of the dataset compared to other data samples in the dataset. It will be appreciated that the third size threshold may be the same as the first and/or second size thresholds (e.g. set to the same value). In such an example, the diversity measure may be used, in the manner described above, to determine which sample or samples to remove. For example, the third data sample (e.g. the sample contributing less or least to the diversity and which should be selected for removal) may be selected from a mode of the generative model with a weight above a weight threshold. E.g. the mode with the highest weight. As another example, according to the generative model, the third data sample has a likelihood in the distribution of the dataset, above a first likelihood threshold. In both examples, having a high likelihood, or being part of a mode with a high weighting indicates a sample that is derived from the main part of the distribution and thus is less likely to add diversity to the dataset. It will be appreciated that the method 200 may be repeated, e.g. in a continuous manner, on different data samples in order to build up a dataset of sufficient size and diversity. The method 200 may be used to assemble a maximally diverse dataset from the available data samples.
A dataset produced using the method 200 may have various uses. For example, it may be used to train a general purpose machine learning model (e.g. to perform a generic predictive task). Such a general purpose model may be used as a source model in a transfer learning process. Put another way, in in some embodiments, the method 200 may comprise training a multi-purpose machine learning model, using the dataset, and using the multipurpose machine learning model as a source model in a transfer learning process. In this way, a multi-purpose, generic source model may be obtained, that is optimised for subsequent transfer learning. The advantage of having a multi-purpose dataset and training a multipurpose model for transfer learning is that it is not necessary to perform “source model selection” and compare different source models/datasets, for example, by comparing the similarity to the target with different sources which is time consuming and computationally expensive. We also do not need to store multiple source datasets/models for each ML task. Moreover, there is no need for target samples to be available for choosing the source model. Thus, embodiments herein may result in reduced computational resource requirements, time saving, and/or storage savings (e.g. compared to saving every data sample).
As noted above, the method 200 may be used in ML-based processes and services in a communications network. For example, as noted above, the method 200 may be for enabling re-use of a machine learning-model based system for different network operation tasks in the communications network. As such, the method 200 may further comprise steps of obtaining a first model trained to perform a first network operation task, and performing further training on the first model to obtain a second model. In such embodiments, the second model is trained to perform a second network operation task, and the further training is performed using the dataset as training data. The diversity of the dataset produced using the method 200 may be optimally diverse and as such, the dataset may be used in a transfer learning process in this way, to perform extra training to train a source model to perform a different task.
The first network operation task and/or the second network operation task may be related to anomaly detection, Key Performance Indicator, KPI, prediction, or network automation.
As an example, the dataset may comprise infrastructure data and/or data from an orchestrator obtained in a communications network. For example, data samples in the dataset may comprise Kubernetes™ data. Kubernetes™ is an is an open-source orchestrator for managing applications running in containers. As an example, each data sample in the dataset may comprise parameters such as: Collective CPU usage
Individual CPU statistics
Memory used and available
Swap space used and available
Overall I/O activities of the system
Individual device I/O activities
Context switch statistics
Run queue and load average data
Network statistics
By compiling the dataset using the method 200 herein, a dataset comprising the above-mentioned parameters may be used to train different models, such as Service placement model, scaling, and/or routing models. These are models that decide where to place a service (service placement), when and how to scale (scaling), or how to route requests between services (routing). Models like this would typically be trained using reinforcement learning but an initial model can be transferred from a related task such as KPI prediction with a collected dataset compiled according to the method 200. One could also construct a base model to be transferred using unsupervised methods (for instance autoencoders). A dataset comprising data samples with the parameters above may also be used to train models to perform network tasks such as anomaly detection, KPI prediction and/or network automation in a communications network.
As another example, the methods herein may be used to compile a dataset of features related to Internet of things (loT) devices used in for manufacturing.
In such an example the data samples in the dataset may comprise parameters such as:
Sensor data, examples:
Temperature, vibration, torque, humidity etc
Such a dataset may be used to train different ML models such as:
Anomaly detection models, models for predicting product quality, troubleshooting models, and/or models for predicting customer complaints.
As, another example, the methods herein may be used to compile a dataset of features related to Automated guided vehicles in factories.
In such an example, the data samples in the dataset may comprise, amongst other parameters: Image data
Such a dataset may be used to train different ML models such as:different Reinforcement Learning (RL) agents for deciding actions that should be performed in order for the vehicle to drive. The skilled person will appreciate that these are merely examples and that other data parameters may be collected and the compiled dataset may be used to train other types of predicted models to those described above.
In another embodiment in a communications network, the first data sample is obtained from a second node in a communications network and the method further comprises sending a message to the second node, indicating whether the first data sample was added to the dataset. In this way, the first node can give feedback to the second node, and the second node may use this feedback when determining whether to forward a data sample to the first node for consideration to be added to the database.
Turning now to Fig, 4 which illustrates an apparatus 400 according to some embodiments herein. The apparatus may be part of the node 100 described above. The apparatus 400 comprises the following 3 components:
404: Global dataset and model
The main component of the apparatus is the global data set component. This component consists of two sub-components: The actual dataset and a model of the data set. This model could for instance be, but is not limited to, a Gaussian mixture model.
406: Sample admission controller
In the sample admission controller, new samples are obtained (according to step 202 described above) and investigated using various statistical approaches: if a new sample adds to the diversity of the global data set (e.g. according to step 204 described above), it is admitted into the global dataset, otherwise it is rejected. Filtering out of anomalies, erroneous values or NaNs could also be done by this module, as true outliers always increase diversity of a data set, but they may not produce a good dataset. Thresholds for such outliers must be manually configured before deployment of the system.
402: Dataset update module
The purpose of the dataset update module is twofold: i. Remove un-necessary samples from the Global dataset. This is done when the size of the dataset exceeds a maximum size, which may or may not be pre-defined, and it is done by removing samples that contribute the least to the diversity. This could be done for instance by removing the samples with the highest likelihood, or samples from modes with the highest weights. ii. Add synthetic samples. If the dataset is smaller than a pre-defined threshold, the dataset model can be used to generate samples that are realistic but under-represented so that the diversity of the dataset would increase.
The apparatus 400 may perform the method 500 shown in Fig. 5. The method 500 has the following steps: 1 : The apparatus receives one or more samples from a worker. (E.g. in the manner of step 202, described above)
2: Filtering out of outliers. If there is an outlier, the sample is rejected, and the rejection is reported back to the source node.
3: Update the dataset model with the new sample/samples.
4: Check if the diversity has increased
5a: If the diversity has increased with the new sample/samples, the sample/samples is/are accepted and added to the global data set (step 204 of the method 200 above). The fact that the sample/samples are accepted is reported back to the source nodes of the data.
5b: If not it/they are rejected, and the generative model is reverted to the previous state. The fact that the sample/samples are rejected is reported back to the source nodes of the data.
6: Check if the dataset is larger than the size limit
7a: If the dataset is larger than the size limit: remove samples that are least useful for the dataset diversity.
7b: Otherwise (optionally) generate more samples. This could be samples that increase the diversity of the data set from the generative model or simply using existing techniques for over sampling/data augmentation.
Turning now to Fig. 6 which shows a signal diagram showing the signals sent between an apparatus 602 performing the method 500 on data received from a data source 604 and a data user 606. In this example, the node data source 1 604 sends samples that increase the diversity of the global data set. In this example, the following signals are sent: S1 : data source 1 sends data sample(s) to apparatus 602.
S2:apparatus 602 indicates that one or more of the samples were accepted into the dataset.
S3: user requests dataset
S4: apparatus 602 sends global dataset to user for use in e.g. a transfer learning problem.
Turning now to Fig. 7 which shows a signal diagram showing the signals sent between apparatus 602 performing the method 500 on data received from a second data source 606. In this example, the node data source 2 608 sends samples that do not increase the diversity of the global data set. In this example, the following signals are sent: S5: data source 2 sends data sample(s) to apparatus 602.
S6: apparatus 602 indicates that one or more of the samples were rejected
Turning to Fig. 8 which shows an example with an environment with n Edge Datacenters (DC1 , DC2, DC3) continuously measuring a set of features that are sent to a source manager 802. These features are used as inputs for a set of ML models in order to help with network and edge DC management tasks. Examples could be for instance KPI prediction or anomaly detection. The achieved technical effect is to have a single augmented data source that manages collected data from multiple sources. This achieves lower storage overhead (less memory required) and removes the need of selecting which source model to transfer from - this collects one universally good source data set, which can be used for transfer learning with either an existing task or a new task.
The source manager 802 performs the method 500 as illustrated in Fig. 5 on the data samples sent to the source manager by the datacentres DC1 ; DC2; DC3. The steps outlined below are made with reference to the steps shown in Fig. 5:
Step 1 : The samples from the different sources are sent to the source manager (and received by the source manager according to step 202 of the method 200 above). In this embodiment, this could be done once per day, where all collected samples, X, where X is a (a,b) matrix where a is the set of measurements being sent and b the number of measured features.
Step 2: Remove samples where feature values are missing or obviously wrong, this “obviously” wrong can be defined by a domain expert. For instance: a sample containing a feature corresponding to temperature showing a value below 0 K or above 100 degrees Celsius will be filtered out.
Step 3: In the source manager there is a Gaussian mixture model (GMM), which is an example of a generative model. This model is trained using the samples that exist in the sample data set (the source data set) and the new samples combined. If the database is empty then the GMM is trained using only the newly income samples.
Step 4: Check if the diversity of the data set with the newly added samples is larger than the diversity of the old data set (without the newly added samples). For example, using the Differential entroy h(x) as the diversity metric: h{x) = -Ex~p(x)logp(x) p(x) is the GMM, Ex~p(x) is the mathematical expectation which takes the expected value of its argument. h(x) can be estimated using monte carlo sampling.
Step 5a: If the diversity has increased with the new samples, the samples are accepted and added to the global data set. The model manager sends back information to the edge data center from which the new samples came, that the information was useful. This could then potentially be used in the edge node for decisions on what samples to send in the future.
Step 5b: If not it/they are rejected, and the generative model is reverted to the previous state and they are not added to the global data set. The model manager sends back information to the edge data center from which the new samples came, that the information was not useful. This can potentially be used in the edge node for future filtering.
Step 6: Check if the global dataset is larger than the size limit Step 7a: If the dataset is larger than the size limit: remove samples that are least useful for the dataset diversity. This can be done in the following way: compute the likelihood of all samples in the data set, remove the ones with the highest likelihood. This way samples are removed where there are already “enough” samples in the dataset.
Turning now to other embodiments, in one embodiment, a source manager such as that illustrated in Fig, 8, could be included in a function for measuring wireless connectivity performance of mobile devices (e.g. user equipments, UEs). In this example, the connected device sends performance measurement readings to a central database in the cloud, to which an analytics engine is connected. The collected data can be used as an input for optimizing the network and for added functionality such as predictive mobility, etc.
The source manager could be applied at the interface between a data streamer and a big data- database. The mobile devices send data samples to the data streamer and admission control to (and continuous maintenance of) the Big Data database is performed by the source manager. If used like this, the technical effect is reducing size of data stored and possibly reducing the sending of un-necessary samples, if the feedback mechanism described in step 5 of Fig. 5 is performed.
Turning now to another embodiment, the Open Radio Access Network (O-RAN) is an open standard for next generation radio access networks driven mostly by telecom operators. An Intelligent Management and Orchestration system within the framework of O- RAN has been proposed which specifically highlights the importance of Al model management, data analytics, and training capabilities. Specifically, the Al model management component is a functionality for life cycle management (LCM) of models and source domains. The proposed source manager described in Fig. 4 above could reside as a function within an Al model management module, supporting the general model management and LCM activities.
Thus, there is described herein, systems and methods to update and maintain a source dataset with samples from multiple distributed data sources, for instance edge data centres or radio base stations. The method accepts samples that increase the diversity of the source dataset, has the option to use a generative model used for sample admission to enhance the dataset further, while keeping the size of the data set below a pre-defined maximum size. This allows for creation of a source dataset in a target domain-, model- and task-agnostic way. This is enabled by maximizing the diversity of the global dataset, which lets the data set be constructed before the need arises, saving time and overhead, as a single source data set can be used for multiple uses. The method enables creation of source datasets of a certain data quality. The admission or rejection of data samples is based on its statistical properties, and hence the method can reduce the number of repeated samples/samples with overlapping feature space in the dataset. In another embodiment, there is provided a computer program product comprising a computer readable medium, the computer readable medium having computer readable code embodied therein, the computer readable code being configured such that, on execution by a suitable computer or processor, the computer or processor is caused to perform the method or methods described herein.
Thus, it will be appreciated that the disclosure also applies to computer programs, particularly computer programs on or in a carrier, adapted to put embodiments into practice. The program may be in the form of a source code, an object code, a code intermediate source and an object code such as in a partially compiled form, or in any other form suitable for use in the implementation of the method according to the embodiments described herein.
It will also be appreciated that such a program may have many different architectural designs. For example, a program code implementing the functionality of the method or system may be sub-divided into one or more sub-routines. Many different ways of distributing the functionality among these sub-routines will be apparent to the skilled person. The sub-routines may be stored together in one executable file to form a self-contained program. Such an executable file may comprise computer-executable instructions, for example, processor instructions and/or interpreter instructions (e.g. Java interpreter instructions). Alternatively, one or more or all of the sub-routines may be stored in at least one external library file and linked with a main program either statically or dynamically, e.g. at runtime. The main program contains at least one call to at least one of the sub-routines. The subroutines may also comprise function calls to each other.
The carrier of a computer program may be any entity or device capable of carrying the program. For example, the carrier may include a data storage, such as a ROM, for example, a CD ROM or a semiconductor ROM, or a magnetic recording medium, for example, a hard disk. Furthermore, the carrier may be a transmissible carrier such as an electric or optical signal, which may be conveyed via electric or optical cable or by radio or other means. When the program is embodied in such a signal, the carrier may be constituted by such a cable or other device or means. Alternatively, the carrier may be an integrated circuit in which the program is embedded, the integrated circuit being adapted to perform, or used in the performance of, the relevant method.
Variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the disclosure and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude a plurality. A single processor or other unit may fulfil the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage. A computer program may be stored/distributed on a suitable medium, such as an optical storage medium or a solid-state medium supplied together with or as part of other hardware, but may also be distributed in other forms, such as via the Internet or other wired or wireless telecommunication systems. Any reference signs in the claims should not be construed as limiting the scope.

Claims

1. A computer implemented method (200) of assembling a multi-purpose dataset suitable for use in training different machine learning models to perform different tasks, the method (200) comprising: obtaining (202) a first data sample; and adding (204) the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to a diversity measure.
2. A method (200) as in claim 1 further comprising: modelling the dataset using a generative model; updating the generative model with the first data sample; and determining whether the diversity of a distribution of the dataset, as modelled by the generative model, has increased as a result of updating the generative model with the first data sample.
3. A method (200) as in claim 2 wherein the diversity is measured using a differential entropy or Shannon entropy.
4. A method (200) as in claim 3 wherein the differential entropy or Shannon entropy is estimated by Monte Carlo sampling of the distribution of the dataset as modelled by the generative model.
5. A method (200) as in any one of claims 2 to 4 wherein, if a size of the dataset is below a first size threshold, the method (200) further comprises: using the generative model to generate new synthetic data samples that increase the diversity of the dataset.
6. A method (200) as in claim 5 wherein, if the size of the dataset is above a second size threshold, the method (200) further comprises: removing a second data sample from the dataset, wherein the second data sample was previously generated by the generative model and contributes least to the diversity of the dataset compared to other data samples previously generated by the generative model in the dataset.
7. A method (200) as in any one of claims 2 to 6 wherein, if the size of the dataset is above a third size threshold, the method (200) further comprises: removing a third data sample from the dataset that contributes least to the diversity of the dataset compared to other data samples in the dataset.
8. A method (200) as in claim 7 wherein: the third data sample is from a mode of the generative model with a weight above a weight threshold; or according to the generative model, the third data sample has a likelihood in the distribution of the dataset, above a first likelihood threshold.
9. A method (200) as in any one of claims 2-8 wherein the generative model is a Gaussian mixture model, Variational auto-encoder, Long Short-Term Memory Network, State Space Model or Hidden Markov Model.
10. A method (200) as in any one of the preceding claims further comprising supplementing the dataset with synthetic data generated using another data augmentation process.
11. A method (200) as in any one of the preceding claims further comprising: training a multi-purpose machine learning model, using the dataset; and using the multi-purpose machine learning model as a source model in a transfer learning process.
12. A method (200) as in any one of the preceding claims wherein the method (200) is performed by a first node in a communications network.
13. A method (200) as in claim 12 wherein the first data sample is obtained from a second node in a communications network; and wherein the method (200) further comprises sending a message to the second node, indicating whether the first data sample was added to the dataset.
14. A method (200) as in any one of the preceding claims wherein the method (200) is for assembling a multi-purpose dataset of measurements and/or features in a communications network, suitable for training different machine learning models to perform different network operation tasks in the communications network.
15. A method (200) as in claim 14 wherein the method (200) is for enabling re-use of a machine learning-model based system for different network operation tasks in the communications network, and wherein the method (200) further comprises: obtaining a first model trained to perform a first network operation task; and performing further training on the first model to obtain a second model wherein the second model is trained to perform a second network operation task, and wherein the further training is performed using the dataset as training data.
16. A method (200) as in claim 15 wherein the first network operation task and/or the second network operation task is related to: anomaly detection;
Key Performance Indicator, KPI, prediction; or network automation.
17. A method (200) as in claim 14, 15, or 16 wherein the dataset comprises a plurality of data samples obtained from a plurality of different nodes in the communications network; and wherein the method (200) is for assembling a diverse multi-purpose dataset from data samples from the plurality of different nodes.
18. A method (200) as in claim 17 wherein the plurality of nodes comprise edge cloud nodes and/or base stations in a communications network.
19. A method (200) as in claim 17 wherein the plurality of nodes comprise user devices; and the plurality of data samples comprise measurements of wireless connectivity performance.
20. A computing node (100) for assembling a multi-purpose dataset suitable for use in training different machine learning models to perform different tasks, the node (100) being configured to: obtain a first data sample; and add the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to a diversity measure.
21. A node (100) as in claim 20 further configured to perform the method (200) of any one of claims 2-19. A computing node (100) for assembling a multi-purpose dataset suitable for use in training different machine learning models to perform different tasks, the node (100) comprising: a memory (104) comprising instruction data representing a set of instructions (106); and a processor (102) configured to communicate with the memory (104) and to execute the set of instructions (106), wherein the set of instructions (106), when executed by the processor (102), cause the processor (102) to: obtain a first data sample; and add the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to a diversity measure. A node (100) as in claim 22 further configured to perform the method (200) of any one of claims 2-19. A computer program comprising instructions which, when executed on at least one processor, cause the at least one processor to carry out a method (200) according to any of claims 1 to 19. A carrier containing a computer program according to claim 24, wherein the carrier comprises one of an electronic signal, optical signal, radio signal or computer readable storage medium. A computer program product comprising non transitory computer readable media having stored thereon a computer program according to claim 24.
PCT/EP2022/055012 2022-02-28 2022-02-28 Assembling a multi-purpose dataset WO2023160827A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2022/055012 WO2023160827A1 (en) 2022-02-28 2022-02-28 Assembling a multi-purpose dataset

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2022/055012 WO2023160827A1 (en) 2022-02-28 2022-02-28 Assembling a multi-purpose dataset

Publications (1)

Publication Number Publication Date
WO2023160827A1 true WO2023160827A1 (en) 2023-08-31

Family

ID=80952306

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2022/055012 WO2023160827A1 (en) 2022-02-28 2022-02-28 Assembling a multi-purpose dataset

Country Status (1)

Country Link
WO (1) WO2023160827A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200364561A1 (en) * 2019-04-23 2020-11-19 Sciencelogic, Inc. Distributed learning anomaly detector
US20210117718A1 (en) * 2019-10-21 2021-04-22 Adobe Inc. Entropy Based Synthetic Data Generation For Augmenting Classification System Training Data
WO2022013264A1 (en) * 2020-07-16 2022-01-20 Koninklijke Philips N.V. Selecting a training dataset with which to train a model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200364561A1 (en) * 2019-04-23 2020-11-19 Sciencelogic, Inc. Distributed learning anomaly detector
US20210117718A1 (en) * 2019-10-21 2021-04-22 Adobe Inc. Entropy Based Synthetic Data Generation For Augmenting Classification System Training Data
WO2022013264A1 (en) * 2020-07-16 2022-01-20 Koninklijke Philips N.V. Selecting a training dataset with which to train a model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
NASIOS, N.BORS, A.: "Variational learning for Gaussian mixture models", IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, PART Β (CYBERNETICS), vol. 36, 2006, pages 849 - 862
PANYANG: "A Survey on Transfer Learning", IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 22, October 2010 (2010-10-01)
SANDFORT, V.YAN, K.PICKHARDT, P.J. ET AL.: "Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks", SCI REP, vol. 9, 2019, pages 16884

Similar Documents

Publication Publication Date Title
Shakarami et al. A survey on the computation offloading approaches in mobile edge computing: A machine learning-based perspective
Kafle et al. Consideration on automation of 5G network slicing with machine learning
US10581667B2 (en) Method and network node for localizing a fault causing performance degradation of a service
US11381463B2 (en) System and method for a generic key performance indicator platform
US11281518B2 (en) Method and system for fault localization in a cloud environment
US20210209481A1 (en) Methods and systems for dynamic service performance prediction using transfer learning
US11652709B2 (en) Managing computation load in a fog network
US11310125B2 (en) AI-enabled adaptive TCA thresholding for SLA assurance
Donta et al. The promising role of representation learning for distributed computing continuum systems
US20230099153A1 (en) Risk-based aggregate device remediation recommendations based on digitized knowledge
Zhou et al. Knowledge transfer and reuse: A case study of AI-enabled resource management in RAN slicing
JPWO2017188048A1 (en) Creation device, creation program, and creation method
Tripathy et al. Sustainable fog-assisted intelligent monitoring framework for consumer electronics in Industry 5.0 applications
US20230327961A1 (en) Determining conflicts between kpi targets in a communications network
US11829799B2 (en) Distributed resource-aware training of machine learning pipelines
US11086606B2 (en) System and method for dynamic process flow control based on real-time events
Mezni et al. Predictive service placement in cloud using deep learning and frequent subgraph mining
WO2023160827A1 (en) Assembling a multi-purpose dataset
Ferdosian et al. Autonomous Intelligent VNF Profiling for Future Intelligent Network Orchestration
Morichetta et al. Demystifying deep learning in predictive monitoring for cloud-native SLOs
Skračić et al. A big data solution for troubleshooting mobile network performance problems
US20240112012A1 (en) Systems and methods for providing a unified causal impact model with hyperedge-enhanced embedding for network interventions
Gokulakrishan et al. An advancing method for web service reliability and scalability using ResNet convolution neural network optimized with Zebra Optimization Algorithm
US11444853B2 (en) Predicting the efficacy of issues detected with machine executed digitized intellectual capital
US20240104270A1 (en) Sustainable and self-adaptive federated digital twin framework

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22712874

Country of ref document: EP

Kind code of ref document: A1