WO2023160827A1

WO2023160827A1 - Assembling a multi-purpose dataset

Info

Publication number: WO2023160827A1
Application number: PCT/EP2022/055012
Authority: WO
Inventors: Hannes LARSSON; Jalil TAGHIA; Andreas Johnsson; Farnaz MORADI; Xiaoyu LAN; Masoumeh EBRAHIMI
Original assignee: Telefonaktiebolaget Lm Ericsson (Publ)
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2023-08-31

Abstract

A computer implemented method of assembling a multi-purpose dataset suitable for use in training different machine learning models to perform different tasks, the method comprises obtaining a first data sample, and adding the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to a diversity measure.

Description

ASSEMBLING A MULTI-PURPOSE DATASET

Technical Field

This disclosure relates to methods and nodes in distributed computing systems such as communications networks. More particularly but non-exclusively, the disclosure relates to assembling multi-purpose datasets suitable for use in training different machine learning models to perform different tasks.

Background

Management of telecoms systems is challenging due to component, infrastructure, and service complexity, heterogeneity, scale, and dynamicity. Promising management approaches based on machine learning (ML) have been developed in academia and industry. However, a key challenge in data-driven model creation is the difficulty in maintaining the accuracy of a model over time, as well as how best to reuse knowledge learnt for one type of execution environment.

In recent years, transfer learning has received considerable attention, specifically in areas such as image, video, and sound recognition. In traditional machine learning, each task is learnt from scratch using training data obtained from a domain and the respective model is trained to make predictions for new data from the same domain. However, sometimes there is not sufficient amounts of data for training in the domain of interest. In such cases, transfer learning can be used to transfer knowledge from a domain where sufficient training data is available to the domain of interest in order to improve the accuracy of the machine learning task.

Transfer learning is defined as follows. Given a source domain D_S and learning task T_s, a target domain D_T and learning task T_T, transfer learning aims to help improve the learning of the target predictive function _r( ) in D_T using the knowledge in D_s and T_s, where D_s * D_T, or T_s * r_T.

Transfer learning methods can be divided into two main categories; homogeneous and heterogeneous. In homogeneous transfer learning the feature space in the source and target domains are the same, while in heterogeneous transfer learning the source and target domains can have different feature spaces.

In a telecoms/edge cloud environments, a source domain may refer to an ML model trained for a specific type of execution environment (e.g. a VM executing with a specific configuration), whereas the target domain corresponds to a scaled or migrated version of the same environment. In distributed systems as in telecom/edge cloud there are typically many available source domains at the same time, from different execution environments.

In certain applications, there may be limited understanding of the target domain due to the lack of availability of data samples that are representative of the domain, for example, because of difficulties in collecting data, limitations in storing data, and dynamic nature of the execution environment in the target domain.

The disclosure herein aims to address some of these issues amongst others.

Summary

As described above, transfer learning is an approach that aims to address certain issues associated with training machine learning models in target domains for which there is limited data with which to train the model. Transfer learning addresses the problem by incorporating knowledge gained from other source domains into the target domain. The training task is then reduced to one of fine-tuning in the target domain (as opposed to complete training of a completely new model). The transferred knowledge from other sources should be relevant to the target domain, and in the setting of distributed cloud of 5G, there are typically multiple deployments of similar infrastructure as the target.

Methods for selection of data samples in source domain have been studied before. In the paper by Yan, Acuna & Fidler (2020) entitled: “Neural Data Server: A Large- Scale Search Engine for Transfer Learning Data"’, https://arxiv.org/abs/2001.02799, a large- scale search engine for selecting the most useful source dataset for transfer learning is presented. The motivation for this work is that pre-training on selected relevant data samples in the source domain is important for achieving good performance in the target domain. In the proposed solution the Neural Data Server has access to a large dataset and the clients have limited target data, and the data and ML architecture for the client is not shared with the server due to privacy concerns. The server represents the source dataset using a mixture of experts (MoE) which partitions the source dataset into mutually exclusive subsets and trains a classifier for each subset. The client downloads the experts to evaluate the performance of each expert on the target data. The performance information is sent to the server so that the most relevant samples in the source can be selected. The selected data samples are then sent to the client, so that it can use it for training its model.

The method in Yan, Acuna & Fidler (2020) requires all data to be stored, and sample selection is done by looking at the different source models performance on the target task. This implies the need of storing all source data separately for all sources which causes overhead and removes the option of having a source model already available when the need for the transfer learning arises, that is you have to train both source and target models at the same time. So, the options are to store everything and wait until there is a target task, or use this for just a single task. In other words, all the selection here is done after the fact that a target task has been identified, which is slow.

In the paper by Jamshidi, Christian & Siegmund (2018) entitled: “Learning to Sample: Exploiting Similarities Across Environments to Learn Performance Models for Configurable Systems”, a guided sampling strategy is proposed. The sampling strategy exploits knowledge from various source domains similar to the target domain.

There have also been studies that looked into selection of samples in the target domain. The paper by Khan, Hon & Abraham (2019) entitled: “Transfer Learning with intelligent training data selection for prediction of Alzheimer’s Disease”, describes a method for selecting better samples in the target environment based on information from the source environment. The proposed method uses entropy to select the most informative images (slices from MRI data) to select training samples in the target domain (not in the source domain).

The citations above explicitly use the target data for determining the good samples in the source. This puts a limitation on the fact that the process of building a source data set and training the source model can not happen until the need for it arises (e.g. until the target is defined). This limits the applicability of these methods for dynamically changing environments such as Cloud where target domain can change frequently. Furthermore, the processes have to be repeated for every new target model, which results in high overhead.

The disclosure herein aims to improve on some of the problems associated with data collection in transfer learning.

According to a first aspect herein there is a computer implemented method of assembling a multi-purpose dataset suitable for use in training different machine learning models to perform different tasks. The method comprises: obtaining a first data sample; and adding the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to a diversity measure.

According to a second aspect herein there is a computing node for assembling a multi-purpose dataset suitable for use in training different machine learning models to perform different tasks, the node being configured to: obtain a first data sample; and add the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to a diversity measure.

According to a third aspect herein there is a computing node for assembling a multi-purpose dataset suitable for use in training different machine learning models to perform different tasks, the node comprises: a memory comprising instruction data representing a set of instructions; and a processor configured to communicate with the memory and to execute the set of instructions. The set of instructions, when executed by the processor, cause the processor to: obtain a first data sample; and add the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to a diversity measure.

According to a fourth aspect there is a computer program comprising instructions which, when executed on at least one processor, cause the at least one processor to carry out a method according to the first aspect.

According to a fifth aspect there is a carrier containing a computer program according to the fourth aspect, wherein the carrier comprises one of an electronic signal, optical signal, radio signal or computer readable storage medium.

According to a sixth aspect there is a computer program product comprising non transitory computer readable media having stored thereon a computer program according to the fourth aspect.

Brief Description of the Drawings

For a better understanding and to show more clearly how embodiments herein may be carried into effect, reference will now be made, by way of example only, to the accompanying drawings, in which:

Fig. 1 shows a node according to some embodiments herein;

Fig. 2 shows a method according to some embodiments herein;

Fig. 3 shows example distributions;

Fig. 4 shows an example apparatus architecture;

Fig. 5 shows an example method;

Fig. 6 shows an example signalling diagram;

Fig. 7 shows an example signalling diagram; and

Fig. 8 shows example data centres interacting with a source manager.

Detailed Description

The disclosure herein relates to a communications network (or telecommunications network). A communications network may comprise any one, or any combination of: a wired link (e.g. ASDL) or a wireless link such as Global System for Mobile Communications (GSM), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), New Radio (NR), WiFi, Bluetooth or future wireless technologies. The skilled person will appreciate that these are merely examples and that the communications network may comprise other types of links. A wireless network may be configured to operate according to specific standards or other types of predefined rules or procedures. Thus, particular embodiments of the wireless network may implement communication standards, such as Global System for Mobile Communications (GSM), Universal Mobile Telecommunications System (UMTS), Long Term Evolution (LTE), and/or other suitable 2G, 3G, 4G, or 5G standards; wireless local area network (WLAN) standards, such as the IEEE 802.11 standards; and/or any other appropriate wireless communication standard, such as the Worldwide Interoperability for Microwave Access (WiMax), Bluetooth, Z-Wave and/or ZigBee standards.

Fig. 1 shows a node (e.g. a computing node, or computer node) according to some embodiments herein. The node 100 is configured (e.g. adapted, operative, or programmed) to perform any of the embodiments of the method 200 as described below. It will be appreciated that the node 100 may comprise one or more virtual machines running different software and/or processes. The node 100 may therefore comprise one or more servers, switches and/or storage devices and/or may comprise cloud computing infrastructure or infrastructure configured to perform in a distributed manner, that runs the software and/or processes.

The node 100 may comprise a processor (e.g. processing circuitry or logic) 102. The processor 102 may control the operation of the node 100 in the manner described herein. The processor 102 can comprise one or more processors, processing units, multi-core processors or modules that are configured or programmed to control the node 100 in the manner described herein. In particular implementations, the processor 102 can comprise a plurality of software and/or hardware modules that are each configured to perform, or are for performing, individual or multiple steps of the functionality of the node 100 as described herein.

The node 100 may comprise a memory 104. In some embodiments, the memory 104 of the node 100 can be configured to store program code or instructions 106 that can be executed by the processor 102 of the node 100 to perform the functionality described herein. Alternatively or in addition, the memory 104 of the node 100, can be configured to store any requests, resources, information, data, signals, or similar that are described herein. The processor 102 of the node 100 may be configured to control the memory 104 of the node 100 to store any requests, resources, information, data, signals, or similar that are described herein.

The node 100 may generally be any computing node or computer device suitable for performing the functionality herein. In some embodiments, the node 100 is a network node in a communications network, as described above. Generally, node 100 may comprise any component or network function (e.g. any hardware or software module) in the communications network suitable for performing the functions described herein. For example, a node may comprise equipment capable, configured, arranged and/or operable to communicate directly or indirectly with a UE (such as a wireless device) and/or with other network nodes or equipment in the communications network to enable and/or provide wireless or wired access to the UE and/or to perform other functions (e.g., administration) in the communications network. Examples of nodes include, but are not limited to, access points (APs) (e.g., radio access points), base stations (BSs) (e.g., radio base stations, Node Bs, evolved Node Bs (eNBs) and NR NodeBs (gNBs)). Further examples of nodes include but are not limited to core network functions such as, for example, core network functions in a Fifth Generation Core network (5GC).

It will be appreciated that the node 100 may comprise other components in addition or alternatively to those indicated in Fig. 1. For example, in some embodiments, the node 100 may comprise a communications interface. The communications interface may be for use in communicating with other nodes in a communications network, (e.g. such as other physical or virtual nodes). For example, the communications interface may be configured to transmit to and/or receive from other nodes or network functions requests, resources, information, data, signals, or similar. The processor 102 of node 100 may be configured to control such a communications interface to transmit to and/or receive from other nodes or network functions requests, resources, information, data, signals, or similar.

Briefly, the node 100 is for assembling a multi-purpose dataset suitable for use in training different machine learning models to perform different tasks. The node 100 may be configured to obtain a first data sample, and add the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to a diversity measure.

In this manner, over time, a (maximally) diverse dataset can be compiled. A dataset compiled in this manner may have many uses for example, it could be used to train a multi-purpose machine learning model that can be used as a source model in a transfer learning process for a wide variety of target domains. Thus, optimising the transfer learning process. Alternatively, or additionally, a database compiled according to the principles herein may be used to re-train a source model to perform a second task in a target domain, again optimising the transfer learning process as one such dataset may contain enough samples with enough diversity to re-train many different models for many different predictive tasks.

Thus, presented herein are methods, nodes, computer programs, computer program products and computer carriers that can combine measurements from different source domains to create a single, maximally diverse, general purpose source data set, which can be used for transfer learning to different target domains. The samples in the source domain are selected independently from the target domain, in an efficient manner, allowing a single multi-purpose dataset to be created in a computationally efficient and storage efficient manner. Thus, providing improved source management for transfer learning.

The skilled person will be familiar with machine learning and more particularly, transfer learning which is described in the paper by Pan & Yang (2010) entitled, “A Survey on Transfer Learning” IEEE Transactions on Knowledge and Data Engineering (Volume: 22, Issue: 10, Oct. 2010). In transfer learning, learning from a source domain (e.g. a first domain) is used as a starting point for training or refining of a model suitable for use in a target domain (e.g a second domain). In practice, the source model trained in the source domain, may for example, be a model trained to perform a different but related task. The learnt weights of the source model may be used as the starting point fortraining the target model. This is particularly useful if there is less data in the target domain. In which case, the data that is available can be used to fine-tune the source domain model. The methods described herein may also be used in Domain Adaption, which is similar to transfer learning, but where there are many unlabelled samples in the target, but no or little labelled data.

Fig. 2 shows a method 200 that may be performed by the node 100 described above. The method 200 is a computer implemented method. The method 200 is for assembling a multi-purpose dataset suitable for use in training different machine learning models to perform different tasks. Briefly, the method comprises in a first step 202, obtaining a first data sample; and in a second step 204 the method comprises: adding the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to a diversity measure.

The dataset may comprise a plurality of data samples obtained from a plurality of nodes (e.g. a plurality of data sources) in a distributed computing system. Each data sample may comprise a plurality of measurements and/or features measured by a respective one of the plurality of nodes in the distributed computing system.

The data produced by (or obtained from) each of the plurality of nodes (or data sources) may generally have similar or overlapping feature spaces. In other words, the data samples will have common features or parameters.

In some embodiments, the method 200 is performed by a node in a communications network. As such, the method may be for assembling a multi-purpose dataset of measurements and/or features in a communications network, suitable for training different machine learning models to perform different network operation tasks in the communications network.

As an example, the plurality of nodes may comprise edge cloud nodes and/or base stations in a communications network. In scenarios where data is in over-abundance and where it is computationally inefficient to store all the possible data samples that are available, the method 200 may be for use in selecting a diverse (or maximally diverse) multi-purpose dataset, of fixed size, from a great many data samples available from a plurality of edgedevices and/or base stations.

The method 200 may enable re-use of a machine learning model based system for different network operation tasks in the communications network by providing a dataset that can be used in a transfer learning process to further train a source model for use in a target domain. This is described in more detail below. In one example, the plurality of nodes comprise user devices and the plurality of data samples comprise measurements of wireless connectivity performance. In such an example, the multi-purpose dataset is for use in training different machine learning models to perform different optimisation or orchestration tasks in the communications network.

More generally, network operation tasks include but are not limited to: anomaly detection, Key Performance Indicator (KPI) prediction, orchestration and network automation tasks.

In step 202, a first data sample is obtained. The first data sample may be a new or previously unseen data sample. The purpose of the method 200 is then to determine whether to add the first data sample to the dataset or whether to discard it.

The first data sample may be obtained from one of the plurality of nodes in a distributed computing system, as described above. For example, if the method 200 is performed by a first node in a communications system, step 202 may comprise the first node receiving a first message from a second node in the communications system, the first message comprising the data sample. The second node may be referred to herein as the source node of the first data sample.

The first message may be sent by the second node, e.g. when the data sample is received or compiled by the second node. In other examples, the first node may send a request to the second node for data samples and the second node may send the first data sample to the first node in response to such a request.

Following step 202, data pre-processing may be performed, such as filtering for outliers/erroneous datapoints. Such filtering may be performed using a sequence of manually configured thresholds or conditions, e.g. to reject data samples containing NANs for example.

In step 204 the method comprises adding the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to (e.g. based on) a diversity measure. The first data sample is added to the dataset if it increases the diversity compared to if the first data sample were not included in the first dataset.

The diversity measure is a metric or parameter that can be used to provide an estimate of the diversity of the samples in the dataset. The diversity measure is used to determine if the addition of the first data sample to the dataset increases the diversity of the dataset. For example, in some embodiments, the diversity measure may be used to estimate the diversity of the dataset in the absence of the first data sample (e.g. before the first data sample is added to it). The diversity measure may then be updated or re-estimated for the dataset including the first data sample (e.g. after the first data sample is added to it). If the diversity is increased as a result of adding the first data sample to the dataset, then the first data sample is added to the dataset. An example is shown in Fig. 3, which shows an initial distribution 302 of a dataset. In step 204, if a (new) first data sample changes the distribution to that shown in graph 304, this this may be admitted to the dataset in preference to another data sample that would change the distribution to that shown in graph 306.

The diversity measure may generally be any measure of spread of the distribution of data samples in the dataset. The diversity measure may be a statistical measure of diversity. For example, the diversity measure may be based on entropy. For example, the diversity is measured using a differential entropy or Shannon entropy.

In another example, the diversity measure is policy-based or rule-based. For example, there may be predefined rules that describe whether a sample increases diversity. An example of a rule is that if the sample originates from a new (e.g. previously unseen/or under-represented) environment e.g., new hardware, or software version, then this should be added to the dataset as it increases the diversity in terms of number of environments we have seen samples from.

In another example, the diversity may be based on a diversity metric specified in a Service Level Agreement (SLA). For example, SLAs may be taken into account when a target requests the source model/dataset. For example, the target may request a diversity above a threshold, as specified in a SLA. In such an example, if the diversity of the dataset meets the requirements in the SLA, then the source model may be shared; if it does not meet the requirements, the source model may not be shared.

In some embodiments, the method comprises: modelling the dataset using a generative model; updating the generative model with the first data sample; and determining whether the diversity of a distribution of the dataset, as modelled by the generative model, has increased as a result of updating the generative model with the first data sample.

The skilled person will be familiar with generative models which are a type of machine learning model that can be trained to model a real life data distribution and generate new data samples from said distribution. The object of a generative model is to generate samples that are indistinguishable from real samples taken from the real distribution. A generative model may thus be used to model the distribution of the dataset, and provide an estimation of how likely any given data sample is to have been selected from the modelled distribution.

Examples of generative models that may be used herein include but are not limited to: Gaussian mixture models (GMMs), Variational auto-encoders, Long Short-Term Memory Networks, State Space Models and/or Hidden Markov Models. The skilled person will be familiar with these types of generative models. GMMs, for example, are described in the paper by Nasios, N., & Bors, A. (2006). Variational learning for Gaussian mixture models. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 36, 849-862. Gaussian mixture models are advantageous as they can generally be easily applied straight to the features. Long Short-Term Memory Networks, State Space Models or Hidden Markov Models may be useful for sequential data.

Generally, in step 204 the method may comprise estimating the differential entropy or Shannon entropy using Monte Carlo sampling of the distribution of the dataset as modelled by the generative model.

For example, one way of estimating the diversity is to use the differential Shannon entropy h(X_s), as the diversity measure for the source domain D_S, which is calculated as:

log p_s(x) dx, where p( ) is a probability density function for the source domain trace X_s. In practice p( ) can be estimated for the dataset by fitting a distribution. For example, the distribution may be obtained from the generative mode (such as a Gaussian Mixture Model) fitted to the dataset. The integral (which is also an expected value) can be estimated by simple Monte Carlo sampling.

In another example, the Differential entropy may be used as the diversity measure, which can be calculated according to: h(x) = -E_x~_p(x)logp(x) where p(x) is the distribution of the dataset, e.g. as output from the generative model (which may, e.g. be a GMM). Again, h(x) may be estimated using monte-carlo sampling of the distribution output from the generative model.

The skilled person will appreciate that these are merely examples and that diversity may be estimated in other ways to those described above.

In some examples, a size threshold (or maximum size) may be imposed on the dataset. The size threshold may be based on computational or storage constraints. In such embodiments, if the dataset reaches the size threshold, then admittance of the first data sample may be dependent on removal of another data sample from the dataset that contributes less (or least) to the diversity of the dataset.

Synthetic data may also be used to increase the size of the dataset. Thus, if the size of the dataset is below a first size threshold (which may be e.g. a target size threshold), the method 200 can further comprise using the generative model to generate new synthetic data samples that increase the diversity of the dataset. The skilled person will be familiar with methods of using generative models to generate new samples. For example, using techniques such as CycleGAN or histogram equalization augmentation techniques as described in the paper entitled: “Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks” by Sandfort, V., Yan, K., Pickhardt, P.J. et al. Sci Rep 9, 16884 (2019). The method 200 described above may be used to determine whether a synthetic data sample adds to the diversity of the dataset.

As another example, the dataset may be supplemented with synthetic data generated using another data augmentation process. The skilled person will be familiar with other data augmentation processes that can be used to supplement training datasets, but as an example, real numerical data samples may be smoothed, over-sampled, offset with random offsets, or noise to produce synthetic examples; image data (such as photographic data) may be transformed (e.g. rotated, cropped, enlarged or flipped), smoothed or have contrast changes applied, so as to provide additional synthetic data samples. These are merely examples however and it will be appreciated that many other methods may equally be used to produce synthetic data examples in order to increase the size of a training dataset.

In the event that the dataset becomes too large, then the synthetic samples may be removed, e.g. before the ‘real’ data samples. For example, if the size of the dataset is above a second size threshold, the method 200 may further comprise removing a second data sample from the dataset, wherein the second data sample was previously generated by the generative model and contributes least to the diversity of the dataset compared to other data samples previously generated by the generative model in the dataset. In such an example, the diversity measure may be used, in the manner described above, to determine which sample or samples to remove. It will be appreciated that the first and second size thresholds may be the same threshold (e.g. set to the same value).

If there are no synthetic samples then a (real) data sample that contributes least to the diversity may be removed instead. For example, if the size of the dataset is above a third size threshold, the method may further comprise: removing a third data sample from the dataset that contributes least to the diversity of the dataset compared to other data samples in the dataset. It will be appreciated that the third size threshold may be the same as the first and/or second size thresholds (e.g. set to the same value). In such an example, the diversity measure may be used, in the manner described above, to determine which sample or samples to remove. For example, the third data sample (e.g. the sample contributing less or least to the diversity and which should be selected for removal) may be selected from a mode of the generative model with a weight above a weight threshold. E.g. the mode with the highest weight. As another example, according to the generative model, the third data sample has a likelihood in the distribution of the dataset, above a first likelihood threshold. In both examples, having a high likelihood, or being part of a mode with a high weighting indicates a sample that is derived from the main part of the distribution and thus is less likely to add diversity to the dataset. It will be appreciated that the method 200 may be repeated, e.g. in a continuous manner, on different data samples in order to build up a dataset of sufficient size and diversity. The method 200 may be used to assemble a maximally diverse dataset from the available data samples.

A dataset produced using the method 200 may have various uses. For example, it may be used to train a general purpose machine learning model (e.g. to perform a generic predictive task). Such a general purpose model may be used as a source model in a transfer learning process. Put another way, in in some embodiments, the method 200 may comprise training a multi-purpose machine learning model, using the dataset, and using the multipurpose machine learning model as a source model in a transfer learning process. In this way, a multi-purpose, generic source model may be obtained, that is optimised for subsequent transfer learning. The advantage of having a multi-purpose dataset and training a multipurpose model for transfer learning is that it is not necessary to perform “source model selection” and compare different source models/datasets, for example, by comparing the similarity to the target with different sources which is time consuming and computationally expensive. We also do not need to store multiple source datasets/models for each ML task. Moreover, there is no need for target samples to be available for choosing the source model. Thus, embodiments herein may result in reduced computational resource requirements, time saving, and/or storage savings (e.g. compared to saving every data sample).

As noted above, the method 200 may be used in ML-based processes and services in a communications network. For example, as noted above, the method 200 may be for enabling re-use of a machine learning-model based system for different network operation tasks in the communications network. As such, the method 200 may further comprise steps of obtaining a first model trained to perform a first network operation task, and performing further training on the first model to obtain a second model. In such embodiments, the second model is trained to perform a second network operation task, and the further training is performed using the dataset as training data. The diversity of the dataset produced using the method 200 may be optimally diverse and as such, the dataset may be used in a transfer learning process in this way, to perform extra training to train a source model to perform a different task.

The first network operation task and/or the second network operation task may be related to anomaly detection, Key Performance Indicator, KPI, prediction, or network automation.

As an example, the dataset may comprise infrastructure data and/or data from an orchestrator obtained in a communications network. For example, data samples in the dataset may comprise Kubernetes™ data. Kubernetes™ is an is an open-source orchestrator for managing applications running in containers. As an example, each data sample in the dataset may comprise parameters such as: Collective CPU usage

Individual CPU statistics

Memory used and available

Swap space used and available

Overall I/O activities of the system

Individual device I/O activities

Context switch statistics

Run queue and load average data

Network statistics

By compiling the dataset using the method 200 herein, a dataset comprising the above-mentioned parameters may be used to train different models, such as Service placement model, scaling, and/or routing models. These are models that decide where to place a service (service placement), when and how to scale (scaling), or how to route requests between services (routing). Models like this would typically be trained using reinforcement learning but an initial model can be transferred from a related task such as KPI prediction with a collected dataset compiled according to the method 200. One could also construct a base model to be transferred using unsupervised methods (for instance autoencoders). A dataset comprising data samples with the parameters above may also be used to train models to perform network tasks such as anomaly detection, KPI prediction and/or network automation in a communications network.

As another example, the methods herein may be used to compile a dataset of features related to Internet of things (loT) devices used in for manufacturing.

In such an example the data samples in the dataset may comprise parameters such as:

Sensor data, examples:

Temperature, vibration, torque, humidity etc

Such a dataset may be used to train different ML models such as:

Anomaly detection models, models for predicting product quality, troubleshooting models, and/or models for predicting customer complaints.

As, another example, the methods herein may be used to compile a dataset of features related to Automated guided vehicles in factories.

In such an example, the data samples in the dataset may comprise, amongst other parameters: Image data

Such a dataset may be used to train different ML models such as:different Reinforcement Learning (RL) agents for deciding actions that should be performed in order for the vehicle to drive. The skilled person will appreciate that these are merely examples and that other data parameters may be collected and the compiled dataset may be used to train other types of predicted models to those described above.

In another embodiment in a communications network, the first data sample is obtained from a second node in a communications network and the method further comprises sending a message to the second node, indicating whether the first data sample was added to the dataset. In this way, the first node can give feedback to the second node, and the second node may use this feedback when determining whether to forward a data sample to the first node for consideration to be added to the database.

Turning now to Fig, 4 which illustrates an apparatus 400 according to some embodiments herein. The apparatus may be part of the node 100 described above. The apparatus 400 comprises the following 3 components:

404: Global dataset and model

The main component of the apparatus is the global data set component. This component consists of two sub-components: The actual dataset and a model of the data set. This model could for instance be, but is not limited to, a Gaussian mixture model.

406: Sample admission controller

In the sample admission controller, new samples are obtained (according to step 202 described above) and investigated using various statistical approaches: if a new sample adds to the diversity of the global data set (e.g. according to step 204 described above), it is admitted into the global dataset, otherwise it is rejected. Filtering out of anomalies, erroneous values or NaNs could also be done by this module, as true outliers always increase diversity of a data set, but they may not produce a good dataset. Thresholds for such outliers must be manually configured before deployment of the system.

402: Dataset update module

The purpose of the dataset update module is twofold: i. Remove un-necessary samples from the Global dataset. This is done when the size of the dataset exceeds a maximum size, which may or may not be pre-defined, and it is done by removing samples that contribute the least to the diversity. This could be done for instance by removing the samples with the highest likelihood, or samples from modes with the highest weights. ii. Add synthetic samples. If the dataset is smaller than a pre-defined threshold, the dataset model can be used to generate samples that are realistic but under-represented so that the diversity of the dataset would increase.

The apparatus 400 may perform the method 500 shown in Fig. 5. The method 500 has the following steps: 1 : The apparatus receives one or more samples from a worker. (E.g. in the manner of step 202, described above)

2: Filtering out of outliers. If there is an outlier, the sample is rejected, and the rejection is reported back to the source node.

3: Update the dataset model with the new sample/samples.

4: Check if the diversity has increased

5a: If the diversity has increased with the new sample/samples, the sample/samples is/are accepted and added to the global data set (step 204 of the method 200 above). The fact that the sample/samples are accepted is reported back to the source nodes of the data.

5b: If not it/they are rejected, and the generative model is reverted to the previous state. The fact that the sample/samples are rejected is reported back to the source nodes of the data.

6: Check if the dataset is larger than the size limit

7a: If the dataset is larger than the size limit: remove samples that are least useful for the dataset diversity.

7b: Otherwise (optionally) generate more samples. This could be samples that increase the diversity of the data set from the generative model or simply using existing techniques for over sampling/data augmentation.

Turning now to Fig. 6 which shows a signal diagram showing the signals sent between an apparatus 602 performing the method 500 on data received from a data source 604 and a data user 606. In this example, the node data source 1 604 sends samples that increase the diversity of the global data set. In this example, the following signals are sent: S1 : data source 1 sends data sample(s) to apparatus 602.

S2:apparatus 602 indicates that one or more of the samples were accepted into the dataset.

S3: user requests dataset

S4: apparatus 602 sends global dataset to user for use in e.g. a transfer learning problem.

Turning now to Fig. 7 which shows a signal diagram showing the signals sent between apparatus 602 performing the method 500 on data received from a second data source 606. In this example, the node data source 2 608 sends samples that do not increase the diversity of the global data set. In this example, the following signals are sent: S5: data source 2 sends data sample(s) to apparatus 602.

S6: apparatus 602 indicates that one or more of the samples were rejected

Turning to Fig. 8 which shows an example with an environment with n Edge Datacenters (DC1 , DC2, DC3) continuously measuring a set of features that are sent to a source manager 802. These features are used as inputs for a set of ML models in order to help with network and edge DC management tasks. Examples could be for instance KPI prediction or anomaly detection. The achieved technical effect is to have a single augmented data source that manages collected data from multiple sources. This achieves lower storage overhead (less memory required) and removes the need of selecting which source model to transfer from - this collects one universally good source data set, which can be used for transfer learning with either an existing task or a new task.

The source manager 802 performs the method 500 as illustrated in Fig. 5 on the data samples sent to the source manager by the datacentres DC1 ; DC2; DC3. The steps outlined below are made with reference to the steps shown in Fig. 5:

Step 1 : The samples from the different sources are sent to the source manager (and received by the source manager according to step 202 of the method 200 above). In this embodiment, this could be done once per day, where all collected samples, X, where X is a (a,b) matrix where a is the set of measurements being sent and b the number of measured features.

Step 2: Remove samples where feature values are missing or obviously wrong, this “obviously” wrong can be defined by a domain expert. For instance: a sample containing a feature corresponding to temperature showing a value below 0 K or above 100 degrees Celsius will be filtered out.

Step 3: In the source manager there is a Gaussian mixture model (GMM), which is an example of a generative model. This model is trained using the samples that exist in the sample data set (the source data set) and the new samples combined. If the database is empty then the GMM is trained using only the newly income samples.

Step 4: Check if the diversity of the data set with the newly added samples is larger than the diversity of the old data set (without the newly added samples). For example, using the Differential entroy h(x) as the diversity metric: h{x) = -E_x~_p(x)logp(x) p(x) is the GMM, E_x~_p(x) is the mathematical expectation which takes the expected value of its argument. h(x) can be estimated using monte carlo sampling.

Step 5a: If the diversity has increased with the new samples, the samples are accepted and added to the global data set. The model manager sends back information to the edge data center from which the new samples came, that the information was useful. This could then potentially be used in the edge node for decisions on what samples to send in the future.

Step 5b: If not it/they are rejected, and the generative model is reverted to the previous state and they are not added to the global data set. The model manager sends back information to the edge data center from which the new samples came, that the information was not useful. This can potentially be used in the edge node for future filtering.

Step 6: Check if the global dataset is larger than the size limit Step 7a: If the dataset is larger than the size limit: remove samples that are least useful for the dataset diversity. This can be done in the following way: compute the likelihood of all samples in the data set, remove the ones with the highest likelihood. This way samples are removed where there are already “enough” samples in the dataset.

Turning now to other embodiments, in one embodiment, a source manager such as that illustrated in Fig, 8, could be included in a function for measuring wireless connectivity performance of mobile devices (e.g. user equipments, UEs). In this example, the connected device sends performance measurement readings to a central database in the cloud, to which an analytics engine is connected. The collected data can be used as an input for optimizing the network and for added functionality such as predictive mobility, etc.

The source manager could be applied at the interface between a data streamer and a big data- database. The mobile devices send data samples to the data streamer and admission control to (and continuous maintenance of) the Big Data database is performed by the source manager. If used like this, the technical effect is reducing size of data stored and possibly reducing the sending of un-necessary samples, if the feedback mechanism described in step 5 of Fig. 5 is performed.

Turning now to another embodiment, the Open Radio Access Network (O-RAN) is an open standard for next generation radio access networks driven mostly by telecom operators. An Intelligent Management and Orchestration system within the framework of O- RAN has been proposed which specifically highlights the importance of Al model management, data analytics, and training capabilities. Specifically, the Al model management component is a functionality for life cycle management (LCM) of models and source domains. The proposed source manager described in Fig. 4 above could reside as a function within an Al model management module, supporting the general model management and LCM activities.

Thus, there is described herein, systems and methods to update and maintain a source dataset with samples from multiple distributed data sources, for instance edge data centres or radio base stations. The method accepts samples that increase the diversity of the source dataset, has the option to use a generative model used for sample admission to enhance the dataset further, while keeping the size of the data set below a pre-defined maximum size. This allows for creation of a source dataset in a target domain-, model- and task-agnostic way. This is enabled by maximizing the diversity of the global dataset, which lets the data set be constructed before the need arises, saving time and overhead, as a single source data set can be used for multiple uses. The method enables creation of source datasets of a certain data quality. The admission or rejection of data samples is based on its statistical properties, and hence the method can reduce the number of repeated samples/samples with overlapping feature space in the dataset. In another embodiment, there is provided a computer program product comprising a computer readable medium, the computer readable medium having computer readable code embodied therein, the computer readable code being configured such that, on execution by a suitable computer or processor, the computer or processor is caused to perform the method or methods described herein.

Thus, it will be appreciated that the disclosure also applies to computer programs, particularly computer programs on or in a carrier, adapted to put embodiments into practice. The program may be in the form of a source code, an object code, a code intermediate source and an object code such as in a partially compiled form, or in any other form suitable for use in the implementation of the method according to the embodiments described herein.

It will also be appreciated that such a program may have many different architectural designs. For example, a program code implementing the functionality of the method or system may be sub-divided into one or more sub-routines. Many different ways of distributing the functionality among these sub-routines will be apparent to the skilled person. The sub-routines may be stored together in one executable file to form a self-contained program. Such an executable file may comprise computer-executable instructions, for example, processor instructions and/or interpreter instructions (e.g. Java interpreter instructions). Alternatively, one or more or all of the sub-routines may be stored in at least one external library file and linked with a main program either statically or dynamically, e.g. at runtime. The main program contains at least one call to at least one of the sub-routines. The subroutines may also comprise function calls to each other.

The carrier of a computer program may be any entity or device capable of carrying the program. For example, the carrier may include a data storage, such as a ROM, for example, a CD ROM or a semiconductor ROM, or a magnetic recording medium, for example, a hard disk. Furthermore, the carrier may be a transmissible carrier such as an electric or optical signal, which may be conveyed via electric or optical cable or by radio or other means. When the program is embodied in such a signal, the carrier may be constituted by such a cable or other device or means. Alternatively, the carrier may be an integrated circuit in which the program is embedded, the integrated circuit being adapted to perform, or used in the performance of, the relevant method.

Variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the disclosure and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude a plurality. A single processor or other unit may fulfil the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage. A computer program may be stored/distributed on a suitable medium, such as an optical storage medium or a solid-state medium supplied together with or as part of other hardware, but may also be distributed in other forms, such as via the Internet or other wired or wireless telecommunication systems. Any reference signs in the claims should not be construed as limiting the scope.

Claims

1. A computer implemented method (200) of assembling a multi-purpose dataset suitable for use in training different machine learning models to perform different tasks, the method (200) comprising: obtaining (202) a first data sample; and adding (204) the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to a diversity measure.

2. A method (200) as in claim 1 further comprising: modelling the dataset using a generative model; updating the generative model with the first data sample; and determining whether the diversity of a distribution of the dataset, as modelled by the generative model, has increased as a result of updating the generative model with the first data sample.

3. A method (200) as in claim 2 wherein the diversity is measured using a differential entropy or Shannon entropy.

4. A method (200) as in claim 3 wherein the differential entropy or Shannon entropy is estimated by Monte Carlo sampling of the distribution of the dataset as modelled by the generative model.

5. A method (200) as in any one of claims 2 to 4 wherein, if a size of the dataset is below a first size threshold, the method (200) further comprises: using the generative model to generate new synthetic data samples that increase the diversity of the dataset.

6. A method (200) as in claim 5 wherein, if the size of the dataset is above a second size threshold, the method (200) further comprises: removing a second data sample from the dataset, wherein the second data sample was previously generated by the generative model and contributes least to the diversity of the dataset compared to other data samples previously generated by the generative model in the dataset.

7. A method (200) as in any one of claims 2 to 6 wherein, if the size of the dataset is above a third size threshold, the method (200) further comprises: removing a third data sample from the dataset that contributes least to the diversity of the dataset compared to other data samples in the dataset.

8. A method (200) as in claim 7 wherein: the third data sample is from a mode of the generative model with a weight above a weight threshold; or according to the generative model, the third data sample has a likelihood in the distribution of the dataset, above a first likelihood threshold.

9. A method (200) as in any one of claims 2-8 wherein the generative model is a Gaussian mixture model, Variational auto-encoder, Long Short-Term Memory Network, State Space Model or Hidden Markov Model.

10. A method (200) as in any one of the preceding claims further comprising supplementing the dataset with synthetic data generated using another data augmentation process.

11. A method (200) as in any one of the preceding claims further comprising: training a multi-purpose machine learning model, using the dataset; and using the multi-purpose machine learning model as a source model in a transfer learning process.

12. A method (200) as in any one of the preceding claims wherein the method (200) is performed by a first node in a communications network.

13. A method (200) as in claim 12 wherein the first data sample is obtained from a second node in a communications network; and wherein the method (200) further comprises sending a message to the second node, indicating whether the first data sample was added to the dataset.

14. A method (200) as in any one of the preceding claims wherein the method (200) is for assembling a multi-purpose dataset of measurements and/or features in a communications network, suitable for training different machine learning models to perform different network operation tasks in the communications network.

15. A method (200) as in claim 14 wherein the method (200) is for enabling re-use of a machine learning-model based system for different network operation tasks in the communications network, and wherein the method (200) further comprises: obtaining a first model trained to perform a first network operation task; and performing further training on the first model to obtain a second model wherein the second model is trained to perform a second network operation task, and wherein the further training is performed using the dataset as training data.

16. A method (200) as in claim 15 wherein the first network operation task and/or the second network operation task is related to: anomaly detection;

Key Performance Indicator, KPI, prediction; or network automation.

17. A method (200) as in claim 14, 15, or 16 wherein the dataset comprises a plurality of data samples obtained from a plurality of different nodes in the communications network; and wherein the method (200) is for assembling a diverse multi-purpose dataset from data samples from the plurality of different nodes.

18. A method (200) as in claim 17 wherein the plurality of nodes comprise edge cloud nodes and/or base stations in a communications network.

19. A method (200) as in claim 17 wherein the plurality of nodes comprise user devices; and the plurality of data samples comprise measurements of wireless connectivity performance.

20. A computing node (100) for assembling a multi-purpose dataset suitable for use in training different machine learning models to perform different tasks, the node (100) being configured to: obtain a first data sample; and add the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to a diversity measure.

21. A node (100) as in claim 20 further configured to perform the method (200) of any one of claims 2-19. A computing node (100) for assembling a multi-purpose dataset suitable for use in training different machine learning models to perform different tasks, the node (100) comprising: a memory (104) comprising instruction data representing a set of instructions (106); and a processor (102) configured to communicate with the memory (104) and to execute the set of instructions (106), wherein the set of instructions (106), when executed by the processor (102), cause the processor (102) to: obtain a first data sample; and add the first data sample to the dataset if the addition of the first data sample to the dataset increases diversity of the dataset, according to a diversity measure. A node (100) as in claim 22 further configured to perform the method (200) of any one of claims 2-19. A computer program comprising instructions which, when executed on at least one processor, cause the at least one processor to carry out a method (200) according to any of claims 1 to 19. A carrier containing a computer program according to claim 24, wherein the carrier comprises one of an electronic signal, optical signal, radio signal or computer readable storage medium. A computer program product comprising non transitory computer readable media having stored thereon a computer program according to claim 24.