WO2023187444A1

WO2023187444A1 - Classification and model retraining detection in machine learning

Info

Publication number: WO2023187444A1
Application number: PCT/IB2022/052917
Authority: WO
Inventors: Mohamed NAILI; Karthikeyan Premkumar
Original assignee: Telefonaktiebolaget Lm Ericsson (Publ)
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2023-10-05

Abstract

A computer-implemented method for training a generative adversarial network to detect a data drift of a machine learning model is described. The method uses supervised learning to generate a plurality of data points based on training data of the machine learning model. The method uses supervised learning to classify each data point of the plurality of data points and generate a classification probability distribution for the plurality of data points. The method detects the data drift of the machine learning model based on a deviation in the classification probability distribution. The detected data drift enables triggering a corrective action.

Description

CLASSIFICATION AND MODEL RETRAINING DETECTION IN MACHINE LEARNING

Inventors: Mohamed Naili

Karthikeyan Premkumar

TECHNICAL FIELD

[0001] The disclosure generally relates to machine learning models, and in particular to detecting data drift in machine learning models.

BACKGROUND

[0002] Machine learning models are often utilized to make predictions or decisions based on real data, without being explicitly programmed to do so. As opposed to methods or circuitry implemented by fixed program instructions, machine learning methods and circuitry derive knowledge (or “learn”) from example inputs of real data (e.g., training data set) and rely on patterns and inferences to make predictions.

[0003] However, real data, much like the real world, continue to change and evolve with time. When incoming real data change sufficiently, relative to the original training data set, the prediction or inference accuracy of the machine learning model declines. The decline in predictive power due to changes in the environment is caused by a data drift. A data drift causes misclassification of new data when the real data distribution changes relative to the distribution of the original training data set or when there is a class imbalance in the data. Thus, a solution is needed to detect a data drift based on a change in data. SUMMARY

[0004] Conventional approaches fail to timely detect a data drift before the model becomes inaccurate. Therefore, a solution is needed for timely detecting a data drift and triggering the model retraining process without user feedback for models in production. The present disclosure solves the above technical problems and provides a technical solution of training and utilizes a generative adversarial network to detect a data drift of the machine learning model in production.

[0005] To address these challenges, the disclosure includes methods and systems for training a generative adversarial network to detect a data drift of a machine learning model.

[0006] The present disclosure provides a computer-implemented method for training a generative adversarial network to detect a data drift of a machine learning model. The method uses supervised learning to generate a plurality of data points based at least on training data of the machine learning model. The method uses supervised learning to classify each data point of the plurality of data points and generate a classification probability distribution for the plurality of data points. The method detects the data drift of the machine learning model based on a deviation in the classification probability distribution. The detected data drift can enable triggering of a corrective action.

[0007] In some embodiments, the method detects the data drift by calculating a data drift score based on the classification probability distribution. In some embodiments, generating the plurality of data points includes generating a different data point based at least on maximizing the reconstruction loss. In some embodiments, generating the plurality of data points includes generating a similar data point based at least on minimizing the reconstruction loss. In some embodiments, detecting the data drift includes calculating a data drift score and determining whether the data drift score is above a predetermined threshold within a predetermined period of time.

[0008] The present disclosure further provides a non-transitory computer readable medium or media containing instructions for executing a method for training a generative adversarial network to detect a data drift of a machine learning model. The method uses supervised learning to generate a plurality of data points based at least on training data of the machine learning model. The method uses supervised learning to classify each data point of the plurality of data points and generate a classification probability distribution for the plurality of data points. The method detects the data drift of the machine learning model based on a deviation in the classification probability distribution. The detected data drift can enable triggering of a corrective action.

[0009] The present disclosure provides a system for executing a method for training a generative adversarial network to detect a data drift of a machine learning model. The system includes a database connected to a network, configured for receiving and storing training data. The system includes one or more processors and memory. The memory contains instructions executable by the one or more processors whereby the system is operative to use supervised learning to generate a plurality of data points based at least on training data of the machine learning model. The system is operative to use supervised learning to classify each data point of the plurality of data points and generate a classification probability distribution for the plurality of data points. The system is operative to detect the data drift of the machine learning model based on a deviation in the classification probability distribution. The detected data drift can enable triggering of a corrective action.

BRIEF DESCRIPTION OF THE DRAWINGS

[0010] Exemplary embodiments will be described with references to the accompanying figures, wherein:

[0011] FIG. 1 is a flowchart illustrating a method for training a generative adversarial network to detect a data drift of a machine learning model in accordance with some embodiments;

[0012] FIG. 2 is a block diagram illustrating an example data drift detecting system in accordance with some embodiments; [0013] FIG. 3 is an example data diagram related to training a GAN to detect a data drift of a machine learning model in accordance with some embodiments;

[0014] FIG. 4 is a flowchart illustrating a method for training a GAN to detect a data drift of a machine learning model in accordance with some embodiments;

[0015] FIG. 5 is a block diagram illustrating a system for training a GAN to detect a data drift of a machine learning model in accordance with some embodiments;

[0016] FIG. 6 is a block diagram illustrating a system for training a GAN to detect a data drift of a machine learning model in accordance with some embodiments;

[0017] FIG. 7 is a flowchart illustrating a method for training a GAN to detect a data drift of a machine learning model in accordance with some embodiments;

[0018] FIG. 8 is a flowchart illustrating a method for training a GAN to detect a data drift of a machine learning model in accordance with some embodiments;

[0019] FIG. 9 is a block diagram illustrating a system for a discriminatorclassifier model for training a generative adversarial network to detect a data drift of a machine learning model in accordance with some embodiments;

[0020] FIG. 10 is a block diagram illustrating a system for a different data generator model for training a generative adversarial network to detect a data drift of a machine learning model in accordance with some embodiments;

[0021] FIG. 11 is a block diagram illustrating a system for a similar data generator model for training a generative adversarial network to detect a data drift of a machine learning model in accordance with some embodiments;

[0022] FIG. 12 is a block diagram illustrating an architecture for similar and different data generators in accordance with some embodiments;

[0023] FIG. 13 is a block diagram illustrating an architecture for learning a similarity metric according to some embodiments in accordance with some embodiments; [0024] FIG. 14 is a block diagram illustrating a system for a data drift detector model for training a generative adversarial network to detect a data drift of a machine learning mode in accordance with some embodiments;

[0025] FIG. 15 is a block diagram illustrating an exemplary computer system configurable by a computer program product to carry out embodiments of the present disclosure; and

[0026] FIG. 16 is a block diagram illustrating a virtualization environment in which functions implemented by some embodiments of the present disclosure may be virtualized.

[0027] While the concept is described with reference to the above drawings, the drawings are intended to be illustrative, and the disclosure contemplates other embodiments within the spirit of the concept.

DETAILED DESCRIPTION

[0028] The concept will now be described more fully hereinafter with reference to the accompanying drawings which show, by way of illustration, specific embodiments. The concept may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the concept to those skilled in the art. Among other things, the concept may be embodied as devices or methods.

Accordingly, the concept may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. The following detailed description is, therefore, not to be taken in a limiting sense.

[0029] In some embodiments, methods and systems for training a generative adversarial network (GAN) to detect a data drift of a machine learning model are provided. When a data drift occurs, model retraining is likely needed. For example, a data drift is indicated when the target variable being predicted keeps changing beyond an acceptable threshold because of the model drift caused by changes of underlying data. When comparing the training data set and a similar set of new data shows a significant deviation, the existing model will no longer be able to make the same generalizations. For example, the predictions the model makes are no longer as accurate as they were at the time of training. Adversarial network and classification models have many important applications in pattern recognition, anomalies detection, systems’ failure detection, and the like. Thus, the cost of wrong predictions or classifications in production is high. Systems and methods for monitoring a model for data drift and continuously determining the retraining interval are thus provided in the present disclosure.

[0030] In some embodiments, the present disclosure provides detection of change in data distribution on many levels from real data (or identical to real data) to very different data. The detection of the level of data distribution change facilitates deciding if the model needs re-training without the need for user feedback. Further, the classification of a similar data generator’s output helps in better classification performance. Finally, the present disclosure provides imbalanced dataset mitigation through pairwise comparison (e.g., similarity) of data points.

[0031] In some embodiments, generator and discriminator architectures may be used in addition to, or instead of the above models, where the two models, generator and discriminator, compete against each other. The discriminator tries to learn how to classify a real data point as real, and how to classify a data point generated by the generator as fake. The generator tries to learn how to generate a data point that would be classified as real by the discriminator.

[0032] FIG. 1 is a flow diagram illustrating a method 100 for training a generative adversarial network to detect a data drift of a machine learning model. In one embodiment, method 100 for training a generative adversarial network to detect a data drift of a machine learning model begins with step 101. At step 101, the method receives training data of the machine learning model. In some embodiments, the method can also include preprocessing the training data of the machine learning model. The preprocessing includes, for example, rebalancing a set of tuples of the training data, cleaning the training data, scaling the training data, or a combination thereof.

[0033] At step 102, method 100 uses supervised learning to generate a plurality of data points based at least on training data of the machine learning model. In some embodiments, generating the plurality of data points can include generating a different data point based at least on maximizing the reconstruction loss. In some embodiments, reconstruction loss represents the difference between the ground-truth (real data) and another data point (generated data) generated by a Variational Autoencoders (VAE)-based data generator and may be used for many purposes such as anomalies detection or to generate more data from a learned data distribution. Variational Autoencoders (VAEs) are expressive latent variable models that can be used to learn complex probability distributions from training data.

[0034] In some embodiments, generating the plurality of data points can include generating a similar data point based at least on minimizing the reconstruction loss. In some embodiments, generating the plurality of data points can include generating a similar to very different data point based at least on minimizing the reconstruction loss. In some embodiments, after preprocessing the data set, the method generates a combination of tuples in the format (data point x, data point y), regardless of whether labels for the data exist.

[0035] At step 103, the method uses supervised learning to classify each data point of the plurality of data points and generate a classification probability distribution for the plurality of data points. In some embodiments, the method can also include using unsupervised learning to enable anomalies detection based on the classification probability distribution. In some embodiments, using supervised learning to generate a classification probability distribution can include training a discriminator-classifier generative adversarial network adapted to generate the classification probability distribution.

[0036] In some embodiments, training the discriminator-classifier can include generating a data point tuple based on the plurality of data points. In some embodiments, training the discriminator- classifier can include calculating a classification score and a classification loss for each data point in the plurality of data points. In some embodiments, training the discriminator-classifier can include calculating a similarity score and a similarity loss for two data points in the plurality of data points. In some embodiments, training the discriminator-classifier can include calculating classification values and similarity values. In some embodiments, training the discriminator-classifier can include calculating mean losses for the plurality of data points based on the classification values and the similarity values.

[0037] At step 104, method 100 detects the data drift of the machine learning model based on a deviation in the classification probability distribution, wherein the detected data drift enables triggering a corrective action. In some embodiments, detecting the data drift can include calculating a data drift score based on the classification probability distribution. In some embodiments, detecting the data drift can include calculating a data drift score and determining whether the data drift score is above a predetermined threshold within a predetermined period of time.

[0038] At step 105, the method triggers a corrective action based on the detected data drift. In some embodiments, a corrective action can include at least one of retraining of the machine learning model and ensemble classification.

[0039] In some embodiments, the method simultaneously trains a discriminator-classifier GAN adapted to generate the classification probability distribution and trains at least one data generator adapted to generate the plurality of data points.

[0040] In some embodiments, the method trains at least one data generator adapted to generate the plurality of data points and calculates at least one of classification values, similarity values and a reconstruction value.

[0041] In some embodiments, the method trains at least one data generator adapted to generate the plurality of data points and calculates mean losses for the plurality of data points based on at least one of classification values, similarity values, and a reconstruction value. [0042] Generative models in machine learning can be trained using an unlabeled dataset and are capable of generating new data points after training is completed. As generating new content requires a good understanding of the training data at hand, such models are often regarded as a key ingredient to unsupervised learning.

[0043] Adversarial aspects include simultaneously training two models, the generator and the discriminator, with competing objectives. The generator captures the data distribution, and the discriminator estimates the probability that a sample came from the training data rather than the generator. The training target for the generator is to maximize the probability of the discriminator making a mistake. Thus, the generative model competes against an adversary: a discriminative model that learns to determine whether a sample is from the model distribution or the data distribution. By iterating the adversarial learning process between the generator and the discriminator, the generator will eventually be able to generate data points that successfully confuse the discriminator.

[0044] FIG. 2 shows an example data drift detecting system 200 in accordance with some embodiments. In the example, data drift detecting system 200 for training a GAN to detect a data drift of a machine learning model includes model 210, data drift detector 220, generator 230, and discriminator 240.

[0045] Data drift detecting system 200 facilitates detecting data drift and consequently, triggering the re-training process of model 210. Based on the input and/or output of model 210, system 200 can generate a data drift score that can be used to determine if model X 210 needs to be re-trained or not. Additionally, system 200 can be used for other (optional) tasks, such as classification. In some embodiments, model X 210 is a machine learning model operating in production and being monitored for a data drift based on its input, output, and prediction abilities. [0046] System 200 determines that model X 210 needs retraining based on the following. If, within a particular or predetermined time window or time interval, the data drift (re-training) score becomes greater than a predefined threshold, system 200 determines that the data distribution has been changed and model X 210 needs to be retrained on new data. The re-training score is calculated based on outputs provided by discriminator 240.

[0047] System 200 determines that model X 210 needs better classification based on the following. Discriminator-classifier 240 will be able to better classify new data as it classifies data as being from a real data distribution, similar data distribution, different data distribution, or very different data distribution.

[0048] In some embodiments, system 200 performs an optional task such as classification, where discriminator-classifier 240 classifies data points to a set of classes (e.g., target classes, a “general” class, and an “unknown” class).

[0049] In some embodiments, system 200 processes an imbalanced dataset, by relying on pairwise comparisons between data points. System 200 augments the training dataset size and applies other techniques such as an under-sampling technique to improve the balance of the newly generated training dataset.

[0050] FIG. 3 shows an example data diagram 300 related to training a GAN to detect a data drift of a machine learning model in accordance with some embodiments. In the example, diagram 300 includes real data (training data) 310, similar to real data 320, different data 330, and very different data 340.

[0051] Regardless of the application for training a GAN to detect a data drift of a machine learning model, whether it is for pattern recognition, anomalies detection, failure detection, among others, when incoming data distribution is different from training data distribution (real data 310), false classification or false prediction results. As such, a model retraining may be triggered. Data diagram 300 illustrates data categorization distributions for real data 310, similar data 320, different data 330, and very different data 340.

[0052] Real data 310 is the training data of any model being monitored for a data drift. Similar to real data 320 is generated based on a distribution of real data 310, yet still very close to it. In other words, a data point that is “similar” to a real data point will have very close features values and would be classified the same as the real data point if the latter belongs to a known class. [0053] Different data 330 and very different data 340 are generated from other data distributions where generated data points will not be classified the same as data points in real data 310.

[0054] The method for training a GAN to detect a data drift of a machine learning model avoids false re-training triggers, by training two generator models described herein. A similar data generator model that can generate similar to real data 320 and a different data generator that generates different data 330.

[0055] FIG. 4 is a flowchart illustrating a method 400 for training a generative adversarial network to detect a data drift of a machine learning model. In one embodiment, method 400 for training a GAN to detect a data drift of a machine learning model begins with step 101. At step 401, the method collects and preprocesses data. In some embodiments, preprocessing data includes rebalancing a set of tuples of the training data, cleaning the training data, scaling the training data, or a combination thereof.

[0056] At step 402, the method determines whether labels for the data points of the collected and preprocessed data are provided. If the answer is yes, the method proceeds to step 404. If the answer is no, the method proceeds to step 403. [0057] If some (or all) input data labels are provided (e.g., in supervised or semi-supervised learning), task 2 will classify data points into a known class or target class(es) (for real data points with a label and for data points that are similar to the labeled real data points), general class (for real data points without a label and for data points that are similar to the unlabeled real data points), and an unknown class for different and very different data points. Task 2 is described below in more detail with respect to task 1, which classifies data to real, similar to real, different, and very different categories.

[0058] If input data labels are not provided (unsupervised learning), task 2 will not be performed. However, real data points and similar data points will be classified as similar to each other, and different data points and very different data points will be classified as similar to each other. [0059] At step 403, the method does not consider task 2, because task 2 depends on having labels for the data points. At step 404, the method performs training, validation and testing. In some embodiments, during the training period, each of the data drift detecting models, including the discriminator, the different data generator, and the similar data generator, is trained on the real data points over their respective objective functions and loss functions.

[0060] At step 405, the method begins making inferences. During the inference period, the discriminator produces classification probabilities. The classification probabilities may then be used for classification and to calculate the data drift (re-training) score to decide if the monitored model is appropriately or accurately processing incoming data or whether the monitored model needs to be retrained or not.

[0061] At step 406, the method calculates a retraining score. The retraining score is calculated based on outputs provided by the discriminator. For example, while the monitored model is in production, during a predefined time window, the method calculates, for each data point that comes in, a score. The mean score for all the data points is calculated and used to determine the retraining score.

[0062] At step 407, the method determines whether the retraining score is greater than the retraining threshold. If the answer is yes, the method proceeds to step 408. If the answer is no, the method proceeds back to step 401 to collect and preprocess more data. The retraining threshold is determined based on the particulars of the model being monitored. Some models have frequent data drifts, for example, models having to determine crashes. Initially, a random training threshold may be set during a model’s testing phase to determine if it is low enough to detect a data drift within a predetermined time interval such as six months. Then, during the calibration phase of the model, the threshold and time interval parameters may be tuned to identify appropriate parameters for the particular model, to be applied during the production phase of the model.

[0063] At step 408, the method retrains the monitored model which has been trained on old data. Model retraining is needed when the target variable being predicted keeps changing beyond an acceptable threshold because of the underlying data changes, causing data drift. Sometimes this is referred to as model drift, which is a misnomer since it is not the model that is changing. Rather, it is the environment or the data that is changing. When the training data set compared to a similar set of new data shows a significant deviation, the existing model will no longer be able to make accurate predictions as compared to the model predictions at the time of training. Thus, the method needs to be performed for continuously monitoring for data drift and determining the retraining interval.

[0064] During the inference period (step 405), various equations may be used to calculate the final classification probabilities. On testing, for a given input data point, the discriminator classification outputs may decide labels.

[0065] The method could be used for many use cases, including but not limited to crashes detection or anomalies detection.

[0066] For example, crashes detection may be achieved with supervised or semi-supervised learning. During training, training data points may be completely labeled or partially labeled as being crashed or not. Thus, the method considers task 2 for the discriminator and trains all the models as described above.

[0067] During inference, the probabilities provided by the discriminator model can be used to decide if the data point represent a crash or not (P(Class = crash)).

[0068] Also, a weighted decision could be considered for the same purpose, where the class’s probability would be weighted by the probabilities of being real, similar to real are parameters to be tuned and y is

a very small positive number):

[0069] Class's weighted probability =

P (similar to real)) + γ) x P (Class)

[0070] Furthermore, the method leverages the similar data point generated by similar data generator (based on the input data point) to generate a weighted decision, where the class’s probability is weighted by the probabilities of being real, similar to real: [0071] Class's weighted probability =

[0072] As another example, unsupervised learning may be used for anomalies detection. If real data points have no labels, task 2 is not performed. The method trains models accordingly as described above.

[0073] During inference, the method defines a weighted probability for an input to be considered an anomaly as follows: anomaly's weighted probability = (μ" x P(different) + ε" X

P (very different) + y) x P(Unknown)

[0074] Using similar data generator’s output:

Anomaly¹ s weighted probability =

[0075] The method decides that a data drift exists based on the discriminator classification results. For each data point received within a predetermined window of time (ν, ρ, θ, data drift's threshold and the size of the window of time are parameters to be tuned):

Data point's retraining score = d x P(Similar to different) + ρ x

P (different) + θ x P (unknown)

The retraining score is equal to: retraining' s score

MEAN_{of all data points "i" in "window of time"} (Data point s retraining score_i)

[0076] If the retraining score is larger than a retraining threshold within the predetermined window of time, then the model processing this data may need to be retrained. [0077] FIG. 5 is a block diagram illustrating a system 500 for training a GAN to detect a data drift of a machine learning model.

[0078] In one embodiment, system 500 for training a GAN to detect a data drift of a machine learning model includes training period 510 followed by inference period 520. Training period 510 includes training different data generator 511, similar data generator 512, and discriminator-classifier 513. Inference period 520 includes inferring classification probability 521, and data drift score 522.

[0079] System 500 is adapted for training a GAN to detect a data drift of a machine learning model in at least two phases, including a training period 510 and an inference period 520. During training period 510, each of the models including the discriminator- classifier 513 (or simply discriminator), different data generator 511 and similar data generator 512 are trained on the real data point over their respective objective functions and loss functions.

[0080] During inference period 520, discriminator 513 produces classification probability 521 that may be used for classification and to calculate data drift score 522 (or re-training score). When the classification probabilities indicate a deviation in data distribution from the real data (training data) and when the data drift scores are over a threshold, system 500 may determine that the model being monitored is inaccurately processing incoming data and needs to be retrained.

[0081] FIG. 6 is a block diagram illustrating a system 600 for training a GAN to detect a data drift of a machine learning model.

[0082] In one embodiment, system 600 for training a GAN to detect a data drift of a machine learning model includes hyperparameters 610. Hyperparameters 610 include classification weights 611, similarity weights 612, and reconstruction weights 613.

[0083] System 600 includes hyperparameters 610 which need to be defined and tuned. Classification weights 611 are listed in Table 1. Similarity weights 612 are listed in Table 2. Reconstruction weights 613 are listed in Table 3.

Table 1:

Table 2:

Table 3:

[0084] FIG. 7 is a flowchart illustrating a method 700 for training a generative adversarial network to detect a data drift of a machine learning model. In one embodiment, method 700 for training a GAN to detect a data drift of a machine learning model begins with step 710.

[0085] At step 710, the method generates a combination of input tuples for each data point in a preprocessed dataset. Step 710 generates input data 715, for example in the format (data point x, data point y). After preprocessing the data set, including for example cleaning the data set, and scaling, the method generates a combination “combox” of tuples “tupleS” (data point x, data point y), regardless of whether their labels exist in the preprocessed data set. In some embodiments, if the size of combox is too large (not enough resources to manage it) or the dataset is highly imbalanced (in case some labels have been provided), the method may under sample a (balanced) subset.

[0086] At step 720, the method generates a “similar data point” for each input tuple using the similar data generator model.

[0087] At step 730, the method generates a “different data point” for each input tuple using the different data generator model. [0088] Steps 720 and 730 generate input data 725, for example, in the format (data point x, data point y, “similar data point x”, “similar data point y”, “different data point x”, “different data point y”). For each data point in tupleS, the method generates a different data point and a similar data point using different data generator and similar data generator respectively.

[0089] At step 740, the method generates a “similar to different data point” for each “different data point” using the similar data generator model. Step 740 generates input data 745, for example, in the format (“similar to different data point x”, “similar to different data point y”). Similar to different data point may also be referred to as a very different data point, generated for each “different data point” generated at step 730.

[0100] At step 750, the method performs classification, similarity analysis, and reconstruction for each generated data point in ((data point x, data point y, “similar data point x”, “similar data point y”, “different data point x”, “different data point y”, “similar to different data point x”, “similar to different data point y”).

[0101] Step 750 generates output data 755, including, for example, classification score, classification loss, similarity score, similarity loss, and reconstruction loss.

[0102] At step 760, the method calculates mean losses for X data points, for each model. Step 760 produces output data 765, including, for example, discriminator’s loss, similar data generator’s loss, and different data generator’s loss.

[0103] To calculate classification score and loss in output data 755, the method feeds each data point to the discriminator-classifier to detect if it is real, similar, different or similar to different. The output will be a probability distribution with four probabilities. Then, the method calculates classification 1 loss.

[0104] The method determines to which target class it belongs (if task 2 is considered). The output will be a probability distribution with a probability for each class. Then, the method calculates classification 2 loss. [0105] The method calculates the classification losses for that data point.

[0106] If Task 2 is considered:

[0107] Discriminator

classification loss 1 + α_c2 X classification loss 2)

[0108] Otherwise:

[0109] Discriminator

classification loss 1)

[0110] If data_type == Similar:

[0111] SimilarDataGenerator classification loss = Discriminator classification loss

[0112] Otherwise:

[0113] SimilarDataGenerator classification loss = 0

[0114] If data_type == Different or similar to different:

[0115] if Discriminator classification loss ! = 0:

[0116] DifferentDataGenerator classification loss =

— Discriminator classification loss

[0117] else:

[0118] DifferentDataGenerator classification loss =

—minimum Discriminator classification loss

[0119] Otherwise:

[0120] DifferentDataGenerator classification loss = 0

[0121] To calculate similarity score and loss in output data 755, the method generates all possible combinations “comblist” of two elements from the list: [real data point x, real data point y, “similar data point x”, “similar data point y”, “different data point x”, “different data point y”, “similar to different data point x” and “similar to different data point y”].

[0122] For each tuple in comblist:

[0123] Feed the two data points to the discriminator (two data points of the same target class are considered similar): [0124] Calculate the similarity measure D_w (e.g., D_w could be considered as the cosine distance).

[0125] Calculate the corresponding loss using a similarity loss function (e.g., contrastive loss function).

[0126] weighted similarity loss = β_similarity [data_point_1_type, data_point_2_type] X Similarity loss

[0127] Discriminator similarity loss = weighted similarity loss

[0128] If one of (data_point_l, data_point_2) is generated by similar data generator based on the other one:

[0129] SimilarDataGenerator similarity loss = weighted similarity loss

[0130] Otherwise:

[0131] SimilarDataGenerator similarity loss = 0

[0132] If one of (data_point_l, data_point_2) is generated by different data generator based on the other one:

[0133] if weighted similarity loss ! = 0:

[0134] DifferentDataGenerator similarity loss =

— weighted similarity loss

[0135] else:

[0136] DifferentDataGenerator similarity loss =

— minimum weighted similarity loss

[0137] Otherwise:

[0138] DifferentDataGenerator similarity loss = 0

[0139] To calculate reconstruction loss, for each similar data point generated based on an input data point, the method calculates:

"Similar" Data Generator' s reconstruction loss =

reconstruction loss

[0140] To calculate reconstruction loss, for each different data point generated based on an input data point, the method calculates, utilizing different data generator: [0141]

[0142] "Different" DataGenerator' s reconstruction loss =

[0143] Else:

[0144] DifferentDataGenerator's reconstruction loss = DifferentDataGenerator's minimum reconstruction loss

[0145] The method calculates the mean losses for X data points (X to tune). For discriminator’s loss, the method calculates:

[0146] Discriminator_{mean loss}= ^α _{discriminator} ^x

MEAN_{for X data points} [Discriminator classification loss] + β_{Discriminator} ^x MEAN_{for X data points} [Discriminator similarity loss]

[0147] For similar data generator’s loss, the method calculates:

[0148] SimilarDataGenerator_{mean loss}= α_{SimilarDataGenerator} ^x

MEAN_{for X data points} [SimilarDataGenerator's classification loss] + β_{SimilarDataGenerator} ^X

MEAN_{for X data points}[SimilarDataGenerator's similarity loss] +

Υ_{SimilarDataGenerator} ^X

MEAN_{forX data points}[SimilarDataGenerator's reconstruction loss]

[0149] For different data generator’s loss, the method calculates:

[0150] DifferentDataGenerator_{mean loss}= _{αDifferentDataGenerator} x

MEAN_{for X data points} [DifferentDataGenerator's classification loss] + β_{DifferentDataGenerator} ^X

MEAN_{for X data points} [DifferentDataGenerator's similarity loss] +

Υ_{DifferentDataGenerator} ^X

MEAN_{for X data points} [DifferentDataGenerator's reconstruction loss]

[0151] FIG. 8 is a flowchart illustrating a method 800 for training a GAN to detect a data drift of a machine learning model. In one embodiment, method 800 for training a GAN to detect a data drift of a machine learning model begins with step 810. At step 810, the method determines that a corrective action is triggered. When the machine learning model is no longer generating accurate predictions and needs to be corrected, a corrective action may be triggered. In some embodiments, a model drift results from changes of feature distributions and changes of targets as compared to the distributions of the features and targets of the training data previous used for training the model.

[0152] At step 820, the method retrains the model with a new dataset. In some embodiments, retraining includes at least one of finding new parameters for the monitored model, changing hyperparameters, generating new training data, or a combination thereof.

[0153] At step 830, the method classifies data points using data classes 831. Data classes 831 include, for example, real data, similar to real data, different data, and very different data. Optionally, classification may include using data classes 832, including target class(es), general class, and unknown class.

[0154] At step 840, the method rebalances the dataset, which may include techniques for pairwise-comparing data points 841, augmenting training dataset 842, and applying under- sampling technique 843. Under-sampling technique 843 includes balancing uneven datasets by keeping all of the data in the minority class and decreasing the size of the majority class.

[0155] In some embodiments, architectures based on Generative Adversarial Network (GAN) and Variational Auto Encoder Generative Adversarial Network (VAEGAN) may be used for unsupervised learning or semi-supervised learning for addressing imbalanced data sets. In this case, the number of data points of the targeted classes, for example anomalies in anomalies detection systems, failures in systems’ failure detection systems, or pattern recognition, are very sparse. GAN models are based on optimizing a min-max function. In some embodiments, the Wasserstein loss function may be used in the case of the GAN model. For example, if x is a real data point coming from a real data distribution p(x), and z is a fake data point coming from another data distribution p(z) learned by the generator, then the optimization min-max function is depicted in equation (1): [0156] mine maxD V (D; G) = E_x~_P(x)(x) [log D(x)] +E_z~_P(z)(z) [log (1 - (1)

D(G(z)))]

[0157] To deal with an imbalance in data and to enhance classification performances, other similarity-based architectures may be used, such as a Siamese Network or a Siamese Generative Adversarial Network. In a Siamese Network, the input could be two data points, for which the model outputs a value, a similarity measure, that represents how similar or different are these two data points. The Siamese model is trained using a similarity-based loss function such as a contrastive loss function, depicted in equation (2):

Y = 0 if the two data points

are similar otherwise 1

= sim (Z_i, Z_j) is a similarity measure for

[0158] FIG. 9 is a block diagram illustrating a system 900 for a discriminatorclassifier model for training a generative adversarial network to detect a data drift of a machine learning model.

[0159] In one embodiment, system 900 for training a GAN to detect a data drift of a machine learning model includes input data point 901, another data point 902, discriminator-classifier 903, similarity measure calculation 904, similarity measure 905, target classes 906, and classes 907.

[0160] In one embodiment, discriminator-classifier 903 is designed with two competing models, a generator (not shown) and a discriminator, where the generator produces fake data and tries to fool the discriminator. Discriminatorclassifier 903 tries to distinguish between real data and fake data.

[0161] If the real data distribution changes, discriminator-classifier 903 can deal with that as described below and avoid labeling real data as fake data. [0162] Discriminator-classifier 903 classifies input data as real, similar, different, similar to different (very different) as opposed to classifying the input data as real or fake. In some embodiments, Discriminator- classifier 903 performs a similarity measure calculation 904 and provides a similarity measure 905.

[0163] If at least some labels for input data 901 are provided, task 2 will include performing classification into target classes 906 including class 1 through class N, a general class, and an unknown class.

[0164] Classes 1-N are known classes where real data and similar data should have the same label (e.g., same class). For example, if system 900 is adapted for crash detection, the two target classes include class 1 which indicates a crash and class 2 which indicates no crash.

[0165] A general class includes real data points and similar data points that are without a label. An unknown class includes different data points and very different data points which are labeled as unknown class.

[0166] If input data labels are not provided, task 2 909 will not be performed. In this case, real data points and similar data points will be considered similar to each other, and different data points and very different data points will be considered similar to each other.

[0167] In some embodiments, discriminator- classifier 903 is implemented as a single model adapted to perform multiple tasks, including for example task 1 908, task 2 909 and similarity measure calculation 904. In some embodiments, discriminator-classifier 903 is implemented as multiple models, each adapted to perform at least one task, including for example, one model adapted to perform task 1 908, a second model adapted to perform task 2 909, and a similarity network model adapted to perform similarity measure calculation 904.

[0168] When model X (not shown) is in production, using at least the model X input data points 901, and based on discriminator- classifier 903 output during a predetermined or particular window of time or time interval, the data drift detector according to some embodiments determines whether data distribution has been changed. If the data distribution has been changed, a data drift is detected, and retraining is triggered.

[0169] FIG. 10 is a block diagram illustrating a system 1000 for a different data generator model for training a generative adversarial network to detect a data drift of a machine learning model. In one embodiment, system 1000 for training a GAN to detect a data drift of a machine learning model includes real data point 1001, different data generator 1002, discriminator-classifier 1003, different data 1004, classification losses calculation 1005, similarity loss calculation 1006, and reconstruction loss calculation 1007. The dotted lines in FIG. 10 represent feedback lines and/or backpropagation.

[0170] Different data generator 1002 (configured as an AE-based or VAE- based or any other architecture) receives the real data points 1001 as input to generate different data points 1004 as its output. Different data generator 1002 is trained for the purpose to make the discriminator 1003 not being able to label different data 1004 as different. The different data generator 1002 and the discriminator-classifier 1003 are trained simultaneously.

[0171] As stated above, different data points 1004 will be used as inputs for the discriminator 1003 to train so that it shall label them as different instead of real, similar to real or very different.

[0172] Different data generator 1002 will work on the objective that the generated data 1004 distribution should be as different as possible from the input data 1001 distribution. This is achieved by maximizing the reconstruction loss, for example.

[0173] Generated data 1004 should be as different as possible to input data 1001 with the objective being to lead the discriminator 1003 to not classify the generated data 1004 as different and unknown as it should be classified. Similarity loss and classification loss may be used to reach the objective, for example.

[0174] FIG. 11 is a block diagram illustrating a system 1100 for a similar data generator model for training a generative adversarial network to detect a data drift of a machine learning model. [0175] In one embodiment, system 1100 for training a GAN to detect a data drift of a machine learning model includes real/different data point 1101, similar data generator 1102, discriminator-classifier 1103, similar data 1104, classification loss calculation 1105, similarity loss calculation 1106, and reconstruction loss calculation 1107. The dotted lines in FIG. 11 represent feedback lines and/or backpropagation.

[0176] Similar data generator 1102 takes real data points/different data points 1101 as input and generates similar to real/similar to different (very different) data 1104, respectively.

[0177] Similar data 1104 are generated using similar data generator 1102 designed as an AE-based or VAE-based architecture, or any other architecture. [0178] For each of similar data 1104, similar data generator 1102 works on the objective that the discriminator 1103 labels the data 1104 as similar to real or similar to different (very different) according to input 1101 of similar data generator 1102. To generate such similar data, which is not identical to real data, yet still similar, similar data generator 1102 will try to make the generated data be as similar as possible to input data 1101, using a similarity-based loss function, for example a contrastive loss function.

[0179] Similarity loss calculation 1106 using contrastive loss takes the output of discriminator 1103 for a positive example and calculates its distance to an example of the same class and contrasts that with the distance to negative examples. The loss is low if positive samples are encoded to similar (closer) representations and negative examples are encoded to different (farther) representations.

[0180] The generated data 1104 distribution should be as close as possible to the input data 1101 distribution, for example, achieved by minimizing a reconstruction loss.

[0181] FIG. 12 is a block diagram illustrating an architecture for similar and different data generators according to some embodiments. [0182] In this example, variational autoencoder (VAE) 1200 is a black-box inference model using a variational autoencoder architecture. In this example, VAE includes encoder 1220 and decoder 1230. Input 1201 includes real data points from the training data. Noise 1202 and 1206 include randomly-generated noise data. [0183] Encoder 1220 includes input data points 1201 and noise data points

1202. When input 1201 and noise 1202 are applied to a learning encoding function

1203, the encoder 1220 produces latent code z 1204.

[0184] Decoder 1230 (that can be a neural network) takes latent code z as input, applies function 1205, adds noise data points 1206 (which can include noise from a normal distribution), applies function 1207 (which can include addition or multiplication of z and the noise data) to produce reconstructed output 1208.

[0185] Noise data points 1202 and 1206 are included as additional input to the inference model 1200 instead of adding them at the end, thereby allowing the inference network to learn complex probability distributions.

[0186] The reconstruction loss which can represent the difference between the ground-truth and other data generated by VAE 1200 is used to generate similar and different data from the real data distribution generated by VAE 1200. Information about the data distribution is stored in two places, code z 1204, and the weights of the network to transform code z 1204 into reconstructed x 1208.

[0187] Variational autoencoders provide a principled framework for learning deep latent-variable models and corresponding inference models. VAEs include an encoder that produces mean code p and standard deviation code o. The actual code is then sampled randomly from, for example, a Gaussian distribution with mean p and standard deviation o. It is understood that other distribution may also be used for sampling. The VAEs also include a decoder that takes the actual code and decodes them normally to match outputs to inputs. Thus, the encoder (or recognition network) converts the inputs to an internal representation (code) and the decoder (or generative network) converts the internal representation (code) to the outputs. [0188] Latent code z 1204 is learned using a self-supervised learning principle, in which first a discrete autoencoder (encoder 1220) is trained on the output sequences, and then the resulting latent codes 1204 are used as intermediate targets for the end-to-end sequence prediction task.

[0189] Data generators such as the similar and different data generators according to some embodiments, can have a VAE architecture, which helps in generating more data. Variational autoencoders are built using machine learning data architectures, such as neural networks, and, for example, can include encoders and decoders which are trained over a number of epochs to generate outputs that can match or represent a similar probability distribution as a set of input data samples. The training can be based on various loss functions, and minimization thereof across training epochs. The VAE can learn parameters of a probability distribution representing the input data, and, accordingly, can be usable to generate new input data samples.

[0190] For example, for a given data point x 1201, the generator (VAE 1200) will learn latent z 1204 through its encoder 1220 and then generate a new data point 1208 using the decoder 1230. Based on the reconstruction loss, between the input x 1201 and the generated output x 1208, the generator will learn how to generate similar data points to the input (if reconstruction loss is minimized), different from the input (if reconstruction loss is maximized), or for a specific purpose such as generating a data point to not be classified (by a classifier) as a generated data point (which involves, reconstruction loss, classification loss and similarity loss).

[0191] FIG. 13 is a block diagram illustrating an architecture for learning a similarity metric according to some embodiments.

[0192] In this example, Siamese model 1300 is a similarity detection model using a Siamese neural network architecture. Siamese model 1300 includes inputs 1301a and 1301b, neural networks 1302a and 1302b, weights 1303, neural networks outputs 1304a and 1304b, distance 1305 (between 1304a and 1304b), and output 1306. [0193] Weights 1303 represent a shared parameter vector that is subject to learning. In some embodiments, a single model can be used twice, with one neural network, and thus reducing the need for saving shared weights 1303.

[0194] 1304a and 1304b represent the output of the neural network after receiving inputs XI 1301a, and X2 1301b, which can represent an encoding of the inputs.

[0195] The inputs are two data points input 1301a and input 1301b, for which Siamese model 1300 outputs a value 1306, which is a similarity measure representing how similar or different are these two data points inputs 1301a and 1301b.

[0196] In this example, a similarity measure or score between two data points may be determined using a Siamese neural network. A data point tuple is passed through the Siamese network to obtain a similarity score. While classification helps in mapping a data point to a class that the data point belongs to, the similarity score helps in measuring how different and/or similar are two data points. Having the classifier learn similarity between its inputs and their classifications would result in learning and encoding better the latent features of the data by Siamese model 1300 which leads to better classification.

[0197] FIG. 14 is a block diagram illustrating a system 1400 for a data drift detector model for training a generative adversarial network to detect a data drift of a machine learning model.

[0198] In one embodiment, system 1400 for training a generative adversarial network to detect a data drift of a machine learning model includes model X implemented in cloud 1410 and data drift detector 1420.

[0199] Cloud computing may be integrated with networks for training a generative adversarial network to detect a data drift of a machine learning model to facilitate resource delivery. Cloud computing refers to an implementation where resources (e.g., processing power, data storage, network logic, protocols, algorithm logic, etc.) are provided to a local client on an on-demand basis, usually by means of the Internet. Resource intensive tasks (e.g., machine learning, monitoring, corrective action) are performed on the cloud systems.

[0200] In some embodiments, during training, models may reside on different servers (e.g., server 1405) as training may need significant amount of memory and computation resources. In production, discriminator and similar data generator could be running on different servers for better resource allocation.

[0201] Although not shown, system 1400 may be implemented utilizing edge computing. Edge computing extends cloud computing and services to the edge of a network, for example, using computing nodes deployed inside access networks, mobile devices, or loT end devices such as sensors and actuators. Edge computing provides data, computing, storage, and application services at the network edge using methods similar to cloud computing in remote data centers. In this example, some or all components in whole or in part may be implemented in the edge nodes utilizing edge gateways for performing the resource intensive tasks. The edge nodes and gateways are intermediary to the cloud 1410.

[0202] As used herein, cloud or edge computing can be applied to any device described herein, or components thereof, and relates to an implementation in which at least a portion of the functionality is implemented as one or more cloud or edge components.

[0203] FIG. 15 illustrates an exemplary computer system configurable by a computer program product to carry out embodiments of the present disclosure.

[0204] In the example, computer system 1500 may provide one or more of the components of training a generative adversarial network to detect a data drift of a machine learning model. Computer system 1500 executes instruction code contained in a computer program product 1560 (which may, for example, be part of the training a generative adversarial network to detect a data drift of a machine learning model as discussed herein). Computer program product 1560 comprises executable code in an electronically readable medium that may instruct one or more computers such as computer system 1500 to perform processing that accomplishes the exemplary method steps performed by the embodiments referenced herein. The electronically readable medium may be any non-transitory medium that stores information electronically and may be accessed locally or remotely, for example, via a network connection. In alternative embodiments, the medium may be transitory. The medium may include a plurality of geographically dispersed media, each configured to store different parts of the executable code at different locations or at different times. The executable instruction code in an electronically readable medium directs the illustrated computer system 1500 to carry out various exemplary tasks described herein. The executable code for directing the carrying out of tasks described herein would be typically realized in software. However, it will be appreciated by those skilled in the art that computers or other electronic devices might utilize code realized in hardware to perform many or all the identified tasks without departing from the present disclosure. Those skilled in the art will understand that many variations on executable code may be found that implement exemplary methods within the spirit and the scope of the present disclosure.

[0205] The code or a copy of the code contained in computer program product 1560 may reside in one or more storage persistent media (not separately shown) communicatively coupled to computer system 1500 for loading and storage in persistent storage device 1570 and/or memory 1510 for execution by processor 1520. Computer system 1500 also includes I/O subsystem 1530 and peripheral devices 1540. I/O subsystem 1530, peripheral devices 1540, processor 1520, memory 1510, and persistent storage device 1570 are coupled via bus 1550. Like persistent storage device 1570 and any other persistent storage that might contain computer program product 1560, memory 1510 is a non-transitory media (even if implemented as a typical volatile computer memory device). Moreover, those skilled in the art will appreciate that in addition to storing computer program product 1560 for carrying out the processing described herein, memory 1510 and/or persistent storage device 1570 may be configured to store the various data elements referenced and illustrated herein.

[0206] Those skilled in the art will appreciate computer system 1500 illustrates just one example of a system in which a computer program product in accordance with an embodiment of the present disclosure may be implemented. To cite but one example of an alternative embodiment, storage and execution of instructions contained in a computer program product in accordance with an embodiment of the present disclosure may be distributed over multiple computers, such as, for example, over the computers of a distributed computing network.

[0207] FIG. 16 is a block diagram illustrating a virtualization environment 1600 in which functions implemented by some embodiments may be virtualized. In the present context, virtualizing means creating virtual versions of apparatuses or devices which may include virtualizing hardware platforms, storage devices and networking resources. As used herein, virtualization can be applied to any device described herein, or components thereof, and relates to an implementation in which at least a portion of the functionality is implemented as one or more virtual components.

[0208] Some or all the functions described herein, including, for example, data drift detection, classification, loss calculation, training a GAN network, etc. , may be implemented as virtual components executed by one or more virtual machines (VMs), resulting in a decrease of time delay and energy consumption, and an increase of task accuracy. The one or more VMs may be implemented in one or more virtual environments 1600 hosted by one or more of hardware nodes, such as a hardware computing device that operates as a network node, user equipment (UE), core network node, host, web server, application server, virtual server or the like. [0209] Further, in embodiments in which the virtual node does not require radio connectivity (e.g., a core network node or host), then the node may be entirely virtualized. In some embodiments, training data drift detecting models may be implemented as virtual components executed by one or more virtual machines (VMs) implemented in one or more virtual environments 1600 hosted by one or more of hardware nodes.

[0210] In some embodiments, central units, distributed nodes, and the data drift detecting model may implemented as virtual components executed by one or more virtual machines (VMs) implemented in one or more virtual environments 1600 hosted by one or more of hardware nodes.

[0211] Applications 1602 (which may alternatively be called software instances, virtual appliances, network functions, virtual nodes, virtual network functions, etc.) are run in the virtualization environment 1600 to implement some of the features, functions, and/or benefits of some of the embodiments disclosed herein, including, for example, systems and methods for data drift detection, classification, loss calculation, training a GAN network, etc. .

[0212] Hardware 1604 includes processing circuitry, memory that stores software and/or instructions executable by hardware processing circuitry, and/or other hardware devices as described herein, such as a network interface, input/output interface, and so forth. Software may be executed by the processing circuitry to instantiate one or more virtualization layers 1606 (also referred to as hypervisors or virtual machine monitors (VMMs)), provide VMs 1608a and 1608b (one or more of which may be generally referred to as VMs 1608), and/or perform any of the functions, features and/or benefits described in relation with some embodiments described herein, including, for example, systems and methods for data drift detection, classification, loss calculation, training a GAN network, etc.. The virtualization layer 1606 may present a virtual operating platform that appears like networking hardware to the VMs 1608.

[0213] The VMs 1608 comprise virtual processing, virtual memory, virtual networking or interface and virtual storage, and may be run by a corresponding virtualization layer 1606. Different embodiments of the instance of a virtual appliance 1602 may be implemented on one or more of VMs 1608, and the implementations may be made in different ways. Virtualization of the hardware is in some contexts referred to as network function virtualization (NFV). NFV may be used to consolidate many network equipment types onto industry standard high volume server hardware, physical switches, and physical storage, which can be located in data centers, and customer premise equipment. [0214] In the context of NFV, a VM 1608 may be a software implementation of a physical machine that runs programs as if they were executing on a physical, non- virtualized machine. Each of the VMs 1608, and that part of hardware 1604 that executes that VM, be it hardware dedicated to that VM and/or hardware shared by that VM with others of the VMs, forms separate virtual network elements. Still in the context of NFV, a virtual network function is responsible for handling specific network functions that run in one or more VMs 1608 on top of the hardware 1604 and corresponds to the application 1602.

[0215] Hardware 1604 may be implemented in a standalone network node with generic or specific components. Hardware 1604 may implement some functions via virtualization. Alternatively, hardware 1604 may be part of a larger cluster of hardware (e.g., such as in a data center or customer premises equipment (CPE)) where many hardware nodes work together and are managed via management and orchestration 1610, which, among others, oversees lifecycle management of applications 1602. In some embodiments, hardware 1604 is coupled to one or more radio units that each include one or more transmitters and one or more receivers that may be coupled to one or more antennas. Radio units may communicate directly with other hardware nodes via one or more appropriate network interfaces and may be used in combination with the virtual components to provide a virtual node with radio capabilities, such as a radio access node or a base station. In some embodiments, some signaling can be provided with the use of a control system 1612 which may alternatively be used for communication between hardware nodes and radio units.

[0216] Although some of the various embodiments presented herein constitute a single combination of inventive elements, it should be appreciated that the inventive subject matter is considered to include all possible combinations of the disclosed elements. As such, if one embodiment comprises elements A, B, and C, and another embodiment comprises elements B and D, then the inventive subject matter is also considered to include other remaining combinations of A, B, C, or D, even if not explicitly discussed herein. Further, the transitional term “comprising” means to have as parts or members, or to be those parts or members. As used herein, the transitional term “comprising” is inclusive or open-ended and does not exclude additional, unrecited elements or method steps.

[0217] Any process described herein may be performed in any order and may omit any of the steps in the process. Processes may also be combined with other processes or steps of other processes. Although steps or operations may be described as a sequential process, some of the steps or operations may in fact be performed in parallel, concurrently, and/or in a distributed environment, and with program code stored locally or remotely for access by single or multi-processor machines. In addition, in some embodiments the order of steps or operations may be rearranged without departing from the spirit of the disclosed subject matter.

[0218] Throughout the discussion herein, numerous references are made regarding clouds, servers, services, devices, platforms, frameworks, cyber physical systems, or other systems formed from computing devices. It should be appreciated that the use of such terms is deemed to represent at least one or more computing devices having at least one processor (e.g., application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), digital signal processor (DSP), x86, reduced instruction set computer architecture (ARM), ColdFire, graphics processing unit (GPU), multi-core processors, etc.) configured to execute software instructions stored on a computer readable tangible, non-transitory medium (e.g., hard drive, solid state drive, random-access memory (RAM), flash, read only memory (ROM), etc.), among other components. For example, a server can include one or more computers operating as a web server, database server, or other type of computer server in a manner to fulfill described roles, responsibilities, or functions. One should further appreciate the disclosed computer-based algorithms, processes, methods, or other types of instruction sets can be embodied as a computer program product comprising a non-transitory, tangible computer readable medium storing the instructions that cause a processor to execute the disclosed steps. The various servers, systems, databases, or interfaces can exchange data using standardized protocols or algorithms, possibly based on Hypertext Transfer Protocol (HTTP), secure Hypertext Transfer Protocol (HTTPS), Advanced Encryption Standard (AES), public-private key exchanges, web service Application programming interfaces (APIs), known financial transaction protocols, or other electronic information exchanging methods. Data exchanges can be conducted over a packet- switched network, a circuit-switched network, the Internet, Local area network (LAN), wide area network (WAN), virtual private network (VPN), or other type of network.

[0219] A system, server, device, model, or other computing element according to some embodiments, being configured to perform or execute functions on data in a memory, where the meaning of “configured to” or “programmed to” is defined as one or more processors or cores of the computing element being programmed by a set of software instructions stored in the memory of the computing element to execute the set of functions on target data or data objects stored in the memory.

[0220] It should be noted that any language directed to a computing device should be read to include any suitable combination of computing devices, including servers, interfaces, systems, databases, agents, peers, engines, controllers, modules, or other types of computing devices operating individually or collectively. One should appreciate the computing devices comprise a processor configured to execute software instructions stored on a tangible, non-transitory computer readable storage medium (e.g., hard drive, field programmable gate array (FPGA), programmable logic array (PLA), solid state drive, RAM, flash, ROM, etc.). The software instructions configure or program the computing device to provide the roles, responsibilities, or other functionality as discussed below with respect to the disclosed apparatus. Further, the disclosed technologies can be embodied as a computer program product that includes a non-transitory computer readable medium storing the software instructions that causes a processor to execute the disclosed steps associated with implementations of computer-based algorithms, processes, methods, or other instructions. In some embodiments, the various servers, systems, databases, or interfaces exchange data using standardized protocols or algorithms, possibly based on HTTP, HTTPS, AES, public-private key exchanges, web service APIs, known financial transaction protocols, or other electronic information exchanging methods. Data exchanges among devices can be conducted over a packet-switched network, the Internet, LAN, WAN, VPN, or other type of packet switched network; a circuit switched network; cell switched network; or other type of network.

[0221] Systems, devices, and methods described herein may be implemented using a computer program product tangibly embodied in an information carrier, e.g., in a non-transitory machine-readable storage device, for execution by a programmable processor; and the method steps described herein, including for example one or more of the steps of FIGs. 1, 4, 7 and 8 may be implemented using one or more computer programs that are executable by such a processor. A computer program is a set of computer program instructions that can be used, directly or indirectly, in a computer to perform a certain activity or bring about a certain result. A computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

[0222] Although the computing devices described herein (e.g., network nodes, cloud-based models, virtual machines) may include the illustrated combination of hardware components, other embodiments may comprise computing devices with different combinations of components. It is to be understood that these computing devices may comprise any suitable combination of hardware and/or software needed to perform the tasks, features, functions and methods disclosed herein. Determining, calculating, obtaining or similar operations described herein may be performed by processing circuitry, which may process information by, for example, converting the obtained information into other information, comparing the obtained information or converted information to information stored in the network node, and/or performing one or more operations based on the obtained information or converted information, and as a result of said processing making a determination. Moreover, while components are depicted as single boxes located within a larger box, or nested within multiple boxes, in practice, computing devices may comprise multiple different physical components that make up a single illustrated component, and functionality may be partitioned between separate components. For example, a communication interface may be configured to include any of the components described herein, and/or the functionality of the components may be partitioned between the processing circuitry and the communication interface. In another example, non-computationally intensive functions of any of such components may be implemented in software or firmware and computationally intensive functions may be implemented in hardware.

[0223] In certain embodiments, some or all of the functionality described herein may be provided by processing circuitry executing instructions stored on in memory, which in certain embodiments may be a computer program product in the form of a non-transitory computer-readable storage medium. In alternative embodiments, some or all of the functionality may be provided by the processing circuitry without executing instructions stored on a separate or discrete device- readable storage medium, such as in a hard-wired manner. In any of those particular embodiments, whether executing instructions stored on a non-transitory computer-readable storage medium or not, the processing circuitry can be configured to perform the described functionality. The benefits provided by such functionality are not limited to the processing circuitry alone or to other components of the computing device but are enjoyed by the computing device as a whole, and/or by end users and a wireless network generally.

[0224] The disclosed technology is designed to be compatible with and operable by any computing device, including, for example, a desktop computer, a mobile device, a smart phone, an Internet of Things device, an Augmented Realty or Virtual Reality device, personal digital assistant (PDA), gaming console or device, playback appliance, wearable terminal device, mobile station, tablet, laptop, or a combination thereof.

[0225] While some examples described herein may refer to functions performed by given actors such as “users,” “systems,” and/or other entities, it should be understood that this is for purposes of explanation only. The claims should not be interpreted to require action by any such example actor unless explicitly required by the language of the claims themselves.

[0226] Many of the details, dimensions, angles and other features shown in the Figures are merely illustrative of particular embodiments of the disclosed technology. Accordingly, other embodiments can have other details, dimensions, angles and features without departing from the spirit or scope of the disclosure. In addition, those of ordinary skill in the art will appreciate that further embodiments of the various disclosed technologies can be practiced without several of the details described below.

[0227] Throughout the specification and claims, the following terms take the meanings explicitly associated herein, unless the context clearly dictates otherwise: [0228] The phrase “in one embodiment” as used herein does not necessarily refer to the same embodiment, though it may. Thus, as described below, various embodiments of the disclosure may be readily combined, without departing from the scope or spirit of the disclosure.

[0229] As used herein, the term “or” is an inclusive “or” operator and is equivalent to the term “and/or,” unless the context clearly dictates otherwise.

[0230] The term “based on” is not exclusive and allows for being based on additional factors not described unless the context clearly dictates otherwise.

[0231] The term “training” herein does not necessarily limit to a supervised, unsupervised or semi-supervised approach. Supervised machine learning is the machine learning task of inferring a function from supervised (labeled) training data. Unsupervised learning is the machine learning task of find hidden structure (function) in unlabeled data. Semi-supervised machine learning includes training with labeled and unlabeled data. The term “backpropagation” refers to updating weights of nodes constituting the learning network according to a calculated loss.

[0232] As used herein, and unless the context dictates otherwise, calculating a loss is not limited to a specific scheme, and for example, hinge loss, square loss, Softmax loss, cross-entropy loss, absolute loss, insensitive loss, or the like may be used.

[0233] As used herein, and unless the context dictates otherwise, the term “coupled to” is intended to include both direct coupling (in which two elements that are coupled to each other contact each other) and indirect coupling (in which at least one additional element is located between the two elements). Therefore, the terms “coupled to” and “coupled with” are used synonymously. Within the context of a networked environment where two or more components or devices are able to exchange data, the terms “coupled to” and “coupled with” are also used to mean “communicatively coupled with”, possibly via one or more intermediary devices.

[0234] In addition, throughout the specification, the meaning of “a”, “an”, and “the” includes plural references, and the meaning of “in” includes “in” and “on”.

[0235] While certain illustrative embodiments are described herein, those embodiments are presented by way of example only, and not limitation. While the embodiments have been particularly shown and described, it will be understood that various changes in form and detail may be made. Although various embodiments have been described as having features and/or combinations of components, other embodiments are possible having a combination of any features and/or components from any of embodiments as discussed above.

Claims

CLAIMS What is claimed is:

1. A computer-implemented method for training a generative adversarial network to detect a data drift of a machine learning model, the method comprising: using supervised learning to generate a plurality of data points based at least on training data of the machine learning model; using supervised learning to classify each data point of the plurality of data points and generate a classification probability distribution for the plurality of data points; and detecting the data drift of the machine learning model based on a deviation in the classification probability distribution, wherein the detected data drift enables triggering a corrective action.

2. The method of claim 1, wherein detecting the data drift further comprises calculating a data drift score based on the classification probability distribution.

3. The method of any of claims 1-2, wherein generating the plurality of data points further comprises generating a different data point based at least on maximizing the reconstruction loss.

4. The method of any of claims 1-2, wherein generating the plurality of data points further comprises generating a similar data point based at least on minimizing the reconstruction loss.

5. The method of any of claims 1-4, wherein detecting the data drift further comprises: calculating a data drift score; and determining whether the data drift score is above a predetermined threshold within a predetermined period of time.

6. The method of any of claims 1-5, wherein the corrective action includes at least one of retraining of the machine learning model and ensemble classification.

7. The method of any of claims 1-6, further comprising using unsupervised learning to enable anomalies detection based on the classification probability distribution.

8. The method of any of claims 1-7, wherein using supervised learning to generate a classification probability distribution comprises training a discriminatorclassifier generative adversarial network adapted to generate the classification probability distribution.

9. The method of claim 8, wherein training the discriminator-classifier further comprises generating a data point tuple based on the plurality of data points.

10. The method of claim 9, wherein training the discriminator-classifier further comprises calculating a classification score and a classification loss for each data point in the plurality of data points.

11. The method of claim 9, wherein training the discriminator-classifier further comprises calculating a similarity score and a similarity loss for two data points in the plurality of data points.

12. The method of claim 9, wherein training the discriminator-classifier further comprises calculating classification values and similarity values.

13. The method of claim 12, wherein training the discriminator-classifier further comprises calculating mean losses for the plurality of data points based on the classification values and the similarity values.

14. The method of any of claims 1-13, wherein generating the plurality of data points further comprises generating a similar to different data point based at least on minimizing the reconstruction loss.

15. The method of any of claims 1-14, further comprising preprocessing the training data of the machine learning model, wherein preprocessing comprises rebalancing a set of tuples of the training data.

16. The method of any of claims 1-15, further comprising training a discriminator-classifier generative adversarial network adapted to generate the classification probability distribution simultaneously with training at least one data generator adapted to generate the plurality of data points.

17. The method of any of claims 1-16, further comprising training at least one data generator adapted to generate the plurality of data points.

18. The method of claim 17, wherein training the at least one data generator further comprises calculating similarity values.

19. The method of claim 17, wherein training the at least one data generator further comprises calculating a reconstruction loss for each data point in the plurality of data points.

20. The method of any of claims 1-19, further comprising training at least one data generator adapted to generate the plurality of data points and calculating mean losses for the plurality of data points based on at least one of classification values, similarity values, and a reconstruction value.

21. A non-transitory computer readable medium or media containing instructions for executing a method for training a generative adversarial network to detect a data drift of a machine learning model, the method comprising: using supervised learning to generate a plurality of data points based at least on training data of the machine learning model; using supervised learning to classify each data point of the plurality of data points and generate a classification probability distribution for the plurality of data points; and detecting the data drift of the machine learning model based on a deviation in the classification probability distribution, wherein the detected data drift enables triggering a corrective action.

22. The non-transitory computer readable medium or media of claim 21, wherein detecting the data drift further comprises calculating a data drift score based on the classification probability distribution.

23. The non-transitory computer readable medium or media of any of claims 21-22, wherein generating the plurality of data points further comprises generating a different data point based at least on maximizing the reconstruction loss.

24. The non-transitory computer readable medium or media of any of claims 21-22, wherein generating the plurality of data points further comprises generating a similar data point based at least on minimizing the reconstruction loss.

25. The non-transitory computer readable medium or media of any of claims 21-24, wherein detecting the data drift further comprises: calculating a data drift score; and determining whether the data drift score is above a predetermined threshold within a predetermined period of time.

26. The non-transitory computer readable medium or media of any of claims 21-25, wherein the corrective action includes at least one of retraining of the machine learning model and ensemble classification.

27. The non-transitory computer readable medium or media of any of claims 21-26, further comprising using unsupervised learning to enable anomalies detection based on the classification probability distribution.

28. The non-transitory computer readable medium or media of any of claims 21-27, wherein using supervised learning to generate a classification probability distribution further comprises training a discriminator-classifier generative adversarial network adapted to generate the classification probability distribution.

29. The non-transitory computer readable medium or media of claim 28, wherein training the discriminator-classifier further comprises generating a data point tuple based on the plurality of data points.

30. The non-transitory computer readable medium or media of claim 29, wherein training the discriminator-classifier further comprises calculating a classification score and a classification loss for each data point in the plurality of data points.

31. The non-transitory computer readable medium or media of claim 29, wherein training the discriminator-classifier further comprises calculating a similarity score and a similarity loss for two data points in the plurality of data points.

32. The non-transitory computer readable medium or media of claim 29, wherein training the discriminator-classifier further comprises calculating classification values and similarity values.

33. The non-transitory computer readable medium or media of claim 32, wherein training the discriminator-classifier further comprises calculating mean losses for the plurality of data points based on the classification values and the similarity values.

34. The non-transitory computer readable medium or media of any of claims 21-33, wherein generating the plurality of data points further comprises generating a similar to different data point based at least on minimizing the reconstruction loss.

35. The non-transitory computer readable medium or media of any of claims 21-34, further comprising preprocessing the training data of the machine learning model, wherein preprocessing comprises rebalancing a set of tuples of the training data.

36. The non-transitory computer readable medium or media of any of claims 21-35, further comprising training a discriminator-classifier generative adversarial network adapted to generate the classification probability distribution simultaneously with training at least one data generator adapted to generate the plurality of data points.

37. The non-transitory computer readable medium or media of any of claims 21-36, further comprising training at least one data generator adapted to generate the plurality of data points.

38. The non-transitory computer readable medium or media of claim 37, wherein training the at least one data generator further comprises calculating similarity values.

39. The non-transitory computer readable medium or media of claim 37, wherein training the at least one data generator further comprises calculating a reconstruction loss for each data point in the plurality of data points.

40. The non-transitory computer readable medium or media of any of claims 21-39, further comprising training at least one data generator adapted to generate the plurality of data points and calculating mean losses for the plurality of data points based on at least one of classification values, similarity values and a reconstruction value.

41. A system for training a generative adversarial network to detect a data drift of a machine learning model, the system comprising: a database connected to a network, configured for receiving and storing training data; one or more processors and memory, said memory containing instructions executable by said one or more processors whereby the system is operative to: use supervised learning to generate a plurality of data points based at least on training data of the machine learning model; use supervised learning to classify each data point of the plurality of data points and generate a classification probability distribution for the plurality of data points; and detect the data drift of the machine learning model based on a deviation in the classification probability distribution, wherein the detected data drift enables triggering a corrective action.

42. The system of claim 41, wherein detecting the data drift further comprises calculating a data drift score based on the classification probability distribution.

43. The system of any of claims 41 and 42, wherein generating the plurality of data points further comprises generating a different data point based at least on maximizing the reconstruction loss.

44. The system of any of claims 41-43, wherein generating the plurality of data points further comprises generating a similar data point based at least on minimizing the reconstruction loss.

45. The system of any of claims 41-44, wherein detecting the data drift further comprises: calculating a data drift score; and determining whether the data drift score is above a predetermined threshold within a predetermined period of time.

46. The system of any of claims 41-45, wherein the corrective action includes at least one of retraining of the machine learning model and ensemble classification.

47. The system of any of claims 41-46, further comprising using unsupervised learning to enable anomalies detection based on the classification probability distribution.

48. The system of any of claims 41-47, wherein using supervised learning to generate a classification probability distribution further comprises training a discriminator-classifier generative adversarial network adapted to generate the classification probability distribution.

49. The system of claim 48, wherein training the discriminator-classifier further comprises generating a data point tuple based on the plurality of data points.

50. The system of claim 49, wherein training the discriminator-classifier further comprises calculating a classification score and a classification loss for each data point in the plurality of data points.

51. The system of claim 49, wherein training the discriminator-classifier further comprises calculating a similarity score and a similarity loss for two data points in the plurality of data points.

52. The system of claim 49, wherein training the discriminator-classifier further comprises calculating classification values and similarity values.

53. The system of claim 52, wherein training the discriminator-classifier further comprises calculating mean losses for the plurality of data points based on the classification values and the similarity values.

54. The system of any of claims 41-53, wherein generating the plurality of data points further comprises generating a similar to different data point based at least on minimizing the reconstruction loss.

55. The system of any of claims 41-54, further comprising preprocessing the training data of the machine learning model, wherein preprocessing comprises rebalancing a set of tuples of the training data.

56. The system of any of claims 41-55, further comprising training a discriminator-classifier generative adversarial network adapted to generate the classification probability distribution simultaneously with training at least one data generator adapted to generate the plurality of data points.

57. The system of any of claims 41-56, further comprising training at least one data generator adapted to generate the plurality of data points.

58. The system of claim 57, wherein training the at least one data generator further comprises calculating similarity values.

59. The system of claim 57, wherein training the at least one data generator further comprises calculating a reconstruction loss for each data point in the plurality of data points.

60. The system of any of claims 41-59, further comprising training at least one data generator adapted to generate the plurality of data points and calculating mean losses for the plurality of data points based on at least one of classification values, similarity values and a reconstruction value.