US11599746B2 - Label shift detection and adjustment in predictive modeling - Google Patents
Label shift detection and adjustment in predictive modeling Download PDFInfo
- Publication number
- US11599746B2 US11599746B2 US16/916,706 US202016916706A US11599746B2 US 11599746 B2 US11599746 B2 US 11599746B2 US 202016916706 A US202016916706 A US 202016916706A US 11599746 B2 US11599746 B2 US 11599746B2
- Authority
- US
- United States
- Prior art keywords
- data
- observed
- time
- training
- shift
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
Images
Classifications
-
- G06K9/6256—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/211—Selection of the most significant subset of features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/211—Selection of the most significant subset of features
- G06F18/2113—Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/217—Validation; Performance evaluation; Active pattern learning techniques
- G06F18/2193—Validation; Performance evaluation; Active pattern learning techniques based on specific statistical tests
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/285—Selection of pattern recognition techniques, e.g. of classifiers in a multi-classifier system
-
- G06K9/6227—
-
- G06K9/623—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/75—Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
- G06V10/751—Comparing pixel values or logical combinations thereof, or feature values having positional relevance, e.g. template matching
Definitions
- the present disclosure relates generally to machine learning and, more particularly, to automatically detecting shift in output labels and automatically adjusting labels in training data based on the detection.
- Machine learning is the study and construction of algorithms that can learn from, and make predictions on, data. Such algorithms operate by building a model from inputs in order to make data-driven predictions or decisions. Thus, a machine learning technique is used to generate a statistical model that is trained based on a history of attribute values associated with one or more objects. The statistical model is trained based on multiple attributes described herein. In machine learning parlance, such attributes are referred to as “features.” To generate and train a statistical model, a set of features is specified and a set of training data is identified.
- the accuracy of a machine-learned model largely depends on the quality and quantity of the training data. For example, if there are not enough training instances in the training data, then the model will not be able to make accurate predictions for inputs that are similar (but not identical) to the training instances. As another example, if the training instances do not reflect real-world scenarios, then the resulting model will not be able to make accurate predictions.
- a cloud service that monitors performance of, and resource consumption by, cloud applications may implement a model to predict how many computer resources of one or more types to allocate to each cloud application based on the cloud application's performance.
- Cloud application performance may change over time in response to changes in how the cloud application is used (e.g., what features are being leveraged), how frequently it is being relied upon by other applications and/or users, and the number of machines that are available for the cloud application to execute on.
- Labels refer to not only the labels of training instances, but also to real-world results, irrespective of the output (predictions) of a machine-learned model.
- Input labels are labels that are part of the training data while output labels are actual labels as observed in historical results.
- a machine-learned model is trained to predict whether an entity will perform a particular action in response to one or more events occurring. Also, about twenty entities typically perform the particular action each week, but only ten entities actually perform that action in the most recent week. Thus, there is a (output) label shift from twenty to ten.
- a shift in label distribution results in a decrease in the accuracy of the machine-learned model.
- a refresh of the machine-learned model is sufficient.
- a refresh involves generating new training instances based on recent data and retraining the machine-learned model based on the new training instances and older training instances.
- model refreshment might not work well because a dramatic change in label distribution probably indicates a large change in feature weights or coefficients to derive the correct label from the feature set.
- the model learned from the historical data is likely to provide incorrect predictions.
- refreshing the model may still result in inaccurate predictions on newly measured scoring data.
- FIG. 1 is a block diagram that depicts an example model training system for detecting label shift and adjusting training instances, in an embodiment
- FIG. 2 is a flow diagram that depicts an example process for label shift detection and adjustment, in an embodiment
- FIG. 3 is an example data plot that depicts historical data and forecast data, in an embodiment
- FIG. 4 is a flow diagram that depicts an example process for adjusting training instances on a segment-wise basis, in an embodiment
- FIG. 5 is a block diagram that illustrates a computer system upon which an embodiment of the invention may be implemented.
- a system and method for automatically adjusting training data in response to a detection of shift in labels are provided.
- historical data is automatically analyzed to generate and train a forecasting model.
- the forecasting model is used to predict an aggregate value of a particular metric.
- the predicted aggregate value is compared to an actual or observed aggregate value of the particular metric. If the difference between the two aggregate values is significant, then a shift in labels is detected and triggers an adjustment of training data upon which a machine-learned model was trained.
- the training data is divided based on segments and the training instances of different segments are adjusted differently. For example, the importance weights of instances in one segment may be adjusted positively while the importance weights of instances in another segment may be adjusted negatively.
- Embodiments improve computer-related technology by automatically and, in a data-driven and scientific way, adjusting instances in training data to improve the accuracy of models in light of significant shifts in label distribution.
- Embodiments involve a novel model treatment system that comprises two main components, where model label shift detection provides directional guidance to model label shift adjustment, and model label shift adjustment is a follow-up step to model label shift detection.
- embodiments leverage forecasting models to auto-detect model label shift even though the main purpose of forecasting has been producing an accurate prediction in order to take a prompt action, such as weather forecasting and economic forecasting.
- the forecasting model(s) described herein serve as a powerful and scientific tool to detect target label distribution shift.
- the forecasting results are just an intermediate step. Using the forecasting results to further identify the model label shift and adjust the model accordingly is a primary goal, which is inherently different from prior usages of forecasting.
- embodiments leverage a novel segment-wise variation of model label shift adjustment.
- existing model label shift adjustment approaches do not take segmentation factor into consideration.
- segmentation factor matters greatly since the extent of label shift could vary significantly across different segments.
- FIG. 1 is a block diagram that depicts an example model training system 100 for detecting label shift and adjusting training instances, in an embodiment.
- Model training system 100 includes historical training data 110 , a model trainer 120 , a machine-learned model 130 , a historical scoring data set 140 , historical results 150 , a label shift detector 160 , a label shift adjustor 170 , and a future scoring data set 180 .
- Model trainer 120 , label shift detector 160 , and label shift adjustor 170 are implemented in software, hardware, or any combination of software and hardware.
- Each of model trainer 120 , label shift detector 160 , and label shift adjustor 170 may be implemented on a single computing device, individually on multiple computing devices, or distributed on multiple computing devices where different functionality of each component is implemented on different computing devices.
- Model trainer 120 takes historical training data 110 as input to train machine-learned model 130 .
- Each training instance in historical training data 110 includes a set of feature values and a label.
- the features and label of machine-learned model 130 vary depending on what is being predicted. For example, machine-learned model 130 may predict whether computer resource utilization of a cloud system is about to exceed capacity of the cloud system (in which case current resource utilization statistics and current capacity statistics may be features of the model), whether a user is going to perform a particular (e.g., online) action in response to certain stimuli (in which case attributes of the user and attributes of the stimuli are features of the model), or whether a weather event is going to occur given certain known conditions (in which case current weather conditions such as temperature, wind, humidity, barometric pressure may be features of the model).
- Machine-learned model 130 may be a binary classification model (that predicts whether a certain entity or event belongs to one of two classes), a multi-class classification model (that predicts whether a certain entity or event belongs to one of multiple classes), or another type of model, such as a regression model that outputs a continuous quantity, such as a specific dollar value for which a house is predicted to sell.
- Historical scoring data set 140 comprises multiple scoring instances, each comprising a set of feature values that is input into machine-learned model 130 .
- machine-learned model 130 computes a predicted label or a score reflecting a prediction, whether the predicted label is a classification label or a regression label.
- the predicted labels computed by machine-learned model 130 may be recorded in the appropriate scoring instances in historical scoring data set 140 .
- Model training system 100 also records actual or observed labels in historical results 150 .
- the observed labels in historical results 150 are different than the predicted labels that were generated by machine-learned model 130 based in input from historical scoring data set 140 .
- Historical results 150 indicate observed or actual events or (e.g., user) behavior.
- Each observed label corresponds to a scoring instance in historical scoring data set 140 (or a training instance in historical training data 110 ).
- an predicted label of a particular scoring instance (in historical scoring data set 140 ) is a score indicating a likelihood that a particular user will perform a particular online action in response to a notification or message and the particular user did not perform the particular online action
- the observed action is recorded, in historical results 150 , as a value indicating a negative result, such as a ‘0.’
- the particular user did perform the particular online action then the observed action is recorded, in historical results 150 , as a value indicating a positive result, such as a ‘ 1.’
- Observed labels may be automatically generated by one or more processes that determine whether a certain event or action occurred. In some scenarios, the observed label is generated based on what is not found in a data set. For example, if there is no record of a user responding to a notification within two days of receiving the notification, then an observed label indicating that the event did not occur is generated and recorded.
- Observed labels are automatically associated with a scoring instance that was used to generate a predicted label. For example, if the event being predicted is a user action, then the user is associated with a scoring instance identifier or with a user identifier and a model identifier. Each scoring instance is associated with a scoring instance identifier or a combination of a model identifier and a user identifier. In this way, observed labels in historical results 150 are mapped to (or associated with) scoring instances in historical scoring data set 140 .
- At least some of the observed labels in historical results 150 may be for scoring instances that are not yet reflected in training instances in historical training data 110 .
- at least a portion of historical results 150 may be newer or “fresher” data than training instances found in historical training data 110 .
- historical training data 110 may include observed labels that were generated between January and December of one year while historical results 150 may include observed labels that were generated between October of the same year and June of the following year.
- the observed labels in historical training data 110 may be a strict subject of the observed labels in historical results 150 .
- Label shift detector 160 analyzes historical results 150 to detect shift in the distribution of observed labels, which detecting is described in more detail below.
- Label shift detector 160 includes a forecast model generator 162 and a forecasting model 164 that forecast model generator 162 generates. Although only a single forecasting model is depicted, forecast model generator 162 may generate multiple forecasting models based on historical data, such as one for each segment of multiple segments.
- label shift adjustor 170 (also described in more detail herein) adjusts or modifies importance weights of training instances in historical training data 110 to generate adjusted training data 112 .
- Model trainer 120 trains a new model 132 based on adjusted training data 112 .
- the new model 132 is applied to each scoring instance in future scoring data set 180 (for which labels are not yet known at the time of label shift detection and adjustment) in order to generate output labels or predictions therefor.
- FIG. 2 is a flow diagram that depicts an example process 200 for label shift detection and adjustment, in an embodiment.
- Process 200 may be implemented by different components of model training system 100 .
- label shift is detected (e.g., by label shift detector 160 ) based on historical results 150 .
- Label shift may be detected using one or more forecasting models that are trained based on observed labels, some of which may be reflected in historical results 150 .
- Label shift may be considered “significant” if an aggregate output value is outside a certain range of values or if a shift measure is above a particular threshold, for example, if an aggregate is outside a 95% confidence interval. If the determination is negative, then process 200 proceeds to block 230 , where machine-learned model 130 is refreshed based on historical scoring data set 140 and historical results 150 . If the determination is positive, then process 200 proceeds to block 240 .
- a segment is a grouping of one or more entities (e.g., people) that share one or more characteristics in common.
- Block 240 may be performed by label shift detector 160 or by another component of model training system 100 .
- Segment-wise discrepancy refers to the fact that the shift in label distribution among different segments of entities is substantially different. For example, if, based on historical results 150 , overall label shift is outside a 95% confidence interval and the magnitude of label shift of each segment within historical results 150 is similar, then there is unlikely to be significant segment-wide discrepancy. On the other hand, if, based on historical results 150 , overall label shift is outside a 95% confidence interval and the label shift of half of the segments within historical results 150 is not outside the 95% confidence interval, then there is segment-wide discrepancy.
- process 200 proceeds to block 250 where all (or most) training instances in historical training data 110 are adjusted or modified, regardless of segment. If the determination in block 240 is positive, then process 200 proceeds to block 260 where training instances in historical training data 110 are adjusted on a segment-wise basis. For example, training instances corresponding to one segment are adjusted a first amount while training instances corresponding to another segment are adjusted a second amount.
- Label shift detector 160 detects shifts in the distribution of observed labels over time. Detecting such a shift may be performed in one or more ways. For example, if the ratio of values of observed labels is relatively constant over time (i.e., with very little variation), then a simple difference may be made between (1) the ratio of values of observed labels during a first time period and (2) the ratio of values of observed labels during a second (subsequent) time period. A shift metric may be defined based on the difference, depending on the possible values of the output labels. For example, in a binary classification scenario, a distribution of 30/70 compared to a distribution of 60/40 represents a 30-point shift. Any shift over 15 may be considered significant.
- one or more forecasting models are trained based on a portion of historical training data 110 and/or a portion of historical results 150 .
- the data upon which a forecasting model is trained is time series data comprises multiple data points, each corresponding to a different period of time and corresponding to an aggregate of observed labels (in historical training data 110 and/or historical results 150 ) that occurred in the corresponding period of time.
- observed labels may be aggregated on a daily basis, a weekly basis, or a monthly basis.
- the aggregation may be a sum, such as a daily sum or a weekly sum, or an average/median value, such as a daily average on a weekly basis or a weekly average on a monthly basis.
- each data point in the times series data reflects an aggregate value.
- the one or more forecasting models take into account historical (and presumably “natural”) trends reflected in changes in distribution of observed labels.
- the portions of historical training data 110 and/or historical results 150 upon which the forecasting models are trained reflect a period of time before a particular point of time, referred to herein as a “candidate shift point in time.”
- a candidate shift point in time refers to a point in time that may correspond to a start in a potential shift in label distribution.
- a candidate shift point in time may be identified based on input from a user, such as a developer of machine-learned model 130 or a data scientist. For example, a user may guess, based on preliminary reports or data, that a significant shift in label distribution has begun. As another example, a user, reading news reports about a global event, may anticipate that machine-learned model 130 will start performing poorly.
- label shift detector 160 automatically identifies multiple candidate shift points in time. For example, each day in the past may act as a candidate shift point in time. Thus, label shift detector 160 may perform shift detection on a daily basis where, for each day it executes, label shift detector 160 uses a week before the current day as the candidate shift point in time.
- a forecasting model is trained based on observed labels generated prior to the candidate shift point in time, the forecasting model is leveraged to produce a forecast or a prediction of one or more labels after the candidate shift point in time.
- Input into the forecasting model may be a number, indicating a number of forecasted values. For example, if data upon which the forecasting model is trained is a weekly sum over the last fourteen months, then an input value of three indicates that the forecasting model is to produce three forecasted values, each representing a weekly sum and one for each of three weeks after the candidate shift point in time.
- Label shift detector 160 compares the forecast to observed values that are based on observed labels that were generated (or that reflect events or activity that occurred) after the candidate shift point in time. Like forecast values, observed values may reflect aggregated data, except that the data that is aggregated is from historical results 150 . For example, if each forecast value is a daily sum, then an observed value is also a daily sum.
- label shift detector 160 determines that a significant shift occurred.
- a measure of significance may vary from one implementation to another. For example, if an observed value is greater than 20% different from a forecast value, then the shift is significant.
- a user (such as an administrator of model training system 100 ) may define the significance measure.
- the measure of significance depends on how accurate the forecasting model is. For example, if the error of the forecasting model against historical data representing events that occurred prior to the candidate point in time is relatively small, then even if the difference between an observed value and a forecast value may be relatively small, the detection of a significant event could still be triggered. Conversely, if the error of the forecasting model time is relatively large, then the difference between an observed value and a forecast value must be relatively large in order to trigger a detection of a significant event.
- label shift detector 160 performs label shift detection on a per-segment basis.
- a segment corresponds to a portion of scoring instances and/or training instances that share one or more feature values in common or that share other characteristics (that are related to one or more model features) in common. For example, if a training instance corresponds to a specific user, then one segment may be all users who live in North America and another segment may be all users who live in South America. However, the only possible values for the geography feature may be country. Therefore, even though no scoring instance or training instance indicates North America as a geographic feature value, instances that indicate a country in North America are grouped together if there is a mapping between the country to North America.
- one segment may be all applications that comprise two or more stateful operations, another segment may be all applications that comprise only one stateful operation, and another segment may be all applications that do not comprise any stateful operations.
- the one or more features are of the entity or event for which a prediction is being made, such as a user, a software application, an organization, a country, or a weather phenomenon.
- Example features for users and/or organizations include geography, industry, job function, employment status, seniority level. and job title.
- a forecasting model is generated for each segment.
- the data upon which each forecasting model is based is limited to observed labels that correspond to the segment that correspond to the forecasting model. For example, all observed labels in historical results 150 corresponding to users in North America are analyzed to generate a time series of daily sums over a period time. Such a time series of daily sums is used to train a forecasting model for the North America segment. Similarly, all observed labels in historical results 150 corresponding to users in South America are analyzed to generate a time series of daily sums over a (same) period of time. Such a time series of daily sums is used to train a forecasting model for the South America segment.
- label shift detector 160 (or another component of model training system 100 ) implements an exponential smoothing algorithm in order to generate a set of forecasting models.
- Each forecasting model in the set is a state space model and may be represented in a component form that includes three different components: error, trend, and seasonal. Each component has finite variations.
- the error component has two possible variations: Additive (A) and Multiplicative (M).
- the trend component has five possible variations: None (N), Additive (A), Additive damped (Ad), Multiplicative (M) and Multiplicative damped (Md).
- the seasonal component has three possible variations: None (N), Additive (A) and Multiplicative (M).
- ETS( ⁇ , ⁇ , ⁇ ) may be used to denote the thirty possible models. This notation helps in remembering the order in which the components are specified, e.g. Model ETS(A,A d , M) denotes the model with additive errors, additive damped trend, and multiplicative seasonality.
- the thirty possible models share a general component form.
- ⁇ t ⁇ are independent and identically distributed Gaussian variables with mean 0 and variance ⁇ 2 ; l t denotes the level of the series at time t; b t denotes the slope (or growth) of the series at time t; s t , s t ⁇ 1 , . . . , s t ⁇ m are seasonal components; and m is the length of seasonality.
- w( ⁇ ), r( ⁇ ), f( ⁇ ) and g( ⁇ ) depends on the components variations.
- the simplest model in exponential smoothing methods is Simple Exponential Smoothing ETS(A,N,N).
- the likelihood of the state space model is relatively straightforward to compute and the maximum likelihood estimates of the model parameters may be obtained.
- a model is selected by minimizing one or more selection criteria.
- selection criteria include AIC (Akaike's Information Criterion), AICc (AIC corrected for small sample bias), and BIC (Bayesian Information Criterion).
- AIC Kaike's Information Criterion
- AICc AIC corrected for small sample bias
- BIC Bayesian Information Criterion
- FIG. 3 is an example data plot 300 that depicts, along with confidence intervals, times series data (specifically, aggregated statistics over time), where some of the time series data pertain to points in time that are prior to a candidate shift point in time, other of the time series data are forecasted values (that are after the candidate shift point in time) (i.e., line 305 ), and other of the time series data (i.e., the point below outer shared region 320 ) are based on observed labels and also pertain to points in time that are after the candidate shift point in time.
- times series data specifically, aggregated statistics over time
- the forecasting model that generated the forecast values in data plot 300 is denoted as ETS(M,N,M).
- the x-axis is time and is divided into months, while the y-axis is an aggregated statistic that represents a number of events that occurred in a monthly period. While this forecasting model may have been generated on monthly data, the forecasting model may have instead been generated on a weekly or daily period. However, averaging the events on a monthly basis removes significant variation in such finer granularity data and reduces the effect of outliers, which, if used to train the forecasting model, might make the forecasting model relatively inaccurate, increasing any confidence intervals and, therefore, the ability to detect significant label shift.
- the candidate shift point in time is February 2020 and there are three forecast values (making up line 305 ): one for February of 2020, one for March of 2020, and one for April of 2020.
- Data plot 300 also shows two aggregated statistics, each of which is based on observed labels that pertain to events associated with February of 2020 (i.e., in inner shaded region 310 ) or March of 2020 (i.e., below outer shaded region 320 ).
- data plot 300 depicts three shaded regions beginning with the candidate shift point in time.
- the inner shaded region 310 indicates a confidence interval of 80%, indicating that, statistically speaking, the forecasting model is 80% confident that an observed (e.g., aggregated) value will fall within inner shaded region 310 .
- the outer shaded regions 320 indicate a confidence level of 95%, indicating that, statistically speaking, the forecasting model is 95% confident that an observed (e.g., aggregated) value will fall within outer shaded regions 320 or inner shaded region 310 .
- label shift detector 160 determines that there is significant label shift, which triggers label shift adjustor 170 .
- the second aggregated statistic (corresponding to March of 2020) after the candidate shift point in time is outside outer shaded regions 320 , indicating that the second aggregate statistic represents significant label shift, or an anomaly.
- label shift detector 160 determines that there is significant label shift. For example, not one of the aggregated statistics falls outside a larger confidence interval (e.g., 95%), but two consecutive aggregated statistics fall outside a smaller (though still relatively large) confidence interval (e.g., 80%). In such a scenario, label shift adjustor 170 may be triggered.
- any label shift adjustment may dictate whether any label shift adjustment should be made. For example, if an aggregated value is outside a particular confidence interval and is greater than a corresponding forecast value, then no label shift adjustment is triggered. On the other hand, if an aggregated value is outside a particular confidence interval and is less than a corresponding forecast value, then label shift adjustment is triggered.
- a forecaster 162 generates a different forecasting model for each segment of multiple segments.
- the forecasting model for one segment may have different ETS components than the forecasting model for another segment.
- a forecasting model for a first segment may be denoted as ETS(M,N,M) while a forecasting model for a second segment may be denoted as ETS(A,N,N).
- thirty possible forecasting models are generated for the first segment (based on the training instances that correspond to the first segment) and the forecasting model denoted as ETS(M,N,M) is ultimately selected for the first segment based on the described selection criteria.
- thirty possible forecasting models are generated for the second segment (based on the training instances that correspond to the second segment) and the forecasting model denoted as ETS(A,N,N) is ultimately selected for the second segment based on the same selection criteria.
- label shift adjuster 170 adjusts training instances in historical training data 110 in response to label shift detector 160 detecting large or significant label shift in at least a portion of historical results 150 .
- “Adjusting” or modifying a training instance may involve modifying an importance weight of the training instance or modifying a label of the training instance.
- An importance weight of a training instance indicates how much coefficients or weights of features are adjusting during training of a machine-learned model based on the training instance. The higher the importance weight, the greater the adjustment of the coefficients or weights of the features of the model. Conversely, the lower the importance weight, the lesser the adjustment of the coefficients or weights of the features of the model.
- non-zero labels may be modified. For example, if a positive label is 1, then a new value for a positive label is l*w, where w may be 0 ⁇ w ⁇ 1. If a negative label is 0, then the negative label remains unmodified. Alternatively, the negative label may be modified to become a negative number.
- a ratio of (1) an aggregated statistic that is based on observed labels that were generated after the candidate shift point in time to (2) a forecast value that corresponds to the same time period as the aggregated statistic is computed and applied to importance weights in the training instances.
- the importance weight of each training instance is assigned the value of 54/97.
- such an adjustment is not statistically or mathematically sound.
- X is the feature vector and Y is the label, where X and Y have a joint distribution p(X, Y) in the historical data set and q(X, Y) is the joint distribution in the scoring data set (e.g., future scoring data set 180 ), and l a loss function defined as l:Y ⁇ Y ⁇ R + .
- l is a loss function that takes its input from a two-dimensional space Y ⁇ Y, and its output is in a one-dimensional space R + (i.e., non-negative real number space).
- loss function l is l( ⁇ (X), Y) ⁇ ( ⁇ (X) ⁇ Y) 2 , where l takes two values ⁇ (X) and Y from the two-dimensional space Y ⁇ Y as the input and produces a non-negative real number ( ⁇ (X) ⁇ Y) 2 as the output, where ⁇ (X) stands for the predicted label via model ⁇ .
- the objective of predictive modeling is to learn a model ⁇ :X ⁇ Y that minimizes E X, Y ⁇ p l( ⁇ (X), Y), where E X, Y ⁇ p l( ⁇ (X), Y) is the expectation of the loss function ( ⁇ (X), Y), given X and Y subject to a joint distribution p.
- E stands for “expectation”
- ⁇ stands for “subject to a distribution.”
- the pre-trained model ⁇ is still valid in the scoring data set (e.g., future scoring data set 180 ). However, the pre-trained model ⁇ is not valid in the scenario when the label shift issue exists.
- the proportion of positive labels could be much larger in the historical dataset than in the scoring data set, which leads to the potential logic change of the label derivation from the features.
- X)! q(Y
- X), which leads to p(X, Y)! q(X, Y).
- the optimal model ⁇ tilde over ( ⁇ ) ⁇ for the scoring data set is the minimizer of E X, Y ⁇ q l( ⁇ tilde over ( ⁇ ) ⁇ (X), Y), which should be different from the model ⁇ learned from the historical dataset.
- Black-Box Shift Estimation is one such technique.
- a key assumption in BBSE is called label shift assumption: p(X
- Y) q(X
- the BBSE approach is extended to account for different segments.
- the label shift assumption under segmentation is: p(X c
- Y, X s ) q(X c
- a confusion matrix is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one. Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa).
- the name “confusion matrix” stems from the fact that the matrix makes it easy to see if a system is confusing two classes (i.e. commonly mislabeling one as another).
- segment s there are (a) n s samples in the historical dataset ⁇ (x 1,s , y 1,s ), . . . , (x n s ,s , y n s ,s ) ⁇ drawn from p(X, Y) and (b) m s samples in the scoring dataset ⁇ x 1,s ′, . . . , x m s ,s ′ ⁇ drawn from q(X).
- ⁇ s (Y) ⁇ p,s ( ⁇ (X),Y) ⁇ 1 ⁇ circumflex over (q) ⁇ s ( ⁇ (X)).
- model ⁇ tilde over ( ⁇ ) ⁇ is obtained by minimizing a weighted sum of loss functions
- FIG. 4 is a flow diagram that depicts an example process 400 for adjusting training instances on a segment-wise basis, in an embodiment.
- Process 400 may be implemented by label shift adjuster 170 .
- a machine-learned model (e.g., machine-learned model 130 ) is trained using one or more machine learning techniques is based on training data (e.g., historical training data 110 ).
- a segment from a set of segments is selected.
- the set of segments may include all possible segments. For example, if the segments are defined based on the geography feature and there are five possible values for the geography feature, then there are initially five segments at the beginning of process 400 .
- ⁇ circumflex over (q) ⁇ s is an estimate of the predicted label distribution q s .
- ⁇ s is the estimated weights for k classes (and each dimension corresponds to one class) applied on the training instances within segment s.
- a proportion of the selected segment s in the validation/testing dataset ⁇ circumflex over (p) ⁇ (s) and in the scoring dataset ⁇ circumflex over (q) ⁇ (s) is estimated.
- ⁇ circumflex over (p) ⁇ (s) is the proportion of instances within segment s in the validation/testing dataset (i.e.,
- ⁇ circumflex over (q) ⁇ (s) is the proportion of instances within segments in the scoring dataset (i.e.,
- training instances in the training data that correspond to the selected segments are adjusted by ⁇ s ⁇ ( ⁇ circumflex over (q) ⁇ (s)/ ⁇ circumflex over (p) ⁇ (s)).
- a portion of historical training data 110 that corresponds to the selected segment are modified by the product ⁇ s ⁇ ( ⁇ circumflex over (q) ⁇ (s)/ ⁇ circumflex over (p) ⁇ (s)).
- Such modification may involve multiplying an importance weight of each training instance associated with the selected segment s by the above product.
- all training instances in segment s may be weighted according to the k-dimensional vector
- each training instance in segment s with its label Y taking value v (v is one of the k values in the label set ⁇ 1, . . . , k ⁇ ) will be assigned the weight
- Block 470 may involve including the modified training instances in adjusted training data 112 .
- process 400 proceeds to block 420 where another segment is selected. If the determination in block 480 is negative, then process 400 proceeds to block 490 . When process 400 proceeds to block 490 , all (or potentially all) training instances are modified.
- a new model is trained based on the adjusted or modified training data.
- model trainer 120 trains new model 132 based on adjusted training data 112 .
- the new model may have the same set of features as the machine-learned model in block 410 or may have a different set of features as the machine-learned model. For example, some features may have been added or removed to the set of features upon which machine-learned model 130 was trained. Scoring instances from future scoring data set 180 may then be input into new model 132 to generate a score or prediction for each.
- the techniques described herein are implemented by one or more special-purpose computing devices.
- the special-purpose computing devices may be hard-wired to perform the techniques, or may include digital electronic devices such as one or more application-specific integrated circuits (ASICs) or field programmable gate arrays (FPGAs) that are persistently programmed to perform the techniques, or may include one or more general purpose hardware processors programmed to perform the techniques pursuant to program instructions in firmware, memory, other storage, or a combination.
- ASICs application-specific integrated circuits
- FPGAs field programmable gate arrays
- Such special-purpose computing devices may also combine custom hard-wired logic, ASICs, or FPGAs with custom programming to accomplish the techniques.
- the special-purpose computing devices may be desktop computer systems, portable computer systems, handheld devices, networking devices or any other device that incorporates hard-wired and/or program logic to implement the techniques.
- FIG. 5 is a block diagram that illustrates a computer system 500 upon which an embodiment of the invention may be implemented.
- Computer system 500 includes a bus 502 or other communication mechanism for communicating information, and a hardware processor 504 coupled with bus 502 for processing information.
- Hardware processor 504 may be, for example, a general purpose microprocessor.
- Computer system 500 also includes a main memory 506 , such as a random access memory (RAM) or other dynamic storage device, coupled to bus 502 for storing information and instructions to be executed by processor 504 .
- Main memory 506 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 504 .
- Such instructions when stored in non-transitory storage media accessible to processor 504 , render computer system 500 into a special-purpose machine that is customized to perform the operations specified in the instructions.
- Computer system 500 further includes a read only memory (ROM) 508 or other static storage device coupled to bus 502 for storing static information and instructions for processor 504 .
- ROM read only memory
- a storage device 510 such as a magnetic disk, optical disk, or solid-state drive is provided and coupled to bus 502 for storing information and instructions.
- Computer system 500 may be coupled via bus 502 to a display 512 , such as a cathode ray tube (CRT), for displaying information to a computer user.
- a display 512 such as a cathode ray tube (CRT)
- An input device 514 is coupled to bus 502 for communicating information and command selections to processor 504 .
- cursor control 516 is Another type of user input device
- cursor control 516 such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 504 and for controlling cursor movement on display 512 .
- This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane.
- Computer system 500 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the computer system causes or programs computer system 500 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 500 in response to processor 504 executing one or more sequences of one or more instructions contained in main memory 506 . Such instructions may be read into main memory 506 from another storage medium, such as storage device 510 . Execution of the sequences of instructions contained in main memory 506 causes processor 504 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.
- Non-volatile media includes, for example, optical disks, magnetic disks, or solid-state drives, such as storage device 510 .
- Volatile media includes dynamic memory, such as main memory 506 .
- storage media include, for example, a floppy disk, a flexible disk, hard disk, solid-state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge.
- Storage media is distinct from but may be used in conjunction with transmission media.
- Transmission media participates in transferring information between storage media.
- transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 502 .
- transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.
- Various forms of media may be involved in carrying one or more sequences of one or more instructions to processor 504 for execution.
- the instructions may initially be carried on a magnetic disk or solid-state drive of a remote computer.
- the remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem.
- a modem local to computer system 500 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal.
- An infra-red detector can receive the data carried in the infra-red signal and appropriate circuitry can place the data on bus 502 .
- Bus 502 carries the data to main memory 506 , from which processor 504 retrieves and executes the instructions.
- the instructions received by main memory 506 may optionally be stored on storage device 510 either before or after execution by processor 504 .
- Computer system 500 also includes a communication interface 518 coupled to bus 502 .
- Communication interface 518 provides a two-way data communication coupling to a network link 520 that is connected to a local network 522 .
- communication interface 518 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line.
- ISDN integrated services digital network
- communication interface 518 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN.
- LAN local area network
- Wireless links may also be implemented.
- communication interface 518 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
- Network link 520 typically provides data communication through one or more networks to other data devices.
- network link 520 may provide a connection through local network 522 to a host computer 524 or to data equipment operated by an Internet Service Provider (ISP) 526 .
- ISP 526 in turn provides data communication services through the world wide packet data communication network now commonly referred to as the “Internet” 528 .
- Internet 528 uses electrical, electromagnetic or optical signals that carry digital data streams.
- the signals through the various networks and the signals on network link 520 and through communication interface 518 which carry the digital data to and from computer system 500 , are example forms of transmission media.
- Computer system 500 can send messages and receive data, including program code, through the network(s), network link 520 and communication interface 518 .
- a server 530 might transmit a requested code for an application program through Internet 528 , ISP 526 , local network 522 and communication interface 518 .
- the received code may be executed by processor 504 as it is received, and/or stored in storage device 510 , or other non-volatile storage for later execution.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Evolutionary Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Probability & Statistics with Applications (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
Techniques for detecting label shift and adjusting training data of predictive models in response are provided. In an embodiment, a first machine-learned model is used to generate a predicted label for each of multiple scoring instances. The first machine-learned model is trained using one or more machine learning techniques based on a plurality of training instances, each of which includes an observed label. In response to detecting a shift in observed labels, for each segment of one or more segments in multiple segments, a portion of training data that corresponds to the segment is identified. For each training instance in a subset of the portion of training data, the training instance is adjusted. The adjusted training instance is added to a final set of training data. The machine learning technique(s) are used to train a second machine-learned model based on the final set of training data.
Description
The present disclosure relates generally to machine learning and, more particularly, to automatically detecting shift in output labels and automatically adjusting labels in training data based on the detection.
Machine learning is the study and construction of algorithms that can learn from, and make predictions on, data. Such algorithms operate by building a model from inputs in order to make data-driven predictions or decisions. Thus, a machine learning technique is used to generate a statistical model that is trained based on a history of attribute values associated with one or more objects. The statistical model is trained based on multiple attributes described herein. In machine learning parlance, such attributes are referred to as “features.” To generate and train a statistical model, a set of features is specified and a set of training data is identified.
The accuracy of a machine-learned model largely depends on the quality and quantity of the training data. For example, if there are not enough training instances in the training data, then the model will not be able to make accurate predictions for inputs that are similar (but not identical) to the training instances. As another example, if the training instances do not reflect real-world scenarios, then the resulting model will not be able to make accurate predictions.
Changes in an environment for which predictions are made is natural and common. For example, a cloud service that monitors performance of, and resource consumption by, cloud applications may implement a model to predict how many computer resources of one or more types to allocate to each cloud application based on the cloud application's performance. Cloud application performance may change over time in response to changes in how the cloud application is used (e.g., what features are being leveraged), how frequently it is being relied upon by other applications and/or users, and the number of machines that are available for the cloud application to execute on.
Usually, changes in the environment cause minor shifts in the output labels. This is referred to as a shift in label distribution. “Labels” refer to not only the labels of training instances, but also to real-world results, irrespective of the output (predictions) of a machine-learned model. Input labels are labels that are part of the training data while output labels are actual labels as observed in historical results. For example, a machine-learned model is trained to predict whether an entity will perform a particular action in response to one or more events occurring. Also, about twenty entities typically perform the particular action each week, but only ten entities actually perform that action in the most recent week. Thus, there is a (output) label shift from twenty to ten. A shift in label distribution results in a decrease in the accuracy of the machine-learned model. For minor shifts in label distribution, a refresh of the machine-learned model is sufficient. A refresh involves generating new training instances based on recent data and retraining the machine-learned model based on the new training instances and older training instances.
However, for significant shifts in label distribution, model refreshment might not work well because a dramatic change in label distribution probably indicates a large change in feature weights or coefficients to derive the correct label from the feature set. Thus, the model learned from the historical data is likely to provide incorrect predictions. However, to completely rebuild the model, there is not sufficient recent data to generate new training instances, since most of the data was collected before the factor(s) that led to the significant label shift. Thus, refreshing the model may still result in inaccurate predictions on newly measured scoring data.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
In the drawings:
In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present invention.
A system and method for automatically adjusting training data in response to a detection of shift in labels are provided. In one technique, historical data is automatically analyzed to generate and train a forecasting model. The forecasting model is used to predict an aggregate value of a particular metric. The predicted aggregate value is compared to an actual or observed aggregate value of the particular metric. If the difference between the two aggregate values is significant, then a shift in labels is detected and triggers an adjustment of training data upon which a machine-learned model was trained. In a related technique, the training data is divided based on segments and the training instances of different segments are adjusted differently. For example, the importance weights of instances in one segment may be adjusted positively while the importance weights of instances in another segment may be adjusted negatively.
Embodiments improve computer-related technology by automatically and, in a data-driven and scientific way, adjusting instances in training data to improve the accuracy of models in light of significant shifts in label distribution. Embodiments involve a novel model treatment system that comprises two main components, where model label shift detection provides directional guidance to model label shift adjustment, and model label shift adjustment is a follow-up step to model label shift detection.
Additionally, embodiments leverage forecasting models to auto-detect model label shift even though the main purpose of forecasting has been producing an accurate prediction in order to take a prompt action, such as weather forecasting and economic forecasting. In contrast, the forecasting model(s) described herein serve as a powerful and scientific tool to detect target label distribution shift. The forecasting results are just an intermediate step. Using the forecasting results to further identify the model label shift and adjust the model accordingly is a primary goal, which is inherently different from prior usages of forecasting.
Furthermore, embodiments leverage a novel segment-wise variation of model label shift adjustment. In contrast, existing model label shift adjustment approaches do not take segmentation factor into consideration. However, in many real-world problems, the segmentation factor matters greatly since the extent of label shift could vary significantly across different segments.
Machine-learned model 130 may be a binary classification model (that predicts whether a certain entity or event belongs to one of two classes), a multi-class classification model (that predicts whether a certain entity or event belongs to one of multiple classes), or another type of model, such as a regression model that outputs a continuous quantity, such as a specific dollar value for which a house is predicted to sell.
Historical scoring data set 140 comprises multiple scoring instances, each comprising a set of feature values that is input into machine-learned model 130. For each scoring instance, machine-learned model 130 computes a predicted label or a score reflecting a prediction, whether the predicted label is a classification label or a regression label. The predicted labels computed by machine-learned model 130 may be recorded in the appropriate scoring instances in historical scoring data set 140.
Observed labels may be automatically generated by one or more processes that determine whether a certain event or action occurred. In some scenarios, the observed label is generated based on what is not found in a data set. For example, if there is no record of a user responding to a notification within two days of receiving the notification, then an observed label indicating that the event did not occur is generated and recorded.
Observed labels are automatically associated with a scoring instance that was used to generate a predicted label. For example, if the event being predicted is a user action, then the user is associated with a scoring instance identifier or with a user identifier and a model identifier. Each scoring instance is associated with a scoring instance identifier or a combination of a model identifier and a user identifier. In this way, observed labels in historical results 150 are mapped to (or associated with) scoring instances in historical scoring data set 140.
At least some of the observed labels in historical results 150 may be for scoring instances that are not yet reflected in training instances in historical training data 110. In other words, at least a portion of historical results 150 may be newer or “fresher” data than training instances found in historical training data 110. For example, historical training data 110 may include observed labels that were generated between January and December of one year while historical results 150 may include observed labels that were generated between October of the same year and June of the following year. Alternatively, the observed labels in historical training data 110 may be a strict subject of the observed labels in historical results 150.
Label shift detector 160 (described in more detail herein) analyzes historical results 150 to detect shift in the distribution of observed labels, which detecting is described in more detail below. Label shift detector 160 includes a forecast model generator 162 and a forecasting model 164 that forecast model generator 162 generates. Although only a single forecasting model is depicted, forecast model generator 162 may generate multiple forecasting models based on historical data, such as one for each segment of multiple segments.
If, after analyzing historical results 150, label shift detector 160 detects significant label shift, then label shift adjustor 170 (also described in more detail herein) adjusts or modifies importance weights of training instances in historical training data 110 to generate adjusted training data 112. Model trainer 120 trains a new model 132 based on adjusted training data 112. The new model 132 is applied to each scoring instance in future scoring data set 180 (for which labels are not yet known at the time of label shift detection and adjustment) in order to generate output labels or predictions therefor.
At block 210, label shift is detected (e.g., by label shift detector 160) based on historical results 150. Label shift may be detected using one or more forecasting models that are trained based on observed labels, some of which may be reflected in historical results 150.
At block 220, it is determined whether the label shift is significant. Label shift may be considered “significant” if an aggregate output value is outside a certain range of values or if a shift measure is above a particular threshold, for example, if an aggregate is outside a 95% confidence interval. If the determination is negative, then process 200 proceeds to block 230, where machine-learned model 130 is refreshed based on historical scoring data set 140 and historical results 150. If the determination is positive, then process 200 proceeds to block 240.
At block 240, it is determined whether there is segment-wise discrepancy. A segment is a grouping of one or more entities (e.g., people) that share one or more characteristics in common. A segment may be defined or influenced by a set of one or more values for a set of one or more features of machine-learned model 130. For example, if the feature that defines a segment is geography and there are five possible values for geography, then there are five segments, or groups of people that live in the corresponding geographic location. As another example, if the set of features that define a segment include industry and geography and there are five possible values for industry and two possible values for geography, then there are 2×5=10 segments, or groups of people, each group sharing a unique pair of industry-geography values in common.
If the determination in block 240 is negative, then process 200 proceeds to block 250 where all (or most) training instances in historical training data 110 are adjusted or modified, regardless of segment. If the determination in block 240 is positive, then process 200 proceeds to block 260 where training instances in historical training data 110 are adjusted on a segment-wise basis. For example, training instances corresponding to one segment are adjusted a first amount while training instances corresponding to another segment are adjusted a second amount.
However, the ratio of values of observed labels typically vary significantly over time. Therefore, a simple comparison between two values will, in many cases, be insufficient in detecting significant label shift.
Thus, in an embodiment, one or more forecasting models are trained based on a portion of historical training data 110 and/or a portion of historical results 150. The data upon which a forecasting model is trained is time series data comprises multiple data points, each corresponding to a different period of time and corresponding to an aggregate of observed labels (in historical training data 110 and/or historical results 150) that occurred in the corresponding period of time. For example, observed labels may be aggregated on a daily basis, a weekly basis, or a monthly basis. The aggregation may be a sum, such as a daily sum or a weekly sum, or an average/median value, such as a daily average on a weekly basis or a weekly average on a monthly basis. Thus, each data point in the times series data reflects an aggregate value.
The one or more forecasting models take into account historical (and presumably “natural”) trends reflected in changes in distribution of observed labels. The portions of historical training data 110 and/or historical results 150 upon which the forecasting models are trained reflect a period of time before a particular point of time, referred to herein as a “candidate shift point in time.”
A candidate shift point in time refers to a point in time that may correspond to a start in a potential shift in label distribution. A candidate shift point in time may be identified based on input from a user, such as a developer of machine-learned model 130 or a data scientist. For example, a user may guess, based on preliminary reports or data, that a significant shift in label distribution has begun. As another example, a user, reading news reports about a global event, may anticipate that machine-learned model 130 will start performing poorly. Additionally or alternatively, label shift detector 160 automatically identifies multiple candidate shift points in time. For example, each day in the past may act as a candidate shift point in time. Thus, label shift detector 160 may perform shift detection on a daily basis where, for each day it executes, label shift detector 160 uses a week before the current day as the candidate shift point in time.
Once a forecasting model is trained based on observed labels generated prior to the candidate shift point in time, the forecasting model is leveraged to produce a forecast or a prediction of one or more labels after the candidate shift point in time. Input into the forecasting model may be a number, indicating a number of forecasted values. For example, if data upon which the forecasting model is trained is a weekly sum over the last fourteen months, then an input value of three indicates that the forecasting model is to produce three forecasted values, each representing a weekly sum and one for each of three weeks after the candidate shift point in time.
If one or more of the forecast values are significantly different than the corresponding observed label(s), then label shift detector 160 determines that a significant shift occurred. A measure of significance may vary from one implementation to another. For example, if an observed value is greater than 20% different from a forecast value, then the shift is significant. A user (such as an administrator of model training system 100) may define the significance measure.
In a related embodiment, the measure of significance depends on how accurate the forecasting model is. For example, if the error of the forecasting model against historical data representing events that occurred prior to the candidate point in time is relatively small, then even if the difference between an observed value and a forecast value may be relatively small, the detection of a significant event could still be triggered. Conversely, if the error of the forecasting model time is relatively large, then the difference between an observed value and a forecast value must be relatively large in order to trigger a detection of a significant event.
In an embodiment, label shift detector 160 performs label shift detection on a per-segment basis. A segment corresponds to a portion of scoring instances and/or training instances that share one or more feature values in common or that share other characteristics (that are related to one or more model features) in common. For example, if a training instance corresponds to a specific user, then one segment may be all users who live in North America and another segment may be all users who live in South America. However, the only possible values for the geography feature may be country. Therefore, even though no scoring instance or training instance indicates North America as a geographic feature value, instances that indicate a country in North America are grouped together if there is a mapping between the country to North America. As another example, if a training instance corresponds to a software application, then one segment may be all applications that comprise two or more stateful operations, another segment may be all applications that comprise only one stateful operation, and another segment may be all applications that do not comprise any stateful operations. The one or more features are of the entity or event for which a prediction is being made, such as a user, a software application, an organization, a country, or a weather phenomenon. Example features for users and/or organizations include geography, industry, job function, employment status, seniority level. and job title.
In order to perform label shift detection on a per-segment basis, a forecasting model is generated for each segment. The data upon which each forecasting model is based is limited to observed labels that correspond to the segment that correspond to the forecasting model. For example, all observed labels in historical results 150 corresponding to users in North America are analyzed to generate a time series of daily sums over a period time. Such a time series of daily sums is used to train a forecasting model for the North America segment. Similarly, all observed labels in historical results 150 corresponding to users in South America are analyzed to generate a time series of daily sums over a (same) period of time. Such a time series of daily sums is used to train a forecasting model for the South America segment.
In an embodiment, label shift detector 160 (or another component of model training system 100) implements an exponential smoothing algorithm in order to generate a set of forecasting models. Each forecasting model in the set is a state space model and may be represented in a component form that includes three different components: error, trend, and seasonal. Each component has finite variations.
The error component has two possible variations: Additive (A) and Multiplicative (M). The trend component has five possible variations: None (N), Additive (A), Additive damped (Ad), Multiplicative (M) and Multiplicative damped (Md). The seasonal component has three possible variations: None (N), Additive (A) and Multiplicative (M). By considering the variations in the combinations of all three components, there are thirty possible forecasting models in total.
Notation ETS(⋅,⋅,⋅) may be used to denote the thirty possible models. This notation helps in remembering the order in which the components are specified, e.g. Model ETS(A,Ad, M) denotes the model with additive errors, additive damped trend, and multiplicative seasonality. The thirty possible models share a general component form. The general component form involves a state vector xt=(lt, bt, st, st−1, st−m+1) and state space equations of the form
y t =w(x t−1)+r(x t−1)εt,
x t =f(x t−1)+g(x t−1)εt,
where yl, yt, . . . , yt−1 are observed time series data; {εt} are independent and identically distributed Gaussian variables with mean 0 and variance σ2; lt denotes the level of the series at time t; bt denotes the slope (or growth) of the series at time t; st, st−1, . . . , st−m are seasonal components; and m is the length of seasonality. The state vector xt is unknown, the initial state x0=(l0, b0, s0, s−1, . . . , s−m+1) is considered as an unknown parameter of the model, and state vector xt is estimated through the state space equations. The formulation of w(⋅), r(⋅), f(⋅) and g(⋅) depends on the components variations. The simplest model in exponential smoothing methods is Simple Exponential Smoothing ETS(A,N,N). The component form of the model is
y t =l t−1+εt,
l t =l t−1+αεt,
where w(xt−1)=f(xt−1)=lt−1, r(xt−1)=1, g(xt−1)=α, and α is an unknown parameter.
y t =w(x t−1)+r(x t−1)εt,
x t =f(x t−1)+g(x t−1)εt,
where yl, yt, . . . , yt−1 are observed time series data; {εt} are independent and identically distributed Gaussian variables with mean 0 and variance σ2; lt denotes the level of the series at time t; bt denotes the slope (or growth) of the series at time t; st, st−1, . . . , st−m are seasonal components; and m is the length of seasonality. The state vector xt is unknown, the initial state x0=(l0, b0, s0, s−1, . . . , s−m+1) is considered as an unknown parameter of the model, and state vector xt is estimated through the state space equations. The formulation of w(⋅), r(⋅), f(⋅) and g(⋅) depends on the components variations. The simplest model in exponential smoothing methods is Simple Exponential Smoothing ETS(A,N,N). The component form of the model is
y t =l t−1+εt,
l t =l t−1+αεt,
where w(xt−1)=f(xt−1)=lt−1, r(xt−1)=1, g(xt−1)=α, and α is an unknown parameter.
Once the model is specified, the likelihood of the state space model is relatively straightforward to compute and the maximum likelihood estimates of the model parameters may be obtained.
After all or a subset of the thirty models are generated, a model is selected by minimizing one or more selection criteria. Examples of selection criteria include AIC (Akaike's Information Criterion), AICc (AIC corrected for small sample bias), and BIC (Bayesian Information Criterion). Given a collection of models, each selection criterion estimates the quality of each model, relative to each of the other models.
One attribute of some forecasting models (such as ETS models) is the ability to compute a confidence interval for each forecasted value. The confidence interval may increase for subsequent (in time) forecast values. FIG. 3 is an example data plot 300 that depicts, along with confidence intervals, times series data (specifically, aggregated statistics over time), where some of the time series data pertain to points in time that are prior to a candidate shift point in time, other of the time series data are forecasted values (that are after the candidate shift point in time) (i.e., line 305), and other of the time series data (i.e., the point below outer shared region 320) are based on observed labels and also pertain to points in time that are after the candidate shift point in time.
The forecasting model that generated the forecast values in data plot 300 is denoted as ETS(M,N,M). The x-axis is time and is divided into months, while the y-axis is an aggregated statistic that represents a number of events that occurred in a monthly period. While this forecasting model may have been generated on monthly data, the forecasting model may have instead been generated on a weekly or daily period. However, averaging the events on a monthly basis removes significant variation in such finer granularity data and reduces the effect of outliers, which, if used to train the forecasting model, might make the forecasting model relatively inaccurate, increasing any confidence intervals and, therefore, the ability to detect significant label shift.
In this depicted example, the candidate shift point in time is February 2020 and there are three forecast values (making up line 305): one for February of 2020, one for March of 2020, and one for April of 2020. Data plot 300 also shows two aggregated statistics, each of which is based on observed labels that pertain to events associated with February of 2020 (i.e., in inner shaded region 310) or March of 2020 (i.e., below outer shaded region 320).
As partially noted, data plot 300 depicts three shaded regions beginning with the candidate shift point in time. The inner shaded region 310 indicates a confidence interval of 80%, indicating that, statistically speaking, the forecasting model is 80% confident that an observed (e.g., aggregated) value will fall within inner shaded region 310. The outer shaded regions 320 indicate a confidence level of 95%, indicating that, statistically speaking, the forecasting model is 95% confident that an observed (e.g., aggregated) value will fall within outer shaded regions 320 or inner shaded region 310.
In an embodiment, if an aggregated statistic based on observed labels falls outside a particular confidence interval (e.g., 95%), then label shift detector 160 determines that there is significant label shift, which triggers label shift adjustor 170. In the example of data plot 300, the second aggregated statistic (corresponding to March of 2020) after the candidate shift point in time is outside outer shaded regions 320, indicating that the second aggregate statistic represents significant label shift, or an anomaly.
In a related embodiment, if multiple (e.g., consecutive) aggregated statistics based on observed labels fall outside one or more confidence levels, then label shift detector 160 determines that there is significant label shift. For example, not one of the aggregated statistics falls outside a larger confidence interval (e.g., 95%), but two consecutive aggregated statistics fall outside a smaller (though still relatively large) confidence interval (e.g., 80%). In such a scenario, label shift adjustor 170 may be triggered.
Also, which side of the forecast value an aggregated value may fall on (e.g., greater than or less than the forecast value) may dictate whether any label shift adjustment should be made. For example, if an aggregated value is outside a particular confidence interval and is greater than a corresponding forecast value, then no label shift adjustment is triggered. On the other hand, if an aggregated value is outside a particular confidence interval and is less than a corresponding forecast value, then label shift adjustment is triggered.
In an embodiment, a forecaster 162 generates a different forecasting model for each segment of multiple segments. In the example of ETS models, the forecasting model for one segment may have different ETS components than the forecasting model for another segment. For example, a forecasting model for a first segment may be denoted as ETS(M,N,M) while a forecasting model for a second segment may be denoted as ETS(A,N,N). In other words, thirty possible forecasting models are generated for the first segment (based on the training instances that correspond to the first segment) and the forecasting model denoted as ETS(M,N,M) is ultimately selected for the first segment based on the described selection criteria. Meanwhile thirty possible forecasting models are generated for the second segment (based on the training instances that correspond to the second segment) and the forecasting model denoted as ETS(A,N,N) is ultimately selected for the second segment based on the same selection criteria.
In an embodiment, label shift adjuster 170 adjusts training instances in historical training data 110 in response to label shift detector 160 detecting large or significant label shift in at least a portion of historical results 150. “Adjusting” or modifying a training instance may involve modifying an importance weight of the training instance or modifying a label of the training instance. An importance weight of a training instance indicates how much coefficients or weights of features are adjusting during training of a machine-learned model based on the training instance. The higher the importance weight, the greater the adjustment of the coefficients or weights of the features of the model. Conversely, the lower the importance weight, the lesser the adjustment of the coefficients or weights of the features of the model.
In an embodiment where are labels are modified, then only non-zero labels may be modified. For example, if a positive label is 1, then a new value for a positive label is l*w, where w may be 0<w<1. If a negative label is 0, then the negative label remains unmodified. Alternatively, the negative label may be modified to become a negative number.
There are multiple ways to adjust or modify training instances in historical training data 110. For example, a ratio of (1) an aggregated statistic that is based on observed labels that were generated after the candidate shift point in time to (2) a forecast value that corresponds to the same time period as the aggregated statistic is computed and applied to importance weights in the training instances. As a specific example, if 54 is the aggregated statistic and the forecast value is 97, then the importance weight of each training instance is assigned the value of 54/97. However, such an adjustment is not statistically or mathematically sound.
The following mathematical formulas and notations are used to formulate the label shift problem. X is the feature vector and Y is the label, where X and Y have a joint distribution p(X, Y) in the historical data set and q(X, Y) is the joint distribution in the scoring data set (e.g., future scoring data set 180), and l a loss function defined as l:Y×Y→R+. l is a loss function that takes its input from a two-dimensional space Y×Y, and its output is in a one-dimensional space R+ (i.e., non-negative real number space). One example of the loss function l is l(ƒ(X), Y)−(ƒ(X)−Y)2, where l takes two values ƒ(X) and Y from the two-dimensional space Y×Y as the input and produces a non-negative real number (ƒ(X)−Y)2 as the output, where ƒ(X) stands for the predicted label via model ƒ.
The objective of predictive modeling is to learn a model ƒ:X→Y that minimizes EX, Y ˜p l(ƒ(X), Y), where EX, Y ˜p l(ƒ(X), Y) is the expectation of the loss function (ƒ(X), Y), given X and Y subject to a joint distribution p. E stands for “expectation” and ˜ stands for “subject to a distribution.” In the ideal case where p(X, Y)=q(X, Y), the pre-trained model ƒ is still valid in the scoring data set (e.g., future scoring data set 180). However, the pre-trained model ƒ is not valid in the scenario when the label shift issue exists. As detected by one or more forecasting models, it is possible that the proportion of positive labels could be much larger in the historical dataset than in the scoring data set, which leads to the potential logic change of the label derivation from the features. In other words, p(Y|X)!=q(Y|X), which leads to p(X, Y)!=q(X, Y). In this case, the optimal model {tilde over (ƒ)} for the scoring data set is the minimizer of EX, Y ˜q l({tilde over (ƒ)}(X), Y), which should be different from the model ƒ learned from the historical dataset.
A challenge in minimizing EX, Y ˜q l(ƒ(X), Y) is the lack of information on the distribution of Yin the scoring data set, since there are no observations of Yin practice. However, observations from the historical data can be leveraged to help estimate the distribution of Y in the scoring data set, and use the following formula:
E X,Y˜q l({tilde over (ƒ)}(X),Y)=E X,Y ˜p (q(X,Y)/p(X,Y))l(ƒ(X),Y) (1)
to obtain the optimal model {tilde over (ƒ)} for the scoring dataset.
E X,Y
to obtain the optimal model {tilde over (ƒ)} for the scoring dataset.
There are systematic and mathematical techniques that may be used to determine how much the importance weights in training instances should be adjusted and modified. Black-Box Shift Estimation (BBSE) is one such technique. A key assumption in BBSE is called label shift assumption: p(X|Y)=q(X|Y). This implies that the logics of the feature derivation from the labels are consistent between the historical dataset and the scoring dataset. While this assumption looks reasonable in many use cases, there is a potential drawback: the label shift assumption may not hold globally across all the segments, where each training instance is assigned to one of multiple segments. Indeed, the change of label distribution might vary significantly across different segments. For example, in some geographic regions, in response to a significant global change, people's behavior may change significantly while people's behavior in other geographic regions might not change significantly.
In an embodiment, the BBSE approach is extended to account for different segments. The above label shift assumption (i.e., that different segments behave differently) is expressed using the following mathematical expressions. Assume the feature vector X=(Xc, Xs), where Xs stands for the one or more features that correspond to a segment (e.g., geographic region or geographic region and industry) and takes values from a discrete set S={1, . . . , s}, and where Xc stands for the remaining features in machine-learned model 130. The label shift assumption under segmentation is: p(Xc|Y, Xs)=q(Xc|Y, Xs). Plugging in this assumption into formula (1) leads to
E X,Y˜q l({tilde over (ƒ)}(X),Y)=E X,Y ˜p (q(X,Y)/p(X,Y))l({tilde over (ƒ)}(X),Y)=E X,Y ˜p (q(X s ,Y)/p(X s ,Y))l({tilde over (ƒ)}(X),Y)=E X,Y ˜p [(q(Y|X s)q(X s))/(p(Y|X s)p(X s))]l({tilde over (ƒ)}(X),Y) (2)
and the key to obtain the optimal model {tilde over (ƒ)} is to estimate ws(Y):=q(Y|Xs)/p(Y|Xs). From q(ƒ(X)|Xs), the following may be derived (assuming Y takes values from a discrete set K={1, . . . , k}, which means this is a multi-class classification problem with k classes:
q(ƒ(X)|X s)=Σq(ƒ(X)|Y,X s)q(Y|X s)=Y∈K
Σp(ƒ(X)|Y,X s)q(Y|X s)=Y∈K
Σp(ƒ(X),Y|X s)w s(Y). Y∈K
Also, we denote qs(ƒ(X)):=q(ƒ(X)|Xs) and Cp,s(ƒ(X), Y):=[p(ƒ(X)=i, Y=j|Xs)]k×k, then ws(Y)=Cp,s(ƒ(X), Y)−1qs(ƒ(X)). Note that Cp,s(ƒ(X), Y) is a confusion matrix (of size k×k) of model ƒ under distribution p within segment s and qs(ƒ(X)) is a k-dimensional vector and is a predicted label distribution of model ƒ under distribution q within segments. The value of k indicates the number of classifications predicted from machine-learnedmodel 130. Thus, if machine-learned model 130 is a binary classification model, then k=2. A confusion matrix is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one. Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa). The name “confusion matrix” stems from the fact that the matrix makes it easy to see if a system is confusing two classes (i.e. commonly mislabeling one as another). Although typically used for visualization, a confusion matrix is used here for calculating ws(Y), where ws(Y)=Cp,s(ƒ(X), Y)−1qs(ƒ(X)).
E X,Y
and the key to obtain the optimal model {tilde over (ƒ)} is to estimate ws(Y):=q(Y|Xs)/p(Y|Xs). From q(ƒ(X)|Xs), the following may be derived (assuming Y takes values from a discrete set K={1, . . . , k}, which means this is a multi-class classification problem with k classes:
q(ƒ(X)|X s)=Σq(ƒ(X)|Y,X s)q(Y|X s)=Y∈K
Σp(ƒ(X)|Y,X s)q(Y|X s)=Y∈K
Σp(ƒ(X),Y|X s)w s(Y). Y∈K
Also, we denote qs(ƒ(X)):=q(ƒ(X)|Xs) and Cp,s(ƒ(X), Y):=[p(ƒ(X)=i, Y=j|Xs)]k×k, then ws(Y)=Cp,s(ƒ(X), Y)−1qs(ƒ(X)). Note that Cp,s(ƒ(X), Y) is a confusion matrix (of size k×k) of model ƒ under distribution p within segment s and qs(ƒ(X)) is a k-dimensional vector and is a predicted label distribution of model ƒ under distribution q within segments. The value of k indicates the number of classifications predicted from machine-learned
It is assumed that in segment s there are (a) ns samples in the historical dataset {(x1,s, y1,s), . . . , (xn s ,s, yn s ,s)} drawn from p(X, Y) and (b) ms samples in the scoring dataset {x1,s′, . . . , xm s ,s′} drawn from q(X). Then Ĉp,s(ƒ(X),Y)=[Σl=1 n s 1{ƒ(xl,s)=i & yl,s=j}/ns]k×k, and {circumflex over (q)}s(ƒ(X))=[Σl=1 m s {ƒ(xl,s′)=i}/ms]k×i. Thus ŵs(Y)=Ĉp,s(ƒ(X),Y)−1{circumflex over (q)}s(ƒ(X)). The symbol ‘{circumflex over ( )}’ stands for “the estimate.” For example, Z is an unknown variable and {circumflex over (Z)} is an estimate of Z based on some observed samples. Finally, model {tilde over (ƒ)} is obtained by minimizing a weighted sum of loss functions
according to formula (2), where {circumflex over (p)}(Xs) and {circumflex over (q)}(Xs) are estimated as the proportion of segment s in the historical dataset and in the scoring dataset respectively.
Therefore, to compute an amount to adjust a training instance (in historical training data 110) that is associated with segments, the following are inputs to label shift adjustor 170: 1) historical training data 110; 2) a validation/testing data set in each segment s (s=1, . . . , S): {(x1,s,y1,s), . . . , (xn s ,s, yn s ,s)}, yi,s∈{1, . . . , k}; and 3) a scoring dataset in each segments (s=1, . . . , S):{x1,s′, . . . , xm s ,s′)}.
At block 410, a machine-learned model (e.g., machine-learned model 130) is trained using one or more machine learning techniques is based on training data (e.g., historical training data 110).
At block 420, a segment from a set of segments is selected. Initially, at the first iteration of block 420, the set of segments may include all possible segments. For example, if the segments are defined based on the geography feature and there are five possible values for the geography feature, then there are initially five segments at the beginning of process 400.
At block 430, a k×k confusion matrix Ĉs is generated where [Ĉs]ij=Σl=1 n s 1{ƒ(xl,s)=i & yl,s=j}/ns.
At block 440, a k-dimensional predicted label distribution vector {circumflex over (q)}s is generated where [{circumflex over (q)}s]i=Σl=1 m s 1{ƒ(xl,s′)=i}/ms. {circumflex over (q)}s is an estimate of the predicted label distribution qs.
At block 450, a k-dimensional weight vector ŵs=Ĉs −1·{circumflex over (q)}s is generated. ŵs is the estimated weights for k classes (and each dimension corresponds to one class) applied on the training instances within segment s.
At block 460, a proportion of the selected segment s in the validation/testing dataset {circumflex over (p)}(s) and in the scoring dataset {circumflex over (q)}(s) is estimated. {circumflex over (p)}(s) is the proportion of instances within segment s in the validation/testing dataset (i.e.,
where ns is the number of instances within segment s in validation/testing dataset and Σl=1 Snl is the total number of instances in validation/testing dataset). {circumflex over (q)}(s) is the proportion of instances within segments in the scoring dataset (i.e.,
where ms is the number of instances within segment s in scoring dataset and Σl=1 Sm1 is the total number of instances in scoring dataset). It is not recommended that {circumflex over (p)}(s) be estimated from the training dataset because the weight vector ŵs is estimated from the validation/testing dataset and the estimation process should be consistent.
At block 470, training instances in the training data that correspond to the selected segments are adjusted by ŵs·({circumflex over (q)}(s)/{circumflex over (p)}(s)). For example, a portion of historical training data 110 that corresponds to the selected segment are modified by the product ŵs·({circumflex over (q)}(s)/{circumflex over (p)}(s)). Such modification may involve multiplying an importance weight of each training instance associated with the selected segment s by the above product.
Thus, all training instances in segment s may be weighted according to the k-dimensional vector
For example, each training instance in segment s with its label Y taking value v (v is one of the k values in the label set {1, . . . , k}) will be assigned the weight
i.e., the v-th element in the k-dimensional vector
At block 480, it is determined whether there are any more segments that have not yet been selected. If so, then process 400 proceeds to block 420 where another segment is selected. If the determination in block 480 is negative, then process 400 proceeds to block 490. When process 400 proceeds to block 490, all (or potentially all) training instances are modified.
At block 490, a new model is trained based on the adjusted or modified training data. For example, model trainer 120 trains new model 132 based on adjusted training data 112. The new model may have the same set of features as the machine-learned model in block 410 or may have a different set of features as the machine-learned model. For example, some features may have been added or removed to the set of features upon which machine-learned model 130 was trained. Scoring instances from future scoring data set 180 may then be input into new model 132 to generate a score or prediction for each.
According to one embodiment, the techniques described herein are implemented by one or more special-purpose computing devices. The special-purpose computing devices may be hard-wired to perform the techniques, or may include digital electronic devices such as one or more application-specific integrated circuits (ASICs) or field programmable gate arrays (FPGAs) that are persistently programmed to perform the techniques, or may include one or more general purpose hardware processors programmed to perform the techniques pursuant to program instructions in firmware, memory, other storage, or a combination. Such special-purpose computing devices may also combine custom hard-wired logic, ASICs, or FPGAs with custom programming to accomplish the techniques. The special-purpose computing devices may be desktop computer systems, portable computer systems, handheld devices, networking devices or any other device that incorporates hard-wired and/or program logic to implement the techniques.
For example, FIG. 5 is a block diagram that illustrates a computer system 500 upon which an embodiment of the invention may be implemented. Computer system 500 includes a bus 502 or other communication mechanism for communicating information, and a hardware processor 504 coupled with bus 502 for processing information. Hardware processor 504 may be, for example, a general purpose microprocessor.
The term “storage media” as used herein refers to any non-transitory media that store data and/or instructions that cause a machine to operate in a specific fashion. Such storage media may comprise non-volatile media and/or volatile media. Non-volatile media includes, for example, optical disks, magnetic disks, or solid-state drives, such as storage device 510. Volatile media includes dynamic memory, such as main memory 506. Common forms of storage media include, for example, a floppy disk, a flexible disk, hard disk, solid-state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge.
Storage media is distinct from but may be used in conjunction with transmission media. Transmission media participates in transferring information between storage media. For example, transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 502. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.
Various forms of media may be involved in carrying one or more sequences of one or more instructions to processor 504 for execution. For example, the instructions may initially be carried on a magnetic disk or solid-state drive of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system 500 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infra-red detector can receive the data carried in the infra-red signal and appropriate circuitry can place the data on bus 502. Bus 502 carries the data to main memory 506, from which processor 504 retrieves and executes the instructions. The instructions received by main memory 506 may optionally be stored on storage device 510 either before or after execution by processor 504.
Network link 520 typically provides data communication through one or more networks to other data devices. For example, network link 520 may provide a connection through local network 522 to a host computer 524 or to data equipment operated by an Internet Service Provider (ISP) 526. ISP 526 in turn provides data communication services through the world wide packet data communication network now commonly referred to as the “Internet” 528. Local network 522 and Internet 528 both use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and the signals on network link 520 and through communication interface 518, which carry the digital data to and from computer system 500, are example forms of transmission media.
The received code may be executed by processor 504 as it is received, and/or stored in storage device 510, or other non-volatile storage for later execution.
In the foregoing specification, embodiments of the invention have been described with reference to numerous specific details that may vary from implementation to implementation. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. The sole and exclusive indicator of the scope of the invention, and what is intended by the applicants to be the scope of the invention, is the literal and equivalent scope of the set of claims that issue from this application, in the specific form in which such claims issue, including any subsequent correction.
Claims (18)
1. A method comprising:
for each scoring instance in multiple scoring instances in a scoring data set, using a first machine-learned model to generate a predicted label for said each scoring instance, wherein the first machine-learned model is trained using one or more machine learning techniques based on a plurality of training instances, each of which includes an observed label;
generating a forecasting model based on time-series data that is based on first observed label data that corresponds to a first period of time;
using the forecasting model to generate a forecast;
performing a comparison between the forecast and second observed data that corresponds to a second period of time that is subsequent to the first period of time;
detecting a shift in observed labels based on the comparison;
in response to detecting the shift in observed labels, for each segment of one or more segments in a plurality of segments:
identifying a portion of training data that corresponds to said each segment;
for each training instance in a subset of the portion of training data:
adjusting said each training instance to generate an adjusted training instance;
adding the adjusted training instance to a final set of training data;
using the one or more machine learning techniques to train a second machine-learned model based on the final set of training data;
wherein the method is performed by one or more computing devices.
2. The method of claim 1 , wherein adjusting said each training instance to generate the adjusted training instance comprises modifying an importance weight of said each training instance.
3. The method of claim 1 , further comprising:
determining a confidence interval for the forecast using the forecasting model;
wherein performing the comparison comprises determining whether a portion of the second observed data is outside the confidence interval;
wherein detecting the shift in observed labels is based, at least in part, on determining that the portion of the second observed data is outside the confidence interval.
4. The method of claim 1 , wherein:
generating the forecasting model comprises generating a plurality of forecasting models based on a plurality of time series data, wherein each forecasting model in the plurality of forecasting models is based on different time series data in the plurality of time series data, wherein each time series data in the plurality of time series data corresponds to a different segment of the plurality of segments;
using the forecasting model to generate the forecast comprises using the plurality of forecasting models to generate a plurality of forecasts;
performing the comparison comprises performing a plurality of comparisons, each between a different forecast of the plurality of forecasts and time series data of the plurality of time series data;
detecting the shift in observed labels comprises detecting shift in observed labels of a first segment of the plurality of segments and detecting no shift in observed labels of a second segment of the plurality of segments.
5. The method of claim 4 , wherein generating the forecasting model comprises:
generating a plurality of forecasting models;
selecting the forecasting model from among the plurality of forecasting models based on accuracy of each forecasting model, in the plurality of forecasting models, relative to the time series data.
6. The method of claim 1 , further comprising:
identifying a particular point in time;
identifying the first observed label data based on the particular point in time, wherein data within the first observed label data is associated with a time that is before the particular point in time;
identifying the second observed data based on the particular point in time, wherein data within the second observed data is associated with a time that is after the particular point in time.
7. The method of claim 6 , wherein the particular point in time is specified in user input or is automatically determined not based on user input.
8. The method of claim 1 , wherein adjusting said each training instance comprises using a black-box shift estimation technique.
9. The method of claim 1 , wherein the plurality of segments are based on one or more of geographic region, industry, employment status, job function, seniority level, or job title.
10. One or more storage media storing instructions which, when executed by one or more processors, cause:
for each scoring instance in multiple scoring instances in a scoring data set, using a first machine-learned model to generate a predicted label for said each scoring instance, wherein the first machine-learned model is trained using one or more machine learning techniques based on a plurality of training instances, each of which includes an observed label;
generating a forecasting model based on time-series data that is based on first observed label data that corresponds to a first period of time;
using the forecasting model to generate a forecast;
performing a comparison between the forecast and second observed data that corresponds to a second period of time that is subsequent to the first period of time;
detecting a shift in observed labels based on the comparison;
in response to detecting the shift in observed labels, for each segment of one or more segments in a plurality of segments:
identifying a portion of training data that corresponds to said each segment;
for each training instance in a subset of the portion of training data:
adjusting said each training instance to generate an adjusted training instance;
adding the adjusted training instance to a final set of training data;
using the one or more machine learning techniques to train a second machine-learned model based on the final set of training data.
11. The one or more storage media of claim 10 , wherein adjusting said each training instance to generate the adjusted training instance comprises modifying an importance weight of said each training instance.
12. The one or more storage media of claim 10 , wherein the instructions, when executed by the one or more processors, further cause:
determining a confidence interval for the forecast using the forecasting model;
wherein performing the comparison comprises determining whether a portion of the second observed data is outside the confidence interval;
wherein detecting the shift in observed labels is based, at least in part, on determining that the portion of the second observed data is outside the confidence interval.
13. The one or more storage media of claim 10 , wherein:
generating the forecasting model comprises generating a plurality of forecasting models based on a plurality of time series data, wherein each forecasting model in the plurality of forecasting models is based on different time series data in the plurality of time series data, wherein each time series data in the plurality of time series data corresponds to a different segment of the plurality of segments;
using the forecasting model to generate the forecast comprises using the plurality of forecasting models to generate a plurality of forecasts;
performing the comparison comprises performing a plurality of comparisons, each between a different forecast of the plurality of forecasts and time series data of the plurality of time series data;
detecting the shift in observed labels comprises detecting shift in observed labels of a first segment of the plurality of segments and detecting no shift in observed labels of a second segment of the plurality of segments.
14. The one or more storage media of claim 13 , wherein generating the forecasting model comprises:
generating a plurality of forecasting models;
selecting the forecasting model from among the plurality of forecasting models based on accuracy of each forecasting model, in the plurality of forecasting models, relative to the time series data.
15. The one or more storage media of claim 10 , wherein the instructions, when executed by the one or more processors, further cause:
identifying a particular point in time;
identifying the first observed label data based on the particular point in time, wherein data within the first observed label data is associated with a time that is before the particular point in time;
identifying the second observed data based on the particular point in time, wherein data within the second observed data is associated with a time that is after the particular point in time.
16. The one or more storage media of claim 15 , wherein the particular point in time is specified in user input or is automatically determined not based on user input.
17. The one or more storage media of claim 10 , wherein adjusting said each training instance comprises using a black-box shift estimation technique.
18. The one or more storage media of claim 10 , wherein the plurality of segments are based on one or more of geographic region, industry, employment status, job function, seniority level, or job title.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/916,706 US11599746B2 (en) | 2020-06-30 | 2020-06-30 | Label shift detection and adjustment in predictive modeling |
CN202110718169.0A CN113869342A (en) | 2020-06-30 | 2021-06-28 | Mark offset detection and adjustment in predictive modeling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/916,706 US11599746B2 (en) | 2020-06-30 | 2020-06-30 | Label shift detection and adjustment in predictive modeling |
Publications (2)
Publication Number | Publication Date |
---|---|
US20210406598A1 US20210406598A1 (en) | 2021-12-30 |
US11599746B2 true US11599746B2 (en) | 2023-03-07 |
Family
ID=78990042
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/916,706 Active 2041-04-22 US11599746B2 (en) | 2020-06-30 | 2020-06-30 | Label shift detection and adjustment in predictive modeling |
Country Status (2)
Country | Link |
---|---|
US (1) | US11599746B2 (en) |
CN (1) | CN113869342A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11893507B1 (en) * | 2020-07-24 | 2024-02-06 | Amperity, Inc. | Predicting customer lifetime value with unified customer data |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11609970B1 (en) * | 2022-06-10 | 2023-03-21 | Snowflake Inc. | Enhanced time series forecasting |
CN117746193B (en) * | 2024-02-21 | 2024-05-10 | 之江实验室 | Label optimization method and device, storage medium and electronic equipment |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10209974B1 (en) * | 2017-12-04 | 2019-02-19 | Banjo, Inc | Automated model management methods |
US20200134504A1 (en) * | 2018-10-29 | 2020-04-30 | Acer Cyber Security Incorporated | System and method of training behavior labeling model |
US20200184278A1 (en) * | 2014-03-18 | 2020-06-11 | Z Advanced Computing, Inc. | System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform |
US20200258182A1 (en) * | 2019-02-12 | 2020-08-13 | TrekTech | Artificial intelligence tracking system and method |
US20200279113A1 (en) * | 2017-10-30 | 2020-09-03 | Panasonic Intellectual Property Management Co., Ltd. | Shelf label detection device, shelf label detection method, and shelf label detection program |
US11295163B1 (en) * | 2020-04-01 | 2022-04-05 | Scandit Ag | Recognition of optical patterns in images acquired by a robotic device |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10373068B2 (en) * | 2014-11-10 | 2019-08-06 | International Business Machines Corporation | Weight adjusted composite model for forecasting in anomalous environments |
CN105373606A (en) * | 2015-11-11 | 2016-03-02 | 重庆邮电大学 | Unbalanced data sampling method in improved C4.5 decision tree algorithm |
CN107958287A (en) * | 2017-11-23 | 2018-04-24 | 清华大学 | Towards the confrontation transfer learning method and system of big data analysis transboundary |
CN109086791A (en) * | 2018-06-25 | 2018-12-25 | 阿里巴巴集团控股有限公司 | A kind of training method, device and the computer equipment of two classifiers |
US20200034692A1 (en) * | 2018-07-30 | 2020-01-30 | National Chengchi University | Machine learning system and method for coping with potential outliers and perfect learning in concept-drifting environment |
CN109146080A (en) * | 2018-09-14 | 2019-01-04 | 苏州正载信息技术有限公司 | The method of model realization framework based on supervision class machine learning algorithm |
CN109345302B (en) * | 2018-09-27 | 2023-04-18 | 腾讯科技(深圳)有限公司 | Machine learning model training method and device, storage medium and computer equipment |
CN110688471B (en) * | 2019-09-30 | 2022-09-09 | 支付宝(杭州)信息技术有限公司 | Training sample obtaining method, device and equipment |
CN111291618B (en) * | 2020-01-13 | 2024-01-09 | 腾讯科技(深圳)有限公司 | Labeling method, labeling device, server and storage medium |
-
2020
- 2020-06-30 US US16/916,706 patent/US11599746B2/en active Active
-
2021
- 2021-06-28 CN CN202110718169.0A patent/CN113869342A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200184278A1 (en) * | 2014-03-18 | 2020-06-11 | Z Advanced Computing, Inc. | System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform |
US20200279113A1 (en) * | 2017-10-30 | 2020-09-03 | Panasonic Intellectual Property Management Co., Ltd. | Shelf label detection device, shelf label detection method, and shelf label detection program |
US10209974B1 (en) * | 2017-12-04 | 2019-02-19 | Banjo, Inc | Automated model management methods |
US20200134504A1 (en) * | 2018-10-29 | 2020-04-30 | Acer Cyber Security Incorporated | System and method of training behavior labeling model |
US20200258182A1 (en) * | 2019-02-12 | 2020-08-13 | TrekTech | Artificial intelligence tracking system and method |
US11295163B1 (en) * | 2020-04-01 | 2022-04-05 | Scandit Ag | Recognition of optical patterns in images acquired by a robotic device |
Non-Patent Citations (2)
Title |
---|
Hyndman, et al., "Forecasting with Exponential Smoothing: The State Space Approach", Published By Springer, 2008, 3 Pages. |
Lipton, et al., "Detecting and Correcting for Label Shift with Black Box Predictors", In Proceedings of the 35th International Conference on Machine Learning, Jul. 10, 2018, 11 Pages. |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11893507B1 (en) * | 2020-07-24 | 2024-02-06 | Amperity, Inc. | Predicting customer lifetime value with unified customer data |
US20240152782A1 (en) * | 2020-07-24 | 2024-05-09 | Amperity, Inc. | Predicting customer lifetime value with unified customer data |
Also Published As
Publication number | Publication date |
---|---|
US20210406598A1 (en) | 2021-12-30 |
CN113869342A (en) | 2021-12-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11599746B2 (en) | Label shift detection and adjustment in predictive modeling | |
US10846643B2 (en) | Method and system for predicting task completion of a time period based on task completion rates and data trend of prior time periods in view of attributes of tasks using machine learning models | |
US20230237329A1 (en) | Method and System Using a Neural Network for Prediction of Stocks and/or Other Market Instruments Price Volatility, Movements and Future Pricing | |
US12106223B2 (en) | Data valuation using reinforcement learning | |
US20200034716A1 (en) | Global optimal particle filtering method and global optimal particle filter | |
Berrar | Introduction to the Non-Parametric Bootstrap. | |
US11836582B2 (en) | System and method of machine learning based deviation prediction and interconnected-metrics derivation for action recommendations | |
KR20210017342A (en) | Time series prediction method and apparatus based on past prediction data | |
US11651271B1 (en) | Artificial intelligence system incorporating automatic model updates based on change point detection using likelihood ratios | |
US11861664B2 (en) | Keyword bids determined from sparse data | |
CN103197983A (en) | Service component reliability online time sequence predicting method based on probability graph model | |
US20230342606A1 (en) | Training method and apparatus for graph neural network | |
Bidyuk et al. | Methods for forecasting nonlinear non-stationary processes in machine learning | |
US20220383145A1 (en) | Regression and Time Series Forecasting | |
CN112668238B (en) | Rainfall processing method, rainfall processing device, rainfall processing equipment and storage medium | |
US11636377B1 (en) | Artificial intelligence system incorporating automatic model updates based on change point detection using time series decomposing and clustering | |
Bidyuk et al. | Forecasting nonlinear nonstationary processes in machine learning task | |
TW202133089A (en) | Method for optimally promoting decisions and computer program product thereof | |
Guo | Integrating genetic algorithm with ARIMA and reinforced random forest models to improve agriculture economy and yield forecasting | |
US12117820B2 (en) | Generating forecasted emissions value modifications and monitoring for physical emissions sources utilizing machine-learning models | |
Dehghani et al. | Crude oil price forecasting: a biogeography-based optimization approach | |
US20200380446A1 (en) | Artificial Intelligence Based Job Wages Benchmarks | |
US20230259762A1 (en) | Machine learning with instance-dependent label noise | |
US20230342664A1 (en) | Method and system for detection and mitigation of concept drift | |
JP6233432B2 (en) | Method and apparatus for selecting mixed model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: AWAITING TC RESP., ISSUE FEE NOT PAID |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |