US20220180250A1 - Processing dynamic data within an adaptive oracle-trained learning system using dynamic data set distribution optimization - Google Patents

Processing dynamic data within an adaptive oracle-trained learning system using dynamic data set distribution optimization Download PDF

Info

Publication number
US20220180250A1
US20220180250A1 US17/528,514 US202117528514A US2022180250A1 US 20220180250 A1 US20220180250 A1 US 20220180250A1 US 202117528514 A US202117528514 A US 202117528514A US 2022180250 A1 US2022180250 A1 US 2022180250A1
Authority
US
United States
Prior art keywords
data
instance
reservoir
training
computers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US17/528,514
Inventor
Shawn Ryan Jeffery
David Alan Johnston
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Groupon Inc
Original Assignee
Groupon Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Groupon Inc filed Critical Groupon Inc
Priority to US17/528,514 priority Critical patent/US20220180250A1/en
Assigned to GROUPON, INC. reassignment GROUPON, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: JOHNSTON, DAVID ALAN, JEFFERY, SHAWN RYAN
Publication of US20220180250A1 publication Critical patent/US20220180250A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Definitions

  • Embodiments of the invention relate, generally, to an adaptive system for building and maintaining machine learning models.
  • a system that automatically identifies new businesses based on data sampled from a data stream representing data collected from a variety of online sources is an example of a system that processes dynamic data. Analysis of such dynamic data typically is based on data-driven models that depend on consistent data, yet dynamic data are inherently inconsistent in both content and quality.
  • embodiments of the present invention provide herein systems, methods and computer readable media for building and maintaining machine learning models that process dynamic data.
  • Data quality fluctuations may affect the performance of a data-driven model, and, in some cases when the data quality and/or statistical distribution of the data has changed over time, the model may have to be replaced by a different model that more closely fits the changed data.
  • Obtaining a set of accurately distributed, high-quality training data instances for derivation of a model is difficult, time-consuming, and/or expensive.
  • high-quality training data instances are data that accurately represent the task being modeled, and that have been verified and labeled by at least one reliable source of truth (an oracle, hereinafter) to ensure their accuracy.
  • the framework enables end-users to declare exactly what they want (i.e., high-quality data) without having to understand how to produce such data.
  • the systems and methods described herein are therefore configured to implement an adaptive oracle-trained learning framework for automatically building and maintaining models that are developed using machine learning algorithms.
  • the framework leverages at least one oracle (e.g., a crowd) for automatic generation of high-quality training data to use in deriving a model.
  • the framework monitors the performance of the model and, in embodiments, leverages active learning and the oracle to generate feedback about the changing data for modifying training data sets while maintaining data quality to enable incremental adaptation of the model.
  • FIG. 1 illustrates a first embodiment of an example system that can be configured to implement an adaptive oracle-trained learning framework for automatically building and maintaining a predictive machine learning model in accordance with some embodiments discussed herein;
  • FIG. 2 is a flow diagram of an example method for automatically generating an initial predictive model and a high-quality training data set used to derive the model within an adaptive oracle-trained learning framework in accordance with some embodiments discussed herein;
  • FIG. 3 illustrates an exemplary process for automatically determining whether an input multi-dimensional data instance is an optimal choice for labeling and inclusion in at least one initial training data set using an adaptive oracle-trained learning framework in accordance with some embodiments discussed herein;
  • FIG. 4 is a flow diagram of an example method for determining whether an input multi-dimensional data instance is an optimal choice for labeling and inclusion in at least one initial training data set in accordance with some embodiments discussed herein;
  • FIG. 5 is a flow diagram of an example method 500 for adaptive processing of input data by an adaptive learning framework in accordance with some embodiments discussed herein;
  • FIG. 6 illustrates a second embodiment of an example system that can be configured to implement an adaptive oracle-trained learning framework for automatically building and maintaining a predictive machine learning model in accordance with some embodiments discussed herein;
  • FIG. 7 is a flow diagram of an example method for adaptive maintenance of a predictive model for optimal processing of dynamic data in accordance with some embodiments discussed herein;
  • FIG. 8 is a flow diagram of an example method for dynamically updating a model core group of clusters along a single dimension k in accordance with some embodiments discussed herein;
  • FIG. 9 is a flow diagram of an example method for dynamically updating a cluster along a single dimension k in accordance with some embodiments discussed herein;
  • FIG. 10 illustrates a diagram in which an exemplary dynamic data quality assessment system is configured as a quality assurance component within an adaptive oracle-trained learning framework in accordance with some embodiments discussed herein;
  • FIG. 11 is a flow diagram of an example method for automatic dynamic data quality assessment of dynamic input data being analyzed using an adaptive predictive model in accordance with some embodiments discussed herein;
  • FIG. 12 is a flow diagram of an example method for using active learning for processing potential training data for a machine-learning algorithm in accordance with some embodiments discussed herein;
  • FIG. 13 is an illustration of various different effects of active learning and dynamic data quality assessment on selection of new data samples to be added to an exemplary training data set for a binary classification model in accordance with some embodiments discussed herein;
  • FIG. 14 illustrates a third embodiment of an example system that can be configured to implement an adaptive oracle-trained learning framework for automatically building and maintaining a predictive machine learning model in accordance with some embodiments discussed herein;
  • FIG. 15 illustrates an example system that can be configured to implement dynamic optimization of a data set distribution in accordance with some embodiments discussed herein;
  • FIG. 16 illustrates a schematic block diagram of circuitry that can be included in a computing device, such as an adaptive learning system, in accordance with some embodiments discussed herein.
  • system components can be communicatively coupled to one or more of each other. Though the components are described as being separate or distinct, two or more of the components may be combined into a single process or routine.
  • the component functional descriptions provided herein including separation of responsibility for distinct functions is by way of example. Other groupings or other divisions of functional responsibilities can be made as necessary or in accordance with design preferences.
  • the terms “data,” “content,” “information” and similar terms may be used interchangeably to refer to data capable of being captured, transmitted, received, displayed and/or stored in accordance with various example embodiments. Thus, use of any such terms should not be taken to limit the spirit and scope of the disclosure.
  • a computing device is described herein to receive data from another computing device, the data may be received directly from the another computing device or may be received indirectly via one or more intermediary computing devices, such as, for example, one or more servers, relays, routers, network access points, base stations, and/or the like.
  • the data may be sent directly to the another computing device or may be sent indirectly via one or more intermediary computing devices, such as, for example, one or more servers, relays, routers, network access points, base stations, and/or the like.
  • intermediary computing devices such as, for example, one or more servers, relays, routers, network access points, base stations, and/or the like.
  • Data being continuously sampled from a data stream representing data collected from a variety of online sources is an example of dynamic data.
  • a system that automatically performs email fraud identification based on data sampled from a data stream is an example of a system that processes dynamic data. Analysis of such dynamic data typically is based on data-driven models that can be generated using machine learning.
  • One type of machine learning is supervised learning, in which a statistical predictive model is derived based on a training data set of examples representing the modeling task to be performed.
  • the statistical distribution of the set of training data instances should be an accurate representation of the distribution of data that will be input to the model for processing. Additionally, the composition of a training data set should be structured to provide as much information as possible to the model. However, dynamic data is inherently inconsistent.
  • the quality of the data sources may vary, the quality of the data collection methods may vary, and, in the case of data being collected continuously from a data stream, the overall quality and statistical distribution of the data itself may vary over time.
  • Data quality fluctuations may affect the performance of a data-driven model, and, in some cases when the data quality and/or statistical distribution of the data has changed over time, the model may have to be replaced by a different model that more closely fits the changed data.
  • Obtaining a set of accurately distributed, high-quality training data instances for derivation of a model is difficult, time-consuming, and/or expensive.
  • high-quality training data instances are data that accurately represent the task being modeled, and that have been verified and labeled by at least one oracle to ensure their accuracy.
  • the systems and methods described herein are therefore configured to implement an adaptive oracle-trained learning framework for automatically building and maintaining machine learning models that are developed using machine learning algorithms.
  • the framework leverages at least one oracle (e.g., a crowd) for automatic generation of high-quality training data to use in deriving a model.
  • the framework monitors the performance of the model and, in embodiments, leverages active learning and the oracle to generate feedback about the changing data for modifying training data sets while maintaining data quality to enable incremental adaptation of the model.
  • the framework is designed to provide high-quality data for less cost than current state of the art machine learning algorithms/processes) across many real-world data sets. No initial training/testing phase is needed to generate a model. No expert human involvement is needed to initially construct and over time maintain the training set and retrain the model.
  • the framework continues to provide high quality output data even if the input data change, since the framework determines how and when to adjust the training data set for incremental re-training of the model, and the framework can rely on verified data from an oracle (e.g., crowd sourced data) while the model is being re-trained.
  • the framework has the ability to utilize any high-quality/oracle-provided data, regardless of how the data was generated (e.g., the framework can make use of data that was not collected as part of the training process, such as a separate process in an organization using an oracle to collect correct categories for business).
  • the framework enables end-users to declare exactly what they want (i.e., high-quality data) without having to understand how to produce such data.
  • the system takes care of not only training the model transparently (as described above), but also deciding for every input data instance if the system should get the answer from the oracle or from a model. All of the details of machine learning models and the accessing of an oracle (e.g., crowd-sourcing) are hidden from the user—the system may not even utilize a full-scale machine learning model or an oracle as long as it can meet its quality requirements.
  • FIG. 1 illustrates a first embodiment of an example system that can be configured to implement an adaptive oracle-trained learning framework 100 for automatically building and maintaining a predictive machine learning model.
  • an adaptive oracle-trained learning framework 100 comprises a predictive model 130 (e.g., a classifier) that has been generated using machine learning based on a set of training data 120 , and that is configured to generate a judgment about unlabeled input data 105 in response to receiving a feature representation of the input data 105 ; an input data analysis component 110 for generating a feature representation of the input data 105 ; an accuracy assessment component 135 for providing an estimated assessment of the accuracy of the judgment of the input data and/or the quality of the input data 105 ; an active labeler 140 to facilitate the generation and maintenance of optimized training data 120 by identifying possible updates to the training data 120 ; at least one oracle 150 (e.g., a crowd, a flat file of data verification results previously received from one or more oracles, and/or data verification software) for providing
  • the predictive model 130 is a trainable model that is derived from the training data 120 using supervised learning.
  • An exemplary trainable model e.g., a trainable classifier
  • a particular task e.g., a binary classification task in which a classifier model returns a judgment as to which of two groups an input data instance 105 most likely belongs
  • each training example in a training data set from which the classifier is derived may represent an input to the classifier that is labeled representing the group to which the input data instance belongs.
  • Supervised learning is considered to be a data-driven process, because the efficiency and accuracy of deriving a model from a set of training data is dependent on the quality and composition of the set of training data.
  • obtaining a set of accurately distributed, high-quality training data instances typically is difficult, time-consuming, and/or expensive.
  • the training data set examples for a classification task should be balanced to ensure that all class labels are adequately represented in the training data.
  • Credit card fraud detection is an example of a classification task in which examples of fraudulent transactions may be rare in practice, and thus verified instances of these examples are more difficult to collect for training data.
  • an initial predictive model and a high-quality training data set used to derive the model via supervised learning may be generated automatically within an adaptive oracle-trained learning framework (e.g., framework 100 ) by processing a stream of unlabeled dynamic data.
  • an adaptive oracle-trained learning framework e.g., framework 100
  • FIG. 2 is a flow diagram of an example method 200 for automatically generating an initial predictive model and a high-quality training data set used to derive the model within an adaptive oracle-trained learning framework.
  • the method 200 will be described with respect to a system that includes one or more computing devices and performs the method 200 .
  • the method 200 will be described with respect to processing of dynamic data by an adaptive oracle-trained learning framework 100 .
  • a framework 100 is configured initially 205 to include an untrained predictive model 130 and an empty training data set 120 .
  • the framework 100 is assigned 210 an input configuration parameter describing a desired accuracy A for processed data 165 to be output from the framework 100 .
  • the desired accuracy A may be a minimum accuracy threshold to be satisfied for each processed data instance 165 to be output from the framework while, in some alternative embodiments, the desired accuracy A may be an average accuracy to be achieved for a set of processed data 165 .
  • the values chosen to describe the desired accuracy A for sets of processed data across various embodiments may vary.
  • an initially configured adaptive oracle-trained learning framework 100 that includes an untrained model and empty training data set may be “cold started” 215 by streaming unlabeled input data instances 105 into the system for processing.
  • the model 130 and training data 120 are then adaptively updated 230 by the framework 100 until the processed data instances 165 produced by the model 130 consistently achieve 225 the desired accuracy A as specified by the single input configuration parameter (i.e., the process ends 235 when the system reaches a “steady state”).
  • one or more high-quality initial training data sets may be generated automatically from a pool of unlabeled data instances.
  • the unlabeled data instances are dynamic data that have been collected previously from at least one data stream during at least one time window.
  • the collected data instances are multi-dimensional data, where each data instance is assumed to be described by a set of attributes (i.e., features hereinafter).
  • the input data analysis component 110 performs a distribution-based feature analysis of the collected data.
  • the feature analysis includes clustering the collected data instances into homogeneous groups across multiple dimensions using an unsupervised learning approach that is dependent on the distribution of the input data as described, for example, in U.S. patent application Ser. No.
  • the clustered data instances are sampled uniformly across the different homogeneous groups, and the sampled data instances are sent to an oracle 150 (as shown in FIG. 1 ) for labeling.
  • FIGS. 3 and 4 respectively illustrate and describe a flowchart for an exemplary method 400 for automatically determining whether an input multi-dimensional data instance is an optimal choice for labeling and inclusion in at least one initial training data set using an adaptive oracle-trained learning framework 100 .
  • the depicted method 400 is described with respect to a system that includes one or more computing devices and performs the method 400 .
  • the system receives an input multi-dimensional data instance having k attributes 405 . Determining whether an input multi-dimensional data instance is a preferred choice for labeling and inclusion in at least one initial training data set 420 is based in part on an operator estimation score and/or on a global estimation score assigned to the data instance.
  • an input multi-dimensional data instance having k attributes is represented by a feature vector x 305 having k elements (x 1 , x 2 , . . . , x k ), where each element in feature vector x represents the value of a corresponding attribute.
  • Each of the elements is assigned to a particular cluster/distribution of the corresponding attribute using a clustering/distribution algorithm 320 (e.g., dynamic clustering as described in U.S. patent application Ser. No. 14/038,661).
  • an operator estimate 302 is calculated 410 (as shown in FIG. 4 ) for each feature.
  • An operator represents a single data cleaning manipulation action applied to a feature.
  • Each operator e.g., normalization
  • an operator estimate 302 may include multiple operators chained together.
  • a classifier 330 Using an input from a clustering/distribution algorithm 320 respectively associated with each operator estimate, a classifier 330 , implementing a per operator estimator trained on the distribution, then determines a per operator estimate confidence value estimating probability P n (x
  • the data instance is assigned an operator estimation score representing the values of the set of per operator estimates 360 .
  • a higher operator estimation score indicates that the data instance would be assigned to one of the two classes by a binary classifier with a greater degree of confidence/certainty because the data instance is at a greater distance from the decision boundary of the classification task.
  • a lower operator estimation score indicates that the assignment of the data instance to one of the classes by the binary classifier would be at a lower degree of confidence/certainty because the data instance is located close to or at the decision boundary for the classification task.
  • the data instance represented by feature vector x 305 , is assigned to each of a group of N global datasets 310 containing data instances of the same type as the input data instance, and an estimated distribution 312 is calculated for each dataset.
  • the group of N global datasets 310 have varying timeline-based sizes (e.g., each dataset respectively represents a set of data instances collected during a weekly, monthly, or quarterly time window).
  • a classifier 350 Using an input from a clustering/distribution algorithm 340 respectively associated with each of the group of datasets, a classifier 350 implementing a per dataset estimator trained on each distribution determines a per dataset global estimate confidence value estimating probability P G (x
  • the input data instance is assigned 415 a global estimation score representing the values of the set of per dataset global estimates 370 .
  • a data instance having a higher global estimation score is more likely to belong to a global distribution of data instances of the same type.
  • the framework 100 may further optimize the initial training data 120 by processing the training data set examples using the model 130 , monitoring the performance of the model 130 during the processing, and then adjusting the input data feature representation and/or the composition and/or distribution of the training dataset based on an analysis of the model's performance.
  • a predictive model 130 and training data 120 deployed within an adaptive oracle-trained learning framework 100 for processing dynamic data may be updated incrementally in response to changes in the quality and/or characteristics of the dynamic data to achieve optimal processing of newly received input data 105 .
  • an input data instance 105 may be selected by the framework as a potential training example based on an accuracy assessment determined from the model output generated from processing the input data instance 105 and/or attributes of the input data instance.
  • Selected data instances receive true labels from at least one oracle 150 , and are stored in a labeled data reservoir 155 .
  • the training data 120 are updated using labeled data selected from the labeled data reservoir 155 .
  • FIG. 5 is a flow diagram of an example method 500 for adaptive processing of input data by an adaptive learning framework.
  • the method 500 is described with respect to a system that includes one or more computing devices that process dynamic data by an adaptive oracle-trained learning framework 100 .
  • method 500 will be described for an exemplary system in which the predictive model 130 is a trainable classifier.
  • the system receives 505 model output (i.e., a judgment) from a classifier model (e.g., model 130 ) that has processed an input data instance 105 .
  • model output may be a predicted label representing a category/class to which the input data instance is likely to belong.
  • the judgment includes a confidence value that represents the certainty of the judgment. For example, if the input data instance is very different from any of the training data instances, the model output that is generated from that input data has a low confidence.
  • the confidence value may be defined by any well-known distance metric (e.g., Euclidean distance, cosine, Jaccard distance).
  • an associated judgment confidence value may be a confidence score.
  • the judgment may be based on the model performing a mapping of the input data instance feature set into a binary decision space representing the task parameters, and the associated judgment confidence value may be a confidence score representing the distance in the binary decision space between the mapping of the data instance feature set and a decision boundary at the separation of the two classes in the decision space.
  • a mapping located at a greater distance from the decision boundary may be associated with a higher confidence score, representing a class assignment predicted at a greater confidence/certainty.
  • a mapping that is located close to the decision boundary may be associated with a lower confidence score, representing a class assignment predicted at a lower confidence/certainty.
  • the system executes 510 an accuracy assessment of the model output and/or the input data instance quality.
  • the accuracy assessment is an accuracy value representing the accuracy of the model judgment.
  • accuracy assessment may include one or a combination of model-dependent and model-independent analytics.
  • accuracy assessment may include that confidence score directly.
  • a second predictive model may be used to estimate the framework model accuracy on a per-instance level. For example, a random sample of data instances labeled by the framework model can be sent to the oracle for verification, and that sample then can be used as training data to train a second model to predict the probability that the framework model judgment is correct.
  • accuracy assessment is implemented by a quality assurance component 160 to generate an aggregate/moving window estimate of accuracy.
  • the quality assurance component 160 is configured as a dynamic data quality assessment system described, for example, in U.S. patent application Ser. No. 14/088,247 entitled “Automated Adaptive Data Analysis Using Dynamic Data Quality Assessment,” filed on Nov. 22, 2013, and which is incorporated herein in its entirety.
  • An exemplary dynamic quality assessment system is described in detail with reference to FIG. 10 and method 700 of FIG. 7 .
  • the system analyzes 515 the assessed model output and input data instance by determining whether the input data instance should be selected for potential inclusion in the training data set 120 . In an instance in which the input data instance is selected 520 as a possible training example, the system sends the instance to an oracle for true labeling.
  • the analysis (“active labeling” hereinafter) includes active learning.
  • Active learning as described, for example, in Settles, Burr (2009), “Active Learning Literature Survey”, Computer Sciences Technical Report 1648, University of Wisconsin—Madison, is a semi-supervised learning process in which the distribution of the training data set instances can be adjusted to optimally represent a machine learning problem.
  • a machine-learning algorithm may achieve greater accuracy with fewer training examples if the selected training data set instances are instances that will provide maximum information to the model about the problem.
  • data instances that may provide maximum information about a classification task are data instances that result in mappings in decision space that are closer to the decision boundary.
  • these data instances may be identified automatically through active labeling analysis because their judgments are associated with lower confidence scores, as previously described.
  • the determination of whether the input data instance should be selected for potential inclusion in the training data set 120 may include a data quality assessment.
  • active labeling analysis may be based on a combination of model prediction accuracy and data quality.
  • the system in response to receiving a labeled data instance from the oracle, stores 530 the labeled data instance in a labeled data reservoir 155 , from which new training data instances may be selected for updates to training data 120 .
  • the labeled data reservoir grows continuously as labeled data instances are received by the system and then stored.
  • the system outputs 545 the labeled data instance before the process ends 550 .
  • the true label assigned to the data instance by the oracle ensures the accuracy of the output, regardless of the outcome of the accuracy assessment of the model performance and/or the input data instance quality.
  • the system sends 535 the assessed input data instance and the model output for accuracy assurance.
  • accuracy assurance may include determining whether the assessed input data instance and the model output satisfy a desired accuracy A that has been received as a declarative configuration parameter by the system.
  • the system outputs 545 the processed data instance and the process ends 550 .
  • the system sends 525 the input data instance to the oracle for true labeling.
  • the labeled data instance is added 530 to the data reservoir and then output 545 before the process ends 550 , as previously described.
  • FIG. 6 illustrates a second embodiment of an example system that can be configured to implement an adaptive oracle-trained learning framework 600 for automatically building and maintaining a predictive machine learning model.
  • an adaptive oracle-trained learning framework 600 comprises a predictive model 630 (e.g., a classifier) that has been generated using machine learning based on a set of training data 620 , and that is configured to generate a judgment about the input data 605 in response to receiving a feature representation of the input data 605 ; an input data analysis component 610 for generating a feature representation of the input data 605 and maintaining optimized, high-quality training data 620 ; a quality assurance component 660 for assessment of the quality of the input data 605 and of the quality of the judgments of the predictive model 630 ; an active learning component 640 to facilitate the generation and maintenance of optimized training data 620 ; and at least one oracle 650 (e.g., a crowd, a flat file of data verification results previously received from one or more oracles, and/or data
  • new unlabeled data instances 605 are input to the framework 600 for processing by the predictive model 630 .
  • each new data instance 605 may be multi-dimensional data collected from one or more online sources describing a particular business (e.g., a restaurant, a spa), and the predictive model 630 may be a classifier that returns a judgment as to which of a set of categories the business belongs.
  • the predictive model 630 generates a judgment (e.g., an identifier of a category) in response to receiving a feature representation of an unlabeled input data instance 605 .
  • the feature representation is generated during input data analysis 610 using a distribution-based feature analysis, as previously described.
  • the judgment generated by the predictive model 630 includes a confidence value.
  • the confidence value included with a classification judgment is a score representing the distance in decision space of the judgment from the task decision boundary, as previously described with reference to FIG. 3 . Classification judgments that are more certain are associated with higher confidence scores because those judgments are at greater distances in decision space from the task decision boundary.
  • a quality assurance component 660 monitors the quality of the predictive model performance as well as the quality of the input data being processed.
  • the processed data 665 and, in some embodiments, an associated judgment are output from the framework 600 if they are determined to satisfy a quality threshold.
  • FIG. 7 is a flow diagram of an example method 700 for adaptive maintenance of a predictive model for optimal processing of dynamic data.
  • the method 700 will be described with respect to a system that includes one or more computing devices and performs the method 700 .
  • the method 700 will be described with respect to processing of dynamic data by an adaptive oracle-trained learning framework 600 .
  • method 700 will be described for an exemplary system in which the predictive model 630 is a trainable classifier.
  • the system receives 705 a classification judgment about an input data instance from the classifier.
  • the judgment includes a confidence value that represents the certainty of the judgment.
  • the confidence value included with a classification judgment is a score representing the distance in decision space of the judgment from the task decision boundary, as previously described with reference to FIG. 3 .
  • the system sends 710 the judgment and the input data instance to a quality assurance component 660 for quality analysis.
  • quality analysis includes determining 715 whether the judgment confidence value satisfies a confidence threshold.
  • the system outputs 730 the data processed by the modeling task and the process ends 735 .
  • the system sends 720 the input data sample to an oracle for verification.
  • verification by the oracle may include correction of the data, correction of the judgment, and/or labeling the input data.
  • the system optionally may update the training data 620 using the verified data before the process ends 735 .
  • updating the training data may be implemented using the quality assurance component 660 and/or the active learning component 640 , which both are described in more detail with reference to FIGS. 10-12 .
  • the training data set 620 is updated continuously as new input data are processed, so that the training data reflect optimal examples of the current data being processed.
  • the training data examples thus are adapted to fluctuations in quality and composition of the dynamic data, enabling the predictive model 630 to be re-trained.
  • the model 630 may be re-trained using the current training data set periodically or, alternatively, under a re-training schedule. In this way, a predictive model can maintain its functional effectiveness by adapting to the dynamic nature of the data being processed. Incrementally adapting an existing model is less disruptive and resource-intensive than replacing the model with a new model, and also enables a model to evolve with the dynamic data.
  • an adaptive oracle-trained learning framework 600 is further configured to perform two sample hypothesis testing (AB testing, hereinafter) to verify the performance of the predictive model 630 after re-training.
  • the system performs a new distribution-based feature analysis of the training data 620 in response to the addition of newly labeled data instances.
  • a new distribution-based feature analysis of the data by dynamic clustering may be performed by the input data analysis component 610 using method 800 , a flow chart of which is illustrated in FIG. 8 , and using method 900 , a flow chart of which is illustrated in FIG. 9 .
  • Method 800 and method 900 are described in detail in U.S. patent application Ser. No. 14/038,661.
  • FIG. 8 is a flow diagram of an example method 800 for dynamically updating a model core group of clusters along a single dimension k.
  • the method 800 will be described with respect to a system that includes one or more computing devices and performs the method 800 .
  • the system receives 805 X k , defined as a model core group of clusters 105 of objects based on a clustering dimension k.
  • clustering dimension k may represent a geographical feature of an object represented by latitude and longitude data.
  • the system receives 810 a new data stream S k representing the objects in X k , where the n-dimensional vector representing each object O i includes the k th dimension.
  • the system classifies 815 each of the objects represented in the new data stream 125 as respectively belonging to one of the clusters within X k .
  • an object is classified by determining, based on a k-means algorithm, C k , the nearest cluster to the object in the k th dimension.
  • classifying an object includes adding that object to the cluster C k .
  • the system determines 820 whether to update X k in response to integrating each of the objects into its respective nearest cluster.
  • FIG. 9 is a flow diagram of an example method 900 for dynamically updating a cluster along a single dimension k.
  • the method 900 will be described with respect to a system that includes one or more computing devices and performs the method 900 .
  • the method 900 will be described with respect to implementation of steps 815 and 820 of method 800 .
  • the system receives 905 a data point from a new data stream S k representing O i k , an instance of clustering dimension k describing a feature of an object being described in new data stream S.
  • the data point may be latitude and longitude representing a geographical feature included in an n-dimensional feature vector describing the object.
  • the system adds 910 the object to the closest cluster C k ⁇ S k for O i k. , and, in response, updates 915 the properties of cluster C k .
  • updating the properties includes calculating ⁇ k , the standard deviation of the objects in cluster C k .
  • the system determines 920 whether to update cluster C k using its updated properties.
  • updating cluster C k may include splitting cluster C k or merging cluster C k with another cluster within the core group of clusters.
  • the system determines 920 whether to update cluster C k using ⁇ k .
  • the system may optimize an initial training data set 120 that has been generated from a pool of unlabeled data by implementing method 300 to process the initial training data set 120 using the predictive model 130 generated from the initial training data and updating the training data set 120 based on the quality assessments of the model judgments of the data instances.
  • the system may repeat implementation of method 300 until the entire training data set meets a pre-determined quality threshold.
  • the quality assurance component 160 is configured as a dynamic data quality assessment system described, for example, in U.S. patent application Ser. No. 14/088,247 entitled “Automated Adaptive Data Analysis Using Dynamic Data Quality Assessment,” filed on Nov. 22, 2013, and which is incorporated herein in its entirety.
  • FIG. 10 illustrates a diagram 1000 , in which an exemplary dynamic data quality assessment system is configured as a quality assurance component 160 within an adaptive oracle-trained learning framework 100 , as described in detail in U.S. patent application Ser. No. 14/088,247.
  • the quality assurance component 160 includes a quality checker 1062 and a quality blocker 1064 , and maintains a data reservoir 1050 within the framework 100 .
  • quality analysis performed by the quality assurance component 160 may include determining the effect of data quality fluctuations on the performance of the predictive model 130 generated from the training data 120 , identifying input data samples that currently best represent examples of the modeled task, and modifying the training data 120 to enable the model to be improved incrementally by being re-trained with a currently optimal set of training data examples.
  • dynamic data quality assessment may be performed automatically by the quality assurance component using method 1000 , a flow chart of which is illustrated in FIG. 11 . Method 1000 is described in detail in U.S. patent application Ser. No. 14/088,247.
  • FIG. 11 is a flow diagram of an example method 1100 for automatic dynamic data quality assessment of dynamic input data being analyzed using an adaptive predictive model.
  • the method 1100 will be described with respect to a system that includes one or more computing devices and performs the method 1100 .
  • method 1100 will be described for a scenario in which the input data sample is a sample of data collected from a data stream, and in which the predictive model is a trainable classifier, adapted based on a set of training data.
  • the predictive model is a trainable classifier, adapted based on a set of training data.
  • a data cleaning process has been applied to the input data sample.
  • the classifier is configured to receive a feature vector representing a view of the input data sample and to output a judgment about the input data sample.
  • the system receives 1105 a judgment about an input data sample from a classifier.
  • the judgment includes a confidence value that represents a certainty of the judgment.
  • the confidence value may be a score that represents the distance of the judgment from the decision boundary in decision space for the particular classification problem modeled by the classifier. The confidence score is higher (i.e., the judgment is more certain) for judgments that are further from the decision boundary.
  • the system maintains a data reservoir of data samples that have the same data type as the input data sample and that have been processed previously by the classifier.
  • the system analyzes 1110 the input data sample in terms of the summary statistics of the data reservoir and/or the judgment.
  • analysis of the judgment may include comparing a confidence value associated with the judgment to a confidence threshold and/or determining whether the judgment matches a judgment determined previously for the input sample by a method other than the classifier.
  • the system determines 1115 whether to send a quality verification request for the input data sample to an oracle based on the analysis. For example, in some embodiments, the system may determine to send a quality verification request for the input data sample if the data sample is determined statistically to be an outlier to the data samples in the data reservoir. In another example, the system may determine to send a quality verification request for the input data sample if the judgment is associated with a confidence value that is below a confidence threshold. In a third example, the system may determine to send a quality verification request for the input data sample if the judgment generated by the classifier does not match a judgment generated by another method, even if the confidence value associated with the classifier's judgment is above the confidence threshold.
  • the process ends 1140 .
  • the system may be configured to send requests to any of a group of different oracles (e.g., a crowd, a flat file of data verification results previously received from one or more oracles, and/or data verification software) and the system may select the oracle to receive the quality verification request based on attributes of the input data sample.
  • a group of different oracles e.g., a crowd, a flat file of data verification results previously received from one or more oracles, and/or data verification software
  • the system determines 1125 whether to add the input data sample, its associated judgment, and its data quality estimate to the data reservoir. In some embodiments, the determination may be based on whether the input data sample statistically belongs in the data reservoir. Additionally and/or alternatively, the determination may be based on whether the judgment is associated with a high confidence value and/or matches a judgment made by a method different from the classifier (e.g., the oracle).
  • the process ends 1140 .
  • the system determines 1125 that the new data sample is to be added to the reservoir, before the process ends 1140 , the system optionally updates summary statistics for the reservoir.
  • the generation and maintenance of an optimized training data set 120 for the predictive model 130 component of the framework is facilitated by the active learning component 140 .
  • Active learning as described, for example, in Settles, Burr (2009), “Active Learning Literature Survey”, Computer Sciences Technical Report 1648, University of Wisconsin—Madison, is a semi-supervised learning process in which the distribution of the training data set instances can be adjusted to optimally represent a machine learning problem.
  • FIG. 12 is a flow diagram of an example method 1200 for using active learning for processing potential training data for a machine-learning algorithm.
  • the method 1200 will be described with respect to a system that includes one or more computing devices and performs the method 1200 .
  • the method 1200 will be described with respect to processing of dynamic data by the active learning component 140 of an adaptive oracle-trained learning framework 100 .
  • method 1200 will be described for an exemplary system in which the machine-learning algorithm is a trainable classifier.
  • the system receives 1205 an input data sample and its associated judgment that includes a confidence value determined to not satisfy a confidence threshold.
  • a machine-learning algorithm may achieve greater accuracy with fewer training labels if the training data set instances are chosen to provide maximum information about the problem.
  • data instances that provide maximum information about the classification task are data instances that result in classifier judgments that are closer to the decision boundary. In some embodiments, these data instances may be recognized automatically because their judgments are associated with lower confidence scores, as previously described.
  • the system sends 1210 the input data sample to an oracle for verification.
  • verification by the oracle may include correction of the data, correction of the judgment, and/or labeling the input data.
  • the system optionally may update 1215 the training data 120 using the verified data.
  • the system can leverage the classifier's performance in real time or near real time to adapt the training data set to include a higher frequency of examples that currently result in judgments having the greatest uncertainty.
  • a dynamic data quality assessment system 160 may complement an active learning component 140 to ensure that any modifications of the training data by adding new samples to the training data set do not result in over-fitting the model to the problem.
  • FIG. 13 is an illustration 1300 of the different effects of active learning and dynamic data quality assessment on selection of new data samples to be added to an exemplary training data set for a binary classification model.
  • a model i.e., a binary classifier
  • a judgment value of 0.5 represents a situation in which the classification decision was not certain; an input data sample assigned a judgment value close to 0.5 by the classifier represents a judgment that is close to the decision boundary 1315 for the classification task.
  • the dashed curve 1340 represents the relative frequencies of new training data samples that would be added to a training data set for this binary classification problem by an active learning component. To enhance the performance of the classifier in situations where the decision was uncertain, the active learning component would choose the majority of new training data samples from input data that resulted in decisions near the decision boundary 1315 .
  • the solid curve 1330 represents the relative frequencies of new training data samples that would be added to the training data set by dynamic quality assessment.
  • dynamic quality assessment may choose the majority of new training data samples based on whether they statistically belong in the data reservoir. It also may choose to add new training data samples that were classified with certainty (i.e., having a judgment value close to either 0 or 1), but erroneously (e.g., samples in which the judgment result from the classifier did not match the result returned from the oracle).
  • FIG. 14 illustrates a third embodiment of an example system that can be configured to implement an adaptive oracle-trained learning framework 1400 for automatically building and maintaining a predictive machine learning model.
  • system 1400 may comprise an input data analysis module 1420 for creating an optimal feature representation (e.g., a feature vector 1404 ) of a received input data sample 1402 selected from a data stream 1401 ; a predictive model 1430 that has been generated using machine learning based on a set of training data 1440 , and that is configured to generate a judgment 1406 about the input data sample 1402 in response to receiving a feature vector 1404 representing the input data sample 1402 ; a data set optimizer 1440 for evaluating the input data sample 1402 and its associated judgment 1406 ; and a data reservoir 1450 that includes a set of data bins 234 maintained by data set optimizer 1440 .
  • an input data analysis module 1420 for creating an optimal feature representation (e.g., a feature vector 1404 ) of a received input data sample 1402 selected from a
  • the data reservoir 1450 thus is ensured to store fresh, up-to-date data that, in embodiments, may be selected from at least one of the bins 1454 to update the training data 1440 , thus enabling the model to be improved incrementally by being re-trained with a currently optimal set of examples.
  • the configuration of the reservoir data bins 1454 may be used to ensure that the data reservoir stores up-to-date samples in a distribution that, if samples were selected from the bins and used to update the training data 1440 , those samples potentially would create training data that would improve the performance of the predictive model 1430 .
  • a set of bins 1454 may be used to generate labeling sets that don't match the distribution of the general population.
  • Each of the bins may be used to store data representing one of the possible labels, and a labeling set with equal frequencies of samples of each label may be generated even though at least one of the labels may be rare in the general population distribution.
  • each of the bins may represent one of the sources that have contributed to the data stream 1401 , and training data may be selected from the bins to match a distribution that represents a particular machine learning problem.
  • each of the data sources is a particular location (e.g., the US, Europe, and Asia)
  • each of the bins stores data samples selected from one of the sources, and the desired training data 1440 distribution should represent 10% US sources, 10% of a labeling sample may be selected from the data bin storing data selected from US sources.
  • dynamic data set distribution optimization may be used as an anomaly detection system to support the quality assurance of data flowing through a real time data processing system.
  • system 1400 may be configured to include an anomaly scorer instead of a predictive model 1430 , and the data bins 1454 would be configured to represent a distribution of anomaly scores.
  • dynamic data set distribution optimization may be used to assess the predictive model 1430 calibration.
  • the model predictions In a perfectly calibrated model, the model predictions exactly match reality.
  • the model is a probabilistic estimator, the model should predict 50% yes and 50% no for a probability 0.5; the model should predict 30% yes and 70% no for a probability 0.7; and the like.
  • the empirical distribution within each data bin may be used to test the extent of the model 1430 calibration. A small sample of data may be pulled out of a bin for analysis, and the distribution of predictions in the sample may be used for the test. For example, if the data in the bin represent a probability of 0.1, the data distribution may be tested to determine if 10% of the predictions in the sample match that probability.
  • dynamic data set distribution optimization may be used to optimize feature modeling.
  • the accuracy of the decisions within a bin may be tested (e.g., in some embodiments, a sample of the decisions within a bin may be sent to a crowd for verification), and the results may be used to adjust the feature modeling performed by the input data analysis module 1420 .
  • FIG. 15 illustrates an example system 1500 that can be configured to implement dynamic optimization of a data set distribution.
  • system 1500 may include a data reservoir 1530 that has been discretized into multiple data bins ( 1534 A, 1534 B, . . . , 1534 X) based on a desired overall statistical distribution of data in the reservoir 1530 ; and a data set optimizer (e.g., data set optimizer 1440 described with reference to FIG. 14 ) that automatically maintains a fresh, up-to-date data reservoir 1530 with the desired distribution by receiving newly collected data and then determining whether to update the data reservoir 1530 using the newly collected data.
  • a data set optimizer e.g., data set optimizer 1440 described with reference to FIG. 14
  • the system 1500 receives a data set optimization job 1505 that includes input data 1502 and configuration data 1504 .
  • the input data set 1502 may be a data stream, as previously described.
  • the configuration data 1504 may include a description of the discretized data reservoir 1530 (e.g., the configuration of the set of bins and, additionally and/or alternatively, a desired distribution of data across the set of bins).
  • the data set optimization job 1505 also may include an input data evaluator 1514 while, in some alternative embodiments, the input data evaluator 1514 may be a component of system 1500 .
  • the input data evaluator 1514 may be a supervised machine learning algorithm (e.g., a classifier).
  • evaluating an input data instance may include assigning the instance an evaluation value (e.g., a classification prediction confidence value as previously described with reference to FIG. 5 ).
  • each of the input data instances 1512 from the input data set 1502 is processed by the data set optimizer 1440 using the input data evaluator 1514 , and then the system determines whether the evaluated data instance 1522 is to be offered to any of the data bins 1534 in the data reservoir 1530 .
  • the evaluated data instance 1522 includes a prediction and/or prediction confidence value, and the determination is based at least in part on matching the prediction and/or prediction confidence value to attributes of the data that are respectively stored within each data bin 1534 .
  • each of the data bins 1534 is respectively associated with a reservoir sampler 1532 that maintains summary statistics of the distribution of data within the bin and determines whether to update the data bin 1534 based in part on those summary statistics.
  • the summary statistics may include a size capacity for the data bin 1534 (i.e., the maximum number of data instances that can be stored in the data bin) since the set of data bins is selected to represent a discretized overall distribution of the data reservoir 1530 .
  • each of the data bins 1534 may be associated with a particular range of evaluation values.
  • a reservoir sampler 1532 may determine that an evaluated data instance 1522 is to be added to a data bin 1534 if the evaluation value associated with the data instance is within the range of evaluation values associated with the bin and if the current bin size is below the bin size capacity. Additionally and/or alternatively, a reservoir sampler 1532 may determine that adding an evaluated data instance 1522 to a data bin 1532 will replace a data instance that currently is stored in the data bin 1532 .
  • FIG. 16 shows a schematic block diagram of circuitry 1600 , some or all of which may be included in, for example, an adaptive oracle-trained learning framework 100 .
  • circuitry 1600 can include various means, such as processor 1602 , memory 1604 , communications module 1606 , and/or input/output module 1608 .
  • module includes hardware, software and/or firmware configured to perform one or more particular functions.
  • circuitry 1600 as described herein may be embodied as, for example, circuitry, hardware elements (e.g., a suitably programmed processor, combinational logic circuit, and/or the like), a computer program product comprising computer-readable program instructions stored on a non-transitory computer-readable medium (e.g., memory 1604 ) that is executable by a suitably configured processing device (e.g., processor 1602 ), or some combination thereof
  • a suitably configured processing device e.g., processor 1602
  • Processor 1602 may, for example, be embodied as various means including one or more microprocessors with accompanying digital signal processor(s), one or more processor(s) without an accompanying digital signal processor, one or more coprocessors, one or more multi-core processors, one or more controllers, processing circuitry, one or more computers, various other processing elements including integrated circuits such as, for example, an ASIC (application specific integrated circuit) or FPGA (field programmable gate array), or some combination thereof. Accordingly, although illustrated in FIG. 16 as a single processor, in some embodiments, processor 1602 comprises a plurality of processors. The plurality of processors may be embodied on a single computing device or may be distributed across a plurality of computing devices collectively configured to function as circuitry 1600 .
  • the plurality of processors may be in operative communication with each other and may be collectively configured to perform one or more functionalities of circuitry 1600 as described herein.
  • processor 1602 is configured to execute instructions stored in memory 1604 or otherwise accessible to processor 1602 . These instructions, when executed by processor 1602 , may cause circuitry 1600 to perform one or more of the functionalities of circuitry 1600 as described herein.
  • processor 1602 may comprise an entity capable of performing operations according to embodiments of the present invention while configured accordingly.
  • processor 1602 when processor 1602 is embodied as an ASIC, FPGA or the like, processor 1602 may comprise specifically configured hardware for conducting one or more operations described herein.
  • processor 1602 when processor 1602 is embodied as an executor of instructions, such as may be stored in memory 1604 , the instructions may specifically configure processor 1602 to perform one or more algorithms and operations described herein, such as those discussed in connection with FIGS. 1-12 .
  • Memory 1604 may comprise, for example, volatile memory, non-volatile memory, or some combination thereof. Although illustrated in FIG. 16 as a single memory, memory 1604 may comprise a plurality of memory components. The plurality of memory components may be embodied on a single computing device or distributed across a plurality of computing devices. In various embodiments, memory 1604 may comprise, for example, a hard disk, random access memory, cache memory, flash memory, a compact disc read only memory (CD-ROM), digital versatile disc read only memory (DVD-ROM), an optical disc, circuitry configured to store information, or some combination thereof. Memory 1604 may be configured to store information, data (including analytics data), applications, instructions, or the like for enabling circuitry 1600 to carry out various functions in accordance with example embodiments of the present invention.
  • memory 1604 is configured to buffer input data for processing by processor 1602 . Additionally or alternatively, in at least some embodiments, memory 1604 is configured to store program instructions for execution by processor 1602 . Memory 1604 may store information in the form of static and/or dynamic information. This stored information may be stored and/or used by circuitry 1600 during the course of performing its functionalities.
  • Communications module 1606 may be embodied as any device or means embodied in circuitry, hardware, a computer program product comprising computer readable program instructions stored on a computer readable medium (e.g., memory 1604 ) and executed by a processing device (e.g., processor 1602 ), or a combination thereof that is configured to receive and/or transmit data from/to another device, such as, for example, a second circuitry 1600 and/or the like.
  • communications module 1606 (like other components discussed herein) can be at least partially embodied as or otherwise controlled by processor 1602 .
  • communications module 1606 may be in communication with processor 1602 , such as via a bus.
  • Communications module 1606 may include, for example, an antenna, a transmitter, a receiver, a transceiver, network interface card and/or supporting hardware and/or firmware/software for enabling communications with another computing device. Communications module 1606 may be configured to receive and/or transmit any data that may be stored by memory 1604 using any protocol that may be used for communications between computing devices. Communications module 1606 may additionally or alternatively be in communication with the memory 1604 , input/output module 1608 and/or any other component of circuitry 1600 , such as via a bus.
  • Input/output module 1608 may be in communication with processor 1602 to receive an indication of a user input and/or to provide an audible, visual, mechanical, or other output to a user. Some example visual outputs that may be provided to a user by circuitry 1600 are discussed in connection with FIG. 1 .
  • input/output module 1608 may include support, for example, for a keyboard, a mouse, a joystick, a display, a touch screen display, a microphone, a speaker, a RFID reader, barcode reader, biometric scanner, and/or other input/output mechanisms.
  • circuitry 1600 is embodied as a server or database
  • aspects of input/output module 1608 may be reduced as compared to embodiments where circuitry 1600 is implemented as an end-user machine or other type of device designed for complex user interactions. In some embodiments (like other components discussed herein), input/output module 1608 may even be eliminated from circuitry 1600 .
  • circuitry 1600 is embodied as a server or database
  • at least some aspects of input/output module 1608 may be embodied on an apparatus used by a user that is in communication with circuitry 1600 .
  • Input/output module 1608 may be in communication with the memory 1604 , communications module 1606 , and/or any other component(s), such as via a bus.
  • more than one input/output module and/or other component can be included in circuitry 1600 , only one is shown in FIG. 16 to avoid overcomplicating the drawing (like the other components discussed herein).
  • Adaptive learning module 1610 may also or instead be included and configured to perform the functionality discussed herein related to the adaptive learning oracle-based framework discussed above. In some embodiments, some or all of the functionality of adaptive learning may be performed by processor 1602 . In this regard, the example processes and algorithms discussed herein can be performed by at least one processor 1602 and/or adaptive learning module 1610 .
  • non-transitory computer readable media can be configured to store firmware, one or more application programs, and/or other software, which include instructions and other computer-readable program code portions that can be executed to control each processor (e.g., processor 1602 and/or adaptive learning module 1610 ) of the components of system 400 to implement various operations, including the examples shown above.
  • a series of computer-readable program code portions are embodied in one or more computer program products and can be used, with a computing device, server, and/or other programmable apparatus, to produce machine-implemented processes.
  • Any such computer program instructions and/or other type of code may be loaded onto a computer, processor or other programmable apparatus's circuitry to produce a machine, such that the computer, processor other programmable circuitry that execute the code on the machine create the means for implementing various functions, including those described herein.
  • all or some of the information presented by the example displays discussed herein can be based on data that is received, generated and/or maintained by one or more components of adaptive oracle-trained learning framework 100 .
  • one or more external systems such as a remote cloud computing and/or data storage system may also be leveraged to provide at least some of the functionality discussed herein.
  • embodiments of the present invention may be configured as methods, mobile devices, backend network devices, and the like. Accordingly, embodiments may comprise various means including entirely of hardware or any combination of software and hardware. Furthermore, embodiments may take the form of a computer program product on at least one non-transitory computer-readable storage medium having computer-readable program instructions (e.g., computer software) embodied in the storage medium. Any suitable computer-readable storage medium may be utilized including non-transitory hard disks, CD-ROMs, flash memory, optical storage devices, or magnetic storage devices.
  • These computer program instructions may also be stored in a computer-readable storage device (e.g., memory 1604 ) that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable storage device produce an article of manufacture including computer-readable instructions for implementing the function discussed herein.
  • the computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that the instructions that execute on the computer or other programmable apparatus provide steps for implementing the functions discussed herein.
  • blocks of the block diagrams and flowchart illustrations support combinations of means for performing the specified functions, combinations of steps for performing the specified functions and program instruction means for performing the specified functions. It will also be understood that each block of the circuit diagrams and process flowcharts, and combinations of blocks in the circuit diagrams and process flowcharts, can be implemented by special purpose hardware-based computer systems that perform the specified functions or steps, or combinations of special purpose hardware and computer instructions

Abstract

In general, embodiments of the present invention provide systems, methods and computer readable media for an adaptive oracle-trained learning framework for automatically building and maintaining models that are developed using machine learning algorithms. In embodiments, the framework leverages at least one oracle (e.g., a crowd) for automatic generation of high-quality training data to use in deriving a model. Once a model is trained, the framework monitors the performance of the model and, in embodiments, leverages active learning and the oracle to generate feedback about the changing data for modifying training data sets while maintaining data quality to enable incremental adaptation of the model.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application is a continuation of and claims priority to U.S. application Ser. No. 14/578,210, titled “PROCESSING DYNAMIC DATA WITHIN AN ADAPTIVE ORACLE-TRAINED LEARNING SYSTEM USING DYNAMIC DATA SET DISTRIBUTION OPTIMIZATION,” filed Dec. 19, 2014, which claims the benefit of U.S. Provisional Application No. 61/920,251, entitled “PROCESSING DYNAMIC DATA USING AN ADAPTIVE CROWD-TRAINED LEARNING SYSTEM,” and filed Dec. 23, 2013, of U.S. Provisional Application No. 62/039,314, entitled “DYNAMICALLY OPTIMIZING A DATA SET DISTRIBUTION,” and filed Aug. 19, 2014, and of U.S. Provisional Application No. 62/055,958, entitled “DYNAMICALLY OPTIMIZING A DATA SET DISTRIBUTION,” and filed Sep. 26, 2014, the entireties of which are hereby incorporated by reference.
  • FIELD
  • Embodiments of the invention relate, generally, to an adaptive system for building and maintaining machine learning models.
  • BACKGROUND
  • A system that automatically identifies new businesses based on data sampled from a data stream representing data collected from a variety of online sources (e.g., websites, blogs, and social media) is an example of a system that processes dynamic data. Analysis of such dynamic data typically is based on data-driven models that depend on consistent data, yet dynamic data are inherently inconsistent in both content and quality.
  • Current methods for building and maintaining models that process dynamic data exhibit a plurality of problems that make current systems insufficient, ineffective and/or the like. Through applied effort, ingenuity, and innovation, solutions to improve such methods have been realized and are described in connection with embodiments of the present invention.
  • SUMMARY
  • In general, embodiments of the present invention provide herein systems, methods and computer readable media for building and maintaining machine learning models that process dynamic data.
  • Data quality fluctuations may affect the performance of a data-driven model, and, in some cases when the data quality and/or statistical distribution of the data has changed over time, the model may have to be replaced by a different model that more closely fits the changed data. Obtaining a set of accurately distributed, high-quality training data instances for derivation of a model is difficult, time-consuming, and/or expensive. Typically, high-quality training data instances are data that accurately represent the task being modeled, and that have been verified and labeled by at least one reliable source of truth (an oracle, hereinafter) to ensure their accuracy.
  • There is a declarative framework/architecture for clear definition of the end goal for the output data. The framework enables end-users to declare exactly what they want (i.e., high-quality data) without having to understand how to produce such data. Once a model has been derived from an initial training data set, being able to perform real time monitoring of the performance of the model as well as to perform data quality assessments on dynamic data as it is being collected can enable updating of the training data set so that the model may be adapted incrementally to fluctuations of quality and/or statistical distribution of dynamic data. Incremental adaptation of a model reduces the costs involved in repeatedly replacing the model.
  • As such, and according to some example embodiments, the systems and methods described herein are therefore configured to implement an adaptive oracle-trained learning framework for automatically building and maintaining models that are developed using machine learning algorithms. In embodiments, the framework leverages at least one oracle (e.g., a crowd) for automatic generation of high-quality training data to use in deriving a model. Once a model is trained, the framework monitors the performance of the model and, in embodiments, leverages active learning and the oracle to generate feedback about the changing data for modifying training data sets while maintaining data quality to enable incremental adaptation of the model.
  • The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.
  • BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING(S)
  • Having thus described the invention in general terms, reference will now be made to the accompanying drawings, which are not necessarily drawn to scale, and wherein:
  • FIG. 1 illustrates a first embodiment of an example system that can be configured to implement an adaptive oracle-trained learning framework for automatically building and maintaining a predictive machine learning model in accordance with some embodiments discussed herein;
  • FIG. 2 is a flow diagram of an example method for automatically generating an initial predictive model and a high-quality training data set used to derive the model within an adaptive oracle-trained learning framework in accordance with some embodiments discussed herein;
  • FIG. 3 illustrates an exemplary process for automatically determining whether an input multi-dimensional data instance is an optimal choice for labeling and inclusion in at least one initial training data set using an adaptive oracle-trained learning framework in accordance with some embodiments discussed herein;
  • FIG. 4 is a flow diagram of an example method for determining whether an input multi-dimensional data instance is an optimal choice for labeling and inclusion in at least one initial training data set in accordance with some embodiments discussed herein;
  • FIG. 5 is a flow diagram of an example method 500 for adaptive processing of input data by an adaptive learning framework in accordance with some embodiments discussed herein;
  • FIG. 6 illustrates a second embodiment of an example system that can be configured to implement an adaptive oracle-trained learning framework for automatically building and maintaining a predictive machine learning model in accordance with some embodiments discussed herein;
  • FIG. 7 is a flow diagram of an example method for adaptive maintenance of a predictive model for optimal processing of dynamic data in accordance with some embodiments discussed herein;
  • FIG. 8 is a flow diagram of an example method for dynamically updating a model core group of clusters along a single dimension k in accordance with some embodiments discussed herein;
  • FIG. 9 is a flow diagram of an example method for dynamically updating a cluster along a single dimension k in accordance with some embodiments discussed herein;
  • FIG. 10 illustrates a diagram in which an exemplary dynamic data quality assessment system is configured as a quality assurance component within an adaptive oracle-trained learning framework in accordance with some embodiments discussed herein;
  • FIG. 11 is a flow diagram of an example method for automatic dynamic data quality assessment of dynamic input data being analyzed using an adaptive predictive model in accordance with some embodiments discussed herein;
  • FIG. 12 is a flow diagram of an example method for using active learning for processing potential training data for a machine-learning algorithm in accordance with some embodiments discussed herein;
  • FIG. 13 is an illustration of various different effects of active learning and dynamic data quality assessment on selection of new data samples to be added to an exemplary training data set for a binary classification model in accordance with some embodiments discussed herein;
  • FIG. 14 illustrates a third embodiment of an example system that can be configured to implement an adaptive oracle-trained learning framework for automatically building and maintaining a predictive machine learning model in accordance with some embodiments discussed herein;
  • FIG. 15 illustrates an example system that can be configured to implement dynamic optimization of a data set distribution in accordance with some embodiments discussed herein; and
  • FIG. 16 illustrates a schematic block diagram of circuitry that can be included in a computing device, such as an adaptive learning system, in accordance with some embodiments discussed herein.
  • DETAILED DESCRIPTION
  • The present invention now will be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the inventions are shown. Indeed, these inventions may be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements. Like numbers refer to like elements throughout.
  • As described herein, system components can be communicatively coupled to one or more of each other. Though the components are described as being separate or distinct, two or more of the components may be combined into a single process or routine. The component functional descriptions provided herein including separation of responsibility for distinct functions is by way of example. Other groupings or other divisions of functional responsibilities can be made as necessary or in accordance with design preferences.
  • As used herein, the terms “data,” “content,” “information” and similar terms may be used interchangeably to refer to data capable of being captured, transmitted, received, displayed and/or stored in accordance with various example embodiments. Thus, use of any such terms should not be taken to limit the spirit and scope of the disclosure. Further, where a computing device is described herein to receive data from another computing device, the data may be received directly from the another computing device or may be received indirectly via one or more intermediary computing devices, such as, for example, one or more servers, relays, routers, network access points, base stations, and/or the like. Similarly, where a computing device is described herein to send data to another computing device, the data may be sent directly to the another computing device or may be sent indirectly via one or more intermediary computing devices, such as, for example, one or more servers, relays, routers, network access points, base stations, and/or the like.
  • Data being continuously sampled from a data stream representing data collected from a variety of online sources (e.g., websites, blogs, and social media) is an example of dynamic data. A system that automatically performs email fraud identification based on data sampled from a data stream is an example of a system that processes dynamic data. Analysis of such dynamic data typically is based on data-driven models that can be generated using machine learning. One type of machine learning is supervised learning, in which a statistical predictive model is derived based on a training data set of examples representing the modeling task to be performed.
  • The statistical distribution of the set of training data instances should be an accurate representation of the distribution of data that will be input to the model for processing. Additionally, the composition of a training data set should be structured to provide as much information as possible to the model. However, dynamic data is inherently inconsistent. The quality of the data sources may vary, the quality of the data collection methods may vary, and, in the case of data being collected continuously from a data stream, the overall quality and statistical distribution of the data itself may vary over time.
  • Data quality fluctuations may affect the performance of a data-driven model, and, in some cases when the data quality and/or statistical distribution of the data has changed over time, the model may have to be replaced by a different model that more closely fits the changed data. Obtaining a set of accurately distributed, high-quality training data instances for derivation of a model is difficult, time-consuming, and/or expensive. Typically, high-quality training data instances are data that accurately represent the task being modeled, and that have been verified and labeled by at least one oracle to ensure their accuracy. Once a model has been derived from an initial training data set, being able to perform real time monitoring of the performance of the model as well as to perform data quality assessments on dynamic data as it is being collected can enable updating of the training data set so that the model may be adapted incrementally to fluctuations of quality and/or statistical distribution of dynamic data. Incremental adaptation of a model reduces the costs involved in repeatedly replacing the model.
  • As such, and according to some example embodiments, the systems and methods described herein are therefore configured to implement an adaptive oracle-trained learning framework for automatically building and maintaining machine learning models that are developed using machine learning algorithms. In embodiments, the framework leverages at least one oracle (e.g., a crowd) for automatic generation of high-quality training data to use in deriving a model. Once a model is trained, the framework monitors the performance of the model and, in embodiments, leverages active learning and the oracle to generate feedback about the changing data for modifying training data sets while maintaining data quality to enable incremental adaptation of the model.
  • Particular embodiments of the subject matter described in this specification can be implemented so as to realize one or more of the following advantages. The framework is designed to provide high-quality data for less cost than current state of the art machine learning algorithms/processes) across many real-world data sets. No initial training/testing phase is needed to generate a model. No expert human involvement is needed to initially construct and over time maintain the training set and retrain the model. The framework continues to provide high quality output data even if the input data change, since the framework determines how and when to adjust the training data set for incremental re-training of the model, and the framework can rely on verified data from an oracle (e.g., crowd sourced data) while the model is being re-trained. The framework has the ability to utilize any high-quality/oracle-provided data, regardless of how the data was generated (e.g., the framework can make use of data that was not collected as part of the training process, such as a separate process in an organization using an oracle to collect correct categories for business).
  • There is a declarative framework/architecture for clear definition of the end goal for the output data. The framework enables end-users to declare exactly what they want (i.e., high-quality data) without having to understand how to produce such data. The system takes care of not only training the model transparently (as described above), but also deciding for every input data instance if the system should get the answer from the oracle or from a model. All of the details of machine learning models and the accessing of an oracle (e.g., crowd-sourcing) are hidden from the user—the system may not even utilize a full-scale machine learning model or an oracle as long as it can meet its quality requirements.
  • FIG. 1 illustrates a first embodiment of an example system that can be configured to implement an adaptive oracle-trained learning framework 100 for automatically building and maintaining a predictive machine learning model. In embodiments, an adaptive oracle-trained learning framework 100 comprises a predictive model 130 (e.g., a classifier) that has been generated using machine learning based on a set of training data 120, and that is configured to generate a judgment about unlabeled input data 105 in response to receiving a feature representation of the input data 105; an input data analysis component 110 for generating a feature representation of the input data 105; an accuracy assessment component 135 for providing an estimated assessment of the accuracy of the judgment of the input data and/or the quality of the input data 105; an active labeler 140 to facilitate the generation and maintenance of optimized training data 120 by identifying possible updates to the training data 120; at least one oracle 150 (e.g., a crowd, a flat file of data verification results previously received from one or more oracles, and/or data verification software) for providing a verified true label for input data 105 identified by the active labeler 140; a labeled data reservoir 155 for storing input data 105 that have received true labels from the oracle 150; and an accuracy assurance component 160 for determining whether the system output processed data 165 satisfies an accuracy threshold.
  • In embodiments, the predictive model 130 is a trainable model that is derived from the training data 120 using supervised learning. An exemplary trainable model (e.g., a trainable classifier) is adapted to represent a particular task (e.g., a binary classification task in which a classifier model returns a judgment as to which of two groups an input data instance 105 most likely belongs) using a set of training data 120 that consists of examples of the task being modeled. Referring to the exemplary binary classification task, each training example in a training data set from which the classifier is derived may represent an input to the classifier that is labeled representing the group to which the input data instance belongs.
  • Supervised learning is considered to be a data-driven process, because the efficiency and accuracy of deriving a model from a set of training data is dependent on the quality and composition of the set of training data. As discussed previously, obtaining a set of accurately distributed, high-quality training data instances typically is difficult, time-consuming, and/or expensive. For example, the training data set examples for a classification task should be balanced to ensure that all class labels are adequately represented in the training data. Credit card fraud detection is an example of a classification task in which examples of fraudulent transactions may be rare in practice, and thus verified instances of these examples are more difficult to collect for training data.
  • In some embodiments, an initial predictive model and a high-quality training data set used to derive the model via supervised learning may be generated automatically within an adaptive oracle-trained learning framework (e.g., framework 100) by processing a stream of unlabeled dynamic data.
  • FIG. 2 is a flow diagram of an example method 200 for automatically generating an initial predictive model and a high-quality training data set used to derive the model within an adaptive oracle-trained learning framework. For convenience, the method 200 will be described with respect to a system that includes one or more computing devices and performs the method 200. Specifically, the method 200 will be described with respect to processing of dynamic data by an adaptive oracle-trained learning framework 100.
  • In embodiments, a framework 100 is configured initially 205 to include an untrained predictive model 130 and an empty training data set 120. In some embodiments, at framework setup, the framework 100 is assigned 210 an input configuration parameter describing a desired accuracy A for processed data 165 to be output from the framework 100. In some embodiments, the desired accuracy A may be a minimum accuracy threshold to be satisfied for each processed data instance 165 to be output from the framework while, in some alternative embodiments, the desired accuracy A may be an average accuracy to be achieved for a set of processed data 165. The values chosen to describe the desired accuracy A for sets of processed data across various embodiments may vary.
  • In some embodiments, an initially configured adaptive oracle-trained learning framework 100 that includes an untrained model and empty training data set may be “cold started” 215 by streaming unlabeled input data instances 105 into the system for processing. The model 130 and training data 120 are then adaptively updated 230 by the framework 100 until the processed data instances 165 produced by the model 130 consistently achieve 225 the desired accuracy A as specified by the single input configuration parameter (i.e., the process ends 235 when the system reaches a “steady state”).
  • In some alternative embodiments, one or more high-quality initial training data sets may be generated automatically from a pool of unlabeled data instances. In some embodiments, the unlabeled data instances are dynamic data that have been collected previously from at least one data stream during at least one time window. In some embodiments, the collected data instances are multi-dimensional data, where each data instance is assumed to be described by a set of attributes (i.e., features hereinafter). In some embodiments, the input data analysis component 110 performs a distribution-based feature analysis of the collected data. In some embodiments, the feature analysis includes clustering the collected data instances into homogeneous groups across multiple dimensions using an unsupervised learning approach that is dependent on the distribution of the input data as described, for example, in U.S. patent application Ser. No. 14/038,661 entitled “Dynamic Clustering for Streaming Data,” filed on Sep. 16, 2013, and which is incorporated herein in its entirety. In some embodiments, the clustered data instances are sampled uniformly across the different homogeneous groups, and the sampled data instances are sent to an oracle 150 (as shown in FIG. 1) for labeling.
  • FIGS. 3 and 4 respectively illustrate and describe a flowchart for an exemplary method 400 for automatically determining whether an input multi-dimensional data instance is an optimal choice for labeling and inclusion in at least one initial training data set using an adaptive oracle-trained learning framework 100. The depicted method 400 is described with respect to a system that includes one or more computing devices and performs the method 400.
  • In embodiments, the system receives an input multi-dimensional data instance having k attributes 405. Determining whether an input multi-dimensional data instance is a preferred choice for labeling and inclusion in at least one initial training data set 420 is based in part on an operator estimation score and/or on a global estimation score assigned to the data instance.
  • Turning to FIG. 3 for illustration, in embodiments, an input multi-dimensional data instance having k attributes is represented by a feature vector x 305 having k elements (x1, x2, . . . , xk), where each element in feature vector x represents the value of a corresponding attribute. Each of the elements is assigned to a particular cluster/distribution of the corresponding attribute using a clustering/distribution algorithm 320 (e.g., dynamic clustering as described in U.S. patent application Ser. No. 14/038,661).
  • In embodiments, an operator estimate 302 is calculated 410 (as shown in FIG. 4) for each feature. An operator represents a single data cleaning manipulation action applied to a feature. Each operator (e.g., normalization) has at most one statistical model to power its cleaning of the data. In some embodiments, an operator estimate 302 may include multiple operators chained together.
  • Using an input from a clustering/distribution algorithm 320 respectively associated with each operator estimate, a classifier 330, implementing a per operator estimator trained on the distribution, then determines a per operator estimate confidence value estimating probability Pn(x|T), a probability based on the operator estimator n that the feature vector x belongs to the cluster/distribution T of multi-dimensional data instance feature vectors to which it has been assigned. The data instance is assigned an operator estimation score representing the values of the set of per operator estimates 360. For example, referring to the exemplary binary classification task, a higher operator estimation score indicates that the data instance would be assigned to one of the two classes by a binary classifier with a greater degree of confidence/certainty because the data instance is at a greater distance from the decision boundary of the classification task. Conversely, a lower operator estimation score indicates that the assignment of the data instance to one of the classes by the binary classifier would be at a lower degree of confidence/certainty because the data instance is located close to or at the decision boundary for the classification task.
  • In some embodiments, the data instance, represented by feature vector x 305, is assigned to each of a group of N global datasets 310 containing data instances of the same type as the input data instance, and an estimated distribution 312 is calculated for each dataset. In some embodiments, the group of N global datasets 310 have varying timeline-based sizes (e.g., each dataset respectively represents a set of data instances collected during a weekly, monthly, or quarterly time window). Using an input from a clustering/distribution algorithm 340 respectively associated with each of the group of datasets, a classifier 350 implementing a per dataset estimator trained on each distribution determines a per dataset global estimate confidence value estimating probability PG(x|DY), a probability that the input data instance belongs to the global distribution represented by its associated dataset Y. The input data instance is assigned 415 a global estimation score representing the values of the set of per dataset global estimates 370. A data instance having a higher global estimation score is more likely to belong to a global distribution of data instances of the same type.
  • Returning to FIG. 1, once the model 130 is derived, in some embodiments, the framework 100 may further optimize the initial training data 120 by processing the training data set examples using the model 130, monitoring the performance of the model 130 during the processing, and then adjusting the input data feature representation and/or the composition and/or distribution of the training dataset based on an analysis of the model's performance.
  • In some embodiments, a predictive model 130 and training data 120 deployed within an adaptive oracle-trained learning framework 100 for processing dynamic data may be updated incrementally in response to changes in the quality and/or characteristics of the dynamic data to achieve optimal processing of newly received input data 105. In embodiments, an input data instance 105 may be selected by the framework as a potential training example based on an accuracy assessment determined from the model output generated from processing the input data instance 105 and/or attributes of the input data instance. Selected data instances receive true labels from at least one oracle 150, and are stored in a labeled data reservoir 155. In embodiments, the training data 120 are updated using labeled data selected from the labeled data reservoir 155.
  • FIG. 5 is a flow diagram of an example method 500 for adaptive processing of input data by an adaptive learning framework. The method 500 is described with respect to a system that includes one or more computing devices that process dynamic data by an adaptive oracle-trained learning framework 100. For clarity and without limitation, method 500 will be described for an exemplary system in which the predictive model 130 is a trainable classifier.
  • In embodiments, the system receives 505 model output (i.e., a judgment) from a classifier model (e.g., model 130) that has processed an input data instance 105. Exemplary model output may be a predicted label representing a category/class to which the input data instance is likely to belong. In some embodiments, the judgment includes a confidence value that represents the certainty of the judgment. For example, if the input data instance is very different from any of the training data instances, the model output that is generated from that input data has a low confidence. The confidence value may be defined by any well-known distance metric (e.g., Euclidean distance, cosine, Jaccard distance). In some embodiments, an associated judgment confidence value may be a confidence score.
  • Referring to the example in which the classification task is a binary classification task, the judgment may be based on the model performing a mapping of the input data instance feature set into a binary decision space representing the task parameters, and the associated judgment confidence value may be a confidence score representing the distance in the binary decision space between the mapping of the data instance feature set and a decision boundary at the separation of the two classes in the decision space. A mapping located at a greater distance from the decision boundary may be associated with a higher confidence score, representing a class assignment predicted at a greater confidence/certainty. Conversely, a mapping that is located close to the decision boundary may be associated with a lower confidence score, representing a class assignment predicted at a lower confidence/certainty.
  • In embodiments, the system executes 510 an accuracy assessment of the model output and/or the input data instance quality. In some embodiments, the accuracy assessment is an accuracy value representing the accuracy of the model judgment.
  • In some embodiments, accuracy assessment may include one or a combination of model-dependent and model-independent analytics. In some embodiments in which the model judgment includes a confidence score, accuracy assessment may include that confidence score directly. In some embodiments, a second predictive model may be used to estimate the framework model accuracy on a per-instance level. For example, a random sample of data instances labeled by the framework model can be sent to the oracle for verification, and that sample then can be used as training data to train a second model to predict the probability that the framework model judgment is correct.
  • In some embodiments, accuracy assessment is implemented by a quality assurance component 160 to generate an aggregate/moving window estimate of accuracy. In some embodiments, the quality assurance component 160 is configured as a dynamic data quality assessment system described, for example, in U.S. patent application Ser. No. 14/088,247 entitled “Automated Adaptive Data Analysis Using Dynamic Data Quality Assessment,” filed on Nov. 22, 2013, and which is incorporated herein in its entirety. An exemplary dynamic quality assessment system is described in detail with reference to FIG. 10 and method 700 of FIG. 7.
  • In embodiments, the system analyzes 515 the assessed model output and input data instance by determining whether the input data instance should be selected for potential inclusion in the training data set 120. In an instance in which the input data instance is selected 520 as a possible training example, the system sends the instance to an oracle for true labeling.
  • In some embodiments, the analysis (“active labeling” hereinafter) includes active learning. Active learning, as described, for example, in Settles, Burr (2009), “Active Learning Literature Survey”, Computer Sciences Technical Report 1648, University of Wisconsin—Madison, is a semi-supervised learning process in which the distribution of the training data set instances can be adjusted to optimally represent a machine learning problem. For example, a machine-learning algorithm may achieve greater accuracy with fewer training examples if the selected training data set instances are instances that will provide maximum information to the model about the problem. Referring to the trainable classifier example, data instances that may provide maximum information about a classification task are data instances that result in mappings in decision space that are closer to the decision boundary. In some embodiments, these data instances may be identified automatically through active labeling analysis because their judgments are associated with lower confidence scores, as previously described.
  • Additionally and/or alternatively, in some embodiments, the determination of whether the input data instance should be selected for potential inclusion in the training data set 120 may include a data quality assessment. In some embodiments, active labeling analysis may be based on a combination of model prediction accuracy and data quality.
  • In some embodiments, in response to receiving a labeled data instance from the oracle, the system stores 530 the labeled data instance in a labeled data reservoir 155, from which new training data instances may be selected for updates to training data 120. In some embodiments, the labeled data reservoir grows continuously as labeled data instances are received by the system and then stored.
  • In embodiments, the system outputs 545 the labeled data instance before the process ends 550. The true label assigned to the data instance by the oracle ensures the accuracy of the output, regardless of the outcome of the accuracy assessment of the model performance and/or the input data instance quality.
  • In an instance in which the input data instance is not selected 520 as a possible training example, in embodiments, the system sends 535 the assessed input data instance and the model output for accuracy assurance. In some embodiments, as previously described, accuracy assurance may include determining whether the assessed input data instance and the model output satisfy a desired accuracy A that has been received as a declarative configuration parameter by the system.
  • In an instance in which the desired accuracy is satisfied 540, the system outputs 545 the processed data instance and the process ends 550.
  • In an instance in which the desired accuracy is not satisfied 540, in embodiments, the system sends 525 the input data instance to the oracle for true labeling. In some embodiments, the labeled data instance is added 530 to the data reservoir and then output 545 before the process ends 550, as previously described.
  • FIG. 6 illustrates a second embodiment of an example system that can be configured to implement an adaptive oracle-trained learning framework 600 for automatically building and maintaining a predictive machine learning model. In embodiments, an adaptive oracle-trained learning framework 600 comprises a predictive model 630 (e.g., a classifier) that has been generated using machine learning based on a set of training data 620, and that is configured to generate a judgment about the input data 605 in response to receiving a feature representation of the input data 605; an input data analysis component 610 for generating a feature representation of the input data 605 and maintaining optimized, high-quality training data 620; a quality assurance component 660 for assessment of the quality of the input data 605 and of the quality of the judgments of the predictive model 630; an active learning component 640 to facilitate the generation and maintenance of optimized training data 620; and at least one oracle 650 (e.g., a crowd, a flat file of data verification results previously received from one or more oracles, and/or data verification software) for providing a verified quality measure for the input data 605 and its associated judgment.
  • In embodiments, new unlabeled data instances 605, sharing the particular type of the examples in the training data set 620, are input to the framework 600 for processing by the predictive model 630. For example, in some embodiments, each new data instance 605 may be multi-dimensional data collected from one or more online sources describing a particular business (e.g., a restaurant, a spa), and the predictive model 630 may be a classifier that returns a judgment as to which of a set of categories the business belongs.
  • In embodiments, the predictive model 630 generates a judgment (e.g., an identifier of a category) in response to receiving a feature representation of an unlabeled input data instance 605. In some embodiments, the feature representation is generated during input data analysis 610 using a distribution-based feature analysis, as previously described. In some embodiments, the judgment generated by the predictive model 630 includes a confidence value. For example, in some embodiments in which the predictive model 630 is performing a classification task, the confidence value included with a classification judgment is a score representing the distance in decision space of the judgment from the task decision boundary, as previously described with reference to FIG. 3. Classification judgments that are more certain are associated with higher confidence scores because those judgments are at greater distances in decision space from the task decision boundary.
  • In some embodiments, a quality assurance component 660 monitors the quality of the predictive model performance as well as the quality of the input data being processed. The processed data 665 and, in some embodiments, an associated judgment are output from the framework 600 if they are determined to satisfy a quality threshold.
  • FIG. 7 is a flow diagram of an example method 700 for adaptive maintenance of a predictive model for optimal processing of dynamic data. For convenience, the method 700 will be described with respect to a system that includes one or more computing devices and performs the method 700. Specifically, the method 700 will be described with respect to processing of dynamic data by an adaptive oracle-trained learning framework 600. For clarity and without limitation, method 700 will be described for an exemplary system in which the predictive model 630 is a trainable classifier.
  • In embodiments, the system receives 705 a classification judgment about an input data instance from the classifier. The judgment includes a confidence value that represents the certainty of the judgment. In some embodiments, the confidence value included with a classification judgment is a score representing the distance in decision space of the judgment from the task decision boundary, as previously described with reference to FIG. 3.
  • In embodiments, the system sends 710 the judgment and the input data instance to a quality assurance component 660 for quality analysis. In some embodiments, quality analysis includes determining 715 whether the judgment confidence value satisfies a confidence threshold.
  • In an instance in which the judgment confidence value satisfies the confidence threshold and the data satisfy a quality threshold, the system outputs 730 the data processed by the modeling task and the process ends 735.
  • In an instance in which the judgment confidence value does not satisfy the confidence threshold, the system sends 720 the input data sample to an oracle for verification. In some embodiments, verification by the oracle may include correction of the data, correction of the judgment, and/or labeling the input data. In response to receiving the verified data from the oracle, the system optionally may update the training data 620 using the verified data before the process ends 735. In some embodiments, updating the training data may be implemented using the quality assurance component 660 and/or the active learning component 640, which both are described in more detail with reference to FIGS. 10-12.
  • In some embodiments, the training data set 620 is updated continuously as new input data are processed, so that the training data reflect optimal examples of the current data being processed. The training data examples thus are adapted to fluctuations in quality and composition of the dynamic data, enabling the predictive model 630 to be re-trained. In some embodiments, the model 630 may be re-trained using the current training data set periodically or, alternatively, under a re-training schedule. In this way, a predictive model can maintain its functional effectiveness by adapting to the dynamic nature of the data being processed. Incrementally adapting an existing model is less disruptive and resource-intensive than replacing the model with a new model, and also enables a model to evolve with the dynamic data. In some embodiments, an adaptive oracle-trained learning framework 600 is further configured to perform two sample hypothesis testing (AB testing, hereinafter) to verify the performance of the predictive model 630 after re-training.
  • In some embodiments, the system performs a new distribution-based feature analysis of the training data 620 in response to the addition of newly labeled data instances. In some embodiments, for example, a new distribution-based feature analysis of the data by dynamic clustering may be performed by the input data analysis component 610 using method 800, a flow chart of which is illustrated in FIG. 8, and using method 900, a flow chart of which is illustrated in FIG. 9. Method 800 and method 900 are described in detail in U.S. patent application Ser. No. 14/038,661.
  • FIG. 8 is a flow diagram of an example method 800 for dynamically updating a model core group of clusters along a single dimension k. For convenience, the method 800 will be described with respect to a system that includes one or more computing devices and performs the method 800.
  • In embodiments, the system receives 805 Xk, defined as a model core group of clusters 105 of objects based on a clustering dimension k. For example, in embodiments, clustering dimension k may represent a geographical feature of an object represented by latitude and longitude data. In embodiments, the system receives 810 a new data stream Sk representing the objects in Xk, where the n-dimensional vector representing each object Oi includes the kth dimension.
  • In embodiments, the system classifies 815 each of the objects represented in the new data stream 125 as respectively belonging to one of the clusters within Xk. In some embodiments, an object is classified by determining, based on a k-means algorithm, Ck, the nearest cluster to the object in the kth dimension. In embodiments, classifying an object includes adding that object to the cluster Ck.
  • In embodiments, the system determines 820 whether to update Xk in response to integrating each of the objects into its respective nearest cluster.
  • FIG. 9 is a flow diagram of an example method 900 for dynamically updating a cluster along a single dimension k. For convenience, the method 900 will be described with respect to a system that includes one or more computing devices and performs the method 900. Specifically, the method 900 will be described with respect to implementation of steps 815 and 820 of method 800.
  • In embodiments, the system receives 905 a data point from a new data stream Sk representing Oi k, an instance of clustering dimension k describing a feature of an object being described in new data stream S. For example, in embodiments, the data point may be latitude and longitude representing a geographical feature included in an n-dimensional feature vector describing the object.
  • In embodiments, the system adds 910 the object to the closest cluster Ck∈Sk for Oi k., and, in response, updates 915 the properties of cluster Ck. In embodiments, updating the properties includes calculating σk, the standard deviation of the objects in cluster Ck.
  • In embodiments, the system determines 920 whether to update cluster Ck using its updated properties. In some embodiments, updating cluster Ck may include splitting cluster Ck or merging cluster Ck with another cluster within the core group of clusters. In some embodiments, the system determines 920 whether to update cluster Ck using σk.
  • In some embodiments, the system may optimize an initial training data set 120 that has been generated from a pool of unlabeled data by implementing method 300 to process the initial training data set 120 using the predictive model 130 generated from the initial training data and updating the training data set 120 based on the quality assessments of the model judgments of the data instances. The system may repeat implementation of method 300 until the entire training data set meets a pre-determined quality threshold.
  • In some embodiments, the quality assurance component 160 is configured as a dynamic data quality assessment system described, for example, in U.S. patent application Ser. No. 14/088,247 entitled “Automated Adaptive Data Analysis Using Dynamic Data Quality Assessment,” filed on Nov. 22, 2013, and which is incorporated herein in its entirety.
  • FIG. 10 illustrates a diagram 1000, in which an exemplary dynamic data quality assessment system is configured as a quality assurance component 160 within an adaptive oracle-trained learning framework 100, as described in detail in U.S. patent application Ser. No. 14/088,247. The quality assurance component 160 includes a quality checker 1062 and a quality blocker 1064, and maintains a data reservoir 1050 within the framework 100.
  • In some embodiments, quality analysis performed by the quality assurance component 160 may include determining the effect of data quality fluctuations on the performance of the predictive model 130 generated from the training data 120, identifying input data samples that currently best represent examples of the modeled task, and modifying the training data 120 to enable the model to be improved incrementally by being re-trained with a currently optimal set of training data examples. In some embodiments, dynamic data quality assessment may be performed automatically by the quality assurance component using method 1000, a flow chart of which is illustrated in FIG. 11. Method 1000 is described in detail in U.S. patent application Ser. No. 14/088,247.
  • FIG. 11 is a flow diagram of an example method 1100 for automatic dynamic data quality assessment of dynamic input data being analyzed using an adaptive predictive model. For convenience, the method 1100 will be described with respect to a system that includes one or more computing devices and performs the method 1100.
  • For clarity and without limitation, method 1100 will be described for a scenario in which the input data sample is a sample of data collected from a data stream, and in which the predictive model is a trainable classifier, adapted based on a set of training data. In some embodiments, a data cleaning process has been applied to the input data sample. The classifier is configured to receive a feature vector representing a view of the input data sample and to output a judgment about the input data sample.
  • In embodiments, the system receives 1105 a judgment about an input data sample from a classifier. In some embodiments, the judgment includes a confidence value that represents a certainty of the judgment. For example, in some embodiments, the confidence value may be a score that represents the distance of the judgment from the decision boundary in decision space for the particular classification problem modeled by the classifier. The confidence score is higher (i.e., the judgment is more certain) for judgments that are further from the decision boundary.
  • As previously described with reference to FIG. 1, in some embodiments, the system maintains a data reservoir of data samples that have the same data type as the input data sample and that have been processed previously by the classifier. In embodiments, the system analyzes 1110 the input data sample in terms of the summary statistics of the data reservoir and/or the judgment. In some embodiments, analysis of the judgment may include comparing a confidence value associated with the judgment to a confidence threshold and/or determining whether the judgment matches a judgment determined previously for the input sample by a method other than the classifier.
  • In embodiments, the system determines 1115 whether to send a quality verification request for the input data sample to an oracle based on the analysis. For example, in some embodiments, the system may determine to send a quality verification request for the input data sample if the data sample is determined statistically to be an outlier to the data samples in the data reservoir. In another example, the system may determine to send a quality verification request for the input data sample if the judgment is associated with a confidence value that is below a confidence threshold. In a third example, the system may determine to send a quality verification request for the input data sample if the judgment generated by the classifier does not match a judgment generated by another method, even if the confidence value associated with the classifier's judgment is above the confidence threshold.
  • In an instance in which the system determines 1120 that a quality request will not be sent to the oracle, the process ends 1140.
  • In an instance in which the system determines 1120 that a quality request will be sent to the oracle, in some embodiments, the system may be configured to send requests to any of a group of different oracles (e.g., a crowd, a flat file of data verification results previously received from one or more oracles, and/or data verification software) and the system may select the oracle to receive the quality verification request based on attributes of the input data sample.
  • In response to receiving a data quality estimate of the input data sample from the oracle, in embodiments, the system determines 1125 whether to add the input data sample, its associated judgment, and its data quality estimate to the data reservoir. In some embodiments, the determination may be based on whether the input data sample statistically belongs in the data reservoir. Additionally and/or alternatively, the determination may be based on whether the judgment is associated with a high confidence value and/or matches a judgment made by a method different from the classifier (e.g., the oracle).
  • In an instance in which the system determines 1125 that the new data sample is not to be added to the reservoir, the process ends 1140.
  • In an instance in which the system determines 1125 that the new data sample is to be added to the reservoir, before the process ends 1140, the system optionally updates summary statistics for the reservoir.
  • In some embodiments, the generation and maintenance of an optimized training data set 120 for the predictive model 130 component of the framework is facilitated by the active learning component 140. Active learning, as described, for example, in Settles, Burr (2009), “Active Learning Literature Survey”, Computer Sciences Technical Report 1648, University of Wisconsin—Madison, is a semi-supervised learning process in which the distribution of the training data set instances can be adjusted to optimally represent a machine learning problem.
  • FIG. 12 is a flow diagram of an example method 1200 for using active learning for processing potential training data for a machine-learning algorithm. For convenience, the method 1200 will be described with respect to a system that includes one or more computing devices and performs the method 1200. Specifically, the method 1200 will be described with respect to processing of dynamic data by the active learning component 140 of an adaptive oracle-trained learning framework 100. For clarity and without limitation, method 1200 will be described for an exemplary system in which the machine-learning algorithm is a trainable classifier.
  • In embodiments, the system receives 1205 an input data sample and its associated judgment that includes a confidence value determined to not satisfy a confidence threshold.
  • A machine-learning algorithm may achieve greater accuracy with fewer training labels if the training data set instances are chosen to provide maximum information about the problem. Referring to the classifier example, data instances that provide maximum information about the classification task are data instances that result in classifier judgments that are closer to the decision boundary. In some embodiments, these data instances may be recognized automatically because their judgments are associated with lower confidence scores, as previously described.
  • In embodiments, the system sends 1210 the input data sample to an oracle for verification. In some embodiments, verification by the oracle may include correction of the data, correction of the judgment, and/or labeling the input data.
  • In embodiments, the system optionally may update 1215 the training data 120 using the verified data. Thus, the system can leverage the classifier's performance in real time or near real time to adapt the training data set to include a higher frequency of examples that currently result in judgments having the greatest uncertainty.
  • In embodiments, a dynamic data quality assessment system 160 may complement an active learning component 140 to ensure that any modifications of the training data by adding new samples to the training data set do not result in over-fitting the model to the problem.
  • FIG. 13 is an illustration 1300 of the different effects of active learning and dynamic data quality assessment on selection of new data samples to be added to an exemplary training data set for a binary classification model. A model (i.e., a binary classifier) assigns a judgment value 1310 to each data point; a data point assigned a judgment value that is close to either 0 or 1 has been determined with certainty by the classifier to belong to one or the other of two classes. A judgment value of 0.5 represents a situation in which the classification decision was not certain; an input data sample assigned a judgment value close to 0.5 by the classifier represents a judgment that is close to the decision boundary 1315 for the classification task.
  • The dashed curve 1340 represents the relative frequencies of new training data samples that would be added to a training data set for this binary classification problem by an active learning component. To enhance the performance of the classifier in situations where the decision was uncertain, the active learning component would choose the majority of new training data samples from input data that resulted in decisions near the decision boundary 1315.
  • The solid curve 1330 represents the relative frequencies of new training data samples that would be added to the training data set by dynamic quality assessment. Instead of choosing new training data samples based on the judgment value, in some embodiments, dynamic quality assessment may choose the majority of new training data samples based on whether they statistically belong in the data reservoir. It also may choose to add new training data samples that were classified with certainty (i.e., having a judgment value close to either 0 or 1), but erroneously (e.g., samples in which the judgment result from the classifier did not match the result returned from the oracle).
  • FIG. 14 illustrates a third embodiment of an example system that can be configured to implement an adaptive oracle-trained learning framework 1400 for automatically building and maintaining a predictive machine learning model. In embodiments, system 1400 may comprise an input data analysis module 1420 for creating an optimal feature representation (e.g., a feature vector 1404) of a received input data sample 1402 selected from a data stream 1401; a predictive model 1430 that has been generated using machine learning based on a set of training data 1440, and that is configured to generate a judgment 1406 about the input data sample 1402 in response to receiving a feature vector 1404 representing the input data sample 1402; a data set optimizer 1440 for evaluating the input data sample 1402 and its associated judgment 1406; and a data reservoir 1450 that includes a set of data bins 234 maintained by data set optimizer 1440. The data reservoir 1450 thus is ensured to store fresh, up-to-date data that, in embodiments, may be selected from at least one of the bins 1454 to update the training data 1440, thus enabling the model to be improved incrementally by being re-trained with a currently optimal set of examples.
  • In embodiments, the configuration of the reservoir data bins 1454 may be used to ensure that the data reservoir stores up-to-date samples in a distribution that, if samples were selected from the bins and used to update the training data 1440, those samples potentially would create training data that would improve the performance of the predictive model 1430. In a first example, a set of bins 1454 may be used to generate labeling sets that don't match the distribution of the general population. Each of the bins may be used to store data representing one of the possible labels, and a labeling set with equal frequencies of samples of each label may be generated even though at least one of the labels may be rare in the general population distribution. In a second example, each of the bins may represent one of the sources that have contributed to the data stream 1401, and training data may be selected from the bins to match a distribution that represents a particular machine learning problem. Thus, if each of the data sources is a particular location (e.g., the US, Europe, and Asia), each of the bins stores data samples selected from one of the sources, and the desired training data 1440 distribution should represent 10% US sources, 10% of a labeling sample may be selected from the data bin storing data selected from US sources.
  • In some embodiments, dynamic data set distribution optimization may be used as an anomaly detection system to support the quality assurance of data flowing through a real time data processing system. In these embodiments, for example, system 1400 may be configured to include an anomaly scorer instead of a predictive model 1430, and the data bins 1454 would be configured to represent a distribution of anomaly scores.
  • In some embodiments, dynamic data set distribution optimization may be used to assess the predictive model 1430 calibration. In a perfectly calibrated model, the model predictions exactly match reality. Thus, for example, if the model is a probabilistic estimator, the model should predict 50% yes and 50% no for a probability 0.5; the model should predict 30% yes and 70% no for a probability 0.7; and the like. The empirical distribution within each data bin may be used to test the extent of the model 1430 calibration. A small sample of data may be pulled out of a bin for analysis, and the distribution of predictions in the sample may be used for the test. For example, if the data in the bin represent a probability of 0.1, the data distribution may be tested to determine if 10% of the predictions in the sample match that probability.
  • In some embodiments, dynamic data set distribution optimization may be used to optimize feature modeling. The accuracy of the decisions within a bin may be tested (e.g., in some embodiments, a sample of the decisions within a bin may be sent to a crowd for verification), and the results may be used to adjust the feature modeling performed by the input data analysis module 1420.
  • FIG. 15 illustrates an example system 1500 that can be configured to implement dynamic optimization of a data set distribution. In embodiments, system 1500 may include a data reservoir 1530 that has been discretized into multiple data bins (1534A, 1534B, . . . , 1534X) based on a desired overall statistical distribution of data in the reservoir 1530; and a data set optimizer (e.g., data set optimizer 1440 described with reference to FIG. 14) that automatically maintains a fresh, up-to-date data reservoir 1530 with the desired distribution by receiving newly collected data and then determining whether to update the data reservoir 1530 using the newly collected data.
  • In embodiments, the system 1500 receives a data set optimization job 1505 that includes input data 1502 and configuration data 1504. In some embodiments, the input data set 1502 may be a data stream, as previously described. In some embodiments, the configuration data 1504 may include a description of the discretized data reservoir 1530 (e.g., the configuration of the set of bins and, additionally and/or alternatively, a desired distribution of data across the set of bins). In some embodiments, the data set optimization job 1505 also may include an input data evaluator 1514 while, in some alternative embodiments, the input data evaluator 1514 may be a component of system 1500. In embodiments, the input data evaluator 1514 may be a supervised machine learning algorithm (e.g., a classifier). In some embodiments, evaluating an input data instance may include assigning the instance an evaluation value (e.g., a classification prediction confidence value as previously described with reference to FIG. 5).
  • In some embodiments, each of the input data instances 1512 from the input data set 1502 is processed by the data set optimizer 1440 using the input data evaluator 1514, and then the system determines whether the evaluated data instance 1522 is to be offered to any of the data bins 1534 in the data reservoir 1530. In some embodiments, the evaluated data instance 1522 includes a prediction and/or prediction confidence value, and the determination is based at least in part on matching the prediction and/or prediction confidence value to attributes of the data that are respectively stored within each data bin 1534.
  • In some embodiments, each of the data bins 1534 is respectively associated with a reservoir sampler 1532 that maintains summary statistics of the distribution of data within the bin and determines whether to update the data bin 1534 based in part on those summary statistics. For example, in some embodiments, the summary statistics may include a size capacity for the data bin 1534 (i.e., the maximum number of data instances that can be stored in the data bin) since the set of data bins is selected to represent a discretized overall distribution of the data reservoir 1530. Additionally, each of the data bins 1534 may be associated with a particular range of evaluation values. Thus, in some embodiments, a reservoir sampler 1532 may determine that an evaluated data instance 1522 is to be added to a data bin 1534 if the evaluation value associated with the data instance is within the range of evaluation values associated with the bin and if the current bin size is below the bin size capacity. Additionally and/or alternatively, a reservoir sampler 1532 may determine that adding an evaluated data instance 1522 to a data bin 1532 will replace a data instance that currently is stored in the data bin 1532.
  • FIG. 16 shows a schematic block diagram of circuitry 1600, some or all of which may be included in, for example, an adaptive oracle-trained learning framework 100. As illustrated in FIG. 16, in accordance with some example embodiments, circuitry 1600 can include various means, such as processor 1602, memory 1604, communications module 1606, and/or input/output module 1608. As referred to herein, “module” includes hardware, software and/or firmware configured to perform one or more particular functions. In this regard, the means of circuitry 1600 as described herein may be embodied as, for example, circuitry, hardware elements (e.g., a suitably programmed processor, combinational logic circuit, and/or the like), a computer program product comprising computer-readable program instructions stored on a non-transitory computer-readable medium (e.g., memory 1604) that is executable by a suitably configured processing device (e.g., processor 1602), or some combination thereof
  • Processor 1602 may, for example, be embodied as various means including one or more microprocessors with accompanying digital signal processor(s), one or more processor(s) without an accompanying digital signal processor, one or more coprocessors, one or more multi-core processors, one or more controllers, processing circuitry, one or more computers, various other processing elements including integrated circuits such as, for example, an ASIC (application specific integrated circuit) or FPGA (field programmable gate array), or some combination thereof. Accordingly, although illustrated in FIG. 16 as a single processor, in some embodiments, processor 1602 comprises a plurality of processors. The plurality of processors may be embodied on a single computing device or may be distributed across a plurality of computing devices collectively configured to function as circuitry 1600. The plurality of processors may be in operative communication with each other and may be collectively configured to perform one or more functionalities of circuitry 1600 as described herein. In an example embodiment, processor 1602 is configured to execute instructions stored in memory 1604 or otherwise accessible to processor 1602. These instructions, when executed by processor 1602, may cause circuitry 1600 to perform one or more of the functionalities of circuitry 1600 as described herein.
  • Whether configured by hardware, firmware/software methods, or by a combination thereof, processor 1602 may comprise an entity capable of performing operations according to embodiments of the present invention while configured accordingly. Thus, for example, when processor 1602 is embodied as an ASIC, FPGA or the like, processor 1602 may comprise specifically configured hardware for conducting one or more operations described herein. Alternatively, as another example, when processor 1602 is embodied as an executor of instructions, such as may be stored in memory 1604, the instructions may specifically configure processor 1602 to perform one or more algorithms and operations described herein, such as those discussed in connection with FIGS. 1-12.
  • Memory 1604 may comprise, for example, volatile memory, non-volatile memory, or some combination thereof. Although illustrated in FIG. 16 as a single memory, memory 1604 may comprise a plurality of memory components. The plurality of memory components may be embodied on a single computing device or distributed across a plurality of computing devices. In various embodiments, memory 1604 may comprise, for example, a hard disk, random access memory, cache memory, flash memory, a compact disc read only memory (CD-ROM), digital versatile disc read only memory (DVD-ROM), an optical disc, circuitry configured to store information, or some combination thereof. Memory 1604 may be configured to store information, data (including analytics data), applications, instructions, or the like for enabling circuitry 1600 to carry out various functions in accordance with example embodiments of the present invention. For example, in at least some embodiments, memory 1604 is configured to buffer input data for processing by processor 1602. Additionally or alternatively, in at least some embodiments, memory 1604 is configured to store program instructions for execution by processor 1602. Memory 1604 may store information in the form of static and/or dynamic information. This stored information may be stored and/or used by circuitry 1600 during the course of performing its functionalities.
  • Communications module 1606 may be embodied as any device or means embodied in circuitry, hardware, a computer program product comprising computer readable program instructions stored on a computer readable medium (e.g., memory 1604) and executed by a processing device (e.g., processor 1602), or a combination thereof that is configured to receive and/or transmit data from/to another device, such as, for example, a second circuitry 1600 and/or the like. In some embodiments, communications module 1606 (like other components discussed herein) can be at least partially embodied as or otherwise controlled by processor 1602. In this regard, communications module 1606 may be in communication with processor 1602, such as via a bus. Communications module 1606 may include, for example, an antenna, a transmitter, a receiver, a transceiver, network interface card and/or supporting hardware and/or firmware/software for enabling communications with another computing device. Communications module 1606 may be configured to receive and/or transmit any data that may be stored by memory 1604 using any protocol that may be used for communications between computing devices. Communications module 1606 may additionally or alternatively be in communication with the memory 1604, input/output module 1608 and/or any other component of circuitry 1600, such as via a bus.
  • Input/output module 1608 may be in communication with processor 1602 to receive an indication of a user input and/or to provide an audible, visual, mechanical, or other output to a user. Some example visual outputs that may be provided to a user by circuitry 1600 are discussed in connection with FIG. 1. As such, input/output module 1608 may include support, for example, for a keyboard, a mouse, a joystick, a display, a touch screen display, a microphone, a speaker, a RFID reader, barcode reader, biometric scanner, and/or other input/output mechanisms. In embodiments wherein circuitry 1600 is embodied as a server or database, aspects of input/output module 1608 may be reduced as compared to embodiments where circuitry 1600 is implemented as an end-user machine or other type of device designed for complex user interactions. In some embodiments (like other components discussed herein), input/output module 1608 may even be eliminated from circuitry 1600. Alternatively, such as in embodiments wherein circuitry 1600 is embodied as a server or database, at least some aspects of input/output module 1608 may be embodied on an apparatus used by a user that is in communication with circuitry 1600. Input/output module 1608 may be in communication with the memory 1604, communications module 1606, and/or any other component(s), such as via a bus. Although more than one input/output module and/or other component can be included in circuitry 1600, only one is shown in FIG. 16 to avoid overcomplicating the drawing (like the other components discussed herein).
  • Adaptive learning module 1610 may also or instead be included and configured to perform the functionality discussed herein related to the adaptive learning oracle-based framework discussed above. In some embodiments, some or all of the functionality of adaptive learning may be performed by processor 1602. In this regard, the example processes and algorithms discussed herein can be performed by at least one processor 1602 and/or adaptive learning module 1610. For example, non-transitory computer readable media can be configured to store firmware, one or more application programs, and/or other software, which include instructions and other computer-readable program code portions that can be executed to control each processor (e.g., processor 1602 and/or adaptive learning module 1610) of the components of system 400 to implement various operations, including the examples shown above. As such, a series of computer-readable program code portions are embodied in one or more computer program products and can be used, with a computing device, server, and/or other programmable apparatus, to produce machine-implemented processes.
  • Any such computer program instructions and/or other type of code may be loaded onto a computer, processor or other programmable apparatus's circuitry to produce a machine, such that the computer, processor other programmable circuitry that execute the code on the machine create the means for implementing various functions, including those described herein.
  • It is also noted that all or some of the information presented by the example displays discussed herein can be based on data that is received, generated and/or maintained by one or more components of adaptive oracle-trained learning framework 100. In some embodiments, one or more external systems (such as a remote cloud computing and/or data storage system) may also be leveraged to provide at least some of the functionality discussed herein.
  • As described above in this disclosure, aspects of embodiments of the present invention may be configured as methods, mobile devices, backend network devices, and the like. Accordingly, embodiments may comprise various means including entirely of hardware or any combination of software and hardware. Furthermore, embodiments may take the form of a computer program product on at least one non-transitory computer-readable storage medium having computer-readable program instructions (e.g., computer software) embodied in the storage medium. Any suitable computer-readable storage medium may be utilized including non-transitory hard disks, CD-ROMs, flash memory, optical storage devices, or magnetic storage devices.
  • Embodiments of the present invention have been described above with reference to block diagrams and flowchart illustrations of methods, apparatuses, systems and computer program products. It will be understood that each block of the circuit diagrams and process flow diagrams, and combinations of blocks in the circuit diagrams and process flowcharts, respectively, can be implemented by various means including computer program instructions. These computer program instructions may be loaded onto a general purpose computer, special purpose computer, or other programmable data processing apparatus, such as processor 1602 and/or adaptive learning module 1610 discussed above with reference to FIG. 16, to produce a machine, such that the computer program product includes the instructions which execute on the computer or other programmable data processing apparatus create a means for implementing the functions specified in the flowchart block or blocks.
  • These computer program instructions may also be stored in a computer-readable storage device (e.g., memory 1604) that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable storage device produce an article of manufacture including computer-readable instructions for implementing the function discussed herein. The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that the instructions that execute on the computer or other programmable apparatus provide steps for implementing the functions discussed herein.
  • Accordingly, blocks of the block diagrams and flowchart illustrations support combinations of means for performing the specified functions, combinations of steps for performing the specified functions and program instruction means for performing the specified functions. It will also be understood that each block of the circuit diagrams and process flowcharts, and combinations of blocks in the circuit diagrams and process flowcharts, can be implemented by special purpose hardware-based computer systems that perform the specified functions or steps, or combinations of special purpose hardware and computer instructions
  • Many modifications and other embodiments of the inventions set forth herein will come to mind to one skilled in the art to which these inventions pertain having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the inventions are not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

Claims (21)

1-20. (canceled)
21. A system, comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to:
determine feature data indicative of a set of features associated with a training data set for a predictive model;
determine, based on the feature data associated with the training data set, a training distribution representative of a goal for the predictive model;
apply an input data instance to the predictive model to determine a label for the input data instance;
add the input data instance to a data bin of a data reservoir based on the label, wherein the data reservoir comprises candidate training data for the training data set;
in response to a determination that the training distribution representative of the goal for the predictive model comprises goal criteria associated with at least one label that corresponds to at least the label, select at least a portion of the candidate training data from at least the data bin of the data reservoir that comprises the input data instance;
train the predictive model based on at least the portion of the candidate training data to generate an updated predictive model; and
classify data collected from one or more network sources based on the updated predictive model.
22. The system of claim 21, wherein the one or more storage devices store instructions that are operable, when executed by the one or more computers, to further cause the one or more computers to:
determine a confidence value associated with the label for the input data instance; and
add the input data instance to the data bin of the data reservoir in response to the confidence value satisfying a defined confidence value threshold.
23. The system of claim 21, wherein the one or more storage devices store instructions that are operable, when executed by the one or more computers, to further cause the one or more computers to:
configure a set of data bins of the data reservoir based on the training distribution representative of the goal for the predictive model.
24. The system of claim 21, wherein the one or more storage devices store instructions that are operable, when executed by the one or more computers, to further cause the one or more computers to:
configure a set of data bins of the data reservoir based on configuration data related to size capacity for respective data bins.
25. The system of claim 24, wherein the one or more storage devices store instructions that are operable, when executed by the one or more computers, to further cause the one or more computers to:
add the input data instance to a data bin of a data reservoir in response to the data bin satisfying size capacity criterion.
26. The system of claim 24, wherein the one or more storage devices store instructions that are operable, when executed by the one or more computers, to further cause the one or more computers to:
replace a particular input data instance stored in the data bin of the data reservoir with the input data instance in response to the data bin not satisfying size capacity criterion.
27. The system of claim 21, wherein the one or more storage devices store instructions that are operable, when executed by the one or more computers, to further cause the one or more computers to:
allocate respective labels for respective data bins of the data reservoir; and
compare the label for the input data instance to the respective labels for the respective data bins of the data reservoir.
28. A computer-implemented method, comprising:
determining, by a computing device comprising a processor, feature data indicative of a set of features associated with a training data set for a predictive model;
determining, by the computing device and based on the feature data associated with the training data set, a training distribution representative of a goal for the predictive model;
applying, by the computing device, an input data instance to the predictive model to determine a label for the input data instance;
adding, by the computing device, the input data instance to a data bin of a data reservoir based on the label, wherein the data reservoir comprises candidate training data for the training data set;
in response to a determination that the training distribution representative of the goal for the predictive model comprises goal criteria associated with at least one label that corresponds to at least the label, selecting, by the computing device, at least a portion of the candidate training data from at least the data bin of the data reservoir that comprises the input data instance;
training, by the computing device, the predictive model based on at least the portion of the candidate training data to generate an updated predictive model; and
classifying, by the computing device, data collected from one or more network sources based on the updated predictive model.
29. The computer-implemented method of claim 28, further comprising:
determining, by the computing device, a confidence value associated with the label for the input data instance; and
adding, by the computing device, the input data instance to the data bin of the data reservoir in response to the confidence value satisfying a defined confidence value threshold.
30. The computer-implemented method of claim 28, further comprising:
configuring, by the computing device, a set of data bins of the data reservoir based on the training distribution representative of the goal for the predictive model.
31. The computer-implemented method of claim 28, further comprising:
configuring, by the computing device, a set of data bins of the data reservoir based on configuration data related to size capacity for respective data bins.
32. The computer-implemented method of claim 31, further comprising:
adding, by the computing device, the input data instance to a data bin of a data reservoir in response to the data bin satisfying size capacity criterion.
33. The computer-implemented method of claim 31, further comprising:
replacing, by the computing device, a particular input data instance stored in the data bin of the data reservoir with the input data instance in response to the data bin not satisfying size capacity criterion.
34. The computer-implemented method of claim 28, further comprising:
allocating, by the computing device, respective labels for respective data bins of the data reservoir; and
comparing, by the computing device, the label for the input data instance to the respective labels for the respective data bins of the data reservoir.
35. A computer program product, stored on a computer readable medium, comprising instructions that when executed by one or more computers cause the one or more computers to:
determine feature data indicative of a set of features associated with a training data set for a predictive model;
determine, based on the feature data associated with the training data set, a training distribution representative of a goal for the predictive model;
apply an input data instance to the predictive model to determine a label for the input data instance;
add the input data instance to a data bin of a data reservoir based on the label, wherein the data reservoir comprises candidate training data for the training data set;
in response to a determination that the training distribution representative of the goal for the predictive model comprises goal criteria associated with at least one label that corresponds to at least the label, select at least a portion of the candidate training data from at least the data bin of the data reservoir that comprises the input data instance;
train the predictive model based on at least the portion of the candidate training data to generate an updated predictive model; and
classify data collected from one or more network sources based on the updated predictive model.
36. The computer program product of claim 35, further comprising instructions that when executed by the one or more computers cause the one or more computers to:
determine a confidence value associated with the label for the input data instance; and
add the input data instance to the data bin of the data reservoir in response to the confidence value satisfying a defined confidence value threshold.
37. The computer program product of claim 35, further comprising instructions that when executed by the one or more computers cause the one or more computers to:
configure a set of data bins of the data reservoir based on the training distribution representative of the goal for the predictive model.
38. The computer program product of claim 37, further comprising instructions that when executed by the one or more computers cause the one or more computers to:
add the input data instance to a data bin of a data reservoir in response to the data bin satisfying size capacity criterion.
39. The computer program product of claim 37, further comprising instructions that when executed by the one or more computers cause the one or more computers to:
replace a particular input data instance stored in the data bin of the data reservoir with the input data instance in response to the data bin not satisfying size capacity criterion.
40. The computer program product of claim 35, further comprising instructions that when executed by the one or more computers cause the one or more computers to:
allocate respective labels for respective data bins of the data reservoir; and
compare the label for the input data instance to the respective labels for the respective data bins of the data reservoir.
US17/528,514 2013-12-23 2021-11-17 Processing dynamic data within an adaptive oracle-trained learning system using dynamic data set distribution optimization Pending US20220180250A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/528,514 US20220180250A1 (en) 2013-12-23 2021-11-17 Processing dynamic data within an adaptive oracle-trained learning system using dynamic data set distribution optimization

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US201361920251P 2013-12-23 2013-12-23
US201462039314P 2014-08-19 2014-08-19
US201462055958P 2014-09-26 2014-09-26
US14/578,210 US11210604B1 (en) 2013-12-23 2014-12-19 Processing dynamic data within an adaptive oracle-trained learning system using dynamic data set distribution optimization
US17/528,514 US20220180250A1 (en) 2013-12-23 2021-11-17 Processing dynamic data within an adaptive oracle-trained learning system using dynamic data set distribution optimization

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US14/578,210 Continuation US11210604B1 (en) 2013-12-23 2014-12-19 Processing dynamic data within an adaptive oracle-trained learning system using dynamic data set distribution optimization

Publications (1)

Publication Number Publication Date
US20220180250A1 true US20220180250A1 (en) 2022-06-09

Family

ID=79168465

Family Applications (2)

Application Number Title Priority Date Filing Date
US14/578,210 Active 2036-05-29 US11210604B1 (en) 2013-12-23 2014-12-19 Processing dynamic data within an adaptive oracle-trained learning system using dynamic data set distribution optimization
US17/528,514 Pending US20220180250A1 (en) 2013-12-23 2021-11-17 Processing dynamic data within an adaptive oracle-trained learning system using dynamic data set distribution optimization

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US14/578,210 Active 2036-05-29 US11210604B1 (en) 2013-12-23 2014-12-19 Processing dynamic data within an adaptive oracle-trained learning system using dynamic data set distribution optimization

Country Status (1)

Country Link
US (2) US11210604B1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210399794A1 (en) * 2017-12-29 2021-12-23 Hughes Network Systems, Llc Machine learning models for adjusting communication parameters

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111742269B (en) * 2017-12-21 2023-05-30 皇家飞利浦有限公司 Computer-implemented method and node implementing said method

Family Cites Families (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5212765A (en) * 1990-08-03 1993-05-18 E. I. Du Pont De Nemours & Co., Inc. On-line training neural network system for process control
US5825646A (en) * 1993-03-02 1998-10-20 Pavilion Technologies, Inc. Method and apparatus for determining the sensitivity of inputs to a neural network on output parameters
US8311673B2 (en) * 1996-05-06 2012-11-13 Rockwell Automation Technologies, Inc. Method and apparatus for minimizing error in dynamic and steady-state processes for prediction, control, and optimization
US6092072A (en) 1998-04-07 2000-07-18 Lucent Technologies, Inc. Programmed medium for clustering large databases
US7418431B1 (en) 1999-09-30 2008-08-26 Fair Isaac Corporation Webstation: configurable web-based workstation for reason driven data analysis
US20030130899A1 (en) * 2002-01-08 2003-07-10 Bruce Ferguson System and method for historical database training of non-linear models for use in electronic commerce
US20040205482A1 (en) * 2002-01-24 2004-10-14 International Business Machines Corporation Method and apparatus for active annotation of multimedia content
US7565304B2 (en) 2002-06-21 2009-07-21 Hewlett-Packard Development Company, L.P. Business processes based on a predictive model
US20040049473A1 (en) 2002-09-05 2004-03-11 David John Gower Information analytics systems and methods
US7490071B2 (en) 2003-08-29 2009-02-10 Oracle Corporation Support vector machines processing system
US7512582B2 (en) 2003-12-10 2009-03-31 Microsoft Corporation Uncertainty reduction in collaborative bootstrapping
US7480640B1 (en) 2003-12-16 2009-01-20 Quantum Leap Research, Inc. Automated method and system for generating models from data
JP2008520318A (en) * 2004-11-19 2008-06-19 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ System and method for reducing false positives in computer aided detection (CAD) using support vector machine (SVM)
US7599897B2 (en) * 2006-05-05 2009-10-06 Rockwell Automation Technologies, Inc. Training a support vector machine with process constraints
US7925620B1 (en) 2006-08-04 2011-04-12 Hyoungsoo Yoon Contact information management
US7756800B2 (en) 2006-12-14 2010-07-13 Xerox Corporation Method for transforming data elements within a classification system based in part on input from a human annotator/expert
US8392381B2 (en) 2007-05-08 2013-03-05 The University Of Vermont And State Agricultural College Systems and methods for reservoir sampling of streaming data and stream joins
US8112421B2 (en) 2007-07-20 2012-02-07 Microsoft Corporation Query selection for effectively learning ranking functions
US8166013B2 (en) 2007-11-05 2012-04-24 Intuit Inc. Method and system for crawling, mapping and extracting information associated with a business using heuristic and semantic analysis
US8086549B2 (en) 2007-11-09 2011-12-27 Microsoft Corporation Multi-label active learning
US8799773B2 (en) 2008-01-25 2014-08-05 Google Inc. Aspect-based sentiment summarization
US20110191335A1 (en) 2010-01-29 2011-08-04 Lexisnexis Risk Data Management Inc. Method and system for conducting legal research using clustering analytics
US9165051B2 (en) 2010-08-24 2015-10-20 Board Of Trustees Of The University Of Illinois Systems and methods for detecting a novel data class
US9390194B2 (en) 2010-08-31 2016-07-12 International Business Machines Corporation Multi-faceted visualization of rich text corpora
US8402543B1 (en) 2011-03-25 2013-03-19 Narus, Inc. Machine learning based botnet detection with dynamic adaptation
US20120254186A1 (en) 2011-03-31 2012-10-04 Nokia Corporation Method and apparatus for rendering categorized location-based search results
US8533224B2 (en) * 2011-05-04 2013-09-10 Google Inc. Assessing accuracy of trained predictive models
US8688601B2 (en) * 2011-05-23 2014-04-01 Symantec Corporation Systems and methods for generating machine learning-based classifiers for detecting specific categories of sensitive information
US9619494B2 (en) * 2011-05-25 2017-04-11 Qatar Foundation Scalable automatic data repair
US8762299B1 (en) * 2011-06-27 2014-06-24 Google Inc. Customized predictive analytical model training
US8843427B1 (en) 2011-07-01 2014-09-23 Google Inc. Predictive modeling accuracy
US9349103B2 (en) 2012-01-09 2016-05-24 DecisionQ Corporation Application of machine learned Bayesian networks to detection of anomalies in complex systems
US9015080B2 (en) 2012-03-16 2015-04-21 Orbis Technologies, Inc. Systems and methods for semantic inference and reasoning
US9715723B2 (en) 2012-04-19 2017-07-25 Applied Materials Israel Ltd Optimization of unknown defect rejection for automatic defect classification
US20140101544A1 (en) 2012-10-08 2014-04-10 Microsoft Corporation Displaying information according to selected entity type
US9501799B2 (en) 2012-11-08 2016-11-22 Hartford Fire Insurance Company System and method for determination of insurance classification of entities
US20140172767A1 (en) 2012-12-14 2014-06-19 Microsoft Corporation Budget optimal crowdsourcing
US20140279745A1 (en) 2013-03-14 2014-09-18 Sm4rt Predictive Systems Classification based on prediction of accuracy of multiple data models
US9122681B2 (en) 2013-03-15 2015-09-01 Gordon Villy Cormack Systems and methods for classifying electronic information using advanced active learning techniques
US9390378B2 (en) 2013-03-28 2016-07-12 Wal-Mart Stores, Inc. System and method for high accuracy product classification with limited supervision
US9390112B1 (en) 2013-11-22 2016-07-12 Groupon, Inc. Automated dynamic data quality assessment
US9652362B2 (en) 2013-12-06 2017-05-16 Qualcomm Incorporated Methods and systems of using application-specific and application-type-specific models for the efficient classification of mobile device behaviors
US9721212B2 (en) 2014-06-04 2017-08-01 Qualcomm Incorporated Efficient on-device binary analysis for auto-generated behavioral models
EP3155758A4 (en) 2014-06-10 2018-04-11 Sightline Innovation Inc. System and method for network based application development and implementation
US10402746B2 (en) 2014-09-10 2019-09-03 Amazon Technologies, Inc. Computing instance launch time
AU2016261830B2 (en) * 2015-05-12 2019-01-17 Dexcom, Inc. Distributed system architecture for continuous glucose monitoring
US10635939B2 (en) * 2018-07-06 2020-04-28 Capital One Services, Llc System, method, and computer-accessible medium for evaluating multi-dimensional synthetic data using integrated variants analysis

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210399794A1 (en) * 2017-12-29 2021-12-23 Hughes Network Systems, Llc Machine learning models for adjusting communication parameters
US11722213B2 (en) * 2017-12-29 2023-08-08 Hughes Network Systems, Llc Machine learning models for adjusting communication parameters

Also Published As

Publication number Publication date
US11210604B1 (en) 2021-12-28

Similar Documents

Publication Publication Date Title
US20190378044A1 (en) Processing dynamic data within an adaptive oracle-trained learning system using curated training data for incremental re-training of a predictive model
US20200302337A1 (en) Automatic selection of high quality training data using an adaptive oracle-trained learning framework
US11295215B2 (en) Automated dynamic data quality assessment
US20200012963A1 (en) Curating Training Data For Incremental Re-Training Of A Predictive Model
US20200293951A1 (en) Dynamically optimizing a data set distribution
US11016996B2 (en) Dynamic clustering for streaming data
US20210049428A1 (en) Managing missing values in datasets for machine learning models
US8756175B1 (en) Robust and fast model fitting by adaptive sampling
US20190236479A1 (en) Method and apparatus for providing efficient testing of systems by using artificial intelligence tools
US20220180250A1 (en) Processing dynamic data within an adaptive oracle-trained learning system using dynamic data set distribution optimization
US11210368B2 (en) Computational model optimizations
WO2016127218A1 (en) Learning from distributed data
US11687804B2 (en) Latent feature dimensionality bounds for robust machine learning on high dimensional datasets
US20160012318A1 (en) Adaptive featurization as a service
KR20200092989A (en) Production organism identification using unsupervised parameter learning for outlier detection
US20230186150A1 (en) Hyperparameter selection using budget-aware bayesian optimization
US20230316153A1 (en) Dynamically updated ensemble-based machine learning for streaming data
US20230177110A1 (en) Generating task-specific training data
US20220067604A1 (en) Utilizing machine learning models to aggregate applications and users with events associated with the applications
CN115392400A (en) User classification method, and product recommendation method and device based on user classification

Legal Events

Date Code Title Description
AS Assignment

Owner name: GROUPON, INC., ILLINOIS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:JEFFERY, SHAWN RYAN;JOHNSTON, DAVID ALAN;SIGNING DATES FROM 20150218 TO 20150602;REEL/FRAME:058138/0670

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION