EP3161731A1 - Gestion de compromis pour un traitement de caractéristiques - Google Patents

Gestion de compromis pour un traitement de caractéristiques

Info

Publication number
EP3161731A1
EP3161731A1 EP15739124.4A EP15739124A EP3161731A1 EP 3161731 A1 EP3161731 A1 EP 3161731A1 EP 15739124 A EP15739124 A EP 15739124A EP 3161731 A1 EP3161731 A1 EP 3161731A1
Authority
EP
European Patent Office
Prior art keywords
mls
client
machine learning
feature processing
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP15739124.4A
Other languages
German (de)
English (en)
Inventor
Leo Parker Dirac
Nicolle M. Correa
Charles Eric Dannaker
Aleksandr Mikhaylovich Ingerman
Sriram Krishnan
Jin Li
Sudhakar Rao Puvvadi
Saman Zarandioon
Rakesh Ramakrishnan
Tianming Zheng
Donghui Zhuo
Tarun Agarwal
Robert Matthias Steele
Jun Qian
Michael Brueckner
Ralf Herbrich
Daniel BLICK
Polly Po Yee Lee
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Amazon Technologies Inc
Original Assignee
Amazon Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US14/319,880 external-priority patent/US9886670B2/en
Priority claimed from US14/319,902 external-priority patent/US10102480B2/en
Priority claimed from US14/460,312 external-priority patent/US11100420B2/en
Priority claimed from US14/460,314 external-priority patent/US10540606B2/en
Priority claimed from US14/463,434 external-priority patent/US10339465B2/en
Priority claimed from US14/484,201 external-priority patent/US10318882B2/en
Priority claimed from US14/489,449 external-priority patent/US9672474B2/en
Priority claimed from US14/489,448 external-priority patent/US10169715B2/en
Priority claimed from US14/538,723 external-priority patent/US10452992B2/en
Priority claimed from US14/569,458 external-priority patent/US10963810B2/en
Application filed by Amazon Technologies Inc filed Critical Amazon Technologies Inc
Publication of EP3161731A1 publication Critical patent/EP3161731A1/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data

Definitions

  • Machine learning combines techniques from statistics and artificial intelligence to create algorithms that can learn from empirical data and generalize to solve problems in various domains such as natural language processing, financial fraud detection, terrorism threat level detection, human health diagnosis and the like.
  • sources such as sensors of various kinds, web server logs, social media services, financial transaction records, security cameras, and the like.
  • the quality of the results obtained from machine learning algorithms may depend on how well the empirical data used for training the models captures key relationships among different variables represented in the data, and on how effectively and efficiently these relationships can be identified.
  • very large data sets may have to be analyzed in order to be able to make accurate predictions, especially predictions of relatively infrequent but significant events.
  • identifying factors that can be used to label a transaction as fraudulent may potentially require analysis of millions of transaction records, each representing dozens or even hundreds of variables.
  • Constraints on raw input data set size, cleansing or normalizing large numbers of potentially incomplete or error- containing records, and/or on the ability to extract representative subsets of the raw data also represent barriers that are not easy to overcome for many potential beneficiaries of machine learning techniques.
  • transformations may have to be applied on various input data variables before the data can be used effectively to train models.
  • the mechanisms available to apply such transformations may be less than optimal - e.g., similar transformations may sometimes have to be applied one by one to many different variables of a data set, potentially requiring a lot of tedious and error-prone work.
  • FIG. 1 illustrates an example system environment in which various components of a machine learning service may be implemented, according to at least some embodiments.
  • FIG. 2 illustrates an example of a machine learning service implemented using a plurality of network-accessible services of a provider network, according to at least some embodiments.
  • FIG. 3 illustrates an example of the use of a plurality of availability containers and security containers of a provider network for a machine learning service, according to at least some embodiments.
  • FIG. 4 illustrates examples of a plurality of processing plans and corresponding resource sets that may be generated at a machine learning service, according to at least some embodiments.
  • FIG. 5 illustrates an example of asynchronous scheduling of jobs at a machine learning service, according to at least some embodiments.
  • FIG. 6 illustrates example artifacts that may be generated and stored using a machine learning service, according to at least some embodiments.
  • FIG. 7 illustrates an example of automated generation of statistics in response to a client request to instantiate a data source, according to at least some embodiments.
  • FIG. 8 illustrates several model usage modes that may be supported at a machine learning service, according to at least some embodiments.
  • FIG. 9a and 9b are flow diagrams illustrating aspects of operations that may be performed at a machine learning service that supports asynchronous scheduling of machine learning jobs, according to at least some embodiments.
  • FIG. 10a is a flow diagram illustrating aspects of operations that may be performed at a machine learning service at which a set of idempotent programmatic interfaces are supported, according to at least some embodiments.
  • FIG. 10b is a flow diagram illustrating aspects of operations that may be performed at a machine learning service to collect and disseminate information about best practices related to different problem domains, according to at least some embodiments.
  • FIG. 11 illustrates examples interactions associated with the use of recipes for data transformations at a machine learning service, according to at least some embodiments.
  • FIG. 12 illustrates example sections of a recipe, according to at least some embodiments.
  • FIG. 13 illustrates an example grammar that may be used to define recipe syntax, according to at least some embodiments.
  • FIG. 14 illustrates an example of an abstract syntax tree that may be generated for a portion of a recipe, according to at least some embodiments.
  • FIG. 15 illustrates an example of a programmatic interface that may be used to search for domain-specific recipes available from a machine learning service, according to at least some embodiments.
  • FIG. 16 illustrates an example of a machine learning service that automatically explores a range of parameter settings for recipe transformations on behalf of a client, and selects acceptable or recommended parameter settings based on results of such explorations, according to at least some embodiments.
  • FIG. 17 is a flow diagram illustrating aspects of operations that may be performed at a machine learning service that supports re-usable recipes for data set transformations, according to at least some embodiments.
  • FIG. 18 illustrates an example procedure for performing efficient in-memory filtering operations on a large input data set by a machine learning service, according to at least some embodiments.
  • FIG. 19 illustrates tradeoffs associated with varying the chunk size used for filtering operation sequences on machine learning data sets, according to at least some embodiments.
  • FIG. 20a illustrates an example sequence of chunk-level filtering operations, including a shuffle followed by a split, according to at least some embodiments.
  • FIG. 20b illustrates an example sequence of in-memory filtering operations that includes chunk-level filtering as well as intra-chunk filtering, according to at least some embodiments.
  • FIG. 21 illustrates examples of alternative approaches to in-memory sampling of a data set, according to at least some embodiments.
  • FIG. 22 illustrates examples of determining chunk boundaries based on the location of observation record boundaries, according to at least some embodiments.
  • FIG. 23 illustrates examples of jobs that may be scheduled at a machine learning service in response to a request for extraction of data records from any of a variety of data source types, according to at least some embodiments.
  • FIG. 24 illustrates examples constituent elements of a record retrieval request that may be submitted by a client using a programmatic interface of an I/O (input-output) library implemented by a machine learning service, according to at least some embodiments.
  • I/O input-output
  • FIG. 25 is a flow diagram illustrating aspects of operations that may be performed at a machine learning service that implements an I/O library for in-memory filtering operation sequences on large input data sets, according to at least some embodiments.
  • FIG. 26 illustrates an example of an iterative procedure that may be used to improve the quality of predictions made by a machine learning model, according to at least some embodiments.
  • FIG. 27 illustrates an example of data set splits that may be used for cross-validation of a machine learning model, according to at least some embodiments.
  • FIG. 28 illustrates examples of consistent chunk-level splits of input data sets for cross validation that may be performed using a sequence of pseudo-random numbers, according to at least some embodiments.
  • FIG. 29 illustrates an example of an inconsistent chunk-level split of an input data set that may occur as a result of inappropriately resetting a pseudo-random number generator, according to at least some embodiments.
  • FIG. 30 illustrates an example timeline of scheduling related pairs of training and evaluation jobs, according to at least some embodiments.
  • FIG. 31 illustrates an example of a system in which consistency metadata is generated at a machine learning service in response to a client request, according to at least some embodiments.
  • FIG. 32 is a flow diagram illustrating aspects of operations that may be performed at a machine learning service in response to a request for training and evaluation iterations of a machine learning model, according to at least some embodiments.
  • FIG. 33 illustrates an example of a decision tree that may be generated for predictions at a machine learning service, according to at least some embodiments.
  • FIG. 34 illustrates an example of storing representations of decision tree nodes in a depth-first order at persistent storage devices during a tree-construction pass of a training phase for a machine learning model, according to at least some embodiments.
  • FIG. 35 illustrates an example of predictive utility distribution information that may be generated for the nodes of a decision tree, according to at least some embodiments.
  • FIG. 36 illustrates an example of pruning a decision tree based at least in part on a combination of a run-time memory footprint goal and cumulative predictive utility, according to at least some embodiments.
  • FIG. 37 illustrates an example of pruning a decision tree based at least in part on a prediction time variation goal, according to at least some embodiments.
  • FIG. 38 illustrates examples of a plurality of jobs that may be generated for training a model that uses an ensemble of decision trees at a machine learning service, according to at least some embodiments.
  • FIG. 39 is a flow diagram illustrating aspects of operations that may be performed at a machine learning service to generate and prune decision trees stored to persistent storage in depth-first order, according to at least some embodiments.
  • FIG. 40 illustrates an example of a machine learning service configured to generate feature processing proposals for clients based on an analysis of costs and benefits of candidate feature processing transformations, according to at least some embodiments.
  • FIG. 41 illustrates an example of selecting a feature processing set form several alternatives based on measured prediction speed and prediction quality, according to at least some embodiments.
  • FIG. 42 illustrates example interactions between a client and a feature processing manager of a machine learning service, according to at least some embodiments.
  • FIG. 43 illustrates an example of pruning candidate feature processing transformations using random selection, according to at least some embodiments.
  • FIG. 44 illustrates an example of a greedy technique for identifying recommended sets of candidate feature processing transformations, according to at least some embodiments.
  • FIG. 45 illustrates an example of a first phase of a feature processing optimization technique, in which a model is trained using a first set of candidate processed variables and evaluated, according to at least some embodiments.
  • FIG. 46 illustrates an example of a subsequent phase of the feature processing optimization technique, in which a model is re-evaluated using modified evaluation data sets to determine the impact on prediction quality of using various processed variables, according to at least some embodiments.
  • FIG. 47 illustrates another example phase of the feature processing optimization technique, in which a model is re-trained using a modified set of processed variables to determine the impact on prediction run-time cost of using a processed variable, according to at least some embodiments.
  • FIG. 48 is a flow diagram illustrating aspects of operations that may be performed at a machine learning service that recommends feature processing transformations based on quality vs. run-time cost tradeoffs, according to at least some embodiments.
  • FIG. 49 is an example of a programmatic dashboard interface that may enable clients to view the status of a variety of machine learning model runs, according to at least some embodiments.
  • FIG. 50 illustrates an example procedure for generating and using linear prediction models, according to at least some embodiments.
  • FIG. 51 illustrates an example scenario in which the memory capacity of a machine learning server that is used for training a model may become a constraint on parameter vector size, according to at least some embodiments.
  • FIG. 52 illustrates a technique in which a subset of features for which respective parameter values are stored in a parameter vector during training may be selected as pruning victims, according to at least some embodiments.
  • FIG. 53 illustrates a system in which observation records to be used for learning iterations of a linear model's training phase may be streamed to a machine learning service, according to at least some embodiments.
  • FIG. 54 is a flow diagram illustrating aspects of operations that may be performed at a machine learning service at which, in response to a detection of a triggering condition, parameters corresponding to one or more features may be pruned from a parameter vector to reduce memory consumption during training, according to at least some embodiments.
  • FIG. 55 illustrates a single-pass technique that may be used to obtain quantile boundary estimates of absolute values of weights assigned to features, according to at least some embodiments.
  • FIG. 56 illustrates examples of using quantile binning transformations to capture nonlinear relationships between raw input variables and prediction target variables of a machine learning model, according to at least some embodiments.
  • FIG. 57 illustrates examples of concurrent binning plans that may be generated during a training phase of a model at a machine learning service, according to at least some embodiments.
  • FIG. 58 illustrates examples of concurrent multi-variable quantile binning transformations that may be implemented at a machine learning service, according to at least some embodiments.
  • FIG. 59 illustrates examples of recipes that may be used for representing concurrent binning operations at a machine learning service, according to at least some embodiments.
  • FIG. 60 illustrates an example of a system in which clients may utilize programmatic interfaces of a machine learning service to indicate their preferences regarding the use of concurrent quantile binning, according to at least some embodiments.
  • FIG. 61 is a flow diagram illustrating aspects of operations that may be performed at a machine learning service at which concurrent quantile binning transformations are implemented, according to at least some embodiments.
  • FIG. 62 illustrates an example system environment in which a machine learning service implements an interactive graphical interface enabling clients to explore tradeoffs between various prediction quality metric goals, and to modify settings that can be used for interpreting model execution results, according to at least some embodiments.
  • FIG. 63 illustrates an example view of results of an evaluation run of a binary classification model that may be provided via an interactive graphical interface, according to at least some embodiments.
  • FIG. 64a and 64b collectively illustrate an impact of a change to a prediction interpretation threshold value, indicated by a client via a particular control of an interactive graphical interface, on a set of model quality metrics, according to at least some embodiments.
  • FIG. 65 illustrates examples of advanced metrics pertaining to an evaluation run of a machine learning model for which respective controls may be included in an interactive graphical interface, according to at least some embodiments.
  • FIG. 66 illustrates examples of elements of an interactive graphical interface that may be used to modify classification labels and to view details of observation records selected based on output variable values, according to at least some embodiments.
  • FIG. 67 illustrates an example view of results of an evaluation run of a multi-way classification model that may be provided via an interactive graphical interface, according to at least some embodiments.
  • FIG. 68 illustrates an example view of results of an evaluation run of a regression model that may be provided via an interactive graphical interface, according to at least some embodiments.
  • FIG. 69 is a flow diagram illustrating aspects of operations that may be performed at a machine learning service that implements interactive graphical interfaces enabling clients to modify prediction interpretation settings based on exploring evaluation results, according to at least some embodiments.
  • FIG. 70 illustrates an example duplicate detector that may utilize space-efficient representations of machine learning data sets to determine whether one data set is likely to include duplicate observation records of another data set at a machine learning service, according to at least some embodiments.
  • FIG. 71a and 71b collectively illustrate an example of a use of a Bloom filter for probabilistic detection of duplicate observation records at a machine learning service, according to at least some embodiments.
  • FIG. 72 illustrates examples of alternative duplicate definitions that may be used at a duplicate detector of a machine learning service, according to at least some embodiments.
  • FIG. 73 illustrates an example of a parallelized approach towards duplicate detection for large data sets at a machine learning service, according to at least some embodiments.
  • FIG. 74 illustrates an example of probabilistic duplicate detection within a given machine learning data set, according to at least some embodiments.
  • FIG. 75 is a flow diagram illustrating aspects of operations that may be performed at a machine learning service that implements duplicate detection of observation records, according to at least some embodiments.
  • FIG. 76 is a block diagram illustrating an example computing device that may be used in at least some embodiments.
  • MLS machine learning service
  • APIs application programming interfaces
  • the interfaces may, for example, allow non-experts to rely on default settings or parameters for various aspects of the procedures used for building, training and using machine learning models, where the defaults are derived from the accumulated experience of other practitioners addressing similar types of machine learning problems.
  • MLS clients may be able to extend the built-in capabilities of the service, e.g., by registering their own customized functions with the service.
  • the modules may in some cases be shared with other users of the service, while in other cases the use of the customized modules may be restricted to their implementers/owners.
  • a relatively straightforward recipe language may be supported, allowing MLS users to indicate various feature processing steps that they wish to have applied on data sets.
  • Such recipes may be specified in text format, and then compiled into executable formats that can be re -used with different data sets on different resource sets as needed.
  • the MLS may be implemented at a provider network that comprises numerous data centers with hundreds of thousands of computing and storage devices distributed around the world, allowing machine learning problems with terabyte-scale or petabyte-scale data sets and correspondingly large compute requirements to be addressed in a relatively transparent fashion while still ensuring high levels of isolation and security for sensitive data.
  • Pre-existing services of the provider network such as storage services that support arbitrarily large data objects accessible via web service interfaces, database services, virtual computing services, parallel-computing services, high-performance computing services, load-balancing services, and the like may be used for various machine learning tasks in at least some embodiments.
  • machine learning data e.g., raw input data, transformed/manipulated input data, intermediate results, or final results
  • models may be replicated across different geographical locations or availability containers as described below.
  • MLS control plane may be used herein to refer to a collection of hardware and/or software entities that are responsible for implementing various types of machine learning functionality on behalf of clients of the MLS, and for administrative tasks not necessarily visible to external MLS clients, such as ensuring that an adequate set of resources is provisioned to meet client demands, detecting and recovering from failures, generating bills, and so on.
  • MLS data plane may refer to the pathways and resources used for the processing, transfer, and storage of the input data used for client-requested operations, as well as the processing, transfer and storage of output data produced as a result of client-requested operations.
  • a number of different types of entities related to machine learning tasks may be generated, modified, read, executed, and/or queried/searched via MLS programmatic interfaces.
  • Supported entity types in one embodiment may include, among others, data sources (e.g., descriptors of locations or objects from which input records for machine learning can be obtained), sets of statistics generated by analyzing the input data, recipes (e.g., descriptors of feature processing transformations to be applied to input data for training models), processing plans (e.g., templates for executing various machine learning tasks), models (which may also be referred to as predictors), parameter sets to be used for recipes and/or models, model execution results such as predictions or evaluations, online access points for models that are to be used on streaming or real-time data, and/or aliases (e.g., pointers to model versions that have been "published” for use as described below). Instances of these entity types may be referred to as machine learning artifacts herein - for example,
  • the MLS programmatic interfaces may enable users to submit respective requests for several related tasks of a given machine learning workflow, such as tasks for extracting records from data sources, generating statistics on the records, feature processing, model training, prediction, and so on.
  • a given invocation of a programmatic interface (such as an API) may correspond to a request for one or more operations or tasks on one or more instances of a supported type of entity.
  • Some tasks (and the corresponding APIs) may involve multiple different entity types - e.g., an API requesting a creation of a data source may result in the generation of a data source entity instance as well as a statistics entity instance.
  • Some of the tasks of a given workflow may be dependent on the results of other tasks.
  • an asynchronous approach may be taken to scheduling the tasks, in which MLS clients can submit additional tasks that depend on the output of earlier-submitted tasks without waiting for the earlier-submitted tasks to complete.
  • a client may submit respective requests for tasks T2 and T3 before an earlier-submitted task Tl completes, even though the execution of T2 depends at least partly on the results of Tl, and the execution of T3 depends at least partly on the results of T2.
  • the MLS may take care of ensuring that a given task is scheduled for execution only when its dependencies (if any dependencies exist) have been met.
  • a queue or collection of job objects may be used for storing internal representations of requested tasks in some implementations.
  • the term "task”, as used herein, refers to a set of logical operations corresponding to a given request from a client, while the term “job” refers to the internal representation of a task within the MLS.
  • a given job object may represent the operations to be performed as a result of a client's invocation of a particular programmatic interface, as well as dependencies on other jobs.
  • the MLS may be responsible for ensuring that the dependencies of a given job have been met before the corresponding operations are initiated.
  • the MLS may also be responsible in such embodiments for generating a processing plan for each job, identifying the appropriate set of resources (e.g., CPUs/cores, storage or memory) for the plan, scheduling the execution of the plan, gathering results, providing/saving the results in an appropriate destination, and at least in some cases for providing status updates or responses to the requesting clients.
  • the MLS may also be responsible in some embodiments for ensuring that the execution of one client's jobs do not affect or interfere with the execution of other clients' jobs.
  • partial dependencies among tasks may be supported - e.g., in a sequence of tasks (Tl, T2, T3), T2 may depend on partial completion of Tl, and T2 may therefore be scheduled before Tl completes.
  • Tl may comprise two phases or passes PI and P2 of statistics calculations, and T2 may be able to proceed as soon as phase PI is completed, without waiting for phase P2 to complete.
  • Partial results of Tl e.g., at least some statistics computed during phase PI
  • a single shared queue that includes jobs corresponding to requests from a plurality of clients of the MLS may be used in some implementations, while in other implementations respective queues may be used for different clients.
  • lists or other data structures that can be used to model object collections may be used as containers of to-be-scheduled jobs instead of or in addition to queues.
  • a single API request from a client may lead to the generation of several different job objects by the MLS.
  • not all client API requests may be implemented using jobs - e.g., a relatively short or lightweight task may be performed synchronously with respect to the corresponding request, without incurring the overhead of job creation and asynchronous job scheduling.
  • the APIs implemented by the MLS may in some embodiments allow clients to submit requests to create, query the attributes of, read, update/modify, search, or delete an instance of at least some of the various entity types supported.
  • entity type “DataSource”
  • respective APIs similar to “createDataSource”, “describeDataSource” to obtain the values of attributes of the data source
  • updateDataSource “searchForDataSource”
  • deleteDataSource may be supported by the MLS.
  • a similar set of APIs may be supported for recipes, models, and so on.
  • Some entity types may also have APIs for executing or running the entities, such as "executeModel” or "executeRecipe” in various embodiments.
  • the APIs may be designed to be largely easy to learn and self-documenting (e.g., such that the correct way to use a given API is obvious to non-experts), with an emphasis on making it simple to perform the most common tasks without making it too hard to perform more complex tasks.
  • multiple versions of the APIs may be supported: e.g., one version for a wire protocol (at the application level of a networking stack), another version as a JavaTM library or SDK (software development kit), another version as a Python library, and so on.
  • API requests may be submitted by clients using HTTP (Hypertext Transfer Protocol), HTTPS (secure HTTP), Javascript, XML, or the like in various implementations.
  • some machine learning models may be created and trained, e.g., by a group of model developers or data scientists using the MLS APIs, and then published for use by another community of users.
  • the "alias" entity type may be supported in such embodiments.
  • an alias may comprise an immutable name (e.g., "SentimentAnalysisModell”) and a pointer to a model that has already been created and stored in an MLS artifact repository (e.g., "samModel-23adf-2013-12-13-08- 06-01", an internal identifier generated for the model by the MLS).
  • an immutable name e.g., "SentimentAnalysisModell”
  • a pointer to a model that has already been created and stored in an MLS artifact repository e.g., "samModel-23adf-2013-12-13-08- 06-01", an internal identifier generated for the model by the MLS.
  • MLS artifact repository e.g., "samModel-23adf-2013-12-13-08- 06-01"
  • an internal identifier generated for the model by the MLS e.g., "samModel-23adf-2013-12-13-08- 06-01"
  • the machine learning model exposed via the alias may represent a "black box" tool, already validated by experts, which is expected to provide useful predictions for various input data sets.
  • the business analysts may not be particularly concerned about the internal working of such a model.
  • the model developers may continue to experiment with various algorithms, parameters and/or input data sets to obtain improved versions of the underlying model, and may be able to change the pointer to point to an enhanced version to improve the quality of predictions obtained by the business analysts.
  • the MLS may guarantee that (a) an alias can only point to a model that has been successfully trained and (b) when an alias pointer is changed, both the original model and the new model (i.e., the respective models being pointed to by the old pointer and the new pointer) consume the same type of input and provide the same type of prediction (e.g., binary classification, multi-class classification or regression).
  • a given model may itself be designated as un-modifiable if an alias is created for it - e.g., the model referred to by the pointer "samModel- 23adf-2013-12-13-08-06-01" may no longer be modified even by its developers after the alias is created in such an implementation.
  • Such clean separation of roles and capabilities with respect to model development and use may allow larger audiences within a business organization to benefit from machine learning models than simply those skilled enough to develop the models.
  • a number of choices may be available with respect to the manner in which the operations corresponding to a given job are mapped to MLS servers. For example, it may be possible to partition the work required for a given job among many different servers to achieve better performance. As part of developing the processing plan for a job, the MLS may select a workload distribution strategy for the job in some embodiments. The parameters determined for workload distribution in various embodiments may differ based on the nature of the job.
  • Such factors may include, for example, (a) determining a number of passes of processing, (b) determining a parallelization level (e.g., the number of "mappers” and “reducers” in the case of a job that is to be implemented using the Map-Reduce technique), (c) determining a convergence criterion to be used to terminate the job, (d) determining a target durability level for intermediate data produced during the job, or (e) determining a resource capacity limit for the job (e.g., a maximum number of servers that can be assigned to the job based on the number of servers available in MLS server pools, or on the client's budget limit).
  • the actual set of resources to be used may be identified in accordance with the strategy, and the job's operations may be scheduled on the identified resources.
  • a pool of compute servers and/or storage servers may be pre-configured for the MLS, and the resources for a given job may be selected from such a pool.
  • the resources may be selected from a pool assigned to the client on whose behalf the job is to be executed - e.g., the client may acquire resources from a computing service of the provider network prior to submitting API requests, and may provide an indication of the acquired resources to the MLS for job scheduling.
  • client-provided code e.g., code that has not necessarily been thoroughly tested by the MLS, and/or is not included in the MLS's libraries
  • client may be required to acquire the resources to be used for the job, so that any side effects of running the client-provided code may be restricted to the client's own resources instead of potentially affecting other clients.
  • FIG. 1 illustrates an example system environment in which various components of a machine learning service (MLS) may be implemented, according to at least some embodiments.
  • the MLS may implement a set of programmatic interfaces 161 (e.g., APIs, command-line tools, web pages, or standalone GUIs) that can be used by clients 164 (e.g., hardware or software entities owned by or assigned to customers of the MLS) to submit requests 111 for a variety of machine learning tasks or operations.
  • the administrative or control plane portion of the MLS may include MLS request handler 180, which accepts the client requests 111 and inserts corresponding job objects into MLS job queue 142, as indicated by arrow 112.
  • control plane of the MLS may comprise a plurality of components (including the request handler, workload distribution strategy selectors, one or more job schedulers, metrics collectors, and modules that act as interfaces with other services) which may also be referred to collectively as the MLS manager.
  • the data plane of the MLS may include, for example, at least a subset of the servers of pool(s) 185, storage devices that are used to store input data sets, intermediate results or final results (some of which may be part of the MLS artifact repository), and the network pathways used for transferring client input data and results.
  • each job object may indicate one or more operations that are to be performed as a result of the invocation of a programmatic interface 161, and the scheduling of a given job may in some cases depend upon the successful completion of at least a subset of the operations of an earlier-generated job.
  • job queue 142 may be managed as a first-in-first-out (FIFO) queue, with the further constraint that the dependency requirements of a given job must have been met in order for that job to be removed from the queue.
  • FIFO first-in-first-out
  • jobs created on behalf of several different clients may be placed in a single queue, while in other embodiments multiple queues may be maintained (e.g., one queue in each data center of the provider network being used, or one queue per MLS customer).
  • the next job whose dependency requirements have been met may be removed from job queue 142 in the depicted embodiment, as indicated by arrow 113, and a processing plan comprising a workload distribution strategy may be identified for it.
  • the workload distribution strategy layer 175, which may also be a component of the MLS control plane as mentioned earlier, may determine the manner in which the lower level operations of the job are to be distributed among one or more compute servers (e.g., servers selected from pool 185), and/or the manner in which the data analyzed or manipulated for the job is to be distributed among one or more storage devices or servers.
  • the job's operations may be scheduled on the resources.
  • Results of some jobs may be stored as MLS artifacts within repository 120 in some embodiments, as indicated by arrow 142.
  • client requests 111 may result in the immediate generation, retrieval, storage, or modification of corresponding artifacts within MLS artifact repository 120 by the MLS request handler 180 (as indicated by arrow 141).
  • the insertion of a job object in job queue 142 may not be required for all types of client requests.
  • a creation or removal of an alias for an existing model may not require the creation of a new job in such embodiments.
  • clients 164 may be able to view at least a subset of the artifacts stored in repository 120, e.g., by issuing read requests 118 via programmatic interfaces 161.
  • a client request 111 may indicate one or more parameters that may be used by the MLS to perform the operations, such as a data source definition 150, a feature processing transformation recipe 152, or parameters 154 to be used for a particular machine learning algorithm.
  • artifacts respectively representing the parameters may also be stored in repository 120.
  • Some machine learning workflows which may correspond to a sequence of API requests from a client 164, may include the extraction and cleansing of input data records from raw data repositories 130 (e.g., repositories indicated in data source definitions 150) by input record handlers 160 of the MLS, as indicated by arrow 114.
  • This first portion of the workflow may be initiated in response to a particular API invocation from a client 164, and may be executed using a first set of resources from pool 185.
  • the input record handlers may, for example, perform such tasks as splitting the data records, sampling the data records, and so on, in accordance with a set of functions defined in an I/O (input/output) library of the MLS.
  • the input data may comprise data records that include variables of any of a variety of data types, such as, for example text, a numeric data type (e.g., real or integer), Boolean, a binary data type, a categorical data type, an image processing data type, an audio processing data type, a bioinformatics data type, a structured data type such as a data type compliant with the Unstructured Information Management Architecture (UIMA), and so on.
  • the input data reaching the MLS may be encrypted or compressed, and the MLS input data handling machinery may have to perform decryption or decompression before the input data records can be used for machine learning tasks.
  • MLS clients may have to provide decryption metadata (e.g., keys, passwords, or other credentials) to the MLS to allow the MLS to decrypt data records.
  • decryption metadata e.g., keys, passwords, or other credentials
  • an indication of the compression technique used may be provided by the clients in some implementations to enable the MLS to decompress the input data records appropriately.
  • the output produced by the input record handlers may be fed to feature processors 162 (as indicated by arrow 115), where a set of transformation operations may be performed 162 in accordance with recipes 152 using another set of resources from pool 185.
  • any of a variety of feature processing approaches may be used depending on the problem domain: e.g., the recipes typically used for computer vision problems may differ from those used for voice recognition problems, natural language processing, and so on.
  • the output 116 of the feature processing transformations may in turn be used as input for a selected machine learning algorithm 166, which may be executed in accordance with algorithm parameters 154 using yet another set of resources from pool 185.
  • a wide variety of machine learning algorithms may be supported natively by the MLS libraries, including for example random forest algorithms, neural network algorithms, stochastic gradient descent algorithms, and the like.
  • the MLS may be designed to be extensible - e.g., clients may provide or register their own modules (which may be defined as user-defined functions) for input record handling, feature processing, or for implementing additional machine learning algorithms than are supported natively by the MLS.
  • some of the intermediate results (e.g., summarized statistics produced by the input record handlers) of a machine learning workflow may be stored in MLS artifact repository 120.
  • the MLS may maintain knowledge base 122 containing information on best practices for various machine learning tasks. Entries may be added into the best practices KB 122 by various control-plane components of the MLS, e.g., based on metrics collected from server pools 185, feedback provided by clients 164, and so on. Clients 164 may be able to search for and retrieve KB entries via programmatic interfaces 161, as indicated by arrow 117, and may use the information contained in the entries to select parameters (such as specific recipes or algorithms to be used) for their request submissions. In at least some embodiments, new APIs may be implemented (or default values for API parameters may be selected) by the MLS on the basis of best practices identified over time for various types of machine learning practices.
  • FIG. 2 illustrates an example of a machine learning service implemented using a plurality of network-accessible services of a provider network, according to at least some embodiments.
  • Networks set up by an entity such as a company or a public sector organization to provide one or more services (such as various types of multi-tenant and/or single-tenant cloud- based computing or storage services) accessible via the Internet and/or other networks to a distributed set of clients may be termed provider networks herein.
  • a given provider network may include numerous data centers hosting various resource pools, such as collections of physical and/or virtualized computer servers, storage devices, networking equipment and the like, needed to implement, configure and distribute the infrastructure and services offered by the provider.
  • At least some provider networks and the corresponding network-accessible services may be referred to as "public clouds” and “public cloud services” respectively.
  • some data centers may be located in different cities, states or countries than others, and in some embodiments the resources allocated to a given service such as the MLS may be distributed among several such locations to achieve desired levels of availability, fault-resilience and performance, as described below in greater detail with reference to FIG. 3.
  • the MLS utilizes storage service 202, computing service 258, and database service 255 of provider network 202. At least some of these services may also be used concurrently by other customers (e.g., other services implemented at the provider network, and/or external customers outside the provider network) in the depicted embodiment, i.e., the services may not be restricted to MLS use.
  • MLS gateway 222 may be established to receive client requests 210 submitted over external network 206 (such as portions of the Internet) by clients 164.
  • MLS gateway 222 may, for example, be configured with a set of publicly accessible IP (Internet Protocol) addresses that can be used to access the MLS.
  • IP Internet Protocol
  • the client requests may be formatted in accordance with a representational state transfer (REST) API implemented by the MLS in some embodiments.
  • MLS customers may be provided an SDK (software development kit) 204 for local installation at client computing devices, and the requests 210 may be submitted from within programs written in conformance with the SDK.
  • SDK software development kit
  • a client may also or instead access MLS functions from a compute server 262 of computing service 262 that has been allocated to the client in various embodiments.
  • Storage service 252 may, for example, implement a web services interface that can be used to create and manipulate unstructured data objects of arbitrary size.
  • Database service 255 may implement either relational or non-relational databases.
  • the storage service 252 and/or the database service 255 may play a variety of roles with respect to the MLS in the depicted embodiment.
  • the MLS may require clients 164 to define data sources within the provider network boundary for their machine learning tasks in some embodiments.
  • clients may first transfer data from external data sources 229 into internal data sources within the provider network, such as internal data source 230A managed by storage service 252, or internal data source 230B managed by database service 255.
  • the clients of the MLS may already be using the provider network services for other applications, and some of the output of those applications (e.g., web server logs or video files), saved at the storage service 252 or the database service 255, may serve as the data sources for MLS workflows.
  • the MLS request handler 180 may generate and store corresponding job objects within a job queue 142, as discussed above.
  • the job queue 142 may itself be represented by a database object (e.g., a table) stored at database service 255.
  • a job scheduler 272 may retrieve a job from queue 142, e.g., after checking that the job's dependency requirements have been met, and identify one or more servers 262 from computing service 258 to execute the job's computational operations. Input data for the computations may be read from the internal or external data sources by the servers 262.
  • the MLS artifact repository 220 may be implemented within the database service 255 (and/or within the storage service 252) in various embodiments. In some embodiments, intermediate or final results of various machine learning tasks may also be stored within the storage service 252 and/or the database service 255.
  • Other services of the provider network may also be used by the MLS in some embodiments.
  • a load balancing service may, for example, be used to automatically distribute computational load among a set of servers 262.
  • a parallel computing service that implements the Map-reduce programming model may be used for some types of machine learning tasks.
  • Automated scaling services may be used to add or remove servers assigned to a particular long-lasting machine learning task.
  • Authorization and authentication of client requests may be performed with the help of an identity management service of the provider network in some embodiments.
  • a provider network may be organized into a plurality of geographical regions, and each region may include one or more availability containers, which may also be termed "availability zones".
  • An availability container in turn may comprise portions or all of one or more distinct physical premises or data centers, engineered in such a way (e.g., with independent infrastructure components such as power-related equipment, cooling equipment, and/or physical security components) that the resources in a given availability container are insulated from failures in other availability containers.
  • a failure in one availability container may not be expected to result in a failure in any other availability container; thus, the availability profile of a given physical host or server is intended to be independent of the availability profile of other hosts or servers in a different availability container.
  • provider network resources may also be partitioned into distinct security containers in some embodiments.
  • security containers which may include resources managed by several different provider network services, such as a computing service, a storage service, or a database service, for example.
  • FIG. 3 illustrates an example of the use of a plurality of availability containers and security containers of a provider network for a machine learning service, according to at least some embodiments.
  • provider network 302 comprises availability containers 366A, 366B and 366C, each of which may comprise portions or all of one or more data centers.
  • Each availability container 366 has its own set of MLS control-plane components 344: e.g., control plane components 344A-344C in availability containers 366A-366C respectively.
  • the control plane components in a given availability container may include, for example, an instance of an MLS request handler, one or more MLS job queues, a job scheduler, workload distribution components, and so on.
  • the control plane components in different availability containers may communicate with each other as needed, e.g., to coordinate tasks that utilize resources at more than one data center.
  • Each availability container 366 has a respective pool 322 (e.g., 322A-322C) of MLS servers to be used in a multi-tenant fashion.
  • the servers of the pools 322 may each be used to perform a variety of MLS operations, potentially for different MLS clients concurrently.
  • single -tenant server pools that are designated for only a single client's workload may be used, such as single tenant server pools 330A, 330B and 330C.
  • Pools 330A and 330B belong to security container 390 A, while pool 330C is part of security container 390B.
  • Security container 390A may be used exclusively for a customer CI (e.g., to run customer- provided machine learning modules, or third-party modules specified by the customer), while security container 390B may be used exclusively for a different customer C2 in the depicted example.
  • At least some of the resources used by the MLS may be arranged in redundancy groups that cross availability container boundaries, such that MLS tasks can continue despite a failure that affects MLS resources of a given availability container.
  • a redundancy group RG1 comprising at least one server SI in availability container 366 A, and at least one server S2 in availability container 366B may be established, such that Si 's MLS-related workload may be failed over to S2 (or vice versa).
  • the state of a given MLS job may be check-pointed to persistent storage (e.g., at a storage service or a database service of the provider network that is also designed to withstand single-availability- container failures) periodically, so that a failover server can resume a partially-completed task from the most recent checkpoint instead of having to start over from the beginning.
  • persistent storage e.g., at a storage service or a database service of the provider network that is also designed to withstand single-availability- container failures
  • the storage service and/or the database service of the provider network may inherently provide very high levels of data durability, e.g., using erasure coding or other replication techniques, so the data sets may not necessarily have to be copied in the event of a failure.
  • clients of the MLS may be able to specify the levels of data durability desired for their input data sets, intermediate data sets, artifacts, and the like, as well as the level of compute server availability desired.
  • the MLS control plane may determine, based on the client requirements, whether resources in multiple availability containers should be used for a given task or a given client.
  • the billing amounts that the clients have to pay for various MLS tasks may be based at least in part on their durability and availability requirements.
  • some clients may indicate to the MLS control-plane that they only wish to use resources within a given availability container or a given security container.
  • the costs of transmitting data sets and/or results over long distances may be so high, or the time required for the transmissions may so long, that the MLS may restrict the tasks to within a single geographical region of the provider network (or even within a single data center).
  • the MLS control plane may be responsible for generating processing plans corresponding to each of the job objects generated in response to client requests in at least some embodiments. For each processing plan, a corresponding set of resources may then have to be identified to execute the plan, e.g., based on the workload distribution strategy selected for the plan, the available resources, and so on.
  • FIG. 4 illustrates examples of various types of processing plans and corresponding resource sets that may be generated at a machine learning service, according to at least some embodiments.
  • MLS job queue 142 comprises five jobs, each corresponding to the invocation of a respective API by a client.
  • Job Jl (shown at the head of the queue) was created in response to an invocation of APIl .
  • Jobs J2 through J5 were created respectively in response to invocations of API2 through API5.
  • an input data cleansing plan 422 may be generated, and the plan may be executed using resource set RS 1.
  • the input data cleansing plan may include operations to read and validate the contents of a specified data source, fill in missing values, identify and discard (or otherwise respond to) input records containing errors, and so on. In some cases the input data may also have to be decompressed, decrypted, or otherwise manipulated before it can be read for cleansing purposes.
  • a statistics generation plan 424 may be generated, and subsequently executed on resource set RS2.
  • the types of statistics to be generated for each data attribute e.g., mean, minimum, maximum, standard deviation, quantile binning, and so on for numeric attributes
  • the manner in which the statistics are to be generated e.g., whether all the records generated by the data cleansing plan 422 are to be used for the statistics, or a sub-sample is to be used
  • the execution of job J2 may be dependent on the completion of job Jl in the depicted embodiment, although the client request that led to the generation of job J2 may have been submitted well before Jl is completed.
  • a recipe-based feature processing plan 426 corresponding to job J3 may be generated, and executed on resource set RS3. Further details regarding the syntax and management of recipes are provided below.
  • Job J4 may result in the generation of a model training plan 428 (which may in turn involve several iterations of training, e.g., with different sets of parameters).
  • the model training may be performed using resource set RS4.
  • Model execution plan 430 may correspond to job J5 (resulting from the client's invocation of API5), and the model may eventually be executed using resource set RS5.
  • the same set of resources may be used for performing several or all of a client's jobs - e.g., the resource sets RSI - RS5 may not necessarily differ from one another.
  • a client may indicate, e.g., via parameters included in an API call, various elements or properties of a desired processing plan, and the MLS may take such client preferences into account. For example, for a particular statistics generation job, a client may indicate that a randomly-selected sample of 25% of the cleansed input records may be used, and the MLS may generate a statistics generation plan that includes a step of generating a random sample of 25% of the data accordingly.
  • the MLS control plane may be given more freedom to decide exactly how a particular job is to be implemented, and it may consult its knowledge base of best practices to select the parameters to be used.
  • FIG. 5 illustrates an example of asynchronous scheduling of jobs at a machine learning service, according to at least some embodiments.
  • a client has invoked four MLS APIs, API1 through API4, and four corresponding job objects Jl through J4 are created and placed in job queue 142.
  • Timelines TL1, TL2, and TL3 show the sequence of events from the perspective of the client that invokes the APIs, the request handler that creates and inserts the jobs in queue 142, and a job scheduler that removes the jobs from the queue and schedules the jobs at selected resources.
  • phase-based dependencies may be handled by splitting a job with N phases into N smaller jobs, thereby converting partial dependencies into full dependencies.
  • Jl has no dependencies of either type in the depicted example.
  • API1 through API4 may be invoked within the time period tO to tl . Even though some of the operations requested by the client depend on the completion of operations corresponding to earlier-invoked APIs, the MLS may allow the client to submit the dependent operation requests much earlier than the processing of the earlier- invoked APIs' jobs in the depicted embodiment.
  • parameters specified by the client in the API calls may indicate the inter-job dependencies.
  • the client in response to API1, the client may be provided with a job identifier for Jl, and that job identifier may be included as a parameter in API2 to indicate that the results of API1 are required to perform the operations corresponding to API2.
  • the jobs corresponding to each API call may be created and queued shortly after the API is invoked. Thus, all four jobs have been generated and placed within the job queue 142 by a short time after tl .
  • job Jl may be scheduled for execution at time t2.
  • the delay between the insertion of Jl in queue 142 (shortly after tO) and the scheduling of Jl may occur for a number of reasons in the depicted embodiment - e.g., because there may have been other jobs ahead of Jl in the queue 142, or because it takes some time to generate a processing plan for Jl and identify the resources to be used for Jl, or because enough resources were not available until t2.
  • Jl 's execution lasts until t3.
  • Jl completes, (a) the client is notified and (b) J2 is scheduled for execution.
  • J2 dependsOnComplete parameter value
  • J2 depends on Jl 's completion, and J2's execution could therefore not have been begun until t3, even if J2's processing plan were ready and J2's resource set had been available prior to t3.
  • J3 can be started when a specified phase or subset of J2's work is complete in the depicted example.
  • the portion of J2 upon which J3 depends completes at time t4 in the illustrated example, and the execution of J3 therefore begins (in parallel with the execution of the remaining portion of J2) at t4.
  • the client may be notified at time t4 regarding the partial completion of J2 (e.g., the results of the completed phase of J2 may be provided to the client).
  • J4 also depends on the completion of J2, so J4 cannot be started until J2 completes at t6. J3 continues execution until t8. J4 completes at tl, earlier than t8. The client is notified regarding the completion of each of the jobs corresponding to the respective API invocations API1 - API4 in the depicted example scenario.
  • partial dependencies between jobs may not be supported - instead, as mentioned earlier, in some cases such dependencies may be converted into full dependencies by splitting multi-phase jobs into smaller jobs.
  • clients may be able to submit queries to the MLS to determine the status (or the extent of completion) of the operations corresponding to various API calls.
  • an MLS job monitoring web page may be implemented, enabling clients to view the progress of their requests (e.g., via a "percent complete” indicator for each job), expected completion times, and so on.
  • a polling mechanism may be used by clients to determine the progress or completion of the jobs.
  • FIG. 6 illustrates example artifacts that may be generated and stored using a machine learning service, according to at least some embodiments.
  • MLS artifacts may comprise any of the objects that may be stored in a persistent manner as a result of an invocation of an MLS programmatic interface.
  • some API parameters e.g., text versions of recipes
  • MLS artifacts 601 may include, among others, data sources 602, statistics 603, feature processing recipes 606, model predictions 608, evaluations 610, modifiable or in- development models 630, and published models or aliases 640.
  • the MLS may generate a respective unique identifier for each instance of at least some of the types of artifacts shown and provide the identifiers to the clients.
  • the identifiers may subsequently be used by clients to refer to the artifact (e.g., in subsequent API calls, in status queries, and so on).
  • a client request to create a data source artifact 602 may include, for example, an indication of an address or location from which data records can be read, and some indication of the format or schema of the data records. For example, an indication of a source URI (universal resource identifier) to which HTTP GET requests can be directed to retrieve the data records, an address of a storage object at a provider network storage service, or a database table identifier may be provided.
  • the format (e.g., the sequence and types of the fields or columns of the data records) may be indicated in some implementations via a separate comma separated variable (csv) file.
  • the MLS may be able to deduce at least part of the address and/or format information needed to create the data source artifact - e.g., based on the client's identifier, it may be possible to infer the root directory or root URI of the client's data source, and based on an analysis of the first few records, it may be possible to deduce at least the data types of the columns of the schema.
  • the client request to create a data source may also include a request to re-arrange the raw input data, e.g., by sampling or splitting the data records using an I/O library of the MLS.
  • clients may also be required to provide security credentials that can be used by the MLS to access the data records.
  • At least some statistics 603 may be generated automatically for the data records of a data source.
  • the MLS may also or instead enable clients to explicitly request the generation of various types of statistics, e.g., via the equivalent of a createStatistics(dataSourceID, statisticsDescriptor) request in which the client indicates the types of statistics to be generated for a specified data source.
  • the types of statistics artifacts that are generated may vary based on the data types of the input record variables- e.g., for numeric variables, the mean, median, minimum, maximum, standard deviation, quantile bins, number of nulls or "not-applicable” values and the like may be generated.
  • Cross-variable statistics such as correlations may also be generated, either automatically or on demand, in at least some embodiments.
  • Recipes 606 comprising feature processing transformation instructions may be provided by a client (or selected from among a set of available recipes accessible from an MLS recipe collection) in some embodiments.
  • a recipe language allowing clients to define groups of variables, assignments, dependencies upon other artifacts such as models, and transformation outputs may be supported by the MLS in such embodiments, as described below in greater detail.
  • Recipes submitted in text form may be compiled into executable versions and re-used on a variety of data sets in some implementations.
  • At least two types of artifacts representing machine learning models or predictors may be generated and stored in the depicted embodiment.
  • the process of developing and refining a model may take a long time, as the developer may try to improve the accuracy of the predictions using a variety of data sets and a variety of parameters.
  • Some models may be improved over a number of weeks or months, for example. In such scenarios it may be worthwhile to enable other users (e.g., business analysts) to utilize one version of a model, while model developers continue to generate other, improved versions.
  • the artifacts representing models may belong to one of two categories in some embodiments: modifiable models 630, and published models or aliases 640.
  • An alias may comprise an alias name or identifier, and a pointer to a model (e.g., alias 640A points to model 630B, and alias 640B points to model 630D in the depicted embodiment).
  • publishing a model refers to making a particular version of a model executable by a set of users by reference to an alias name or identifier. In some cases, at least some of the users of the set may not be permitted to modify the model or the alias.
  • Non-expert users 678 may be granted read and execute permissions to the aliases, while model developers 676 may also be allowed to modify models 630 (and/or the pointers of the aliases 640) in some embodiments.
  • a set of guarantees may be provided to alias users: e.g., that the format of the input and output of an alias (and the underlying model referred to by the alias) will not change once the alias is published, and that the model developers have thoroughly tested and validated the underlying model pointed to by the alias.
  • a number of other logical constraints may be enforced with respect to aliases in such embodiments. For example, if the alias is created for a model used in online mode (model usage modes are described in further detail below with respect to FIG. 8), the MLS may guarantee that the model pointed to remains online (i.e., the model cannot be unmounted).
  • a distinction may be drawn between aliases that are currently in production mode and those that are in internal-use or test mode, and the MLS may ensure that the underlying model is not deleted or un-mounted for an alias in production mode.
  • a minimum throughput rate of predictions/evaluations may be determined for the alias, and the MLS may ensure that the resources assigned to the model can meet the minimum throughput rate in some embodiments.
  • model developers 676 improve the accuracy and/or performance characteristics of a newer version of a model 630 relative to an older version for which an alias 640 has been created, they may switch the pointer of the alias so that it now points to the improved version.
  • alias users may be able to submit a query to learn when the underlying model was last changed, or may be notified when they request an execution of an alias that the underlying model has been changes since the last execution.
  • Results of model executions such as predictions 608 (values predicted by a model for a dependent variable in a scenario in which the actual values of the dependent variable are not known) and model evaluations 610 (measures of the accuracy of a model, computed when the predictions of the model can be compared to known values of dependent variables) may also be stored as artifacts by the MLS in some embodiments.
  • model evaluations 610 measures of the accuracy of a model, computed when the predictions of the model can be compared to known values of dependent variables
  • dependent variable values may be assumed to depend upon values of one or more independent variables in at least some types of machine learning techniques, this is not meant to imply that any of the independent variables are necessarily statistically independent of any of the other independent variables.
  • other artifact types may also be supported in some embodiments - e.g., objects representing network endpoints that can be used for real-time model execution on streaming data (as opposed to batch-mode execution on a static set of data) may be stored as artifacts in some embodiments, and client session logs (e.g., recordings of all the interactions between a client and the MLS during a given session) may be stored as artifacts in other embodiments.
  • the MLS may support recurring scheduling of related jobs.
  • a client may create an artifact such as a model, and may want that same model to be re- trained and/or re-executed for different input data sets (e.g., using the same configuration of resources for each of the training or prediction iterations) at specified points in time.
  • the points in time may be specified explicitly (e.g., by the client requesting the equivalent of "re-run model Ml on the currently available data set at data source DS1 at 11 :00, 15:00 and 19:00 every day").
  • the client may indicate the conditions under which the iterations are to be scheduled (e.g., by the client requesting the equivalent of "re-run model Ml whenever the next set of 1000000 new records becomes available from data source DS1").
  • a respective job may be placed in the MLS job queue for each recurring training or execution iteration.
  • the MLS may implement a set of programmatic interface enabling such scheduled recurring operations in some embodiments.
  • a client may specify a set of model/alias/recipe artifacts (or respective versions of the same underling artifact) to be used for each of the iterations, and/or the resource configurations to be used.
  • Such programmatic interfaces may be referred to as "pipelining APIs" in some embodiments.
  • pipeline artifacts may be stored in the MLS artifact repository in some embodiments, with each instance of a pipeline artifact representing a named set of recurring operations requested via such APIs.
  • a separately-managed data pipelining service implemented at the provider network may be used in conjunction with the MLS for supporting such recurrent operations.
  • the MLS may automatically generate statistics when a data source is created.
  • FIG. 7 illustrates an example of automated generation of statistics in response to a client request to instantiate a data source, according to at least some embodiments.
  • a client 764 submits a data source creation request 712 to the MLS control plane 780 via an MLS API 761.
  • the creation request may specify an address or location from which data records can be retrieved, and optionally a schema or format document indicating the columns or fields of the data records.
  • the MLS control plane 780 may generate and store a data source artifact 702 in the MLS artifact repository.
  • the MLS may also initiate the generation of one or more statistics objects 730 in the depicted embodiment, even if the client request did not explicitly request such statistics. Any combination of a number of different types of statistics may be generated automatically in one of two modes in various embodiments.
  • an initial set of statistics 763 based on a sub-sample (e.g., a randomly-selected subset of the large data set) may be obtained in a first phase, while the generation of full-sample statistics 764 derived from the entire data set may be deferred to a second phase.
  • a multi-phase approach towards statistics generation may be implemented, for example, to allow the client to get a rough or approximate summary of the data set values fairly rapidly in the first phase, so that the client may begin planning subsequent machine learning workflow steps without waiting for a statistical analysis of the complete data set.
  • basic statistics 765 may include the mean, median, minimum, maximum, and standard deviation.
  • Numeric variables may also be binned (categorized into a set of ranges such as quartiles or quintiles); such bins 767 may be used for the construction of histograms that may be displayed to the client. Depending on the nature of the distribution of the variable, either linear or logarithmic bin boundaries may be selected.
  • correlations 768 between different variables may be computed as well.
  • the MLS may utilize the automatically generated statistics (such as the correlation values) to identify candidate groups 769 of variables that may have greater predictive power than others.
  • FIG. 8 illustrates several model usage modes that may be supported at a machine learning service, according to at least some embodiments.
  • Model usage modes may be broadly classified into three categories: batch mode, online or real-time mode, and local mode.
  • batch mode a given model may be run on a static set of data records.
  • real-time mode a network endpoint (e.g., an IP address) may be assigned as a destination to which input data records for a specified model are to be submitted, and model predictions may be generated on groups of streaming data records as the records are received.
  • IP address e.g., an IP address
  • clients may receive executable representations of a specified model that has been trained and validated at the MLS, and the clients may run the models on computing devices of their choice (e.g., at devices located in client networks rather than in the provider network where the MLS is implemented).
  • a client 164 of the MLS may submit a model execution request 812 to the MLS control plane 180 via a programmatic interface 861.
  • the model execution request may specify the execution mode (batch, online or local), the input data to be used for the model run (which may be produced using a specified data source or recipe in some cases), the type of output (e.g., a prediction or an evaluation) that is desired, and/or optional parameters (such as desired model quality targets, minimum input record group sizes to be used for online predictions, and so on).
  • the MLS may generate a plan for model execution and select the appropriate resources to implement the plan.
  • a job object may be generated upon receiving the execution request 812 as described earlier, indicating any dependencies on other jobs (such as the execution of a recipe for feature processing), and the job may be placed in a queue.
  • one or more servers may be identified to run the model.
  • the model may be mounted (e.g., configured with a network address) to which data records may be streamed, and from which results including predictions 868 and/or evaluations 869 can be retrieved.
  • clients may optionally specify expected workload levels for a model that is to be instantiated in online mode, and the set of provider network resources to be deployed for the model may be selected in accordance with the expected workload level.
  • a client may indicate via a parameter of the model execution/creation request that up to 100 prediction requests per day are expected on data sets of 1 million records each, and the servers selected for the model may be chosen to handle the specified request rate.
  • the MLS may package up an executable local version 843 of the model (where the details of the type of executable that is to be provided, such as the type of byte code or the hardware architecture on which the model is to be run, may have been specified in the execution request 812) and transmit the local model to the client.
  • only a subset of the execution modes illustrated may be supported.
  • not all of the combinations of execution modes and output types may be supported - for example, while predictions may be supported for online mode in one implementation, evaluations may not be supported for online mode.
  • FIG. 9a and 9b are flow diagrams illustrating aspects of operations that may be performed at a machine learning service that supports asynchronous scheduling of machine learning jobs, according to at least some embodiments.
  • the MLS may receive a request from a client via a programmatic interface (such as an API, a command-line tool, a web page, or a custom GUI) to perform a particular operation on an entity belonging to a set of supported entity types of the MLS.
  • the entity types may include, for example, data sources, statistics, feature processing recipes, models, aliases, predictions, and/or evaluations in the depicted embodiment.
  • the operations requested may include, for example, create, read (or describe the attributes of), modify/update attributes, execute, search, or delete operations. Not all the operation types may apply to all the entity types in some embodiments - e.g., it may not be possible to "execute" a data source.
  • the request may be encrypted or encapsulated by the client, and the MLS may have to extract the contents of the request using the appropriate keys and/or certificates.
  • the request may next be validated in accordance with various rules or policies of the MLS (element 904). For example, in accordance with a security policy, the permissions, roles or capabilities granted to the requesting client may be checked to ensure that the client is authorized to have the requested operations performed.
  • the syntax of the request itself, and/or objects such as recipes passed as request parameters may be checked for some types of requests. In some cases, the types of one or more data variables indicated in the request may have to be checked as well.
  • a decision may be made as to whether a job object is to be created for the request.
  • the amount of work required may be small enough that the MLS may simply be able to perform the requested operation synchronously or "in-line", instead of creating and inserting a job object into a queue for asynchronous execution (at least in scenarios in which the prerequisites or dependencies of the request have already been met, and sufficient resources are available for the MLS to complete the requested work).
  • a job object may be generated, indicating the nature of the lower-level operations to be performed at the MLS as well as any dependencies on other jobs, and the job object may be placed in a queue (element 913).
  • the requesting client may be notified that the request has been accepted for execution (e.g., by indicating to the client that a job has been queued for later execution). The client may submit another programmatic request without waiting for the queued job to be completed (or even begun) in some cases.
  • the requested operation may be performed without creating a job object (element 910) and the results may optionally be provided to the requesting client.
  • Operations corresponding to elements 901-913 may be performed for each request that is received via the MLS programmatic interface.
  • Jk may be identified (e.g., by a job scheduler component of the MLS control plane) as the next job to be implemented (element 951 of FIG. 9b).
  • the scheduler may, for example, start from the head of the queue (the earliest-inserted job that has not yet been executed) and search for jobs whose dependencies (if any are specified) have been met.
  • the MLS may perform validations at various other stages in some embodiments, e.g., with the general goals of (a) informing clients as soon as possible when a particular request is found to be invalid, and (b) avoiding wastage of MLS resources on requests that are unlikely to succeed.
  • one or more types of validation checks may be performed on the job Jk identified in element 951.
  • each client may have a quota or limit on the resources that can be applied to their jobs (such as a maximum number of servers that can be used concurrently for all of a given customer's jobs, or for any given job of the customer).
  • respective quotas may be set for each of several different resource types - e.g., CPUs/cores, memory, disk, network bandwidth and the like.
  • the job scheduler may be responsible for verifying that the quota or quotas of the client on whose behalf the job Jk is to be run have not been exhausted.
  • a quota may be exhausted until at least some of the client's resources are released (e.g., as a result of a completion of other jobs performed on the same client's behalf).
  • Such constraint limits may be helpful in limiting the ability of any given client to monopolize shared MLS resources, and also in minimizing the negative consequences of inadvertent errors or malicious code.
  • other types of run-time validations may be required for at least some jobs - e.g., data type checking may have to be performed on the input data set for jobs that involve feature processing, or the MLS may have to verify that the input data set size is within acceptable bounds.
  • client requests may be validated synchronously (at the time the request is received, as indicated in element 904 of FIG. 9a) as well as asynchronously (as indicated in element 952 of FIG. 9b) in at least some embodiments.
  • a workload distribution strategy and processing plan may be identified for Jk - e.g., the number of processing passes or phases to be used, the degree of parallelism to be used, an iterative convergence criterion to be used for completing Jk (element 954).
  • a number of additional factors may be taken into account when generating the processing plan in some embodiments, such as client budget constraints (if any), the data durability needs of the client, the performance goals of the client, security needs (such as the need to run third-party code or client-provided code in isolation instead of in multi-tenant mode).
  • a set of resources may be identified for Jk (element 957).
  • the resources (which may include compute servers or clusters, storage devices, and the like) may be selected from the MLS-managed shared pools, for example, and/or from customer-assigned or customer-owned pools. JK's operations may then be performed on the identified resources (element 960), and the client on whose behalf Jk was created may optionally be notified when the operations complete (or in the event of a failure that prevents completion of the operations).
  • Some of the types of operations requested by MLS clients may be resource-intensive. For example, ingesting a terabyte-scale data set (e.g., in response to a client request to create a data store) or generating statistics on such a data set may take hours or days, depending on the set of resources deployed and the extent of parallelism used. Given the asynchronous manner in which client requests are handled in at least some embodiments, clients may sometimes end up submitting the same request multiple times. In some cases, such multiple submissions may occur because the client is unaware whether the previous submission was accepted or not (e.g., because the client failed to notice an indication that the previous submission was accepted, or because such an indication was lost).
  • a duplicate request may be received because the client has assumed that since the expected results of completing the requested task have not been provided for a long time, the previous request must have failed. If, in response to such a duplicate submission, the MLS actually schedules another potentially large job, resources may be deployed unnecessarily and the client may in some cases be billed twice for a request that was only intended to be serviced once. Accordingly, in order to avoid such problematic scenarios, in at least one embodiment one or more of the programmatic interfaces supported by the MLS may be designed to be idempotent, such that the re-submission of a duplicate request by the same client does not have negative consequences.
  • FIG. 10a is a flow diagram illustrating aspects of operations that may be performed at a machine learning service at which a set of idempotent programmatic interfaces are supported, according to at least some embodiments.
  • a creation interface e.g., an API similar to "createDataSource” or "createModel"
  • createDataSource or "createModel”
  • createModel an API similar to "createDataSource” or "createModel”
  • idempotency may be especially useful for programmatic interfaces that involve creation of artifacts such as data sources and models, idempotent interfaces may also be supported for other types of operations (e.g., deletes or executes) in various embodiments.
  • a request to create a new instance of an entity type ET1 may be received from a client CI at the MLS via a programmatic interface such as a particular API.
  • the request may indicate an identifier IDl, selected by the client, which is to be used for the new instance.
  • the client may be required to specify the instance identifier, and the identifier may be used as described below to detect duplicate requests. (Allowing the client to select the identifier may have the additional advantage that a client may be able to assign a more meaningful name to entity instances than a name assigned by the MLS.)
  • the MLS may generate a representation IPR1 of the input parameters included in the client's invocation of the programmatic interface (element 1004). For example, the set of input parameters may be supplied as input to a selected hash function, and the output of the hash function may be saved as IPR1.
  • the MLS repository may store the corresponding instance identifier, input parameter representation, and client identifier (i.e., the identifier of the client that requested the creation of the artifact).
  • the MLS may check, e.g., via a lookup in the artifact repository, whether an instance of entity type ET1, with instance identifier IDl and client identifier CI already exists in the repository. If no such instance is found (as detected in element 1007), a new instance of type ET1 with the identifier IDl, input parameter representation IPR1 and client identifier CI may be inserted into the repository (element 1007).
  • a job object may be added to a job queue to perform additional operations corresponding to the client request, such as reading/ingesting a data set, generating a set of statistics, performing feature processing, executing a model, etc.
  • a success response to the client's request (element 1016) may be generated in the depicted embodiment. (It is noted that the success response may be implicit in some implementations - e.g., the absence of an error message may serve as an implicit indicator of success.)
  • the MLS may check whether the input parameter representation of the pre-existing instance also matches IPR1 (element 1013). If the input parameter representations also match, the MLS may assume that the client's request is a (harmless) duplicate, and no new work needs to be performed. Accordingly, the MLS may also indicate success to the client (either explicitly or implicitly) if such a duplicate request is found (element 1016). Thus, if the client had inadvertently resubmitted the same request, the creation of a new job object and the associated resource usage may be avoided.
  • an indication may be provided to the client that the request, while not being designated as an error, was in fact identified as a duplicate. If the input parameter representation of the pre-existing instance does not match that of the client's request, an error message may be returned to the client (element 1019), e.g., indicating that there is a preexisting instance of the same entity type ET1 with the same identifier.
  • a different approach to duplicate detection may be used, such as the use of a persistent log of client requests, or the use of a signature representing the (request, client) combination.
  • FIG. 10b is a flow diagram illustrating aspects of operations that may be performed at a machine learning service to collect and disseminate information about best practices related to different problem domains, according to at least some embodiments.
  • At least some of the artifacts (such as recipes and models) generated at the MLS as a result of client requests may be classified into groups based on problem domains - e.g., some artifacts may be used for financial analysis, others for computer vision applications, others for bioinformatics, and so on. Such classification may be performed based on various factors in different embodiments - e.g. based on the types of algorithms used, the names of input and output variables, customer-provided information, the identities of the customers, and so on.
  • the MLS control plane may comprise a set of monitoring agents that collect performance and other metrics from the resources used for the various phases of machine learning operations (element 1054). For example, the amount of processing time it takes to build N trees of a random forest using a server with a CPU rating of C 1 and a memory size of Ml may be collected as a metric, or the amount of time it takes to compute a set of statistics as a function of the number of data attributes examined from a data source at a database service may be collected as a metric.
  • the MLS may also collect ratings/rankings or other types of feedback from MLS clients regarding the effectiveness or quality of various approaches or models for the different problem domains.
  • ROC area under receiver operating characteristic
  • respective sets of best practices for various phases of machine learning workflows may be identified (element 1057). Some of the best practices may be specific to particular problem domains, while others may be more generally applicable, and may therefore be used across problem domains. Representations or summaries of the best practices identified may be stored in a knowledge base of the MLS. Access (e.g., via a browser or a search tool) to the knowledge base may be provided to MLS users (element 1060).
  • the MLS may also incorporate the best practices into the programmatic interfaces exposed to users - e.g., by introducing new APIs that are more likely to lead users to utilize best practices, by selecting default parameters based on best practices, by changing the order in which parameter choices in a drop-down menu are presented so that the choices associated with best practices become more likely to be selected, and so on.
  • the MLS may provide a variety of tools and/or templates that can help clients to achieve their machine learning goals. For example, a web-based rich text editor or installable integrated development environment (IDE) may be provided by the MLS, which provides templates and development guidance such as automated syntax error correction for recipes, models and the like.
  • IDE integrated development environment
  • the MLS may provide users with candidate models or examples that have proved useful in the past (e.g., for other clients solving similar problems).
  • the MLS may also maintain a history of the operations performed by a client (or by a set of users associated with the same customer account) across multiple interaction sessions in some implementations, enabling a client to easily experiment with or employ artifacts that the same client generated earlier.
  • FIG. 11 illustrates examples interactions associated with the use of recipes for data transformations at a machine learning service, according to at least some embodiments.
  • a recipe language defined by the MLS enables users to easily and concisely specify transformations to be performed on specified sets of data records to prepare the records for use for model training and prediction.
  • the recipe language may enable users to create customized groups of variables to which one or more transformations are to be applied, define intermediate variables and dependencies upon other artifacts, and so on, as described below in further detail.
  • raw data records may first be extracted from a data source (e.g., by input record handlers such as those shown in FIG.
  • recipe 1 with the help of an MLS I/O library), processed in accordance with one or more recipes, and then used as input for training or prediction.
  • the recipe may itself incorporate the training and/or prediction steps (e.g., a destination model or models may be specified within the recipe).
  • Recipes may be applied either to data records that have already split into training and test subsets, or to the entire data set prior to splitting into training and test subsets.
  • a given recipe may be re -used on several different data sets, potentially for a variety of different machine learning problem domains, in at least some embodiments.
  • the recipe management components of the MLS may enable the generation of easy-to-understand compound models (in which the output of one model may be used as the input for another, or in which iterative predictions can be performed) as well as the sharing and re-use of best practices for data transformations.
  • a pipeline of successive transformations to be performed starting with a given input data set may be indicated within a single recipe.
  • the MLS may perform parameter optimization for one or more recipes - e.g., the MLS may automatically vary such transformation properties as the sizes of quantile bins or the number of root words to be included in an n-gram in an attempt to identify a more useful set of input variables to be used for a particular machine learning algorithm.
  • a text version 1101 of a transformation recipe may be passed as a parameter in a "createRecipe" MLS API call by a client.
  • a recipe validator 1104 may check the text version 1101 of the recipe for lexical correctness, e.g., to ensure that it complies with a grammar 1151 defined by the MLS in the depicted embodiment, and that the recipe comprises one or more sections arranged in a predefined order (an example of the expected structure of a recipe is illustrated in FIG. 12 and described below).
  • the version of the recipe received by the MLS need not necessarily be a text version; instead, for example, a pre-processed or partially-combined version (which may in some cases be in a binary format rather than in plain text) may be provided by the client.
  • the MLS may provide a tool that can be used to prepare recipes - e.g., in the form of a web-based recipe editing tool or a downloadable integrated development environment (IDE).
  • IDE integrated development environment
  • Such a recipe preparation tool may, for example, provide syntax and/or parameter selection guidance, correct syntax errors automatically, and/or perform at least some level of preprocessing on the recipe text on the client side before the recipe (either in text form or binary form) is sent to the MLS service.
  • the recipe may use a number of different transformation functions or methods defined in one or more libraries 1152, such as functions to form Cartesian products of variables, n-grams (for text data), quantile bins (for numeric data variables), and the like.
  • the libraries used for recipe validation may include third-party or client-provided functions or libraries in at least some embodiments, representing custom feature processing extensions that have been incorporated into the MLS to enhance the service's core or natively-supported feature processing capabilities.
  • the recipe validator 1104 may also be responsible for verifying that the functions invoked in the text version 1101 are (a) among the supported functions of the library 1152 and (b) used with the appropriate signatures (e.g., that the input parameters of the functions match the types and sequences of the parameters specified in the library).
  • MLS customers may register additional functions as part of the library, e.g., so that custom "user-defined functions" (UDFs) can also be included in the recipes.
  • UDFs custom "user-defined functions"
  • Customers that wish to utilize UDFs may be required to provide an indication of a module that can be used to implement the UDFs (e.g., in the form of source code, executable code, or a reference to a third-party entity from which the source or executable versions of the module can be obtained by the MLS) in some embodiments.
  • a number of different programming languages and/or execution environments may be supported for UDFs in some implementations, e.g., including JavaTM, Python, and the like.
  • the text version of the recipe may be converted into an executable version 1107 in the depicted embodiment.
  • the recipe validator 1104 may be considered analogous to a compiler for the recipe language, with the text version of the recipe analogous to source code and the executable version analogous to the compiled binary or byte code derived from the source code.
  • the executable version may also be referred to as a feature processing plan in some embodiments.
  • both the text version 1101 and the executable version 1107 of a recipe may be stored within the MLS artifact repository 120.
  • a run-time recipe manager 1110 of the MLS may be responsible for the scheduling of recipe executions in some embodiments, e.g., in response to the equivalent of an "executeRecipe" API specifying an input data set.
  • two execution requests 1171 A and 1171B for the same recipe Rl are shown, with respective input data sets IDS1 and IDS2.
  • the input data sets may comprise data records whose variables may include instances of any of a variety of data types, such as, for example text, a numeric data type (e.g., real or integer), Boolean, a binary data type, a categorical data type, an image processing data type, an audio processing data type, a bioinformatics data type, a structured data type such as a particular data type compliant with the Unstructured Information Management Architecture (UIMA), and so on.
  • a numeric data type e.g., real or integer
  • Boolean e.g., a binary data type
  • a categorical data type e.g., an image processing data type
  • an audio processing data type e.g., a bioinformatics data type
  • UIMA Unstructured Information Management Architecture
  • the run-time recipe manager 1110 may retrieve (or generate) the executable version of Rl, perform a set of run-time validations (e.g., to ensure that the requester is permitted to execute the recipe, that the input data appears to be in the correct or expected format, and so on), and eventually schedule the execution of the transformation operations of Rl at respective resource sets 1175 A and 1175B.
  • the specific libraries or functions to be used for the transformation may be selected based on the data types of the input records - e.g., instances of a particular structured data type may have to be handled using functions or methods of a corresponding library defined for that data type.
  • Respective outputs 1185 A and 1185B may be produced by the application of the recipe Rl on IDS1 and IDS2 in the depicted embodiment.
  • the outputs 1185 A may represent either data that is to be used as input for a model, or a result of a model (such as a prediction or evaluation).
  • a recipe may be applied asynchronously with respect to the execution request - e.g., as described earlier, a job object may be inserted into a job queue in response to the execution request, and the execution may be scheduled later.
  • the execution of a recipe may be dependent on other jobs in some cases - e.g., upon the completion of jobs associated with input record handling (decryption, decompression, splitting of the data set into training and test sets, etc.).
  • the validation and/or compilation of a text recipe may also or instead be managed using asynchronously-scheduled jobs.
  • a client request that specifies a recipe in text format and also includes a request to execute the recipe on a specified data set may be received - that is, the static analysis steps and the execution steps shown in FIG. 11 may not necessarily require separate client requests.
  • a client may simply indicate an existing recipe to be executed on a data set, selected for example from a recipe collection exposed programmatically by the MLS, and may not even have to generate a text version of a recipe.
  • the recipe management components of the MLS may examine the set of input data variables, and/or the outputs of the transformations indicated in a recipe, automatically identify groups of variables or outputs that may have a higher predictive capability than others, and provide an indication of such groups to the client.
  • FIG. 12 illustrates example sections of a recipe, according to at least some embodiments.
  • the text of a recipe 1200 may comprise four separate sections - a group definitions section 1201, an assignments section 1204, a dependencies section 1207, and an output/destination section 1210.
  • only the output/destination section may be mandatory; in other implementations, other combinations of the sections may also or instead be mandatory.
  • the sections may have to be arranged in a specified order.
  • a destination model i.e., a machine learning model to which the output of the recipe transformations is to be provided
  • clients may define groups of input data variables, e.g., to make it easier to indicate further on in the recipe that the same transformation operation is to be applied to all the member variables of a group.
  • the recipe language may define a set of baseline groups, such as ALL INPUT (comprising all the variables in the input data set), ALL TEXT (all the text variables in the data set), ALL NUMERIC (all integer and real valued variables in the data set), ALL CATEGORICAL (all the categorical variables in the data set) and ALL BOOLEAN (all the Boolean variables in the data set, e.g., variables that can only have the values "true” or "false” (which may be represented as “1” and "0” respectively in some implementations)).
  • the recipe language may allow users to change or "cast" the types of some variables when defining groups - e.g., variables that appear to comprise arbitrary text but are only expected to have only a discrete set of values, such as the names of the months of the year, the days of the week, or the states of a country, may be converted to categorical variables instead of being treated as generic text variables.
  • groups e.g., variables that appear to comprise arbitrary text but are only expected to have only a discrete set of values, such as the names of the months of the year, the days of the week, or the states of a country, may be converted to categorical variables instead of being treated as generic text variables.
  • the methods/functions "group” and “group remove” may be used to combine or exclude variables when defining new groups.
  • a given group definition may refer to another group definition in at least some embodiments.
  • LONGTEXT comprises all the text variables in the input data, except for variables called "title” and "subject”.
  • SPECIAL TEXT includes the text variables "subject” and "title”.
  • BOOLCAT includes all the Boolean and categorical variables in the input data. It is noted that at least in some embodiments, the example group definitions shown may be applied to any data set, even if the data set does not contain a "subject" variable, a "title” variable, any Boolean variables, any categorical variables, or even any text variables. If there are no text variables in an input data set, for example, both LONGTEXT and SPECIAL TEXT would be empty groups with no members with respect to that particular input data set in such an embodiment.
  • a variable called “binage” is defined in terms of a "quantile bin” function (which is assumed to be included among the pre-defined library functions of the recipe language in the depicted embodiment) applied to an "age" variable in the input data, with a bin count of "30".
  • a variable called “countrygender” is defined as a Cartesian product of two other variables “country” and “gender” of the input data set, with the "cartesian” function assumed to be part of the pre-defined library.
  • a user may indicate other artifacts (such as the model referenced as "clustermodel” in the illustrated example, with the MLS artifact identifier "pr- 23872-28347-alksdj ') upon which the recipe depends.
  • the output of a model that is referenced in the dependencies section of the recipe may be ingested as the input of the recipe, or a portion of the output of the referenced model may be included in the output of the recipe.
  • the dependencies section may, for example, be used by the MLS job scheduler when scheduling recipe-based jobs in the depicted embodiment.
  • Dependencies on any of a variety of artifacts may be indicated in a given recipe in different embodiments, including other recipes, aliases, statistics sets, and so on.
  • a number of transformations are applied to input data variables, groups of variables, intermediate variables defined in earlier sections of the recipe, or the output of an artifact identified in the dependencies section.
  • the transformed data is provided as input to a different model identified as "modell”.
  • a term- frequency-inverse document frequency (tfidf) statistic is obtained for the variables included in the LONGTEXT group, after punctuation is removed (via the "nopunct” function) and the text of the variables is converted to lowercase (by the "lowercase” function).
  • the tfidf measure may be intended to reflect the relative importance of words within a document in a collection or corpus; the tfidf value for a given word typically is proportional to the number of occurrences of the word in a document, offset by the frequency of the word in the collection as a whole.
  • the tfidf, nopunct and lowercase functions are all assumed to be defined in the recipe language's library.
  • other transformations indicated in the output section use the osb (orthogonal sparse bigrams) library function, the quantile bin library function for binning or grouping numeric values, and the Cartesian product function.
  • Some of the outputs indicated in section 1210 may not necessarily involve transformations per se: e.g., the BOOLCAT group's variables in the input data set may simply be included in the output, and the "clusterNum” output variable of "clustermodel” may be included without any change in the output of the recipe as well.
  • the entries listed in the output section may be used to implicitly discard those input data variables that are not listed.
  • the input data set includes a "taxable-income" numeric variable, it may simply be discarded in the illustrated example since it is not directly or indirectly referred to in the output section.
  • the recipe syntax and section-by-section organization shown in FIG. 12 may differ from those of other embodiments.
  • a wide variety of functions and transformation types (at least some of which may differ from the specific examples shown in FIG. 12) may be supported in different embodiments. For example, date/time related functions "dayofweek”, “hourofday” "month”, etc. may be supported in the recipe language in some embodiments.
  • Mathematical functions such as “sqrt” (square root), “log” (logarithm) and the like may be supported in at least one embodiment. Functions to normalize numeric values (e.g., map values from a range ⁇ -Nl to +N2 ⁇ into a range ⁇ 0 to 1 ⁇ ), or to fill in missing values (e.g., "replace_missing_with_mean(ALL_NUMERIC)”) may be supported in some embodiments. Multiple references within a single expression to one or more previously-defined group variables, intermediate variables, or dependencies may be allowed in one embodiment: e.g., the recipe fragment "replace_missing(ALL_NUMERIC, mean(ALL NUMERIC))" may be considered valid. Mathematical expressions involving combinations of variables such as "'income' + 10*'capital_gains"' may also be permitted within recipes in at least some embodiments. Comments may be indicated by delimiters such as "//" in some recipes.
  • FIG. 13 illustrates an example grammar that may be used to define acceptable recipe syntax, according to at least some embodiments.
  • the grammar shown may be formatted in accordance with the requirements of a parser generator such as a version of ANTLR (ANother Tool for Language Recognition).
  • the grammar 1320 defines rules for the syntax of expressions used within a recipe.
  • a tools such as ANTLR may generate a parser than can build an abstract syntax tree from a text version of a recipe, and the abstract syntax tree may then be converted into a processing plan by the MLS control plane.
  • An example tree generated using the grammar 1320 is shown in FIG. 14.
  • an expression “expr” can be one of a “BAREID”, a “QUOTEDID”, a “NUMBER” or a “functioncall”, with each of the latter four entities defined further down in the grammar.
  • a BAREID starts with an upper case or lower case letter and can include numerals.
  • a QUOTEDID can comprise any text within single quotes.
  • NUMBERS comprise real numeric values with or without exponents, as well as integers.
  • a functioncall must include a function name (a BAREID) followed by zero or more parameters within round brackets. Whitespace and comments are ignored when generating an abstract syntax tree in accordance with the grammar 1320, as indicated by the lines ending in " -> skip".
  • FIG. 14 illustrates an example of an abstract syntax tree that may be generated for a portion of a recipe, according to at least some embodiments.
  • the example recipe fragment 1410 comprising the text "cartesian(binage, quantile_bin('hours-per-week', 10))" may be translated into abstract syntax tree 1420 in accordance with grammar 1320 (or some other similar grammar) in the depicted embodiment.
  • grammar 1320 or some other similar grammar
  • recipe validator 1104 may ensure that the number and order of the parameters passed to "cartesian” and “quantile bin” match the definitions of those functions, and that the variables “binage” and “hours_per_week” are defined within the recipe. If any of these conditions are not met, an error message indicating the line number within the recipe at which the "cartesian" fragment is located may be provided to the client that submitted the recipe. Assuming that no validation errors are found in the recipe as a whole, an executable version of the recipe may be generated, of which a portion 1430 may represent the fragment 1410.
  • FIG. 15 illustrates an example of a programmatic interface that may be used to search for domain-specific recipes available from a machine learning service, according to at least some embodiments.
  • a web page 1501 may be implemented for a recipe search, which includes a message area 1504 providing high-level guidance to MLS users, and a number of problem domains for which recipes are available.
  • a MLS customer can use a check-box to select from among the problem domains fraud detection 1507, sentiment analysis 1509, image analysis 1511, genome analysis 1513, or voice recognition 1515.
  • a user may also search for recipes associated with other problem domains using search term text block 1517 in the depicted web page.
  • recipes FR1 and FR2 for facial recognition, BTR1 for brain tumor recognition, ODA1 for ocean debris recognition, and AED1 for astronomical event detection. Additional details regarding a given recipe may be obtained by the user by clicking on the recipe's name: for example, in some embodiments, a description of what the recipe does may be provided, ratings/rankings of the recipe submitted by other users may be provided, comments submitted by other users on the recipes, and so on.
  • a user finds a recipe that they wish to use (either unchanged or after modifying the recipe), they may be able to download the text version of the recipe, e.g., for inclusion in a subsequent MLS API invocation.
  • users may also be able to submit their own recipes for inclusion in the collection exposed by the MLS in the depicted embodiment.
  • the MLS may perform some set of validation steps on a submitted recipe (e.g., by checking that the recipe produces meaningful output for various input data sets) before allowing other users access.
  • parameters may typically have to be selected, such as the sizes/boundaries of the bins, the lengths of the ngrams, the removal criteria for sparse words, and so on.
  • the values of such parameters may have a significant impact on the predictions that are made using the recipe outputs.
  • the MLS may support automated parameter exploration.
  • FIG. 16 illustrates an example of a machine learning service that automatically explores a range of parameter settings for recipe transformations on behalf of a client, and selects acceptable or recommended parameter settings based on results of such explorations, according to at least some embodiments.
  • an MLS client 164 may submit a recipe execution request 1601 that includes parameter auto-tune settings 1606.
  • the client 164 may indicate that the bin sizes/boundaries for quantile binning of one or more variables in the input data should be chosen by the service, or that the number of words in an n-gram should be chosen by the service.
  • Parameter exploration and/or auto-tuning may be requested for various clustering- related parameters in some embodiments, such as the number of clusters into which a given data set should be classified, the cluster boundary thresholds (e.g., how far apart two geographical locations can be to be considered part of a set of "nearby" locations), and so on.
  • Various types of image processing parameter settings may be candidates for automated tuning in some embodiments, such as the extent to which a given image should be cropped, rotated, or scaled during feature processing.
  • Automated parameter exploration may also be used for selection dimensionality values for a vector representation of a text document (e.g., in accordance with the Latent Dirichlet Allocation (LDA) technique) or other natural language processing techniques.
  • the client may also indicate the criteria to be used to terminate exploration of the parameter value space, e.g., to arrive at acceptable parameter values.
  • the client may be given the option of letting the MLS decide the acceptance criteria to be used - such an option may be particularly useful for non-expert users.
  • the client may indicate limits on resources or execution time for parameter exploration.
  • the default setting for an auto-tune setting for at least some output transformations may be "true", e.g., a client may have to explicitly indicate that auto-tuning is not to be performed in order to prevent the MLS from exploring the parameter space for the transformations.
  • the MLS may select a parameter tuning range 1654 for the transformation (e.g., whether the quantile bin counts of 10, 20, 30 and 40 should be explored for a particular numeric variable).
  • the parameter ranges may be selected based on a variety of factors in different embodiments, including best practices known to the MLS for similar transformations, resource constraints, the size of the input data set, and so on.
  • the parameter explorer 1642 may select a respective set of values for each parameter so as to keep the number of combinations that are to be tried below a threshold. Having determined the range of parameter values, the parameter explorer may execute iterations of transformations for each parameter value or combination, storing the iteration results 1656 in at least some implementations in temporary storage. Based on the result sets generated for the different parameter values and the optimization criteria being used, at least one parameter value may be identified as acceptable for each parameter.
  • a results notification 1667 may be provided to the client, indicating the accepted or recommended parameter value or values 1668 for the different parameters being auto-tuned.
  • the MLS may instead identify a set of candidate values ⁇ VI, V2, V3, Vn ⁇ for a given parameter P, such that all the values of the set provide results of similar quality.
  • the set of candidate values may be provided to the client, enabling the client to choose the specific parameter value to be used, and the client may notify the MLS regarding the selected parameter value.
  • the client may only be provided with an indication of the results of the recipe transformations obtained using the accepted/optimized parameter values, without necessarily being informed about the parameter value settings used.
  • FIG. 17 is a flow diagram illustrating aspects of operations that may be performed at a machine learning service that supports re-usable recipes for data set transformations, according to at least some embodiments.
  • an indication of a text version of a recipe for transformation operations to be performed on input data sets may be received at a network-accessible MLS implemented at a provider network.
  • the recipe text may include one or more of four sections in accordance with a recipe language defined by the MLS: a group definitions section, an assignment section, a dependency section, and an output/destination section (which may also be referred to simply as the output section).
  • one or more sections may be mandatory.
  • the output/destination section may indicate various feature processing transformation operations that are to be performed on entities defined in other sections of the recipe, or directly on input variables of a data set.
  • the group definitions section may be used to define custom groups of input variables (or input data variables combined with other groups, or groups derived from other groups). Such group definitions may make it easier to specify in the output section that a common transformation is to be applied to several variables.
  • a number of built-in or predefined groups may be supported by the recipe language in some embodiments, such as ALL NUMERIC or ALL CATEGORICAL, along with functions such as "group remove” and "group” to allow recipe creators to easily indicate variable exclusions and combinations to be used when defining new groups.
  • the assignment section may be used to define one or more intermediate variables that can be used elsewhere in the recipe.
  • the dependency section may indicate that the recipe depends on another machine learning artifact (such as a model, or another recipe) or on multiple other artifacts stored in an MLS's repository.
  • the output section may indicate not just the specific transformations to be applied to specified input variables, defined groups, intermediate variables or output of the artifacts indicated in the dependency section, but also the destination models to which the transformation results are to be provided as input.
  • the machine learning service may natively support libraries comprising a variety of different transformation operations that can be used in the recipe's output section, such as the types of functions illustrated in FIG. 12.
  • libraries comprising a variety of different transformation operations that can be used in the recipe's output section, such as the types of functions illustrated in FIG. 12.
  • several different libraries may be supported by the MLS.
  • MLS customers may be able to register their own custom functions (called "user-defined functions" or UDFs), third-party functions, or libraries comprising multiple UDFs or third-party functions with the MLS to extend the core feature processing capabilities of the MLS.
  • UDFs may be provided to the MLS by clients in a variety of different formats (e.g., including one or more text formats and/or one or more binary formats) in some embodiments.
  • a number of different programming or scripting languages may be supported for UDFs in such embodiments.
  • An API for registering externally- produced transformation functions or libraries with the MLS may be supported in some embodiments, e.g., enabling a client to indicate whether the newly-registered functions are to be made accessible to other clients or restricted for use by the submitting client.
  • a recipe may comprise an import section in which one or more libraries (e.g., libraries other than a core or standard library of the MLS) whose functions are used in the recipe may be listed.
  • the MLS may impose resource usage restrictions on at least some UDFs - e.g., to prevent runaway consumption of CPU time, memory, disk space and the like, a maximum limit may be set on the time that a given UDF can run. In this way, the negative consequences of executing potentially error-prone UDFs (e.g., a UDF whose logic comprises an infinite loop under certain conditions) may be limited.
  • the recipe text (or a file or URL from which the recipe text can be read) may be passed as a parameter in an API (such as a "createRecipe" API) invoked by an MLS client.
  • the recipe text may be validated at the MLS, e.g., in accordance with a set of syntax rules of a grammar and a set of libraries that define supported transformation methods or functions (element 1704). If syntax errors or unresolvable tokens are identified during the text validation checks, in at least some embodiments error messages that indicate the portion of the text that needs to be corrected (e.g., by indicating the line number and/or the error-inducing tokens) may be provided to the recipe submitter. If no errors are found, or after the errors found are corrected and the recipe is re-submitted, an executable version of the recipe text may be generated (element 1707).
  • One or both versions of the recipe may be stored in an artifact repository of the MLS in the depicted embodiment, e.g., with a unique recipe identifier generated by the MLS being provided to the recipe submitter.
  • the MLS may determine, e.g., in response to a different API invocation or because the initial submission of the recipe included an execution request, that the recipe is to be applied to a particular data set (element 1710).
  • the data set may be checked to ensure that it meets runtime acceptance criteria, e.g., that the input variable names and data types match those indicated in the recipe, and that the data set is of an acceptable size (element 1713).
  • a set of provider network resources e.g., one or more compute servers, configured with appropriate amounts of storage and/or network capacity as determined by the MLS
  • the transformations indicated in the recipe may then be applied to the input data set (element 1719).
  • the MLS may perform parameter explorations in an effort to identify acceptable parameter values for one or more of the transformations.
  • a notification that the recipe's execution is complete may be provided to the client that requested the execution (element 1722) in the depicted embodiment.
  • some machine learning input data sets can be much larger (e.g., on the order of terabytes) than the amount of memory that may be available at any given server of a machine learning service.
  • a number of filtering or input record rearrangement operations may sometimes have to be performed in a sequence on an input data set.
  • the same input data set may have to be split into training and test data sets multiple times, and such split operations may be considered one example of input filtering.
  • Other input filtering operation types may include sampling (obtaining a subset of the data set), shuffling (rearranging the order of the input data objects), or partitioning for parallelism (e.g., dividing a data set into N subsets for a computation implemented using map-reduce or a similar parallel computing paradigm, or for performing multiple parallel training operations for a model). If a data set that takes up several terabytes of space were to be read from and/or written to persistent storage for each filtering operation (such as successive shuffles or splits), the time taken for just the I/O operations alone may become prohibitive, especially if a large fraction of the I/O comprised random reads of individual observation records of the input data set from rotating disk-based storage devices.
  • a technique of mapping large data sets into smaller contiguous chunks that are read once into some number of servers' memories, and then performing sequences of chunk-level filtering operations in place without copying the data set to persistent storage between successive filtering operations may be implemented at a machine learning service.
  • an I/O library may be implemented by the machine learning service, enabling a client to specify, via a single invocation of a data-source-agnostic API, a variety of input filtering operations to be performed on a specified data set.
  • Such a library may be especially useful in scenarios in which the input data sets comprise varying-length observation records stored in files within file system directories rather than in structured database objects such as tables, although the chunking and in-memory filtering technique described below may in general be performed for any of a variety of data source types (including databases) as described below.
  • the I/O library may allow clients to indicate data sources of various types (e.g., single- host file systems, distributed file systems, storage services of implemented at a provider network, non-relational databases, relational databases, and so on), and may be considered data-source- agnostic in that the same types of filtering operations may be supported regardless of the type of data source being used. In some cases, respective subsets of a given input data set may be stored in different types of data sources.
  • FIG. 18 illustrates an example procedure for performing efficient in-memory filtering operations on a large input data set by a machine learning service (MLS), according to at least some embodiments.
  • a data source 1802 from which a client of the machine learning service wishes to extract observation records may comprise a plurality of data objects such as files Fl, F2, F3 and F4 in the depicted embodiment.
  • the sizes of the files may differ, and/or the number of observation records in any given file may differ from the number of observation records in other files.
  • observation record may be used synonymously with the term "data record” when referring to input data for machine learning operations.
  • a data record extraction request submitted by the client may indicate the data source 1802, e.g., by referring to locations (e.g., a directory name or a set of URLs) of files Fl, F2, F3 and F4.
  • the MLS may ascertain or estimate the size of the data set as a whole (e.g., the combined size of the files) in the depicted embodiment, and determine an order in which the files should be logically concatenated to form a unified address space.
  • data set 1804 may be generated, for example, by logically concatenating the files in the order Fl, F2, F3 and F4.
  • the client's data record extraction request may specify the order in which the files of a multi-file data set are to be combined (at least initially), and/or the sizes of the files.
  • the MLS may determine the concatenation order (e.g., based on any combination of various factors such as lexical ordering of the file names, the sizes of the files, and so on). It is noted that although files are used as an example of the data objects in which observation records are stored in FIG. 18 and some subsequent figures, similar techniques for input filtering may be used regardless of the type of the data objects used (e.g., volumes providing a block-level interface, database records, etc.) in various embodiments.
  • the concatenated address space of data set 1804 may then be sub-divided into a plurality of contiguous chunks, as indicated in chunk mapping 1806.
  • the size of a chunk (Cs) may be determined based on any of several factors in different embodiments. For example, in one embodiment, the chunk size may be set such that each chunk can fit into the memory of an MLS server (e.g., a server of pools 185 of FIG. 1) at which at least a portion of the response to the client's data record extraction request is to be generated.
  • an MLS server e.g., a server of pools 185 of FIG.
  • a chunk size Cs such that Cs is less than or equal to Sm may be selected, as shown in FIG. 18.
  • the client request may indicate a chunk sizing preference, or the MLS may define a default chunk size to be used even if different servers have different amounts of memory available for the data records.
  • the chunk size to be used for responding to one record extraction request may differ from that used for another record extraction request; in other embodiments, the same chunk size may be used for a plurality of requests, or for all requests.
  • the sub-division of the concatenated data set 1804 into contiguous chunks may increase the fraction of the data set that can be read in via more efficient sequential reads than the fraction that has to be read via random reads, as illustrated below with respect to FIG. 19.
  • different chunks of a given chunk mapping may have different sizes - e.g., chunk sizes need not necessarily be identical for all the chunks of a given data set.
  • the initial sub- division of the data set into chunks represents a logical operation that may be performed prior to physical I/O operations on the data set.
  • an initial set of candidate chunk boundaries 1808 may be determined, e.g., based on the chunk sizes being used. As shown, candidate chunk boundaries need not be aligned with file boundaries in at least some embodiments. The candidate chunk boundaries may have to be modified somewhat to align chunk boundaries with observation record boundaries in at least some embodiments when the chunks are eventually read, as described below in greater detail with reference to FIG. 22.
  • a chunk-level filtering plan 1850 may be generated for the chunked data set 1810 in some embodiments, e.g., based on contents of a filtering descriptor (which may also be referred to as a retrieval descriptor) included in the client's request.
  • the chunk-level filtering plan may indicate, for example, the sequence in which a plurality of in-memory filtering operations 1870 (e.g., 1870A, 1870B and 1870N) such as shuffles, splits, samples, or partitioning for parallel computations such as map reduce are to be performed on the chunks of the input data.
  • the machine learning model may support parallelized training of models, in which for example respective (and potentially partially overlapping) subsets of an input data set may be used to train a given model in parallel.
  • the duration of one training operation may overlap at least partly with the duration of another in such a scenario, and the input data set may be partitioned for the parallel training sessions using a chunk-level filtering operation.
  • a chunk-level shuffle for example, may involve rearranging the relative order of the chunks, without necessarily rearranging the relative order of observation records within a given chunk. Examples of various types of chunk-level filtering operations are described below.
  • the client may not necessarily be aware that at least some of the filtering operations will be performed on chunks of the data set rather than at the granularity of individual data records.
  • data transfers 1814 of the contents of the chunks e.g., the observation records respectively included within CI, C2, C3 and C4 may be performed to load the data set into the memories of one or more MLS servers in accordance with the first filtering operation of the sequence.
  • a set of reads directed to one or more persistent storage devices at which least some of the chunks are stored may be executed.
  • De-compression and/or decryption may also be required in some embodiments, e.g., prior to one or more operations of the sequence of filtering operations 1870.
  • the data may be stored in compressed form at the persistent storage devices, it may be de-compressed in accordance with de-compression instructions/metadata provided by the client or determined by the MLS.
  • the MLS may decrypt the data (e.g., using keys or credentials provided or indicated by the client).
  • At least a subset of the chunks C1-C4 may be present in MLS server memories. (If the first filtering operation of the sequence involves generating a sample, for example, not all the chunks may even have to be read in.)
  • the remaining filtering operations of plan 1850 may be performed in place in the MLS server memories, e.g., without copying the contents of any of the chunks to persistent storage in the depicted embodiment, and/or without re-reading the content of any of the chunks from the source data location.
  • the in- memory results of the first filtering operation may serve as the input data set for the second filtering operation
  • the in-memory results of the second filtering operation may serve as the input data set for the third filtering operation
  • the final output of the sequence of filtering operations may be used as input for record parsing 1818 (i.e., determining the content of various variables of the observation records).
  • the observation records 1880 generated as a result of parsing may then be provided as input to one or more destinations, e.g., to model(s) 1884 and/or feature processing recipe(s) 1882.
  • FIG. 19 illustrates tradeoffs associated with varying the chunk size used for filtering operation sequences on machine learning data sets, according to at least some embodiments.
  • Read operations corresponding to two example chunk mappings are shown for a given data set DSl in FIG. 19.
  • data set DSl is assumed to be stored on a single disk, such that a disk read head has to be positioned at a specified offset in order to start a read operation (either a random read or a set of sequential reads) on DSl .
  • chunk mapping 1904A a chunk size of SI is used, and DSl is consequently subdivided into four contiguous chunks starting at offsets 01, 02, 03 and 04 within the data set address space. (It is noted that the number of chunks in the example mappings shown in FIG.
  • a data set may comprise hundreds or thousands of chunks.
  • a total of (at least) four read head positioning operations (RHPs) would have to be performed. After positioning a disk read head at offset 01, for example, the first chunk comprising the contents of DSl with offsets between 01 and 02 may be read in sequentially.
  • This sequential read (SRI) or set of sequential reads may typically be fast relative to random reads, because the disk read head may not have to be repositioned during the sequential reads, and disk read head positioning (also known as "seeking") may often take several milliseconds, which may be of the same order of magnitude as the time taken to sequentially read several megabytes of data.
  • SI sequential read
  • reading the entire data set DSl as mapped to four chunks may involve a read operations mix 1910A that includes four slow RHPs (RHP1 - RHP4) and four fast sequential reads (SRI- SR4).
  • mapping 1904B instead of using a chunk size of S, if a chunk size of 2S (twice the size used for mapping 1904 A) were used, as in mapping 1904B, only two RHPs would be required (one to offset 01 and one to offset 03) as indicated in read operations mix 1910B, and the data set could be read in via two sequential read sequences SRI and SR2. Thus, the number of slow operations required to read DS1 would be reduced in inverse proportion to the chunk size used.
  • chunk size increases from left to right, and on the Y-axis, the change in various metrics that results from the chunk size change is illustrated.
  • partial chunks or subsets of chunks may also be stored at an MLS server - e.g., the number of chunks stored in a given server's memory need not be an integer.
  • intra-chunk and/or cross-chunk filtering operations e.g., at the observation record level
  • the curves shown in graph 1990 are intended to illustrate broad qualitative relationships, not exact mathematical relationships. The rate at which the different metrics change with respect to chunk size may differ from that shown in the graph, and the actual relationships may not necessarily be representable by smooth curves or lines as shown.
  • a chunked data set 2010 comprises ten chunks CI - CIO.
  • a detailed view of chunk CI at the top of FIG. 20a shows its constituent observation records ORl-1 through ORl-n, with successive observation records being separated by delimiters 2004.
  • the observation records of a data set or a chunk need not be of the same size.
  • a chunk-level shuffle operation 2015 which may be one of the in-memory chunk-level filtering operations of a plan 1850, the chunks are reordered.
  • the chunk order may be C5-C2-C7-C9-C10-C6-C8-C3-C1-C4.
  • 70% of the chunks e.g., C5-C2-C7-C9-C10-C6- C8
  • 30% of the chunks C3-C1-C4
  • the internal ordering of the observation records within a given chunk remains unchanged in the depicted example.
  • the observation records of chunk CI are in the same relative order (ORl-1, OR1-2, ORl-n) after the shuffle and split as they were before the shuffle and split filtering operations were performed. It is noted that for at least some types of filtering operations, in addition to avoiding copies to persistent storage, the chunk contents may not even have to be moved from one memory location to another in the depicted embodiment.
  • pointers to the chunks may be modified, such that the pointer that indicates the first chunk points to C5 instead of CI after the shuffle, and so on.
  • filtering at the observation record level may also be supported by the MLS.
  • a client's record extraction request may comprise descriptors for both chunk-level filtering and record-level filtering.
  • FIG. 20b illustrates an example sequence of in-memory filtering operations that includes chunk-level filtering as well as intra-chunk filtering, according to at least some embodiments.
  • the same set of chunk-level filtering operations are performed as those illustrated in FIG. 20a - i.e., a chunk-level shuffle 2015 is performed on data set 2004, followed by a 70-30 split 2020 into training set 2022 and test set 2024.
  • an intra-chunk shuffle 2040 is also performed, resulting in the re-arrangement of the observation records within some or all of the chunks.
  • the observation records of chunk CI may be provided as input in the order OR1-5, ORl-n, OR1-4, ORl-1, OR1-2, to a model or feature processing recipe (or to a subsequent filtering operation), for example, which differs from the original order of the observation records prior to the chunk-level shuffle.
  • Observation records of the other chunks e.g., C2 - CIO
  • cross-chunk record-level filtering operations may also be supported. For example, consider a scenario in which at least two chunks Cj and Ck are read into the memory of a given MLS server SI . In a cross-chunk shuffle, at least some of the observation records of Cj may be shuffled or re-ordered with some of the observation records of Ck in Si 's memory. Other types of record-level filtering operations (e.g., sampling, splitting, or partitioning) may also be performed across chunks that are co-located in a given server's memory in such embodiments.
  • record-level filtering operations e.g., sampling, splitting, or partitioning
  • multiple servers may cooperate with one another to perform cross-chunk operations.
  • only a single chunk-level filtering operation may be performed before the result set of the chunk-level operation is fed to a recipe for feature processing or to a model for training - that is, a sequence of multiple chunk-level operations may not be required.
  • Other types of operations (such as aggregation/collection of observation records or applying aggregation functions to values of selected variables of observation records) may also be performed subsequent to one or more chunk-level operations in at least some embodiments.
  • FIG. 21 illustrates examples of alternative approaches to in-memory sampling of a data set, according to at least some embodiments.
  • a 60% sample of a chunked data set 2110 comprising ten chunks CI - CIO is to be obtained - that is, approximately 60% of the observation records of the data set are to be retained, while approximately 40% of the observation records are to be excluded from the output of the sampling operation.
  • chunk-level sampling 2112 of the chunks may be implemented, e.g., resulting in the selection of chunks CI, C2, C4, C6, C8 and CIO as the desired sample.
  • a combination of chunk- level and intra-chunk sampling may be used. For example, as indicated by the arrow labeled "2", in a first step, 80%> of the chunks may be selected (resulting in the retention of chunks CI, C2, C3, C5, C6, C7, C8 and C9) using chunk-level sampling 2114.
  • an intra-chunk sampling step 2116 75%) of the observation records of each of the retained chunks may be selected, resulting in a final output of approximately 60%> of the observation records (since 75%> of 80%> is 60%).
  • 60% of each chunk's observation records may be sampled in a single intra-chunk sampling step 2118. Similar alternatives and combinations for achieving a given input filtering goal may also be supported for other types of filtering operations in at least some embodiments.
  • candidate chunk boundaries may have to be adjusted in order to ensure that individual observation records are not split, and to ensure consistency in the manner that observation records are assigned to chunks.
  • Data set 2202 A comprises observation records OR1 - OR7 (which may vary in size) separated by record delimiters such as delimiter 2265.
  • record delimiters such as delimiter 2265.
  • newline characters such as ⁇ n
  • the candidate chunk boundaries happen to fall within the bodies of the observation records in data set 2202A.
  • Candidate chunk boundary (CCB) 2204A falls within observation record OR2 in the depicted example, CCB 2204B falls within OR4, and CCB 2204C falls within OR6.
  • the following approach may be used to identify the actual chunk boundaries (ACBs).
  • the first observation record delimiter found is selected as the ending ACB for the chunk.
  • the position of the delimiter between OR2 and OR3 is identified as the actual chunk boundary 2214A corresponding to CCB 2204 A.
  • ACB 2214B corresponds to the delimiter between OR4 and OR5
  • ACB 2214C corresponds to the delimiter between OR6 and OR7.
  • chunk CI comprises OR1 and OR2
  • chunk C2 comprises OR3 and OR4
  • chunk C3 comprises OR5 and OR6
  • chunk C4 comprises OR7.
  • CCB 2204K happens to be aligned with the delimiter separating OR2 and OR3
  • CCB 2204L coincides with the delimiter separating OR4 and OR5
  • CCB 2204M coincides with the delimiter separating OR6 and OR7.
  • FIG. 23 illustrates examples of jobs that may be scheduled at a machine learning service in response to a request for extraction of data records from any of a variety of data source types, according to at least some embodiments.
  • a set of programming interfaces 2361 enabling clients 164 to submit observation record extraction/retrieval requests 2310 in a data- source-agnostic manner may be implemented by the machine learning service.
  • Several different types 2310 of data sources may be supported by the MLS, such as an object storage service 2302 that may present a web-services interface to data objects, a block storage service 2304 that implements volumes presenting a block-device interface, any of a variety of distributed file systems 2306 (such as the Hadoop Distributed File System or HDFS), as well as single-host file systems 2308 (such as variants of Ext3 that may be supported by Linux-based operating systems).
  • distributed file systems 2306 such as the Hadoop Distributed File System or HDFS
  • single-host file systems 2308 such as variants of Ext3 that may be supported by Linux-based operating systems.
  • databases may also be supported data sources.
  • Data objects e.g., files
  • data objects that are implemented using any of the supported types of data sources may be referred to in the retrieval requests, as indicated by the arrows labeled 2352A and 2352B.
  • a single client request may refer to input data objects such as files that are located in several different types of data sources, and/or in several different instances of one or more data source types.
  • different subsets of a given input data set may comprise files located at two different single-host file systems 2308, while respective subsets of another input data set may be located at an object storage service and the block-storage service.
  • An MLS request handler 180 may receive a record extraction request 2310 indicating a sequence of filtering operations that are to be performed on a specified data set located at one or more data sources, such as some combination of shuffling, splitting, sampling, partitioning (e.g., for parallel computations such as map-reduce computations, or for model training operations/sessions that overlap with each other in time and may overlap with each other in the training sets used), and the like.
  • a filtering plan generator 2380 may generate a chunk mapping of the specified data set, and a plurality of jobs to accomplish the requested sequence of filtering operations (either at the chunk level, the record level, or both levels) in the depicted embodiment, and insert the jobs in one or more MLS job queues 142.
  • one or more chunk read jobs 2311 may be generated to read in the data from the data source. If needed, separate jobs may be created to de-compress the chunks (such as jobs 2312) and/or decrypt the data (jobs 2313).
  • jobs 2314 may be generated for chunk-level filtering operations
  • jobs 2315 may be generated for observation record-level filtering operations. Filtering operations at the observation record level may comprise intra-chunk operations (e.g., shuffles of records within a given chunk) and/or cross-chunk operations (e.g., shuffles of records of two or more different chunks that may be co-located in the memory of a given MLS server) in the depicted embodiment.
  • intra-chunk operations e.g., shuffles of records within a given chunk
  • cross-chunk operations e.g., shuffles of records of two or more different chunks that may be co-located in the memory of a given MLS server
  • respective jobs may be created for each type of operation for each chunk - thus, for example, if the chunk mapping results in 100 chunks, 100 jobs may be created for reading in one chunk respectively, 100 jobs may be created for the first chunk-level filtering operation, and so on.
  • a given job may be created for an operation involving multiple chunks, e.g., a separate job may not be required for each chunk.
  • the splitting of a data set into a training set and a test set may be implemented as separate jobs - one for the training set and one for the test set.
  • a given job may indicate dependencies on other jobs, and such dependencies may be used to ensure that the filtering tasks requested by the client are performed in the correct order.
  • FIG. 24 illustrates examples constituent elements of a record extraction request that may be submitted by a client using a programmatic interface of an I/O (input-output) library implemented by a machine learning service, according to at least some embodiments.
  • observation record (OR) extraction request 2401 may include a source data set indicator 2402 specifying the location(s) or address(es) from which the input data set is to be retrieved.
  • one or more URLs uniform resource locators
  • URIs uniform resource identifiers
  • files some combination of one or more file server host names, one or more directory names, and/or one or more file names may be provided as the indicator 2402.
  • a client may include instructions for logical concatenation of the objects of the data set to form a unified address space (e.g., the logical equivalent of "combine files of directory dl in alphabetical order by file name, then files of directory d2 in alphabetical order").
  • an expected format 2404 or schema for the observation records may be included in the OR extraction request, e.g., indicating the names of the variables or fields of the ORs, the inter- variable delimiters (e.g., commas, colons, semicolons, tabs, or other characters) and the OR delimiters, the data types of the variables, and so on.
  • the MLS may assign default data types (e.g., "string” or "character") to variables for which data types are not indicated by the client.
  • the OR extraction request 2401 may include compression metadata 2406, indicating for example the compression algorithm used for the data set, the sizes of the units or blocks in which the compressed data is stored (which may differ from the sizes of the chunks on which chunk-level in-memory filtering operations are to be performed), and other information that may be necessary to correctly de-compress the data set.
  • Decryption metadata 2408 such as keys, credentials, and/or an indication of the encryption algorithm used on the data set may be included in a request 2401 in some embodiments.
  • Authorization/authentication metadata 2410 to be used to be able to obtain read access to the data set may be provided by the client in request 2401 in some implementations and for certain types of data sources.
  • Such metadata may include, for example, an account name or user name and a corresponding set of credentials, or an identifier and password for a security container (similar to the security containers 390 shown in FIG. 3).
  • OR extraction request 2401 may include one or more filtering descriptors 2412 in the depicted embodiment, indicating for example the types of filtering operations (shuffle, split, sample, etc.) that are to be performed at the chunk level and/or at the OR level, and the order in which the filtering operations are to be implemented.
  • one or more descriptors 2452 may be included for chunk-level filtering operations
  • one or more descriptors 2454 may be included for record-level (e.g., intra-chunk and/or cross-chunk) filtering operations.
  • Each such descriptor may indicate parameters for the corresponding filtering operation - e.g., the split ratio for split operations, the sampling ratio for sampling operations, the number of partitions into which the data set is to be subdivided for parallel computations or parallel training sessions, the actions to be taken if a record's schema is found invalid, and so on.
  • the OR extraction request 2401 may include chunking preferences 2414 indicating, for example, a particular acceptable chunk size or a range of acceptable chunk sizes.
  • the destination(s) to which the output of the filtering operation sequence is to be directed e.g., a feature processing recipe or a model
  • a client may indicate performance goals 2418 for the filtering operations, such as a "complete-by" time, which may be used by the MLS to select the types of servers to be used, or to generate a filtering sequence plan that is intended to achieve the desired goals.
  • not all of the constituent elements shown in FIG. 25 may be included within a record extraction request - for example, the compression and/or decryption related fields may only be included for data sets that are stored in a compressed and/or encrypted form.
  • FIG. 25 is a flow diagram illustrating aspects of operations that may be performed at a machine learning service that implements an I/O library for in-memory filtering operation sequences on large input data sets, according to at least some embodiments.
  • An I/O library that enables clients to submit observation record extraction requests similar to those illustrated in FIG. 24 may be implemented.
  • the I/O library may be agnostic with respect to the type of data store at which the input data set is stored - e.g., a common set of programmatic interfaces may be provided for record extraction requests stored at any combination of several different data store types.
  • Such an OR extraction request may be received (element 2501), indicating a source data set that may be too large to fit into the available memory of an MLS server.
  • the OR extraction request may include one or more descriptors indicating a sequence of filtering operations that are to be performed on the input data set.
  • a chunk size to be used for transferring contiguous subsets of the input data set into the memories of one or more MLS servers may be determined (element 2504), e.g., based on any of various factors such as the memory capacity constraints of the MLS servers, a preference indicated by the requesting client via parameters of the request, a default setting of the MLS, the estimated or actual size of the input data set, and so on. In some implementations several different chunk sizes may be selected - e.g., some MLS servers may have a higher memory capacity than others, so the chunks for the servers with more memory may be larger.
  • the objects may be logically concatenated to form a single unified address space (element 2507) in some embodiments.
  • the sequence in which the objects are concatenated may be determined, for example, based on instructions or guidance provided in the request, based on alphanumeric ordering of the object names, in order of file size, in random order, or in some other order selected by the MLS.
  • a chunk mapping may be generated for the data set (element 2510), indicating a set of candidate chunk boundaries based on the selected chunk size(s) and the unified address space.
  • the positions or offsets of the candidate chunk boundaries within the data object or object of the input data set may be computed as part of the mapping generation process.
  • a plan for a sequence of chunk-level filtering operations corresponding to the filtering descriptor(s) in the OR extraction request may be created (element 2513).
  • the plan may include record-level filtering operations (e.g., intra-chunk or cross-chunk operations), in addition to or instead of chunk-level filtering operations, in some embodiments.
  • Cross-chunk operations may, for example, be performed on observation records of several chunks that are co-located in the memory of a given MLS server in some embodiments. In other embodiments, cross-chunk operations may also or instead be performed on chunks that have been read into the memories of different MLS servers.
  • the types of filtering operations supported may include sampling, splitting, shuffling, and/or partitioning. Based at least in part on the first filtering operation of the plan, a data transfer of at least a subset of the chunks of the data set from persistent storage to MLS server memories may be performed (element 2516).
  • the data transfer process may include decryption and/or decompression in addition to read operations in some embodiments.
  • the client may request the MLS to encrypt and/or compress the data prior to transferring the chunks from the source locations to the MLS servers, and then to perform the reverse operation (decryption and/or decompression) once the encrypted/compressed data reaches the MLS servers.
  • the remaining filtering operations may be performed in place in the depicted embodiment, e.g., without copying the chunks to persistent storage or re-reading the chunks for their original source locations (element 2519).
  • respective jobs may be generated and placed in an MLS job queue for one or more of the filtering operations.
  • a record parser may be used to obtain the observation records from the output of the sequence of filtering operations performed (element 2522).
  • the ORs may be provided programmatically to the requesting client (e.g., as an array or collection returned in response to the API call representing the OR extraction request), and/or to a specified destination such as a model or a feature processing recipe (element 2525).
  • FIG. 26 illustrates an example of an iterative procedure that may be used to improve the quality of predictions made by a machine learning model, according to at least some embodiments.
  • the procedure may include re-splitting or re-shuffling the input data set for each of several cross-validation iterations, for example, as described below.
  • An input data set comprising labeled observation records i.e., observation records for which the values or "labels" of dependent variables are known
  • An in-memory chunk-level split operation 2604 may be performed to obtain a training set 2610 and a test set 2615. For example, 80% of the chunks may be included in the training set 2610 in one scenario, and the remaining 20% of the chunks may be included in the test set 2615.
  • a candidate model 2620 may be trained in a training run 2618 (e.g., for a linear regression model, candidate coefficients to be assigned to the various independent/input variables of the data set may be determined). The candidate model 2620 may then be used to make predictions on the test set, and the evaluation results 2625 of the model may be obtained (e.g., indicating how accurately the model was able to generate predictions for the dependent variables of the records of the test set using the candidate coefficients).
  • RMSE root mean square error
  • RMSD root mean square deviation
  • Model tuning 2672 may comprise modifying the set of independent or input variables being used for the predictions, changing model execution parameters (such as a minimum bucket size or a maximum tree depth for tree-based classification models), and so on, and executing additional training runs 2618. Model tuning may be performed iteratively using the same training and test sets, varying some combination of input variables and parameters in each iteration in an attempt to enhance the accuracy or quality of the results.
  • changes 2674 may be made to the training and test data sets for successive training-and-evaluation iterations.
  • the input data set may be shuffled (e.g., at the chunk level and/or at the observation record level), and a new pair of training/test sets may be obtained for the next round of training.
  • the quality of the data may be improved by, for example, identifying observation records whose variable values appear to be invalid or outliers, and deleting such observation records from the data set.
  • One common approach for model improvement may involve cross-validating a candidate model using a specified number of distinct training and test sets extracted from the same underlying data, as described below with reference to FIG. 27.
  • data set changes 2674 may also be performed iteratively in some embodiments, e.g., until either a desired level of quality/accuracy is obtained, until resources or time available for model improvement are exhausted, or until the changes being tried no longer lead to much improvement in the quality or accuracy of the model.
  • FIG. 27 illustrates an example of data set splits that may be used for cross-validation of a machine learning model, according to at least some embodiments.
  • a data set comprising labeled observation records 2702 is split five different ways to obtain respective training sets 2720 (e.g., 2720A - 2720E) each comprising 80% of the data, and corresponding test sets 2710 (e.g., 2710A-2710E) comprising the remaining 20% of the data.
  • Each of the training sets 2720 may be used to train a model, and the corresponding test set 2710 may then be used to evaluate the model.
  • the model may be trained using training set 2720A and then evaluated using test set 271 OA.
  • a different training set 2720B (shown in two parts, part 1 and part 2 in FIG. 27) comprising 80% of the input data may be used, and a different test set 271 OB may be used for evaluating the model.
  • the MLS may implement an API allowing a client to request k-fold cross validation in some embodiments, where k is an API parameter indicating the number of distinct training sets (and corresponding test sets) to be generated for training a specified model using the same underlying input data set.
  • the labeled observation records are distributed among eight chunks CI - C8 in the example shown in FIG. 27.
  • the chunk sizes and boundaries may be determined based on any of various factors, including memory size limits at MLS servers, client preferences, and so on.
  • the split ratio desired (such as the 80-20 split illustrated in FIG. 27) may result in the observation records of a given chunk having to be distributed across a training set and the corresponding test set. That is, partial chunks may have to be included in training and test sets in some cases.
  • Some observation records of chunk C2 may be included in test set 271 OA, while other observation records of chunk C2 may be included in training set 2720A, for example.
  • training sets may appear to comprise contiguous portions of the input data set in FIG. 27, in practice the training and test data sets may be obtained using random selection (e.g., either at the chunk level, at the observation record level, or at both levels) in at least some embodiments.
  • the quality of the predictions made may in general improve, as the effect of localized non-uniformity of the input variable values in different subsets of the input data set may be reduced.
  • FIG. 28 illustrates examples of consistent chunk-level splits of input data sets for cross validation that may be performed using a sequence of pseudo-random numbers, according to at least some embodiments.
  • a random number based split algorithm 2804 is used to divide data set chunks CI -CIO into training and test sets for successive training- evaluation iterations (TEIs). Each TEI may, for example, represent a particular cross-validation iteration such as those illustrated in FIG. 27, although such training and evaluation iterations may also be performed independently of whether cross-validation is being attempted.
  • a pseudo-random number generator (PRNG) 2850 may be used to obtain a sequence 2872 of pseudo-random numbers.
  • PRNG pseudo-random number generator
  • the PRNG 2850 may be implemented, for example, as a utility function or method of an MLS library or a programming language library accessible from a component of the MLS.
  • the state of PRNG 2850 may be deterministically initialized or reset using a seed value S (e.g., a real number or string) in the depicted embodiment, such that the sequence of pseudo-random numbers that is produced after resetting the state with a given seed S is repeatable (e.g., if the PRNG is reset using the same seed multiple times, the same sequence of PRNs would be provided after each such state reset).
  • a seed value S e.g., a real number or string
  • the number of chunks of the input data set (10) and the split ratio (80-20) has been chosen such that an integer number of chunks is placed into the training set and the test set - i.e., observation records of a given chunk do not have to be distributed between both a training set and a test set.
  • the pseudo-random numbers (PRNs) of the sequence 2872 produced by the PRNG may be used to select members of the training and test sets. For example, using the first PRN 2874 (produced after resetting the state of the PRNG), which has a value of 84621356, chunk C7 may be selected for inclusion in the training set 2854A to be used for TEI 2890A.
  • chunk C2 may be selected for the training set 2854A, and so on.
  • the random-number based split algorithm 2804 may rely on certain statistical characteristics of the PRN sequence to correctly designate each chunk of the input data set into either the training set or the test set in the depicted example scenario.
  • the statistical characteristics may include the property that a very large number of distinct pseudo-random numbers (or distinct sub-sequences of some length N) are expected to be produced in any given sequence (e.g., before a given PRN is repeated in the sequence, or before a sub-sequence of length N is repeated).
  • the sequence of PRNs 2872 generated may ensure that each chunk of the input data is mapped to either the training set or the test set, and no chunk is mapped to both the training set and the test set.
  • Such a split operation in which each object (e.g., chunk or observation record) of the source data set is placed in exactly one split result set (e.g., a training set or the corresponding test set), may be referred to as a "consistent" or "valid" split.
  • a split operation in which one or more objects of the input data set are either (a) not placed in any of the split result sets, or (b) placed in more than one of the split result sets may be termed an "inconsistent" or "invalid" split.
  • the sequence of the PRNs used for each of the two split mappings (the mapping to the training set and the mapping to the test set), and hence the state of the PR source, may influence the probability of producing inconsistent splits in at least some embodiments.
  • the use of inconsistent splits for training and evaluation may result in poorer prediction quality and/or poorer accuracy than if consistent splits are used.
  • intra-chunk shuffles may be implemented within the training set and/or the test set, e.g., based on contents of a client request in response to which the TEIs are being implemented.
  • the observation records within a given chunk (e.g., C7) of training set 2854A may be re-ordered in memory (without copying the records to persistent storage) relative to one another before they are provided as input to the model being trained.
  • the observation records of a given chunk (e.g., C3) of test set 2856A may be shuffled in memory before the model is evaluated using the test set.
  • the first TEI 2890A may be implemented with a training set 2854A of chunks (C7,C2,C4,C5,C9,C1,C10,C8) and a test set 2856A of chunks (C3,C6).
  • the same PRNG 2850 may also be used (e.g., without re-initialization or resetting), to split the input data set for the next TEI 2890B. It is noted that for some models and/or applications, only one TEI may be implemented in various embodiments.
  • training set 2854B of TEI 2890B comprises chunks (C8,C3,C5,C6,C10,C2,C1,C9) and the corresponding test set 2856B comprises chunks (C4,C7).
  • Both the splits illustrated in FIG. 28 are consistent/valid according to the definitions provided above. It is noted that although the splitting of the data is illustrated at the chunk level in FIG. 28, the same type of relationship between the PRNG state and the consistency of the split may apply to splits at the observation record level (or splits involving partial chunks) in at least some embodiments.
  • a split involving partial chunks may be implemented in some embodiments as a chunk-level split in which a non-integer number of chunks is placed in each split result set, followed by an intra-chunk split for those chunks whose records are distributed across multiple split result sets.
  • the PRN- based approach to splitting a data set may also be used for N-way splits (where N > 2).
  • FIG. 29 illustrates an example of an inconsistent chunk-level split of an input data set that may occur as a result of inappropriately resetting a pseudo-random number generator, according to at least some embodiments.
  • a PRNG 1850 is initialized using a seed S.
  • the PRN sequence 2972A is used by the split algorithm 2804 to produce the training set 2954A comprising the same set of chunks of data set 2844A that were included in test set 2854A of FIG. 28 (C7,C2,C4,C5,C9,C1,C10,C8).
  • the PRNG is re-initialized.
  • the sequence of pseudo-random numbers generated is repeated - e.g., the first PRN generated after the reset is once again 84621356, the second PRN is once again 56383672, and so on.
  • the split algorithm chooses chunks C7 and C2 for inclusion in test set 2956A as a result of the repetition of PRNs in the depicted example. Such a split may be deemed invalid or inconsistent because C2 and C7 are in both the training set and the test set (and because chunks C3 and C6 are in neither the training set nor the test set).
  • a PRNG may not be invoked in real time for each placement of a given chunk or record into a training set or a test set.
  • a list of pseudo-random numbers or random numbers may be generated beforehand (e.g., using a PRNG), and the numbers in the pre-generated list may be used one by one for the split placements.
  • split consistency may be achieved in at least some embodiments.
  • respective mechanisms may be implemented to (a) save a current state of a PRNG and (b) to re-set a PRNG to a saved state in one embodiment.
  • an API “save state(PRNG)” can be invoked to save the internal state of a PRNG to an object "state AfterTraining” after the training set of a TEI has been generated, and a different API “set_state(PRNG, state AfterTraining)” can be invoked to reset the state of the PRNG (or a different PRNG) to the saved state just before starting the selection of the test set of the TEI.
  • the same sequence of PRNs may be obtained as would be obtained if all the PRNs were obtained without saving/re-setting the PRNG state.
  • different PRN sources may be used for the training set selection than of a given TEI are used for the test set selection, as described below with respect to FIG. 30, and the state of such PRN sources may be synchronized to help achieve consistent splits.
  • the selection of a test set from a given input data set may occur asynchronously with respect to (and in some cases much later than) the selection of the corresponding training set.
  • separate jobs may be inserted in the MLS job queue for the selection of a training set and the selection of the corresponding test set, and the jobs may be scheduled independently of each other in a manner similar to that described earlier.
  • the MLS may maintain state information pertaining to the selection of the training set in some embodiments, which can then be used to help generate the test set.
  • FIG. 30 illustrates an example timeline of scheduling related pairs of training and evaluation jobs, according to at least some embodiments. Four events that occur during a period of approximately four hours (from 11 :00 to 15:00 on a particular day) of a job scheduler's timeline are shown.
  • Job Jl is scheduled at a set of servers SSI of the MLS, and may include the selection of a training set, e.g., either at the chunk-level, at the observation record level, or at both levels.
  • a pseudo-random number source PRNS 3002 (such as a function or method that returns a sequence of PRNs, or a list of pre-generated PRNs) may be used to generate the training set for Job Jl .
  • a training job J2 may be scheduled at a server set SS2, for a training-and-evaluation iteration TEI2 for a different model M2.
  • the training set for job J2 may be obtained using pseudo-random numbers obtained from a different PRNS 3002B.
  • a test job J3 for the evaluation phase of TEI1 is scheduled, more than two hours later than job Jl .
  • the scheduling of J3 may be delayed until Jl completes, for example, and the size of the data set being used for J1/J3 may be so large that it takes more than two hours to complete the training phase in the depicted example.
  • J3 may be scheduled at a different set of servers SS3 than were used for Jl .
  • a different PRNS 9002C may be available at server set SS3 than was available at server set SSI .
  • PRNS 3002C may be synchronized with PRNS 3002 A in the depicted embodiment.
  • a seed value Seedl was used to initialize PRNS 3002A, and 1000 pseudo-random numbers were obtained from PRNS 3002A during job Jl
  • the same seed value Seedl may be used to initialize a logically equivalent PRNS 3002C, and 1000 pseudo-random numbers may be acquired from PRNS 3002C before the pseudo-random numbers to be used for test set selection are acquired.
  • Equivalents of the "save_state()" and "set_state()" calls discussed above may be used in some embodiments to synchronize PRNS 3002C with PRNS 3002A.
  • the MLS may ensure that (a) the same list is used for Jl and J3 and (b) the first PRN in the list that is used for J3 is in a position immediately after the position of the last PRN used for Jl .
  • Other synchronization techniques may be used in various embodiments to ensure that the sequence of pseudo-random numbers used for test set determination is such that a valid and consistent split is achieved for jobs Jl and J3.
  • PRNS 3002D may be synchronized with PRNS 3002B.
  • the numbers used in J3 may have to be coordinated with respect to the numbers used in Jl, and the numbers used in J4 may have to be coordinated with respect to the numbers used in J2).
  • FIG. 31 illustrates an example of a system in which consistency metadata is generated at a machine learning service in response to a client request, according to at least some embodiments.
  • the consistency metadata may be retained or shared across related jobs (e.g., a training job and a corresponding evaluation job) to achieve the kinds of coordination/synchronization discussed with respect to FIG. 30.
  • a client 164 of an MLS may submit a split request 3110 via a data-source-agnostic programmatic interface 3161 of an MLS I/O library.
  • the split request may be part of a cross-validation request, or part of a request to perform a specified number of training-and- evaluation iterations.
  • the split request may represent a variant of the type of observation record extraction request 2401 shown in FIG. 24.
  • the split request may include, for example, one or more client-specified seed values 3120 that may be used for obtaining the pseudo-random numbers for the requested split operations, although such seed values may not have to be provided by the client in at least one embodiment.
  • the split request 3110 may include an indication (e.g., file names, paths or identifiers) of the input data set 3122.
  • Split parameters 3124 may indicate one or more training- to-test ratios (e.g., the 80-20 split ratio illustrated in FIG. 29).
  • the desired iteration count 3126 may be included in the client request.
  • a request handler component 180 of the MLS may pass on the request 3110 to a plan generator 3180 in the depicted embodiment.
  • the plan generator may determine a set of consistency metadata 3152, e.g., metadata that may be shared among related jobs that are inserted in the MLS job queue for the requested split iterations.
  • the metadata 3152 may comprise the client-provided seed values 3120, for example.
  • the plan generator 3180 may determine a set of one or more seed values if a client- provided seed value is not available (e.g., because the API 3161 used for the client request does not require a seed to be provided, or because the client failed to provide a valid seed value).
  • Such MLS-selected seed values may be based, for example, on some combination of input data set IDs 3122 (e.g., a hash value corresponding to a file name or directory name of the input data set may be used as a seed), client identifier, the time at which the request 3110 was received, the IP address from which the request 3110 was received, and so on.
  • the MLS may have several sources of pseudo-random numbers available, such as PRNGs or lists of pre-generated PRNs, and an identifier of one or more PRN sources may be included in the consistency metadata 3152.
  • a pointer to the last-used PRN within a specified list may be used, such that each entity that uses the list (e.g., an MLS job executor) updates the pointer after it has used some number of the list's PRNs.
  • a state record of a PRNG may be included in the metadata. The state record may be updated by each entity (e.g., an MLS job executor) that used the PRNG, e.g., so that the next entity that uses the PRNG can set its state appropriately to obtain PRNs that can be used to perform a consistent split.
  • the plan generator 3180 may generate respective jobs 3155 for selecting the split result sets. For example, for a given training-and-evaluation iteration, one job may be created for selecting the training set and another job may be generated for selecting the test set.
  • a job object created by the plan generator 3180 may include a reference or pointer to the consistency metadata to be used for that job.
  • at least a portion of the consistency metadata 3152 may be included within a job object. When a job is executed, the metadata 3152 may be used to ensure that the input data set is split consistently.
  • a single job may be created that includes both training and test set selection.
  • a similar approach towards consistency or repeatability may be taken for other types of input filtering operations, such as sampling or shuffling, in at least some embodiments.
  • a client may wish to ensure shuffle repeatability (i.e., that the results of one shuffle request can be re-obtained if a second shuffle request with the same input data and same request parameters is made later) or sample repeatability (i.e., that the same observation records or chunks are retrievable from a data set as a result of repeated sample requests). If the filtering operation involves a use of pseudo-random numbers, saving seed values and/or the other types of consistency metadata shown in FIG.
  • a repeated shuffle may be obtained starting with the same input data set and re-initializing a PRNG with the same seed value as was used for an initial shuffle. Similarly, re-using the same seed may also result in a repeatable sample.
  • consistent splits may be performed at the chunk level, at the observation record level, or at some combination of chunk and record levels, using consistency metadata of the kind described above.
  • the records of the individual chunks in the training set or the test set may be shuffled prior to use for training/evaluating a model.
  • FIG. 32 is a flow diagram illustrating aspects of operations that may be performed at a machine learning service in response to a request for training and evaluation iterations of a machine learning model, according to at least some embodiments.
  • a request to perform one or more TEIs may be received via a programmatic interface such as an MLS I/O library API.
  • a set of consistency metadata may be generated for the iteration(s), e.g., comprising one or more initialization parameter values (such as a value VI) for pseudo-random number sources (PR Ss).
  • the metadata may comprise a seed value to be used to initialize or reset a state of a PRNG, for example, or a pointer to a particular offset within a list of pre-generated pseudo- random number.
  • the client may include at least a portion of the metadata in the TEI request.
  • the consistency metadata may include, for example, an identifier of a PR S, a representation of a state of a PRNS, and/or a pointer into a list of pseudo-random numbers.
  • the files/objects may be logically concatenated to form a unified address space for the input data.
  • the address space of the input data set may be sub-divided into contiguous chunks (element 3207), e.g., with the chunk sizes/boundaries being selected based on client preferences, memory constraints at MLS servers, and/or other factors.
  • One or more chunks of the input data set may be read in from persistent storage to respective memories at one or more MLS servers, e.g., such that at least a portion of chunk CI is stored in memory at server SI and at least a portion of chunk C2 is stored in memory at server S2 (element 3210).
  • a first training set Trnl of the input data may be selected (element 3213), e.g., including at least some observation records of chunk CI .
  • the training set may be selected at the chunk level, the observation record level, or some combination of chunk level and observation record level.
  • Partial chunks may be included in the training set Trnl in at least some embodiments (that is, some observation records of a given chunk may be included in the training set while others may eventually be included in the corresponding test set).
  • an initialization parameter value VI may be used to obtain a first set of pseud-random numbers from a source that provided deterministic sequences of such numbers based on the source's initial state, and the first set of pseudo-random numbers may in turn be used to select the training set Trnl used to train a targeted machine learning model Ml .
  • a test set Tstl may be determined using the consistency metadata (element 3216) (e.g., using a set of pseudo-random numbers obtained from the same source, or from a source whose state has been synchronized with that of the source used for selecting Trnl).
  • the consistency metadata may indicate a seed Seedl and a count Nl of pseudo-random numbers that are obtained from a PRNG for generating Trnl .
  • an equivalent PRNG may be initialized with Seedl, and the first Nl pseudo-random numbers generated from the equivalent PRNG may be discarded before using the succeeding pseudorandom numbers (starting from the (Nl+l)th number) for selecting Tstl .
  • the algorithm used for selecting Trnl and Tstl may be designed in such a way that the same sequence of pseudo-random numbers can be used to select Trnl and Tstl while still meeting the consistency criteria described earlier.
  • same seed value may be used to initialize a PRNG for Tstl, and no pseudorandom numbers may have to be skipped to select Tstl .
  • Model Ml may be tested/evaluated (e.g., the accuracy/quality of the model's predictions may be determined) using test set Tstl .
  • the training and test sets for the next iteration may be identified in place, without copying any of the chunk contents to other locations in the depicted embodiment (element 3222).
  • the consistency metadata that was used to generate Trnl and Tstl may be used for selecting the training set and the test set for subsequent TEIs as well.
  • respective sets of consistency metadata may be used for respective TEIs.
  • the observation records within individual chunks of the training set may be shuffled in memory (i.e., an intra- chunk shuffle may be performed without any additional I/O to persistent storage) prior to using the observation records to train the model.
  • intra-chunk shuffles may be performed on test sets in some embodiments before the test sets are used for evaluation.
  • FIG. 33 illustrates an example of a decision tree that may be generated for predictions at a machine learning service, according to at least some embodiments.
  • a training set 3302 comprising a plurality of observation records (ORs) such as OR 3304 A, OR 3304B and OR 3304C is to be used for training a model to predict the value of a dependent variable DV.
  • ORs observation records
  • Each OR in the training set 3302 contains values for some number of independent variables (IVs), such as IV1, IV2, IV3, IVn (for example, in OR 3304A, IVl 's value is x, IV2's value is y, IV3's value is k, IV4's value is m, and IVn's value is q) as well as a value of the dependent variable DV (whose value is X in the case of OR 3304A).
  • independent variables may also be referred to herein as input variables, and the dependent variable may be referred to as an output variable.
  • not all the ORs 3304 need have values for all of the independent variables in at least some embodiments; for example, some values may not be available from the source from which the observation records are obtained.
  • the dependent variable which may also be referred to as the "label” or the "target variable” (since it is the variable whose value the model is to predict) takes on one of two values, X or Y.
  • Any given independent variable as well as the dependent variable may take on any number of different values, and may be of any desired data type such as numerical, categorical, Boolean, character, and so on.
  • one or more decision trees 3320 may be constructed, e.g., by a model generator component or model manager component of the machine learning service described above, to make predictions for the value of DV based on the values of at least some of the IVs of an observation record.
  • Each non-leaf node of a decision tree 3320 such as root node 3322, may indicate one or more conditions or predicates to be evaluated on one or more independent variables, and the results of evaluating the predicate may determine the path to be taken next towards a leaf node of the tree at which a prediction for the DV is made for the OR.
  • the root node indicates that the value of independent variable IV2 is to be compared with k. If IV2 is less than k for a given observation record for which a prediction is to be made, the path to intermediate node 3323 should be taken, as indicated by the edge labeled "y" (for "yes” in answer to the evaluation of "IV2 ⁇ k"). If IV2 is greater than or equal to k in the observation record being analyzed, the path labeled "n" (for "no") would be taken. Similar decisions would be taken at various non-leaf nodes until a leaf node is reached, at which point a value for DV would be predicted based on the combination of predicates checked along the path.
  • a similar traversal would be performed for all the records of a test data set 3330 by a decision tree based model 3335, resulting in a set of predictions 3340 of DV values.
  • one or more of the independent variables may not necessarily be represented in a decision tree - for example, if independent variable IVn is not significant with respect to predicting DV, none of the nodes included in the tree 3320 may include a condition that refers to IVn.
  • the model generator component of the machine learning service may be responsible for identifying efficient ways of predicting DV values accurately using some subset of the independent variables, and encoding such efficient ways in the form of one or more decision trees. A number of factors which may contribute to prediction quality and efficiency are discussed below.
  • a simple binary classification example is illustrated in FIG. 33 to simplify the presentation.
  • Decision trees may also be used for multi-way classification and/or regression in various embodiments.
  • a given node of a decision tree may have more than two child nodes (i.e., more than two outgoing paths towards the leafs) in some embodiments - that is, more complex multi-result conditions may be evaluated at each node than the simple binary tests shown in FIG. 33.
  • each node may be represented by a corresponding descriptor indicating the predicates/conditions to be checked, the number and identity of its child nodes, etc., so that the tree as whole may be represented as a collection of node descriptors.
  • the size and shape of a decision tree 3320 that is generated may depend on various factors such as the number of independent variables that are found to be significant for predictions, the order in which the tree-generation algorithm analyzes the observation records of the training set, and so on.
  • Some models (such as Random Forest models and adaptive boosting models) may require or rely on ensembles or collections of many different trees, e.g., respective trees obtained using respective subsets of the training data set.
  • the costs (e.g., in terms of resources used or time required) for making decision-tree based predictions may be broadly categorized into two categories: training costs and execution/prediction costs. Execution/prediction costs may also be called run-time costs herein. Training costs refer to the resources used to construct the trees and train the model using the training data set, while the execution costs refer to the resources used when the models make predictions on new data (or test data) that was not used for the training phase. In at least some embodiments, as described below, tradeoffs may be possible between the training costs and the quality of the predictions made on new data. By expending more resources and/or time during training, better (e.g., more accurate and/or faster) predictions may be made possible for at least some types of problems.
  • decision trees may be constructed in depth-first order, with the descriptors for the nodes being streamed immediately to disk or some other form of persistent storage as they are being created, instead of requiring the tree-construction procedure to be limited to the amount of main memory available at a given server.
  • Such a depth-first and persistent-storage- based tree construction pass may result in a number of benefits relative to breadth- first memory- constrained approaches, such as better prediction accuracies for observation record classes with small populations, better processor cache utilization (e.g., at level 2 or level 1 hardware caches associated with the CPUs or cores being used at MLS servers), and so on.
  • the trees may be pruned intelligently during a second pass of the training phase, e.g., to remove a subset of the nodes based on one or more run-time optimization goals.
  • run-time optimization goals may be used herein to refer to objectives associated with executing a trained model to make predictions, such as reducing the time it takes to generate predictions for a test data set or a production data set, reducing the amount of CPU or other resources consumed for such predictions, and so on.
  • clients of the MLS may also or instead have training time goals pertaining to the resources or time used for training the model.
  • Pruned trees that can fit within memory constraints may then be used to make high-quality predictions on non-training data sets. Details regarding the manner in which the decision trees may be generated and pruned in different embodiments are provided below.
  • FIG. 34 illustrates an example of storing representations of decision tree nodes in a depth-first order at persistent storage devices during a tree-construction pass of a training phase for a machine learning model, according to at least some embodiments.
  • training data 3432 may be read into training set memory buffers 3340 (e.g., at one or more MLS servers) prior to construction of one or more decision tree trees 3433.
  • the entire training set need not be read into memory - for example, in one implementation, pointers to the observation records may be retained in memory instead of the entire records.
  • the rearrangement of the training set records may be performed in memory (i.e., without I/O to disk or other persistent storage devices) in at least some embodiments. As lower levels of the tree are reached, smaller subsets of the training set may have to be rearranged, thereby potentially improving hardware cache utilization levels in at least some embodiments.
  • Tree 3433 may be constructed in depth-first order in the depicted embodiment. Although the pre-order version of depth first traversal/construction is illustrated in FIG. 34, in- order or post-order depth- first traversals/construction may be employed in some embodiments.
  • the labels "N ⁇ #>" for the nodes indicate the sequence in which they are generated, and the order in which corresponding descriptors 3430 are written from memory to persistent storage device(s) such as various disk-based devices accessible at the MLS servers at which the model generator or model manager runs. Thus, node Nl is created first, and written to persistent storage first, followed by N2, N3, as indicated by arrows 3435.
  • the first leaf node created in the depth- first sequence is N6, followed by N7, N8, N9, N10 and N12.
  • the descriptors 3430 e.g., 3430A - 3430L for nodes N1-N12 respectively
  • a respective predictive utility metric (PUM) 3434 may also be generated for some or all of the nodes of tree 3433 in the depicted embodiment and stored in persistent storage - e.g., PUM 3434A may be computed and stored for node Nl, PUM 3434B for node N2, and so on.
  • PUM 3434A may be computed and stored for node Nl, PUM 3434B for node N2, and so on.
  • the PUM of a given node may be indicative of the relative contribution or usefulness of that node with respect to the predictions that can be made using all the nodes.
  • Different measures may be used as predictive utility metrics in different embodiments, e.g., based on the type of machine learning problem being solved, the specific algorithm being used for the tree's construction, and so on.
  • a Gini impurity value may be used as the PUM or as part of the PUM, or an entropy-based measure of information gain, or some other measure of information gain may be used.
  • some measure of predictive utility or benefit of a predicate may have to be computed in any case during tree construction for at least some of the nodes to be added to the tree, and the PUM assigned to the node may simply represent such a benefit.
  • PUM values may not be identified for one or more nodes of a tree - that is, having PUM values available for a subset of the nodes may suffice for tree pruning purposes.
  • a histogram or similar distribution indicator of the PUM values with respect to the tree nodes may be created and/or written to persistent storage, e.g., together with the node descriptors and PUM values.
  • a histogram may, for example, take much less memory than an exhaustive list of the tree's nodes and corresponding PUM values.
  • FIG. 35 illustrates an example of predictive utility distribution information that may be generated for the nodes of a decision tree, according to at least some embodiments.
  • PUM values increase from left to right on the X-axis of the PUM histogram 3510, and the number of decision tree nodes that fall within each PUM value bucket is indicated by the height of the corresponding bar of the histogram.
  • bucket 3520A representing relatively low-value nodes may be identified, indicating how many nodes have low PUM values
  • bucket 3520B indicating the number of high-value nodes may be identified, for example.
  • the low value nodes may be deemed better candidates for removal from the tree during pruning than the high value nodes.
  • identifiers of at least some of the nodes belonging to one or more of the buckets of the histogram 3510 may be stored in persistent storage to assist in the pruning phase.
  • the identifiers of nodes within two levels from a leaf node may be stored for one or more low-value buckets in one implementation, and such a list may be used to identify pruning candidate nodes.
  • the tree-construction pass of a training phase may be followed by a pruning pass in at least some embodiments, in which the tree representations are reduced in size by eliminating selected nodes in view of one or more run-time optimization goals or criteria.
  • a pruning pass in at least some embodiments, in which the tree representations are reduced in size by eliminating selected nodes in view of one or more run-time optimization goals or criteria.
  • several separate periods of tree-construction interspersed with periods of tree- pruning may be implemented, so that the entire tree need not necessarily be generated before some its nodes are pruned (which might help reduce the total number of nodes generated).
  • a number of different goals may be taken into consideration in different embodiments for pruning.
  • FIG. 36 illustrates an example of pruning a decision tree based at least in part on a combination of a run-time memory footprint goal and cumulative predictive utility, according to at least some embodiments.
  • run-time memory footprint may be used herein to indicate the amount of main memory required for an execution of the model at a given server or a combination of servers, e.g., after the model's training phase is completed. Tradeoffs between two conflicting run-time goals may be considered in the depicted embodiment: the amount of memory it takes to store the tree during model execution, and the accuracy or quality of the prediction. In at least some implementations, both the memory footprint or usage (for which lower values are better) and the accuracy/quality (for which higher values are better) may increase with the number of retained nodes (i.e., the nodes that are not removed/pruned from the initial decision tree generated using the depth-first stream-to-persistent- storage technique described above).
  • a runtime memory footprint goal may be translated into a "max-nodes" value 3610, indicating the maximum number of nodes that can be retained.
  • the quality or accuracy of the pruned tree may be expressed in terms of the cumulative retained predictive utility 3620, for example, which may be computed by summing the PUM values of the retained nodes, or by some other function that takes the PUM values of retained nodes as inputs.
  • Nodes may be identified for removal using a variety of approaches in different embodiments.
  • a greedy pruning technique 3650 the unpruned tree 3604 may be analyzed in a top-down fashion, selecting the path that leads to the node with the highest PUM value at each split in the tree.
  • the cumulative PUM values of the nodes encountered during the greedy top-down traversal may be tracked, as well as the total number of nodes encountered. When the total number of nodes encountered equals the max-nodes value, the nodes that have been encountered thus far may be retained and the other nodes may be discarded or removed.
  • a modified or pruned version 3608 of the tree 3604 may be stored (e.g., in persistent storage) separately from the un-pruned version, so that, for example, re- pruning may be attempted using a different pruning approach if necessary. In other embodiments, only the pruned version 3608 may be retained.
  • a bottom-up approach may be used as indicated by arrow 3660, in which leaf nodes are analyzed first, and nodes are removed if their contribution to the quality/accuracy of the model is below a threshold until the max-nodes constraint 3610 is met.
  • the PUM distribution information (such as a histogram similar to that illustrated in FIG.
  • the MLS may have to prioritize the conflicting goals relative to each other.
  • the max-nodes goal shown in FIG. 36 may be considered a higher priority than the goal of accumulating predictive utility.
  • at least some nodes may be selected for pruning using a random selection procedure, e.g., without using a strictly top-down or bottom-up approach while still adhering to the run-time goals and quality objectives.
  • FIG. 37 illustrates an example of pruning a decision tree based at least in part on a prediction time variation goal, according to at least some embodiments.
  • a decision tree such as un-pruned decision tree 3704 may be very unbalanced. That is, some paths between the root node and leaf nodes may be much longer than others.
  • leaf node N8 of tree 3704 may be reached from root node Nl via a decision path 3704A that traverses eight nodes (including Nl and N8), while leaf node N17 may be reached via a decision path 3704B that includes only three nodes.
  • the time taken (and the CPU resources consumed) to make a prediction for a given observation record's dependent variable may be at least approximately proportional to the length of the decision path, as indicated in graph 3786.
  • the variation in the time taken to make predictions for different observation records or test sets may be considered an important indicator of the quality of the model, with less variation typically being preferred to more variation.
  • the maximum variation in prediction time 3710 may be an important run-time optimization goal in such embodiments, and some number of nodes may be removed from the tree 3704 so as to reduce the maximum variation in possible decision paths.
  • nodes N6, N7, N8, N9, N10 and Nl 1 may be removed from tree 3704, so that the maximum decision path length in the modified/pruned tree 3608 is reduced from eight to five.
  • a primary goal of minimizing variation in prediction time may be combined with a secondary goal of maximizing cumulative retained predictive utility. For example, when choices for pruning are to be made that affect the lengths of decision paths equally, the PUM values of the alternative pruning target nodes may be compared and the node with the greater PUM value may be retained.
  • business goals may also be considered when pruning decision trees. For example, consider a scenario in which a group of potential customers of a service is being classified into segments SI, S2, Sn, such that the customers that are classified as belonging to segment S6 are expected to spend substantially higher amounts on the service that customers belonging to other segments. In such a scenario, nodes along the decision paths that lead to classification of S6 customers may be retained during pruning in preference to nodes along decision paths that lead to other segments.
  • a combination of memory footprints/constraints, quality/accuracy goals, absolute execution-time (prediction- time) goals, prediction-time variation goals, business/revenue goals, and/or other goals may be used, with application- specific prioritization of the different goals.
  • a programmatic interface of the MLS may allow clients to indicate one or more run-time optimization goals of the kinds described above, e.g., by ranking the relative importance to a client of the different types of goals for a given model or problem.
  • information regarding best practices for decision tree pruning e.g., which pruning methodologies are most useful
  • FIG. 38 illustrates examples of a plurality of jobs that may be generated for training a model that uses an ensemble of decision trees at a machine learning service, according to at least some embodiments.
  • respective training samples 3805A, 3805B and 3805C may be obtained from a larger training set 3802 (e.g., using any of a variety of sampling methodologies such as random sampling with replacement), and each such sample may be used to create a respective decision tree using the depth-first approach described above.
  • training sample 3805 A may be used to generate and store an un-pruned decision tree (UDT) 381 OA in depth-first order at persistent storage during tree-creation pass 3812 of training phase 3820
  • training sample 3805B may be used for UDT 3810B
  • UDT 3810C may be generated using training sample 3805C.
  • Respective jobs Jl, J2 and J3 may be inserted into an MLS job queue or collection for the construction of UDTs 381 OA, 3810B and 38 IOC in some embodiments.
  • the jobs of the tree-creation pass may be performed in parallel in at least some embodiments, e.g., using respective servers of an MLS server pool, or using multiple threads of execution (or processes) at the same MLS server.
  • Each UDT may be pruned in accordance with applicable run-time optimization goals to produce a corresponding pruned decision tree (PDT) 3818 in the pruning pass 3814 of the training phase in the depicted embodiment.
  • Jobs J4, J5 and J6 may be implemented for pruning UDTs 3810A-3810C respectively, producing PDT 3818A -3818C.
  • jobs J7, J8 and J9 respectively may be scheduled to execute the model using the three PDTs 3818A - 3818C using some specified test set (or production data set) in the depicted embodiment, resulting in prediction results 3850A - 3850C.
  • the results 3850 obtained from the different PDTs may be combined in any desired fashion (e.g., by identifying an average or median value for the predictions for each test set observation record) to produce aggregated prediction results 3860 during a prediction or test phase of the machine learning algorithm being used.
  • a prediction phase may differ from a test phase, for example, in that the values of the dependent variables may not be known for the data set in the prediction phase, while values for the dependent variables may be known for the data set used for testing the model.
  • an additional job J10 may be scheduled for the aggregation of the results.
  • any of the jobs Jl - J10 may be performed in parallel with other jobs, as long as the applicable job dependencies are met - e.g., job J4 may have to be initiated after Jl completes, and J7 may be initiate after J4 completes. Note, however, that J7 may be begun even before J2 completes, as J7 does not depend on J2 - thus, in at least some embodiments, the prediction/test phase 3830 may overlap with the training phase if sufficient resources are available. For some tree ensemble- based algorithms such as Random Forest, hundreds of UDTs and PDTs may be generated for a given training set, and the use of parallelism may reduce both the training time and the execution time substantially relative to sequential approaches.
  • different run-time optimization goals may be applied to pruning different UDTs, while in other embodiments, the same set of run-time optimization goals may be applied to all the trees of an ensemble. Jobs for any of the different tasks illustrated (e.g., tree generation, tree pruning or model execution) that have met their dependencies may be executed in parallel at the thread level (e.g., different threads of execution may be used for the jobs on the same server), the process level (e.g., respective processes may be launched for multiple jobs to be run concurrently on the same server or different servers), or the server level (e.g., each job of a set of concurrently-schedulable jobs may be executed at a different thread/process at a respective MLS server) in various embodiments.
  • the thread level e.g., different threads of execution may be used for the jobs on the same server
  • the process level e.g., respective processes may be launched for multiple jobs to be run concurrently on the same server or different servers
  • the server level e.g., each job of a set
  • Combinations of thread-level, process-level and server-level parallelism may be used in some embodiments - e.g., of four jobs to be run in parallel, two may be run at respective threads/processes at one MLS server, while two may be run at another MLS server.
  • FIG. 39 is a flow diagram illustrating aspects of operations that may be performed at a machine learning service to generate and prune decision trees stored to persistent storage in depth-first order, according to at least some embodiments.
  • a set of run-time optimization goals may be identified for a prediction-tree based model Ml to be trained using a training data set TDS and executed at a machine learning service.
  • a variety of goals may be determined and/or prioritized in different embodiments, including for example memory usage or footprint goals, utilization goals for other resources such as CPUs, prediction-time goals (e.g., the elapsed time for a prediction run of the model), prediction-time variation goals (e.g., reducing the differences between model prediction times for different observation records), prediction accuracy/quality goals, budget goals (e.g., the total amount that a client wishes to spend on model execution, which may be proportional to the CPU utilization of the model execution or to utilization levels of other resources), revenue/profit goals of the kind described above, and so on.
  • prediction-time goals e.g., the elapsed time for a prediction run of the model
  • prediction-time variation goals e.g., reducing the differences between model prediction times for different observation records
  • prediction accuracy/quality goals e.g., the total amount that a client wishes to spend on model execution, which may be proportional to the CPU utilization of the model execution or to utilization levels of other resources
  • the training data set and/or indications of some or all of the optimization goals may be provided by an MLS client programmatically, e.g., via one or more MLS APIs.
  • an API to create a decision tree based model may be invoked by a client, with respective request parameters indicating the data set and one or more run-time goals.
  • At least some of the goals may be qualitative instead of being expressed in exact quantities in some embodiments - e.g., it may not always be possible to indicate a precise target value for cumulative predictive utility, but a goal of maximizing cumulative predictive utility to the extent possible may still be used to guide pruning in some scenarios.
  • a tree-construction pass of Ml 's training phase may be initiated using some selected subset of all of the training data set.
  • the training data (or at least pointers to the observation records of the training data) may be loaded into memory prior to the construction of the tree, and rearranged in memory based on the predicates evaluated at the nodes of the tree as the nodes are generated.
  • the nodes of a decision tree may be generated in depth-first order in the depicted embodiment (element 3904), and node information such as the predicates being tested and the child node count or pointers to the child nodes may be streamed to persistent storage (e.g., rotating-disk based storage) in depth- first order.
  • a predictive utility metric (PUM) value may be stored for at least some of the nodes, indicative of the contribution or utility of the nodes towards the predictions made by the model.
  • PUM predictive utility metric
  • Any of several types of statistical measures may be used as PUM values in different implementations, such as Gini impurity values, entropy measures, information gain measures, and so on.
  • the PUM values may be used, for example in a subsequent tree- pruning pass of the training phase, to determine an order in which nodes can be pruned or removed from the tree without affecting the quality of the model predictions significantly.
  • a histogram or a similar representation of the distribution of PUM among the tree's nodes may be generated during the tree construction pass.
  • the distribution information may be collected in a separate traversal of the tree.
  • the terms "tree construction” and "tree creation” may be used as synonyms herein.
  • the constructed tree may be analyzed, e.g., in either a top-down greedy approach or a bottom-up approach, to identify some number of nodes that should be removed in view of the run-time optimization goals and/or the nodes' PUM values in the depicted embodiment (element 3907).
  • the tree-pruning phase need not be performed, e.g., if the un- pruned tree already meets desired optimization goals.
  • the modified or pruned version of the decision tree may be stored (element 3910), e.g., in a separate location than the un-pruned tree, for use later during a test phase and/or production-level prediction runs of the model.
  • multiple trees may have to be constructed in some cases. If more trees are required (as determined in element 3913), a different sample of the training data set may be generated and the construction and pruning operations of elements 3904 onwards may be repeated. Although parallelism is not explicitly illustrated in FIG. 39, in some embodiments, as mentioned earlier, multiple trees may be constructed and/or pruned in parallel. In the depicted embodiment, after all the trees have been constructed and pruned, the model may be executed using the pruned tree(s) to obtain one or more sets of predictions (element 3916). Prediction runs corresponding to multiple pruned trees may be performed in parallel in some implementations.
  • Metrics that can be used to determine whether the optimization goals were achieved during the prediction run(s) may be obtained in some embodiments. If all the goals were met to an adequate extent, as detected in element 3919, the training and execution phases of the model may be considered complete (element 3928). If some goals (such as a desired level of accuracy) were not met, and if additional resources such as more memory are available (as detected in element 3922), in some embodiments the training and/or execution phases may be retried using additional resources (element 3925). Such retries may be repeated in some embodiments until the goals are met or no additional resources are available.
  • tree generation and tree pruning may be performed iterative ly, e.g., with several different periods of tree generation and several different periods of tree pruning interspersed with each other during the training phase of the model. In such a scenario, some number of nodes may be generated and stored in depth first order in a first tree-generation period.
  • tree generation may be paused, the created nodes may be examined for pruning (e.g., based on their PUM values and on the optimization goals) in a first tree-pruning period, and some nodes may be removed based on the analysis. More nodes may be generated for the resulting tree in the next tree-generation period, followed by removal of zero or more nodes during the next tree-pruning period, and so on.
  • pruning e.g., based on their PUM values and on the optimization goals
  • More nodes may be generated for the resulting tree in the next tree-generation period, followed by removal of zero or more nodes during the next tree-pruning period, and so on.
  • Such iterative generation and pruning may help eliminate nodes with low utility from the tree earlier than in an approach in which the entire tree is generated before any nodes are pruned.
  • a number of different components of the machine learning service may collectively perform the operations associated with decision tree optimizations.
  • a client request for the training or creation of a tree-based model (e.g., either a model based on a single tree, or a model using an ensemble of trees), submitted via one or more APIs may be received at a request/response handler, which may determine the nature of the request and pass on the client request (or an internal representation of the client request) to a model generator or model manager.
  • each pass of the training phase may be performed by a respective MLS component - e.g., one or more tree generator components may create the trees in depth-first order and stream the node descriptors to persistent storage at one or more MLS servers, while one or more tree reducers may be responsible for pruning trees.
  • one or more training servers of the MLS may be used for training tree-based models, while one or more prediction servers may be used for the actual predictions.
  • a job manager may be responsible for maintaining a collection or queue of outstanding jobs and for scheduling jobs as resources become available and job dependencies are met.
  • Responses may be provided to the client by the front-end request/response handler in some embodiments.
  • some or all of these components may comprise specialized, tuned, or task-optimized hardware and/or software.
  • a machine learning service implemented at a provider network may support a wide variety of feature processing transformations (which may be referred to as FPTs), such as quantile binning, generation of a Cartesian product of values of one or more variables, n-gram generation, and so on.
  • FPTs feature processing transformations
  • Each FPT (or group of related FPTs) may have its own set of costs for various phases of a model's lifecycle, which may be expressible in any of a variety of units such as elapsed times, resource consumption, and so on.
  • the additional or marginal costs e.g., memory, CPU, network or storage costs
  • the additional or marginal costs may all have to be considered in some embodiments when determining whether the FPT is worthwhile.
  • the MLS may be configured to provide recommendations to clients regarding possible sets of feature processing transformations, e.g., based on automated cost-benefit analyses in view of goals indicated by the clients.
  • FIG. 40 illustrates an example of a machine learning service configured to generate feature processing proposals for clients based on an analysis of costs and benefits of candidate feature processing transformations, according to at least some embodiments.
  • a feature processing (FP) manager 4080 of the machine learning service may comprise a candidate generator 4082 and an optimizer 4084.
  • the FP manager 4080 may receive an indication of a training data set 4004 comprising values for a set of raw or unprocessed independent variables 4006 and one or more target variables 4007 whose values are to be predicted by a model.
  • the model may be trainable using variables derived from the training data set using one or more FPTs.
  • the FP manager 4080 may also determine one or more prediction quality metrics 4012, and one or more run-time goals 4016 for the predictions.
  • quality metrics 4012 may be determined in different embodiments and for different types of models, such as ROC (receiver operating characteristics) AUC (area under curve) measures for binary classification problems, mean square error metrics for regression problems, and so on.
  • a client may indicate one or more constraints 4014 (such as one or more required or mandatory FPTs, and/or one or more prohibited FPTs) for training the model, and the FP manager may attempt to meet the specified constraints.
  • the goals 4016 may include elapsed time goals for producing predictions on a data set of a specified size, goals for an amount of memory not to be exceeded when making such predictions, budget goals regarding the maximum billing costs per prediction, and so on.
  • the FP manager may also be provided with a set of training phase goals, such as the maximum amount of time to be consumed to train the model, a budget not to be exceeded for training the model, or a time or budget limit for the MLS to provide a feature processing proposal to the client.
  • the candidate generator 4082 may be responsible for identifying an initial candidate FPT set 4052.
  • the initial candidate FPT set may be represented at least internally within the MLS as an acyclic graph of possible transformations in some implementations, such as the illustrated graph comprising FPT1 - FPT10.
  • the acyclic graph representation may indicate, for example, a recommended sequence in which the different FPTs should be performed, and/or dependencies between different FPTs.
  • the depicted representation of FPT set 4052 may indicate that FPT9 depends on a result of FPT7, FPT7 depends on a result of FPT3, and so on.
  • the candidate generator 4082 may include a large number (e.g., dozens or hundreds) of candidate FPTs.
  • the initial set 4052 of candidate FPTs may comprise a relatively small subset of the feasible candidate transformations.
  • the initial set 4052 may include any FPTs that are specified (e.g., in constraints 4014) as being mandatory, and exclude any FPTs that were prohibited.
  • the optimizer 4084 may be responsible for generating one or more FP proposals such as 4062A and 4062B.
  • the FP proposals may typically be versions of the candidate set 4052 from which some number of candidate FPTs have been removed or pruned, e.g., based on a cost- benefit analysis performed by the optimizer. If a client had indicated mandatory feature processing transformations via constraints 4014, such transformations may be retained in the FP proposals.
  • the cost benefit analysis may comprise the scheduling of a plurality of jobs as described below in various embodiments, e.g., jobs that involve training and evaluating a model with results of the initial set of candidate FPTs, re-evaluating the model with modified evaluation sets to estimate the impact of various FPTs on prediction quality, and/or re-training the model with modified sets of processed variables to estimate the impact of various FPTs on prediction run-time metrics.
  • proposal 4062A is obtained from initial FPT candidate set 4052 by removing FPT5, FPT8, FPT9 and FPT10
  • proposal 4062B results from the elimination of FPT4, FPT7, FPT 8, FPT9 and FPT10 from FPT candidate set 4052.
  • a variety of techniques may be used in different embodiments for selecting the FPTs that are eliminated in different proposals, such as random removals, greedy algorithms, and so on, as described below in further detail.
  • One of the advantages of pruning (e.g., removing) FPTs from the candidate set is that clients may not have to go to the trouble of including some independent variables in their training and testing data sets. For example, if FPT5 is the only transformation in the candidate set 4052 that applies to a given independent variable 4006, and the FP manager determines that FPT5 is not required to meet the objectives of the client, the client need not collect values of the independent variable 4006 for future training and/or test/evaluation data. Since collecting, storing and providing training data to the MLS may have a significant impact on the client's overall costs of obtaining solutions to machine learning problems, such training- data-reduction optimizations may be especially valuable.
  • one or more FP proposals 4062 may be provided programmatically to a client of the MLS, e.g., in the form of a catalog or menu from which the client may approve a specific proposal or multiple proposals.
  • an iterative process may be used to arrive at a final approved FP plan, e.g., with a given iteration comprising the MLS providing a proposal to the client, followed by a proposal change request from the client.
  • the FP manager may transmit a requirements reconsideration request to the client, in effect requesting the client to prioritize/modify at least some of the goals or quality metrics, or relax some of the constraints.
  • the client may respond to the reconsideration request by indicating relative priorities for some or all of the goals and metrics.
  • the MLS may implement the proposal on behalf of the client, e.g., using the results of approved FPTs as input to train a model and then obtaining predictions/evaluations on specified non-training data.
  • Such optimization based on feature processing cost-benefit tradeoffs may be used for a variety of model types, including for example classification models, regression models, clustering models, natural language processing models and the like, and for a variety of problem domains in different embodiments.
  • a client may indicate that a recipe written using a recipe language of the kind described earlier is to be used for generating processed variables for training their model.
  • the MLS may analyze the FPTs indicated in the recipe, and may ascertain whether some (or all) of the FPTs in the recipe should be replaced or eliminated when generating the FP proposal to be provided to the client. That is, an FP manager may be configured to suggest or recommend modifications to a client-specified FP recipe in such embodiments if better alternatives appear to be available.
  • one or more programmatic interfaces may be made available to clients to enable them to submit requests for FP optimizations, e.g., indicating their training data, target variables, run-time goals, prediction quality metrics, and so on.
  • the MLS may utilize various internal APIs to provide the requested recommendations, e.g., respective jobs may be scheduled using lower-level APIs to read the training data using the chunked approach described above, to perform feature processing, training, evaluation, re -training and/or re- evaluation.
  • programmatic interfaces e.g., web-based dashboards
  • FIG. 41 illustrates an example of selecting a feature processing set from several alternatives based on measured prediction speed and prediction quality, according to at least some embodiments.
  • the prediction speed (for a given data set size for which predictions are expected to be made after training) increases from left to right along the X- axis.
  • Each point 4110 (e.g., any of the twelve points 4110A-4110N) represents a prediction run of a model with a corresponding set of FPTs being used for training the model.
  • the client on whose behalf the model is being trained and executed has indicated a target prediction speed goal PSG and a target prediction quality goal PQG.
  • FPT set 4110G is selected as the best alternative, as it meets both of the client's criteria.
  • not all the client's objectives may be simultaneously achievable.
  • a client may desire prediction times to be less than X seconds, and also desire prediction quality to exceed some measure Ql, such that the MLS is not necessarily able to meet both goals.
  • the client may be requested to prioritize the goals, so that the MLS can try to optimize for one goal in preference to others.
  • at least some clients may not have to specify quality goals (or may not specify quality goals even if such goals can be specified), and may rely instead on the MLS to select appropriate prediction quality criteria that should be targeted for optimization.
  • the MLS may even select and/or prioritize the run-time goals that should be targeted on behalf of a given client.
  • Clients that are more knowledgeable with respect to machine learning may be allowed to provide as much detailed guidance regarding FP tradeoff management as they wish to in some embodiments, e.g., using values for optional API parameters when interacting with the MLS.
  • the MLS may be able to handle a variety of client expertise levels with respect to managing tradeoffs between feature processing costs and benefits.
  • FIG. 42 illustrates example interactions between a client and a feature processing manager of a machine learning service, according to at least some embodiments.
  • a client 164 of the machine learning service implemented in system 4200 may submit a model creation request 4210 via a programmatic interface 4262.
  • the model creation request 4210 may indicate, for example, some combination of the following elements: one or more training sets 4220 (which include an indication of the target variables to be predicted), one or more test or evaluation sets 4222, one or more model quality metrics 4224 of interest to the client, goals 4225 (such as prediction run-time goals and/or training goals), and in some cases, one or more optional feature processing recipes 4226 formatted in accordance with the MLS's recipe language specification.
  • a client may also optionally indicate one or more constraints 4227, such as a mandatory feature processing transformation that has to be performed on behalf of the client or a prohibited transformation that must not be performed. Not all the elements shown in FIG. 42 may be included in the model creation request 4210 in some embodiments; for example, if no specific model quality metrics are indicated, the FP manager may select certain metrics for optimization based on the nature of the machine learning problem being solved.
  • the model creation request 4210 may be received by a front-end request/response handler 4280 of the MLS, and an internal representation of the request may be handed off to the
  • Model creation requests may also be referred to as model training requests herein.
  • the FP manager 4080 may generate a candidate set of feature processing transformations, and then prune that candidate set to identify proposals based on the quality metrics, goals and/or constraints identified for the model.
  • a number of different jobs may be generated and scheduled during this process, including, for example one or more feature processing jobs 4255, one or more model evaluation jobs 4258, and/or one or more training or re-training jobs 4261.
  • the FP manager may take the recipe as a starting point for its exploration of feature processing options, without necessarily restricting the set of transformations considered to those indicated in the recipe.
  • the FP manager may consult the MLS's knowledge base of best practices to identify candidate transformations in some embodiments, e.g., based on the problem domain being addresses by the model to be created or trained.
  • candidate transformations feature processing transformations
  • some subset of the transformations may be removed or pruned from the set in each of several optimization iterations, and different variants of the model may be trained and/or evaluated using the pruned FPT sets.
  • the model variants 4268 may be stored within the MLS artifact repository in at least some embodiments.
  • the client request includes training time goals or deadlines by which the MLS is required to provide FP proposals
  • goals/deadlines may influence the specific pruning techniques that are used by the FP manager 4080 - for example, a greedy pruning technique such as that illustrated below may be used with strict training time deadlines.
  • the MLS may set its own training time goals in scenarios in which clients do not specify such goals, e.g., so as to keep training-time resource consumption within reasonable bounds.
  • the client may be billed a fixed fee for the generation of FP proposals, in which case the experimentation/testing of different FPT options by the FP manager may be constrained by the resource usage limits corresponding to the fixed fee.
  • the FP manager 4080 may eventually terminate its analysis of alternative transformation sets and provide one or more FP proposals 4272 to the client 164 in the depicted embodiment (e.g., via an API response generated by the request/response handler 4280).
  • the FP proposal may indicate one or more changes to the client's recipe(s) that are recommended based on the analysis performed by the MLS, or entirely different recipes may be indicated.
  • the FP proposal(s) may be formatted in accordance with the MLS's recipe language, while in other embodiments a different representation of the proposed feature processing transformations may be provided.
  • the client 164 may either approve one or more of the proposals, or may request changes to the proposal(s), e.g., via FP change requests 4278.
  • an iterative negotiation may occur between the MLS and the client, in which the client submits suggestions for changes and the MLS performs additional evaluations or re-training operations to try out the changes.
  • the number of such iterations that are performed before the negotiation ends may also be based at least partly on billing in some embodiments - e.g., the client may be charged a fee based on the amount of time or resources consumed for each iteration of re-testing.
  • the client may approve a particular FP proposal and submit a model execution request 4254, e.g., via an MLS API.
  • a production-level model execution manager 4232 may then implement production run(s) 4258 of the model corresponding to the approved FP proposal.
  • the client may request additional changes based on the results achieved in the production runs, e.g., by submitting additional change requests 4278 and/or requesting re- training or re-creation of the model based on new training data.
  • FIG. 43 illustrates an example of pruning candidate feature processing transformations using random selection, according to at least some embodiments.
  • one or more FPTs of the initial candidate FPT set 4302 may be selected for removal at random, and the impact of such a removal on the model's quality metrics and the goals may be estimated.
  • FP mutation 4320A may result from the removal of FPT 1 1 from candidate FPT set 4302, for example, while FP mutation 4320B may result from the removal of FPT6, FPT7 and FPT13.
  • a selection of one particular node of an FPT set as a pruning victim may result in the removal of one or more other nodes as well. For example, if FPT 13 and FPT7 depend on (e.g., use the output of) FPT6, the selection of FPT6 as a victim may also result in the pruning of FPT7 and FPT13.
  • the estimates of the costs and benefits of removing the victim FPTs may be determined, e.g., by re-evaluating the model using dummy or statistically selected replacement values for the features produced by the victims to determine the impact on the prediction quality metrics, and/or by re-training the model with a smaller set of features to determine the impact on run-time performance metrics.
  • the FP manager may store the pruning results for each FP mutation 4320 in the depicted embodiment, e.g., as artifacts in the MLS artifact repository.
  • Pruning results 4390 may include an estimate of prediction quality contribution 4333 of the removed FPTs (FPT6, FPT7 and FPT13), as well as an estimate of the contribution 4334 of the removed FPTs to prediction run-time costs. Such estimates for different mutations may be used to generate the proposals to be provided to the client by the FP manager.
  • the randomized pruning approach may be especially useful if the different candidate FPTs are not expected to differ significantly in their cost and quality contributions, or if the FP manager cannot predict (e.g., based on best practices) whether different candidates are likely to have significantly different cost or quality contributions.
  • the FP manager's optimizer may identify specific FPTs that are expected to provide a significant positive contribution to model quality.
  • the FP manager may then develop proposals based on the positions of such highly beneficial FPTs in the candidate FPT graph, e.g., proposals that include the beneficial FPTs and their neighbors.
  • FIG. 44 illustrates an example of such a greedy technique for identifying recommended sets of candidate feature processing transformations, according to at least some embodiments.
  • the FP manager has identified node 4410 (corresponding to FPT 14) as the particular node with the highest contribution to model quality (or at least the highest contribution among the nodes whose quality contributions have been evaluated).
  • Node 4410 has accordingly been selected as the starting node for construction a graph of FPTs to be included in a proposal of recommended FPTs to be provided to a client.
  • its prerequisite nodes may also be included in the proposal. For example, in order to perform the transformation indicated by FPT 14, results of FPT 10, FPT3, FPT2 and FPT1 may be required in the depicted example.
  • the contributions and costs of other neighboring nodes of the already-selected nodes may then be determined using re-evaluations and re -training iterations, until the desired quality and/or cost goals are met.
  • the resulting FPT graph (with other candidate FPTs removed) may be included in the FP proposal 4432 transmitted to the client.
  • a model may first be generated/trained using the entire set of candidate FPTs identified initially. Statistics on the values of certain candidate processed variables (PVs) may be obtained and later used for determining the specific contributions of the PVs and their corresponding FPTs to model prediction quality.
  • FIG. 45 illustrates an example of a first phase of a feature processing optimization technique, in which a model is trained using a first set of candidate processed variables and evaluated, according to at least some embodiments.
  • an original set of processed variables (PVs) 4560 may be obtained from an un-processed training set 4502 in the depicted embodiment.
  • the un-processed training set 4502 may include some number of independent variables IV1, IV2, ..., and a dependent or target variable DV.
  • the PV training set 4560 may include some number of PVs such as PVl (obtained from feature processing transformation FPTl), PV2 (obtained via FPT2) and PV3 (obtained via FPT3). It is noted that while in general, a training set may include one or more un-processed variables as well as some number of processed variables, to simplify the presentation only three processed variables are shown in the example training set 4560.
  • Respective sets of statistics may be generated in the depicted embodiment for some or all of the PVs, such as PVl stats, PV2 stats, and PV3 stats.
  • categorical variables of the unprocessed training data may be converted or mapped to numerical or Boolean values, and in some cases numerical values may be normalized (e.g., mapped to real numbers in the range -1 to 1).
  • a model 4510 may be trained using the original PV training set 4560 at some training cost TC.
  • TC may be expressed in a variety of units, such as CPU-seconds on a machine with memory size Ml, or the corresponding billing amounts.
  • the model may be evaluated using a PV set 4562 derived from an un-processed evaluation set (or several such sets) 4504 in the depicted embodiment.
  • the evaluation set values for PVl, PV2 and PV3 may be obtained by applying the same types of transformations to the un-processed evaluation set(s) 4504.
  • the cost (EC) of evaluating the trained model may at least in some cases be smaller than TC, the cost of training the model with results of all the candidate FPTs (e.g., because identifying various coefficients to be used for predictions may be more compute- intensive than simply applying the coefficients during test/evaluation runs).
  • the original evaluation results 4536, obtained without pruning any of the candidate FPTs, may be saved in a persistent repository (e.g., to be used later as described below to determine the respective quality contributions of different FPTs).
  • the original prediction run-time metrics 4537 e.g., elapsed time, CPU-seconds used, memory used, etc.
  • the prediction quality of the model may be higher when more FPTs are used for training. Differences or deltas to the model's prediction quality metrics, corresponding to different pruning selections, may then be obtained in later phases of the feature processing technique as described below.
  • modified evaluation set 4662 A the original PV1 values are replaced by PVl 's mean value (from the PV1 statistics obtained earlier), while the original values of PV2 and PV3 are retained.
  • modified evaluation set 4662B the original PV2 values are replaced by random values selected in the range between the minimum and maximum values for PV2 from the statistics generated using the original candidate training set.
  • modified evaluation set 4662C the original PV3 values are replaced by the median PV3 value in the PV3 statistics obtained from the original candidate training set.
  • Each of the modified evaluation sets is then provided as input to model 4510 which was trained using the original PV training set 4560 to obtain a respective set of predictions.
  • modified evaluation set 4662 A PV1 -pruned evaluation results 4636 A may be obtained (indicative of, or approximating, the results that may have been achieved had PV1 not been included in the training set of model 4510).
  • a measure of the contribution of PV1 to the model's quality (termed FPT1 -quality-delta in FIG. 46) may be obtained.
  • PV1 -pruned evaluation results 4636B may be used to estimate FPT2-quality-delta, the contribution of FPT2 or PV2 to the quality of the model prediction result, and PV3 -pruned evaluation results 4636C may be used to estimate FPT3 -quality-delta.
  • the relative contributions of several different FPTs towards the quality of the model's predictions may be estimated, and such contribution estimates may be used to generate the FP proposals for the client.
  • the costs (e.g., in terms of resource consumption or time) of estimating the quality contributions such as FPT1 -quality-delta, FPT2-quality-delta and FPT3 -quality-delta using the modified evaluation sets may be similar to the evaluation costs EC, which may be smaller than the costs of re-training the model TC and then re-evaluating the model.
  • the particular statistic or values to be used to generate the modified PV evaluation set may differ for different types of PVs and/or for different types of models or problem domains.
  • the mean value may be used (as in the case of PV1 in FIG. 46) as the default substitution, while in other cases random values may be assigned, or the median or mode value may be used based on earlier results achieved for similar types of problems.
  • FIG. 46 illustrates another example phase of the feature processing optimization technique, in which a model is re-trained using a modified set of processed variables to determine the impact on prediction run-time cost of using a processed variable, according to at least some embodiments.
  • a pruned PV training set 4760 may be obtained from the PV training set 4560 that was generated in an earlier phase of the optimization process, e.g., by simply omitting the values of PV2.
  • a pruned PV evaluation set may be obtained from the original PV evaluation set 4562, e.g., by omitting the PV2 values.
  • the pruned PV training set 4760 and/or the pruned PV evaluation set 4762 may have to be obtained from the un-processed training and evaluation sets.
  • the model 4710 may be trained using the pruned PV training set 4760 and evaluated using the pruned PV evaluation set 4762.
  • FPT2-cost-delta a measure of the contribution of FPT2 to prediction run-time costs, may be computed as the difference between the prediction run-time metrics 4736 (corresponding to the pruning of FPT2 or PV2) and the original run-time metrics 4537 (which were obtained using a model trained/evaluated with all the candidate FPTs).
  • the cost TC2 of re-training the model may be similar to the cost TC (shown in FIG. 45) of training the model with all the FPTs included, while the cost EC2 of re-evaluating the model may be smaller.
  • the FP manager may attempt to do more re-evaluations than re-trainings - e.g., many FPTs may be analyzed for their quality contributions, and then a smaller subset may be analyzed for their cost contributions.
  • FIG. 48 is a flow diagram illustrating aspects of operations that may be performed at a machine learning service that recommends feature processing transformations based on quality vs. run-time cost tradeoffs, according to at least some embodiments.
  • a component of an MLS such as a feature processing manager
  • a client may indicate constraints, such as one or more mandatory feature processing transformations or one or more prohibited feature processing transformations.
  • some or all of these parameters may be indicated in a client's request submitted to the MLS, e.g., via a programmatic interface such as an API (application programming interface), a web-based console, a standalone GUI (graphical user interface), or a command-line tool.
  • a programmatic interface such as an API (application programming interface), a web-based console, a standalone GUI (graphical user interface), or a command-line tool.
  • the client may indicate one or more training-time goals, e.g., in addition to run-time goals for prediction runs.
  • Any combination of a variety of prediction quality metrics may be identified by the MLS component for different types of machine learning problems, such as an AUC (area under curve) metric, an accuracy metric, a recall metric, a sensitivity metric, a true positive rate, a specificity metric, a true negative rate, a precision metric, a false positive rate, a false negative rate, an Fl score, a coverage metric, an absolute percentage error metric, or a squared error metric.
  • AUC area under curve
  • goals for training may be determined in some embodiments.
  • goals may be specified in absolute terms (e.g. that the model execution time must be less than X seconds) or in terms of distributions or percentiles (e.g., that 90% of the model execution times must be less than x seconds).
  • Clients may request the creation, training or re-training of a wide variety of models in different embodiments, including for example classification models (e.g., binary or n-way classification models), regression models, natural language processing (NLP) models, clustering models and the like.
  • classification models e.g., binary or n-way classification models
  • regression models e.g., binary or n-way classification models
  • NLP natural language processing
  • the MLS may identify a set of candidate feature processing transformations (FPTs) that can be used to obtain processed variables or features from the raw training data, such that the features may in turn be used to predict values of the target variable(s) (element 4804).
  • FPTs candidate feature processing transformations
  • one or more of the un-processed independent variables may also be included in the candidate sets of variables to be used for training; that is, not all the variables in a training set need be the results of FPTs.
  • any of a wide variety of FPT candidates may be selected, such as quantile binning, Cartesian product generation, bi-gram generation, an n-gram generation, an orthogonal sparse bigram generation, a calendar-related transformation, an image processing function, an audio processing function, a bio-informatics processing function, or a natural language processing function.
  • the MLS may generally try to come up with a large list of candidates, in some embodiments, the number of different FPT candidates may be restricted based on one or more constraints, such as explicit or implicit goals for training time or training resources.
  • at least some of the FPT candidates may be dependent upon each other, e.g., the output of one FPT may be used as the input of another, and one or more directed graphs of FPT candidates may be generated in some cases to represent such relationships.
  • respective estimates of the contribution of the FPT to the prediction quality of the model, and/or respective estimates of the effects of the FPT on metrics that impact the run-time goals may be determined (element 4807).
  • the model may first be trained and evaluated using the complete set of candidate FPTs to obtain a best-case prediction quality measure and corresponding run-time metrics. Then, to obtain quality contributions, the model may be re- evaluated using modified evaluation data sets, e.g., evaluation data sets in which the values of a given processed variable are replaced by a mean value (or some other statistically derived replacement value) for that processed variable in the un-modified training set in a manner similar to that illustrated in FIG. 46.
  • modified evaluation data sets e.g., evaluation data sets in which the values of a given processed variable are replaced by a mean value (or some other statistically derived replacement value) for that processed variable in the un-modified training set in a manner similar to that illustrated in FIG. 46.
  • models may have to be retrained with pruned training data (i.e., training data from which one or more processed variables of the candidate set have been removed) in some embodiments.
  • respective jobs may be generated for the re-evaluations and/or the re-trainings.
  • the MLS may produce one or more feature processing proposals to be presented programmatically to the client (element 4810), e.g., without violating any explicit or implicit training time constraints or goals. If the client indicates an approval of a particular proposal FP1 (as detected in element 4813), that proposal may be implemented for subsequent runs (e.g., post-training production runs of the model) on behalf of the client (element 4816).
  • the client does not approve of any proposal put forth by the MLS (as also detected in element 4813), different combinations of FPTs may be selected for further training/testing (element 4819), and the operations corresponding to elements 4807 onwards may be repeated for the new combinations until either a proposal is accepted or a decision to abandon the optimization iterations is reached by the MLS or the client.
  • the client may be given the option of utilizing the full (un-optimized) candidate set of FPTs - that is, the MLS may retain a model variant that was trained using all the candidate FPTs that were identified prior to pruning.
  • the MLS may have to prioritize among the goals indicated by the client - e.g., fast prediction execution times may be incompatible with low memory usage goals.
  • the MLS may indicate such prioritizations to the client and obtain the client's approval for the selected ordering of goals.
  • the client may indicate or suggest a recipe of FPTs to be used, and the MLS may analyze at least some of the FPTs indicated in the recipe for possible inclusion in the candidate FPT set.
  • the MLS may provide the FP proposal in the form of a recipe formatted in the MLS recipe language discussed earlier.
  • the proposals (or recipes corresponding to the proposals) may be stored as artifacts in the MLS artifact repository in at least some embodiments.
  • an FP proposal After an FP proposal is approved by a client, it may be used for subsequent executions of the model (i.e., processed variables produced using the FP proposal may be used as input variables used to train the model and to make predictions using the model), potentially for many different production-mode data sets.
  • a given client may submit several different model creation requests to the service, approve respective FP proposals for each model, and then utilize the approved models for a while.
  • clients may wish to view the success rate with respect to their prediction run-time goals for various models after they are approved.
  • FIG. 49 is an example of a programmatic dashboard interface that may enable clients to view the status of a variety of machine learning model runs, according to at least some embodiments.
  • the dashboard may be incorporated within a web page 4901 in the depicted example, comprising a message area 4904 and respective entries for some subset or all of a client's approved models.
  • a web page 4901 comprising a message area 4904 and respective entries for some subset or all of a client's approved models.
  • the client may change the time period covered by the dashboard, e.g., by clicking on link 4908.
  • the client for whom the example dashboard shown in FIG. 49 is displayed has three models that were run in the covered time period of 24 hours: a brain tumor detection model BTM1, a hippocampus atrophy detection model HADM1 and a motor cortex damage detection model MCDD1.
  • the quality metric selected by the client for BTM1 is ROC AUC
  • the run-time performance goal is that the prediction be completed in less than X seconds
  • 95% of the prediction runs in the last 24 hours have met that goal.
  • the quality metric is the false positive rate
  • the runtime performance goal is a memory footprint no greater than Y
  • the achieved success rate is 97%.
  • the prediction quality metric is also the false positive rate
  • the run-time performance goal is a cost goal per prediction run of less than Z
  • the achieved success rate is 92%.
  • feature identifier may refer to a unique identifier for a property derived from observation records of a data set to be used to train a model.
  • feature set may refer to a set of feature identifiers for which (a) feature values are observable while training the model and (b) feature parameters are known or inferred from the training data.
  • feature may refer to a value (e.g., either a single numerical, categorical, or binary value, or an array of such values) of a property of an observation record indexed by a feature identifier.
  • feature vector may refer to a set of pairs or tuples of (feature identifiers, feature values), which may, for example, be stored in a key- value structure (such as a hash map) or a compressed vector.
  • feature parameter or “parameter” may refer to a value of a parameter corresponding to a property indexed by the feature identifier.
  • a real number representing a weight is one example of a parameter that may be used in some embodiments, although for some types of machine learning techniques more complex parameters (e.g., parameters that comprise multiple numerical values or probability distributions) may be used.
  • parameter vector may refer to a set of pairs or tuples (feature identifier, parameter), which may also be stored in a key-value structure such as a hash map or a compressed vector.
  • a feature vector may be considered a transient structure (created for example for a given observation record that is examined during a learning iteration) that is used primarily to update the parameter vector and then discarded.
  • the parameter vector may be retained for the duration of the training phase of the model, although as described below the parameter vector may grow and shrink during the training phase.
  • key-value structures may be used for parameter vectors and/or feature vectors in some embodiments, other types of representations of parameter vectors and/or feature vectors may be employed in various embodiments.
  • FIG. 50 illustrates an example procedure for generating and using linear prediction models, according to at least some embodiments.
  • an unprocessed or raw training data set 5002 to be used to train a linear model may comprise some number of observation records (ORs) 5004, such as ORs 5004A, 5004B, and 5004B.
  • ORs 5004A, 5004B, and 5004B may in turn comprise values of some number of input variables (IVs), such as IV 1, IV2, IV3, IVn, and a value of at least one dependent variable DV.
  • IVs input variables
  • IV 1, IV2, IV3, IVn dependent variable
  • Dependent variables may also be referred to as "output" variables.
  • observation records may be available before model training has to be begun - e.g., as described below in further detail, in some cases observation records may be streamed to a machine learning service as they become available from one or more online data sources.
  • the MLS may be responsible for training a model iteratively, e.g., with each iteration representing an attempt to improve the quality of the model's predictions based on the ORs analyzed up to that point.
  • Such training iterations that are based on analysis of respective sets of observation records may also be termed "learning iterations" herein.
  • a model generator component of the MLS may require that input variables to be used for generating features (that can then be used for training a linear model) meet certain data-type constraints.
  • the model generator may require that the raw values of categorical IVs of the training data be converted into numerical values and/or normalized (e.g., by mapping the numerical values to real numbers between -1 and 1).
  • Such type transformations may be performed during an initial data preparation phase 5010, producing a set of modified or prepared observation records 5015.
  • the linear model may then be trained iteratively in the depicted embodiment, e.g., using a plurality of learning iterations 5020.
  • an empty parameter vector 5025 may be created.
  • the parameter vector 5025 may be used to store parameters (e.g., real numbers that represent respective weights) assigned to a collection of features or processed variable values, where the features are derived from the observation record contents using one or more feature processing transformations (FPTs) of the types described earlier.
  • FPTs feature processing transformations
  • a linear model may compute the weighted sum of the features whose weights are included in the parameter vector in some implementations.
  • a key-value structure such as a hash map may be used for the parameter vector 5025, with feature identifiers (assigned by the model generator) as keys, and the parameters as respective values stored for each key.
  • feature identifiers assigned by the model generator
  • parameters Wl, W2, and Wm shown in FIG. 50 are assigned respectively to features with feature identifiers Fl, F2, and Fm.
  • one or more prepared ORs 5015 may be examined by the model generator (which may also be referred to as a model trainer). Based on the examination of the input variables in the prepared OR, and/or the accuracy of a prediction for the dependent variables of the prepared OR by the model in its current state, respective parameters or weights may be identified for a new set of one or more processed variables. In at least some implementations, the previously-stored parameters or weights may be updated if needed in one or more learning iterations, e.g., using a stochastic gradient descent technique or some similar optimization approach. As more and more observation records are examined, more and more (feature identifier, parameter) key-value pairs may be added into the parameter vector.
  • this growth of the parameter vector may eventually lead to a scenario in which the memory available at an MLS server being used for the model generator is exhausted and an out-of-memory error may end the training phase of the model prematurely.
  • a technique for pruning selected parameters i.e., removing entries for selected features from the parameter vector
  • a technique for pruning selected parameters may be employed in some embodiments.
  • certain triggering conditions e.g., when the number of features for which parameters are stored in the parameter vector exceeds a threshold
  • a fraction of the features that contribute least to the models' predictions may be identified as pruning victims (i.e., features whose entries are removed or "pruned" from the parameter vector).
  • An efficient in-memory technique to estimate quantile boundary values e.g., the 20% of the features that contribute the least to the model's predictions
  • quantile boundary values e.g., the 20% of the features that contribute the least to the model's predictions
  • the importance or contribution of a given feature to the predictive performance of the model may be determined by the deviation of the corresponding parameter value from an "a-priori parameter value" in at least some embodiments.
  • the efficient in-memory technique described below for estimating quantile boundary values may represent one specific example of using such deviations to select pruning victims, relevant in scenarios in which a scalar weight value is used as a parameter value, the a priori parameter value is zero, and the relative contributions correspond to the absolute values of the weights (the respective "distances" of the weights from zero).
  • the parameters are vectors of values, and the a priori value is a vector of zeros
  • a similar approach involving the computation of the distance of a particular vector parameter from the vector of zeros may be used.
  • the parameters may comprise probability distributions rather than scalars.
  • the relative contributions of different features represented in a parameter vector may be obtained by estimating Kullback-Leibler (KL) divergence from the a- priori values, and such divergence estimates may be used to identify features whose parameters should be pruned.
  • KL Kullback-Leibler
  • Entries e.g., parameter values
  • Entries for the pruning victims identified may be removed from the parameter vector 5025, thus reducing the memory consumed.
  • additional learning iterations may be performed even after pruning some parameters.
  • the parameter vector size may grow and shrink repeatedly as more observation records are considered, more parameters are added, and more parameters are pruned.
  • pruning a parameter or “pruning a feature” may be used synonymously herein to refer to the removal of a particular entry comprising a (feature identifier, parameter) pair from a parameter vector.
  • a parameter for a particular feature that was pruned in one learning iteration may even be re-added to the parameter vector later, e.g., in response to a determination by the model generator (based on additional observation records) that the feature is more useful for predictions than at the time when it was pruned.
  • the value of the re-added parameter may differ from the value that was removed earlier in some cases.
  • the linear model may be executed using the current parameter vector.
  • the parameter vector 5025 may be "frozen" (e.g., an immutable representation of the parameter vector as of a particular point in time may be stored in an MLS artifact repository) prior to model execution 5040 for predictions 5072 on a production or test data set 5050.
  • additional learning iterations 5020 may be performed using new observation records. In scenarios in which a parameter vector is frozen for production use or testing, additional learning iterations may continue on a non-frozen or modifiable version of the parameter vector.
  • operations on either side of the boundary indicated by the dashed line in FIG. 50 may be interspersed with one another - e.g., one or more learning iterations during which the parameter vector is modified based on new observation data may be followed by a production run of the model, and the production run may be followed by more learning iterations, and so on.
  • FIG. 51 illustrates an example scenario in which the memory capacity of a machine learning server that is used for training a model may become a constraint on parameter vector size, according to at least some embodiments.
  • FPTs feature processing transformations
  • Supported feature processing transformation functions may include, for example, quantile bin functions 5154 for numerical variables, Cartesian product functions 5150 for various types of variables, n-gram functions 5152 for text, calendar functions, domain- specific transformation functions 5156 such as image processing functions, audio processing functions, video processing functions, bio-informatics processing functions, natural language processing functions other than n-grams, and so on.
  • quantile bin functions 5154 for numerical variables
  • Cartesian product functions 5150 for various types of variables
  • n-gram functions 5152 for text
  • calendar functions domain- specific transformation functions 5156 such as image processing functions, audio processing functions, video processing functions, bio-informatics processing functions, natural language processing functions other than n-grams, and so on.
  • domain-specific transformation functions 5156 such as image processing functions, audio processing functions, video processing functions, bio-informatics processing functions, natural language processing functions other than n-grams, and so on.
  • FPTs may be applied to it, and additional FPTs may be applied to the results.
  • the number 5133 of possible feature processing transformations and combinations may be very large, which could lead to a parameter vector 5144 that is unbounded in size.
  • the various features identified may be mapped to a vector of real numbers, where the dimension of the vector may be arbitrarily large at least in principle.
  • a significant portion or all of the learning iterations of a particular model may be intended to be performed on a single MLS server such as server 5160 (e.g., using one or more threads of execution at such a server).
  • the parameter vector for the model may be required to fit in the main memory 5170 of the MLS server 5160. If the in-memory parameter vector representation 5180 grows too large, the process or thread used for learning may exit prematurely with an out-of-memory error, and at least some of the learning iterations may have to be re-implemented.
  • the MLS server memory requirement may grow in a non-linear fashion with the number of input variables and/or observation records examined. It is noted that the requirement graph 5175 is not intended to illustrate an exact relationship between the number of observations and the possible parameter vector size for any given machine learning problem; instead, it is intended to convey general trends that may be observed in such relationships.
  • the training of a model may simply be terminated when the number of features whose parameters are stored in the parameter vector reaches a selected maximum. This means that in such approaches, features that may otherwise have been identified later as significant contributors to prediction quality may never be considered for representation in the parameter vector.
  • different features may be combined disjunctively using hash functions (e.g., to save space, only N bits of K bits of a hash value that would otherwise represent a particular feature may be used, with the N bits being selected using a modulo function), which may also result in reduction in the quality of the predictions.
  • one or more regularization techniques may be used, in which the weights or parameters assigned to different features may be reduced by some factor in various learning iterations, and as a result, some features may gradually be eliminated from the parameter vector (with their weights approaching zero).
  • regularization may result in relatively poor quality of model prediction.
  • Regularization may also require a selection of one or more hyper- parameters (such as the reduction factors to use), which may not be straightforward. It is noted that even in embodiments in which the parameter pruning techniques described below are implemented, regularization may still be used for various reasons (such as to prevent over-fitting, or to at least contribute to parameter vector size reduction).
  • a technique that imposes limits on the size of the parameter vector used for a linear model, without sacrificing the quality of the predictions made and without restricting the set of features based on how early during the training phase the features are identified may be utilized in some embodiments.
  • parameters corresponding to a subset of the features identified thus far may be pruned from the parameter vector (effectively replacing the removed parameter values with a default or a priori value).
  • pruning victim features or more simply as “pruning victims”.
  • An efficient estimation technique to identify a selected fraction or quantile of the features that contribute the least to the predictions of the model may be used to identify the pruning victims in some implementations as described below.
  • such a technique may not require explicitly sorting the parameters or copying the parameters.
  • parameters for additional features may be added, e.g., in subsequent learning iterations.
  • a parameter for a given feature that was selected as a pruning victim earlier may be re-introduced into the parameter vector if later observations indicate that the given feature may be more useful for prediction than it was expected to be when it was pruned.
  • FIG. 52 illustrates such a technique in which a subset of features for which respective parameter values are stored in a parameter vector during training may be selected as pruning victims, according to at least some embodiments.
  • Four learning iterations 521 OA, 5210B, 521 OK and 5210L are shown.
  • a respective observation record set (ORS) 5202 e.g., ORS 5202A in learning iteration 521 OA, ORS 5202B in learning iteration 5210B, and so on
  • ORS observation record set
  • earlier-generated parameter values may be updated or adjusted in at least some embodiments, e.g., using a stochastic gradient technique.
  • the parameter vector After learning iteration 5210, the parameter vector comprises parameters 5222A corresponding to feature identifiers 5212A. After the next learning iteration 5210B, the parameter vector has grown and now comprises parameters 5222B for feature identifiers 5212B (and some or all of the parameters set in learning iteration 521 OA may have been adjusted or changed).
  • the model generator may determine that a threshold parameter vector size PVS has been exceeded, and may perform a pruning analysis. It is noted that at least in some embodiments, operations to detect whether the triggering condition for pruning has been met may not be performed in or after every learning iteration, as such frequent pruning may be unnecessary. Instead, such checks may be performed periodically, e.g., based on the number of learning iterations that have been performed since such a check was last completed, or based on the time that has elapsed since such a check was last performed, or based on the number of observation records that have been examined since a check was last performed.
  • the PVS may be based at least in part on (e.g., set to some fraction of) the memory capacity of an MLS server, or the triggering condition may be based on some other server resource capacity constraint such as CPU utilization limits.
  • a client on whose behalf the linear model is being trained may indicate one or more goals for training (e.g., that a server with no more than X gigabytes of memory is to be used for training) and/or for post-training execution, and such goals may influence the value of PVS.
  • PVS may be expressed in terms of the number of parameters included in the parameter vector, or simply in terms of the amount of memory consumed by the parameter vector.
  • the model generator may identify some selected number (or some selected fraction) of the features whose parameters are to be removed.
  • the 10% least significant features may be identified, e.g., based on the absolute values of weights assigned to the features represented in the parameter vector.
  • the relative contribution of the features to a prediction (which is computed at least in part using the weighted sums of the feature values) may be assumed to be proportional to the absolute value of their weights. The task of identifying the 10% least important features may thus be equivalent to identifying the 10% of the weights that have the smallest absolute value.
  • An exact identification of such a fraction of the features may require sorting the absolute values of the weights of the entire parameter vector, which may pose resource consumption problems of its own for large parameter vectors - e.g., a substantial amount of memory, CPU cycles and/or persistent storage may be required for such sort operations. Accordingly, an optimization may be used in some implementations to find an approximate boundary weight for the selected fraction (i.e., the weight Wk such that approximately 10% of the features have smaller absolute weights and the remaining approximately 90%> have higher absolute weights), without sorting the weights or copying the weights.
  • an optimization technique is described below in conjunction with the discussion of FIG. 55.
  • weights whose absolute values are below the boundary may be easily identified, and the entries for such weights may be removed from the parameter vector.
  • weights are discussed herein as a simple example of the kinds of parameters that may be stored, similar techniques may be used to determine pruning candidates when more complex parameters (e.g., parameter structures that include more than just a single real number) are used. That is, the pruning technique described is not restricted to embodiments in which a single numerical quantity (such as a weight with a real number value) is used as a parameter. More complex parameters may be transformed, for example, into numerical values that approximate the relative contributions of the corresponding features to the predictions made by the model. As mentioned earlier, different measures of deviations of specific parameter values from a priori values may be used in various embodiments to estimate the relative contributions of the parameters, depending on the types of parameters being used for the model.
  • the pruned parameter vector (comprising adjusted parameters 5222K* for feature identifiers 5212K*) may no longer violate the PVS constraint.
  • a sufficiently large fraction of the parameter vector may be pruned that additional parameters may again be added in one or more subsequent learning iterations, such as learning iteration 5210L shown in FIG. 52.
  • the parameter vector size may grow again after being reduced via pruning. Additional pruning may be required if the parameter vector size again exceeds PVS eventually, and more parameters may be added after the additional pruning is completed.
  • a parameter corresponding to any feature may be added to the parameter vector in a given learning iteration, including for example parameters corresponding to features that were selected as pruning victims earlier.
  • the technique illustrated in FIG. 52 may converge on a parameter vector that provides highly accurate predictions while limiting memory use during training.
  • the reduction in the parameter vector size may also reduce the time it takes to load and execute the model during prediction runs - thus, the benefits of the technique may be obtained both during the training phase and in post-training -phase prediction runs.
  • FIG. 53 illustrates a system in which observation records to be used for learning iterations of a linear model's training phase may be streamed to a machine learning service, according to at least some embodiments.
  • a data receiver endpoint 5308 e.g., a network address or a uniform resource identifier
  • SDSs streaming data sources
  • Such data sources may, for example, include web server logs of a geographically distributed application, sensor-based data collectors, and the like.
  • ORs observation records from such data sources may arrive in arbitrary order - e.g., OR1 from SDS 5302A may be received first, followed by OR2 from SDS 5302C, OR3 and OR4 from SDS 5302B, and so on.
  • the records may be used for learning iterations in the order in which they arrive in the depicted embodiment.
  • OR1, OR2 and OR3 may be examined during a first set of learning iterations 5333 A, resulting in the generation of a particular parameter vector.
  • the learning iteration set 5333 A may be followed by a pruning iteration 5334 in which some selected parameters are removed from the parameter vector based on their relative contributions to the predictions of the model being trained. Pruning iteration 5334 may be followed by another learning iteration set 5333B, in which OR4, OR5 and OR6 are examined and parameters for one or more new features (and/or features whose parameters were previously pruned) are added to the parameter vector.
  • pruning iterations 5334 may be scheduled at regular intervals, e.g., once every X seconds, regardless of the rate at which observation records are received or examined. Such schedule -based pruning may help the MLS to respond to wide fluctuations in observation record arrival rates - e.g. to prevent out-of-memory errors resulting from a sudden burst of observation records that arrive at a time at which the parameter vector size is already close to its maximum threshold. [00277] FIG.
  • FIG. 54 is a flow diagram illustrating aspects of operations that may be performed at a machine learning service at which, in response to a detection of a triggering condition, parameters corresponding to one or more features may be pruned from a parameter vector to reduce memory consumption during training, according to at least some embodiments.
  • an indication of a data source from which unprocessed or raw observation records of a training data set that is to be used to develop a linear predictive model may be received at a machine learning service.
  • the data source may be indicated by a client via an MLS programmatic interface such as an API, a web-based console, a standalone GUI or a command line tool.
  • the linear predictive model may, for example, be expected to make predictions based at least in part on weighted sums of feature values derived from the training data via one or more feature processing transformations (FPTs) of the types described earlier.
  • FPTs feature processing transformations
  • a job object for generating/training the model may be created in response to the invocation of the API by the client and placed in a job queue such as queue 142 of FIG. 1.
  • the job may be scheduled, e.g., asynchronously, on a selected training server (or a set of training servers) of the MLS server pool(s) 185.
  • the process of training the model may be initiated (e.g., when the queued job is scheduled).
  • An empty parameter vector may be initialized (element 5404) and one or more settings to be used during the training phase of the model may be determined - e.g., the threshold condition that is to be used to trigger pruning may be identified, the fraction of parameters that is to be pruned each time such a threshold condition is detected may be identified, and so on.
  • the threshold may be based on a variety of factors in different implementations, such as the maximum number of parameters that can be included in the parameter vector, the memory capacity of the MLS server(s) used for training the model, and/or goals indicated by the client.
  • Client-provided goals from which the threshold may be derived may include, for example, limits on various types of resources that can be consumed during training and/or during post-training runs of the model, including memory, CPU, network bandwidth, disk space and the like.
  • a client may specify a budget goal for the training and/or for prediction runs, and the budget may be translated into corresponding resource limits at a component of the MLS.
  • a model generator or trainer may then begin implementing one or more learning iterations in the depicted embodiment.
  • a set of one or more observation records may be identified for the next learning iteration (element 5407).
  • some preliminary data type transformations and/or normalization operations may have to be performed (element 5410).
  • some model generators may require that categorical input variables be converted into numerical or Boolean variables, and/or that numerical variable values be mapped to real numbers in the range -1 to 1.
  • One or more new features for which parameters such as weights are to be added to the parameter vector may be identified (element 5413). In some cases, a new entry for a feature was selected as a pruning victim earlier may be re-inserted into the parameter vector.
  • the parameter value for such a re- added entry may differ from the parameter value of the previously pruned entry in some cases, while the parameter values of the original and re-introduced entries may be the same in other cases.
  • a key- value structure such as a hash map or hash table may be used to store (feature identifier, parameter) pairs of the parameter vector in some implementations, e.g., with feature identifiers as the keys.
  • one or more previously-generated parameter values may also be updated at this stage, e.g., using a stochastic gradient descent technique.
  • one or more features may be identified as pruning victims (element 5419).
  • the features that contribute the least to the models' predictions e.g. by virtue of having the smallest absolute weights, may be selected as pruning victims.
  • the manner in which the relative contributions of different features are determined or estimated, and the manner in which the features expected to provide the smallest contributions are identified, may differ in various embodiments.
  • each feature is assigned a respective real number as a weight
  • an efficient estimation technique that does not require sorting or copying of the weights and can estimate a quantile boundary value among the weights in a single in-memory pass over the parameter vector may be used.
  • the quantile boundary e.g., the weight representing the estimated 10 th percentile or the estimated 20 th percentile among the range of absolute values of the weights represented in the parameter vector
  • entries for features with lower weights may be removed from the parameter vector.
  • the memory consumed by the parameter vector may be reduced by the removal of the entries corresponding to the pruning victims (element 5422).
  • the trained model may be used for generating predictions on production data, test data, and/or on other post-training-phase data sets (element 5428). Learning iterations may be deemed to be complete if, for example, all the observation records expected to be available have been examined, or if the accuracy of the predictions that can be made by the model on the basis of the learning iterations performed thus far meets an acceptance criteria.
  • operations corresponding to elements 5407 onwards may be repeated - e.g., a new set of one or more observation records may be identified, the raw data may be transformed as needed, parameters for new features may be added to the parameter vector, and so on. In some cases, at least some additional learning iterations may be performed on observation records that have already been examined.
  • FIG. 55 illustrates a single-pass technique that may be used to determine quantile boundary estimates of the absolute values of weights assigned to features, according to at least some embodiments.
  • a set of weights Wl, W2, ....Wm corresponding to respective features Fl, F2, Fm may be examined in memory, e.g., without copying the weights and without explicitly sorting the weights.
  • the quantile for which a boundary value is to be obtained is referred to as "tau”.
  • tau may be set to 0.2.
  • the boundary itself is referred to as "phi-tau".
  • tau and another parameter "eta" (representing a learning rate to be used to determine phi-tau) may be determined and phi- tau may be set to zero.
  • abs(Wj) may be obtained (element 5505). If abs(Wj) is greater than phi-tau, as determined in element 5508, phi-tau may be increased by adding (tau*eta), the product of tau and eta.
  • phi-tau may be reduced by subtracting (l-tau)*eta (element 5511). If more weights remain to be examined (as detected in element 5517), the operations corresponding to elements 5505 onwards may be repeated. Otherwise, after all the weights have been examined, the estimation of the quantile boundary phi-tau may be complete (element 5520). The value of phi-tau at the end of the procedure illustrated in FIG. 55 may then be used to select the pruning victims - e.g., features with weights whose absolute values are less than phi-tau may be chosen as victims, while features with weights whose absolute values are no less than phi-tau may be retained. In at least some implementations, the learning rate (eta) may be modified or adjusted during the quantile boundary estimation procedure; that is, eta need not remain constant. Concurrent binning
  • feature identifier may refer to a unique identifier for a property derived from observation records of a data set to be used to train a model.
  • feature set may refer to a set of feature identifiers for which (a) feature values are observable while training the model and (b) feature parameters are known or inferred from the training data.
  • feature may refer to a value (e.g., either a single numerical, categorical, or binary value, or an array of such values) of a property of an observation record indexed by a feature identifier.
  • the term "binned feature”, for example, may refer to a particular binary indicator value (e.g., a "0" or a "1") of an array of binary indicator values obtained from a quantile binning transformation applied to one or more input variables of a set of observation records.
  • feature vector may refer to a set of pairs or tuples of (feature identifiers, feature values), which may, for example, be stored in a key- value structure (such as a hash map) or a compressed vector.
  • feature parameter or “parameter” may refer to a value of a parameter corresponding to a property indexed by the feature identifier.
  • a real number representing a weight is one example of a parameter that may be used in some embodiments, although for some types of machine learning techniques more complex parameters (e.g., parameters that comprise multiple numerical values) may be used.
  • the term "parameter vector” may refer to a set of pair or tuples (feature identifier, feature parameter), which may also be stored in a key-value structure such as a hash map or a compressed vector.
  • key-value structures may be used for parameter vectors and/or feature vectors in some embodiments, other types of representations of parameter vectors and/or feature vectors may be employed in various embodiments.
  • quantile binning transformations may be used for at least some input variables.
  • the values of a raw or unprocessed input variable may each be mapped to one of a selected number of quantile bins, such that each of the bins is at least approximately equal in population to the others.
  • a set of binary indicator variables (variables that can either be set to "0” or "1”) may then be generated, with each such binary indicator variable representing a respective "binned feature" derived from the raw input variable.
  • one of the indicator variables (the one corresponding to the particular bin to which the value of the raw variable is mapped) is set to "1", and the remaining indicator variables are set to "0".
  • bin counts i.e., the number of bins to which a given input variable's raw values should be mapped
  • the bin counts i.e., the number of bins to which a given input variable's raw values should be mapped
  • the bin count approximately 10 percent of the observation records would be mapped to each of the 10 bins, while with a bin count of 1000, only roughly 0.1% of the observation records would be mapped to each bin.
  • two versions of the model may have to be fully trained separately and then evaluated.
  • a first version Ml of the model may be trained with features obtained from the 10-bin transformation (as well as other features, if any are identified by the model generator), and a second version M2 may be trained using features obtained from the 1000-bin transformation (as well as the other features).
  • Mi 's predictions on test data may be compared to M2's predictions on the same test data to determine which approach is better.
  • Such an approach, in which different bin counts are used for training respective versions of a model may be less than optimal for a number of reasons. First, training multiple models with respective groups of binned features may be expensive even for a single input variable.
  • any single bin count may not necessarily produce predictions that are as accurate as could be produced using multiple bin counts.
  • a machine learning service may implement a concurrent binning technique, in which several different feature transformations with respective bin counts may be applied to a given input variable during a single training phase or training session of a model.
  • initial weights or more complex parameters
  • initial weights may be assigned to all the binned features derived from multiple bin counts.
  • a large number of binned features may be generated, with corresponding parameters or weights stored in a parameter vector.
  • At least some of the parameters corresponding to binned features may later be removed, e.g., based on the examination of additional observation records, a re-examination of some observation records, and/or the results of training-phase predictions during successive learning iterations.
  • the initial weights or parameters may be adjusted using selected optimization techniques such as LI or L2 regularization in some embodiments, and features whose absolute weight values fall below a threshold value may be eliminated from the parameter vector.
  • the efficient pruning technique described above e.g., in conjunction with the descriptions of FIG. 51 - FIG. 55
  • parameter vectors that allow a model to make accurate post-training-phase predictions with respect to non-linear relationships of the kinds described above may be obtained very efficiently in some embodiments, e.g., without incurring the costs of repeatedly training a model from scratch.
  • FIG. 56 illustrates examples of using quantile binning transformations to capture nonlinear relationships between raw input variables and prediction target variables of a machine learning model, according to at least some embodiments.
  • training data variables 5690 included in observation records obtained from a data source to be used to generate a model at a machine learning service may include a number of numeric input variables (NIVs), such as NIV1 and NIV2.
  • Distribution graphs DGl and DG2 respectively illustrate the statistical distribution of the values of NIV1 and NIV2 of a set of observation records.
  • the values of NIV1 lie in the range NIVl-min to NIVl-max, with the highest density of observations in the sub-range between n2 and n3.
  • the values of NIV2 lie in the range NIV2-min to NIV2-max, with a peak density between pi and p2.
  • the values of NIV1 have been mapped to 4 bins labeled NIVl-Binl through NIVl-Bin4.
  • the names of the bins correspond to feature identifiers of the corresponding binned features in FIG. 56. That is, a quantile binning transformation with a bin count of 4 has been used to generate four binned features 561 OA derived from the single variable NIV1, with one indicator variable corresponding to each of the bins.
  • NIV1 in observation record ORl falls in bin NIVl-Bin3; accordingly, for ORl, the indicator variable for NIVl-Bin3 has been set to 1 and the remaining NIV1 -related indicator variables NIVl-Binl, NIVl-Bin2, and NIVl-Bin4 have been set to zero.
  • the value of NIV1 falls within NlVl-Bin2, and the corresponding indicator variable has been set to 1 with the remaining set to zero.
  • the values of NIV2 have been mapped to three bins NIV2-Binl through NIV2-Bin3 via a quantile binning transformation with a bin count of 3. In both ORl and OR2, the value of NIV1 falls within NIV2-Bin2.
  • indicator variable NIV2-Bin2 has been set to 1
  • the remaining NIV2 -related indicator variables have been set to 0.
  • the number of binned features or binary indicator variables for a given variable corresponds to the bin count in the depicted embodiment.
  • the example transformations illustrated in FIG. 56 may be referred to as single-variable non-concurrent binning transformations herein.
  • the transformations may be designated as single-variable in that the values of only one variable are used to derive a given binned feature, and non-concurrent because only a single bin count is used for binning each of the variables.
  • a parameter vector 5625 comprising parameters for the combination of binned features (such as NIVl-Binl and NIVl-Bin2) and non-binned features (such as NF1) may be generated for the training data.
  • the parameters may comprise weights, such as respective real numbers for each feature.
  • the parameter vector may grow and shrink in some embodiments, e.g., as the kinds of pruning techniques described above are used iteratively.
  • the bin boundaries may also shift as more observation records are examined or previously-examined observation records are reanalyzed.
  • the model's training phase may be deemed complete (or at least sufficiently complete to be used for a prediction on some non-training data set), and the current version of the parameter vector 5625 may be used during an execution 5640 of the model to generate predictions 5672 for a test or production data set 5650.
  • a single bin count (four) is used for binning NIV1 values, and a single bin count (three) is used for binning NIV2.
  • the binned features generated may not necessarily lead to the highest-quality predictions. This may be the case, for example, because the particular bin count selected for a given raw input variable at the start of the training/learning process may not be able to represent the non-linear relationship between the raw input variable values and the target variables as well as the relationship may have been represented using a different bin count.
  • the bin count may have been chosen somewhat arbitrarily, without any quantifiable justification.
  • the machine learning service may concurrently implement quantile binning using several different bin counts for at least one raw input variable of the training set.
  • FIG. 57 illustrates examples of concurrent binning plans that may be generated during a training phase of a model at a machine learning service, according to at least some embodiments.
  • the set of training data variables 5790 includes numerical input variables NIVl, NIV2, and NIV3 that have been selected as candidates for concurrent quantile binning.
  • a respective concurrent binning plan (CBP) may be generated and implemented during the training phase of the model.
  • CBP1 three quantile binning transformations QBTl-1, QBT1-2 and QBT1-3 may be applied within the training phase to the values of NIVl, with respective bin counts of 10, 100 and 1000.
  • a total of 1110 binned features 5730A may be produced as a result of implementing CBP1 : 10 features (labeled NIVl -1-1 through NIVl -1-10) from QBTl-1, 100 features (NIVl -2-1 through NIVl -2- 100) from QBT1-2, and 1000 features (NIV1-3-1 through NIVl -3 -1000) from QBT1-3. .
  • Initial weights (or other types of parameters to be used to represent the relative contributions of the respective features to the model's predictions) may be assigned to each of the binned features 573 OA.
  • concurrent binning plan CBP2 four quantile binning transformations may be applied to NIV2 concurrently within the same training phase, with bin counts of 20, 40, 80 and 160 respectively, resulting in 300 binned features 5730B.
  • three quantile binning transformations may be applied to NIV3, with bin counts of 5, 25 and 625 respectively, resulting in 655 binned features 5730C.
  • Respective initial weights/parameters may be assigned to all the binned features.
  • a model generator or another component of the machine learning service may select the different bin counts (e.g., 10, 100, 1000 in the case of NIVl, or 20, 40, 80, 160 in the case of NIV2) to be used for concurrent binning of a given variable based on any of a variety of factors in different embodiments.
  • the different bin counts e.g., 10, 100, 1000 in the case of NIVl, or 20, 40, 80, 160 in the case of NIV2
  • a small sample of the observation records available may be obtained, and the distribution of the values of a numerical input variable (such as NIVl, NIV2 or NIV3) in the sample may be determined. The distribution may then be used to select the different bin counts.
  • the range and granularity of the numeric variables' values may influence the selection of bin counts as well: for example, if a particular numeric variable takes only integer values between 1 and 1000, the maximum number of bins for that variable may be limited to 1000.
  • a knowledge base of the machine learning service e.g. KB 122 shown in FIG. 1
  • quantile binning transformations of a given set of CBPs may be implemented during a single training phase or training session of the model in at least some embodiments, the computations involved in the transformations need not be performed simultaneously or in parallel at the hardware level.
  • values for the indicator variables of a given quantile binning transformation such as QBT1 may typically be produced using at least one thread of execution of a model generator.
  • the number of candidate variables for binning transformations may be quite large, and as a result the number of binned features produced as a result of implementing the concurrent binning plans may also become very large.
  • the memory required at an MLS server at which the model is being generated or trained increases.
  • one or more weight adjustment optimizations 5710 may be performed in the depicted embodiment.
  • Such optimizations may include, for example, a regularization technique in which the weights of at least some of the binned features (and/or some non-binned features) are reduced over successive learning iterations, as the model generator is able to learn more about the relative contributions of the various features to prediction accuracy.
  • regularization the weights associated with some features may become small enough that at least the parameters corresponding to such features may be removed or pruned from the parameter vector in at least one embodiment.
  • regularization may also help to reduce over- fitting in at least some embodiments; that is, reduction of parameter vector size may not be the only (or even the primary) reason for using regularization.
  • a quantile boundary for the different weights assigned to the features may be estimated (e.g., using a technique similar to that shown in FIG. 55), and a selected set of weights that fall in the lowest X% of the range of absolute values of weights may be removed from the model's parameter vector.
  • Both regularization and quantile-boundary-based pruning may be used in some embodiments to eliminate parameters from the parameter vector during training. In other embodiments, optimizations other than regularization and quantile-boundary- based pruning may be used.
  • the initial weights assigned to the different binned features obtained in accordance with CBP1 - CBP3 may be adjusted in accordance with the selected optimization strategy or strategies in the embodiment depicted in FIG. 57. If the adjusted weight for a given binned feature falls below a rejection threshold, the entry for that feature may be removed from the parameter vector, and may not be used for post-training-phase predictions (unless it is reintroduced later as more learning iterations are completed). In the illustrated example, corresponding to each of the input variables for which concurrent binning transformations were applied, only a subset are used for post-training-phase predictions as their adjusted weights are above the rejection threshold.
  • NIVl-related binned features For example, from among the 1110 NIVl-related binned features, only NIVl-1-3 and NIV 1-2-5 are used. From among the 300 NIV2 -related binned features, NIV2-2-1 through NIV2-2-40 are used, and from among the 655 NIV3-related binned features, NIV3-3-1 through NIV3-3-10 and NIV3-3-50 through NIV3-3-53 are used for post-training predictions. The parameters for the remaining binned features may be removed from the parameter vector. Although only binned features produced as a result of the implementation of concurrent binning plans CBP1-CBP3 are shown in FIG. 57, parameters for non-binned features may also be added to and removed from the parameter vector during the training phase.
  • each binning transformation itself is applied to a single variable.
  • the values of more than one input variable may be used together to map a given observation record to a single bin.
  • Such bins may be referred to herein as multi-variable bins, and the corresponding feature transformations may be referred to herein as multi-variable quantile binning transformations.
  • different combinations of bin counts may be assigned to each of the input variables to produce multi- variable binned features concurrently during a model's training phase.
  • FIG. 58 illustrates examples of concurrent multi-variable quantile binning transformations that may be implemented at a machine learning service, according to at least some embodiments.
  • three numerical input variables NIVl, NIV2 and NIV3 are identified as candidates to be grouped together for concurrent multi-variable binning in the depicted embodiment.
  • Respective decision trees 581 OA and 5810B may be generated for binning decisions for the combination of the three variables, with respective bin-count combinations.
  • Decision tree 581 OA represents the bin-count combination (cl, c2, c3) for the variables (NIVl, NIV2, NIV3) respectively. Given an observation record, the decision tree may be navigated based on the values of the three variables, with each level comprising decision nodes at which a particular one of the variables is checked to decide which node should be traversed next. Leaf nodes of the tree may correspond to the bins derived from the combination of all the grouped variables. For example, level LI of tree 5810A may comprise cl decision nodes, each representing one quantile subset of the values of NIVl .
  • c2 decision nodes for values of NIV2 may be generated at level L2, each representing a combination of NIVl -based binning and NIV2 -based binning.
  • c3 leaf nodes may be generated, each representing a multi-variable bin and a corresponding binned feature.
  • a total of (cl *c2*c3) bins may be generated with corresponding binary indicator variables.
  • the leaf nodes of tree 5810A are labeled Binl23-1-1 through Binl23-l-m, where m is the product of cl, c2 and c3.
  • Binl23-k-q would represent the qth leaf node for the kth tree used for binning variables NIVl, NIV2 and NIV3.
  • Any given observation record may be mapped to a particular one of the leaf nodes, based on the values of NIVl, NIV2 and NIV3 in that observation record.
  • the binary indicator variable for that leaf node may be set to 1 for the observation record, while other indicator variables may all be set to zero.
  • multi-variable binning may also be performed concurrently with different combinations of bin counts for a given variable set. For example, using a different combination of bin counts (c4, c5, c6), a second decision tree 5810B may be generated concurrently for the (NIV1, NIV2, NIV3) combination. Once again, the number of bins/features at the leaf nodes is equal to the product of the bin counts: thus, in FIG. 58, the leaf nodes of tree 5810B are labeled Bin 123 -2-1 through Bin 123 -2 -n, where n is (c4*c5*c6).
  • any desired number of decision trees for respective multi-variable concurrent binning transformations may be used in various embodiments.
  • the use of multiple variables for grouped quantile binning as shown in FIG. 58 may allow a wider variety of non- linear relationships to be captured than may be possible using single-variable binning.
  • Similar kinds of approaches to limiting the parameter vector size may be used with multi-variable concurrent quantile binning as were discussed above with reference to single-variable binning in various embodiments.
  • regularization and/or techniques involving quantile-boundary estimation for the weights assigned to the binned features may be employed in at least some embodiments.
  • multi-variable concurrent binning transformations as well as single-variable concurrent binning transformations may be used within a given training phase of a model.
  • Single-variable concurrent binning of the type illustrated in FIG. 57 may be considered one variant of the more general multi-variable binning technique, with a simple decision tree comprising only leaf nodes (plus a root node representing the start of the binning decision procedure).
  • some number of groups of variables may be selected for concurrent binning. Some of the groups may comprise just one variable, while other groups may comprise multiple variables.
  • FIG. 59 illustrates examples of recipes that may be used for representing concurrent binning operations at a machine learning service, according to at least some embodiments.
  • the machine learning service may support a recipe language in which a wide variety of feature transformation operations may be indicated in user-friendly syntax, and such recipes may be re-used for different data sets as needed.
  • Recipes corresponding to concurrent quantile binning transformations such as the single-variable concurrent binning illustrated in FIG. 57, as well as the multi-variable concurrent binning illustrated in FIG. 58, may be generated and stored within the MLS repository in the embodiment depicted in FIG. 59.
  • the outputs section of recipe 5902A corresponds to the concurrent binning transformations of FIG. 58, with the name of the input variable and the bin count indicated for each transformation.
  • concurrent single-variable quantile binning transformations with bin counts of 10, 100, and 1000 are to be performed for NIVl, with bin counts of 20, 40, 80 and 160 for NIV2, and with bin counts of 5 , 25 and 625 for NIV3.
  • the outputs section of recipe 5902B indicates concurrent multi-variable quantile binning transformations (with the "MV” in the token "MV quantile bin” standing for “multiple variable”) to be performed on specified groups of variables.
  • the first such transformation is to be applied to NIVl and NIV2 together, with NIVl values mapped to 10 bins and NIV2 values also mapped to 10 bins (as indicated by the "10X10"), thereby creating 100 bins for the combination.
  • a second multi- variable binning transformation is to be performed concurrently for NIVl and NIV2, with bin counts of 100 for NIVl and 100 for NIV2, resulting in 10000 bins overall.
  • a third multi- variable binning transformation is to be performed on NIVl and NIV3 together, with respective bin counts of 100 for NIVl and 20 for NIV3.
  • Single-variable quantile binning transformations may also be indicated using the MV quantile bin token in some embodiments, specifying a group that has just one variable.
  • the "quantile bin" token shown in recipe 5902A may be used for both single-variable and multi- variable binning transformations, and the parameters associated with the token may be used to determine whether single-variable or multi-variable binning is to be performed.
  • Recipes similar to 5902A or 5902B may be produced by a model generator in some embodiments, and stored in an MLS artifact repository for possible re-use on similar types of machine learning problems.
  • a client of the machine learning service may explicitly request concurrent quantile binning, and may provide recipes that specify the attributes or properties of such transformations (e.g., the groups of one or more variables to be binned concurrently, the number of concurrent binning transformations for each group, the bin counts, etc.).
  • the process of generating or training a model may be initiated at the MLS in response to a programmatic request from a client, e.g., via an API or a web-based console.
  • FIG. 60 illustrates an example of a system in which clients may utilize programmatic interfaces of a machine learning service to indicate their preferences regarding the use of concurrent quantile binning, according to at least some embodiments.
  • a client 164 may submit a model creation or training request 6010 via a programmatic interface 6062.
  • the client request may indicate a data source 6020 whose observation records are to be used to train a model to predict values of one or more target variables 6022 indicated in the request.
  • the request may include a "concurrent binning" parameter 6024, which may be set to "true” if the use of concurrent quantile binning is acceptable to the client.
  • Clients that do not want concurrent quantile binning to be used may set such a parameter to "false” in such embodiments.
  • the default setting for concurrent binning may be "true", so that the MLS may implement concurrent quantile binning for selected input variables that are identified as suitable candidates even if the client does not indicate a preference.
  • clients instead of or in addition to setting a value for the concurrent binning parameter, clients may indicate or include a recipe that includes concurrent binning transformations in their model creation request 6010.
  • the client request 6010 may be received by a request/response handler 6042 of the machine learning service, and a corresponding internal request may be transmitted to a model generator 6080.
  • the model generator may also be referred to herein as a model trainer, a feature processing manager, or a feature transformation manager.
  • Model generator 6080 may identify one or more candidate variables of the observation records for which concurrent quantile binning is to be performed. In some embodiments, the model generator 6080 may consult the MLS best practices knowledge base 122 to determine the attributes to be used for concurrent binning: e.g., if/how multiple variables should be grouped for multi-variable quantile binning, the bin counts that should be used, and so on.
  • the model generator 6080 may be able to identify earlier-generated recipes (e.g., in the knowledge base or in the MLS artifact repository 120) which include concurrent quantile binning transformations that were used successfully for similar models to the one whose creation is requested by the client. Such preexisting recipes may be used to select the concurrent binning transformations to be applied in response to request 6010.
  • a k-dimensional tree (k-d tree) representation of a set of observation records may be generated, e.g., with the k dimensions representing a selected set of variables.
  • the attributes of the concurrent binning transformations to be applied to one or more of the selected set of variables may be based at least in part on an examination of such a k-d tree in such embodiments.
  • one or more training jobs 6068 that include the use of concurrent quantile binning may be generated and scheduled.
  • a training job 6068 may include preprocessing tasks 6070 that convert raw input variables into numeric values that can then be used for binning.
  • Such pre-processing conversions may, for example, include mapping of one or more selected categorical variables to real numbers, and/or domain-specific transformations (e.g., transformations that map raw audio data, graphics data, or video data into real numbers suitable for binning).
  • an iterative learning procedure may be used to train the model, with alternating phases of expanding the model's parameter vector (e.g., by adding parameters for more binned features as well as un-binned features as more learning iterations are completed) and contracting the parameter vector (e.g., using the pruning technique described earlier).
  • parameter vector expansions 6072 may result in a rapid growth in the amount of memory needed, and an aggressive approach to pruning may therefore be required during parameter vector contractions 6072.
  • Attributes of the optimization technique(s) (such as regularization) used for pruning may be adjusted accordingly, e.g., so that the weights for features that are identified as less significant to model predictions are reduced more quickly.
  • the fraction of parameters that are eliminated or pruned during any particular iteration may be increased to implement more aggressive parameter vector size reductions, the triggering conditions for pruning may be modified so that pruning is performed more frequently, and so on. It is noted that although parameters may be removed from the parameter vector in many scenarios, at least in some embodiments it may be sometimes be the case that no parameters are eliminated from the parameter vector during the training phase. Thus, the use of concurrent quantile binning transformations of the kind described herein does not require the pruning of parameters.
  • a representation of the model may be stored in the artifact repository 120 and an identifier 6082 of the trained model may be provided to the client via the programmatic interface 6062.
  • an indication (such as a recipe) of the concurrent quantile binning transformations performed may also be provided to the client 164.
  • the client may eventually submit a model execution request 6054, and post-training-phase production runs 6058 of the model may be scheduled by a model execution manager 6032.
  • FIG. 61 is a flow diagram illustrating aspects of operations that may be performed at a machine learning service at which concurrent quantile binning transformations are implemented, according to at least some embodiments.
  • an indication of a data source from which unprocessed observation records are to be obtained to generate a model may be received at a machine learning service of a provider network, e.g., via a client request submitted via a programmatic interface.
  • the machine learning service may determine that a linear model whose predictions are to be based on real- valued weights (and/or linear combinations of more complex parameters) assigned to features derived from raw values of the observation records' variables is to be generated.
  • a component of the machine learning service such as a model generator may identify one or more unprocessed variables as candidates for concurrent quantile binning transformations (element 6104).
  • the candidates may be identified based on any of a number of different factors in different embodiments, such as an analysis of the distributions of the variables' raw values in a sample of observation records, a default strategy for performing concurrent binning, and so on.
  • one or more groups of candidates may be identified for multi-variable concurrent binning transformations.
  • raw values of one or more variables of the observation records may be mapped to real numbers in a pre-processing step. For example, variable comprising audio, video, or graphics content may be mapped to real numbers using domain-specific mapping algorithms, or some types of categorical variables or text tokens may be mapped to real numbers.
  • a concurrent binning plan may be generated in the depicted embodiment (element 6107).
  • the attributes or properties of such plans may include, for example, the number of distinct quantile binning transformations to be implemented during a single training phase and the bin counts selected for each such transformation.
  • the sequence in which the variable values are to be checked e.g., which variable is to be examined at successive levels of the decision trees to be used for binning, similar to the trees illustrated in FIG. 58
  • the model generator may utilize a knowledge base of best practices to help generate the concurrent binning plans in some embodiments, e.g., by looking up recipes that were used successfully in the past for the same problem domain (or similar problem domains) as the model being generated.
  • Initial weights for the features obtained at least in part as a result of implementing the concurrent binning plans may be stored in a parameter vector in the depicted embodiment.
  • the weights may subsequently be adjusted, e.g., using LI or L2 regularization or other optimization techniques (element 6113).
  • At least some of the parameter vector entries may be removed based on the adjusted weights in some embodiments (element 6116). For example, entries whose weights fall below a rejection threshold may be removed.
  • an efficient quantile boundary estimation technique similar to that discussed in the context of FIG. 52 and FIG.
  • the trained model may be used to generate predictions on production data and/or test data (element 6119). That is, the parameters or weights assigned to the retained features (e.g., some number of binned features and/or some number of non-binned features that have not been pruned) may be used to obtain the predictions.
  • Concurrent quantile binning may be used for a wide variety of supervised learning problems, including problems that can be addressed using various types of generalized linear models in different embodiments. Concurrent quantile binning transformations similar to those described above may also be used for unsupervised learning, e.g., in addition to or instead of being used for supervised learning in various embodiments. In one embodiment, for example, at least some of the variables of an unlabeled data set may be binned concurrently as part of a clustering technique .
  • model execution results may not always be straightforward, especially if the results are presented simply in text format, e.g., as one or more tables of numbers.
  • quality-related metrics such as accuracy, false positive rate, false negative rate and the like
  • interpretation-related settings such as cutoff values or boundaries between classes in the case of classification models
  • the MLS may provide support for an interactive graphical interface.
  • an interactive graphical interface which may for example be implemented via a collection of web sites or web pages (e.g., pages of a web-based MLS console), or via standalone graphical user interface (GUI) tools, may enable users of the MLS to browse or explore visualizations of results of various model executions (such as various post-training phase evaluation runs, or post-evaluation production runs).
  • GUI graphical user interface
  • the interface may allow users to change one or more interpretation-related settings dynamically, learn about various quality metrics and their inter-relationships, and prioritize among a variety of goals in various embodiments.
  • the interface may comprise a number of control elements (e.g., sliders, knobs, and the like) that can be used by MLS clients to change the values of one or more prediction-related settings, and to observe the consequences of such changes in real time.
  • control elements e.g., sliders, knobs, and the like
  • continuous-variation control elements such as sliders that emulate smooth changes to underlying variables or settings may be used, with in other implementations, discrete-variation control elements such as knobs that allow one of a small set of values to be selected may be used.
  • the interface may allow clients to "reverse-engineer" the impact of certain types of prediction-related choices: for example, a client may use a slider control to indicate a desired change a prediction quality result metric (e.g., the false positive rate for a particular evaluation run of a binary classification model) and view, in real time, the cutoff value that could be used to obtain the desired value of the result metric.
  • a prediction quality result metric e.g., the false positive rate for a particular evaluation run of a binary classification model
  • Clients may also be presented with visual evidence of the relationships between different prediction quality metrics and thresholds - e.g., as a client changes the sensitivity level for a given evaluation run, the impact of that change on other metrics such as precision or specificity may be shown.
  • visual evidence of the relationships between different prediction quality metrics and thresholds - e.g., as a client changes the sensitivity level for a given evaluation run, the impact of that change on other metrics such as precision or specificity may be shown.
  • Using such interfaces that enable "what-i ' explorations of various changes, it may become easier for a user of the MLS to select settings such as classification cutoffs, the ranges of variable values to which a model's predictions should be restricted in subsequent runs of the model, and the like, to meet that user's particular business objectives (e.g., to keep false positives low, or to keep accuracy high).
  • a user may vary a number of different settings or metrics and observe the resulting trends, without affecting any of the saved results of the evaluation run.
  • the user may submit a request via the interactive interface in some embodiments to save a respective target value of one or more prediction-related settings that are to be used for subsequent runs of the model.
  • the dynamic display of the effects of various possible settings changes may be made possible in various embodiments by efficient communications between the back-end components of the MLS (e.g., various MLS servers where the model execution results are obtained and stored, and where the impacts of the changes are rapidly quantified) and the front-end or client- side devices (e.g., web browsers or GUIs being executed at laptops, desktops, smart phones and the like) at which the execution results are displayed and the interactions of the clients with various control elements of the interface are first captured.
  • the front-end or client- side devices e.g., web browsers or GUIs being executed at laptops, desktops, smart phones and the like
  • an indication of the change may be transmitted rapidly to a back-end server of the MLS in some embodiments.
  • the back-end server may compute the results of the change on the data set to be displayed quickly, and transmit the data necessary to update the display back to the front-end device.
  • a continuous-variation control such as a slider is used by a client to transition from one value to another
  • multiple such interactions between the front-end device and the back-end server may occur within a short time in some implementations (e.g., updates may be computed and displayed several times a second) to simulate continuous changes to the display.
  • the logic required for calculating at least some of the impacts of client-indicated changes may be incorporated into the interactive interface itself, or at other subcomponents the client- side device used for the graphical displays.
  • FIG. 62 illustrates an example system environment in which a machine learning service implements an interactive graphical interface enabling clients to explore tradeoffs between various prediction quality metric goals, and to modify settings that can be used for interpreting model execution results, according to at least some embodiments.
  • one or more training data sets 6202 to be used for a model may be identified, e.g., in a training request or a model generation request submitted by a client of the MLS.
  • Model generator 6252 may use the training data sets 6202 to train a model 6204 to predict values of one or more output variables for an observation record, based on the values of various input variables (including, for example, results of applying feature transformations of the kinds described earlier to raw input data).
  • one or more evaluation runs may be performed in the depicted embodiment using observation records (which were not used to train the model) for which the values of the output variable(s) are known, e.g., to determine how good the model's predictions are on observations that it has not examined during training.
  • Evaluation data set 6212 may comprise such observation records in system 6200.
  • the trained model 6204 may be provided the evaluation data set 6212 as input by model executor 6254A (e.g., a process running at one of the MLS servers of server pools 185 shown in FIG. 1). Respective jobs (similar to the jobs illustrated in FIG. 4) may be scheduled for training the model and for evaluating the model in at least some embodiments.
  • At least some of the results of the evaluation may be packaged for display to the client or user on whose behalf the evaluation was conducted in the depicted embodiment.
  • a set of evaluation run result data 6222 may be formatted and transmitted for an interactive graphical interface 6260 (e.g., a web browser, or a custom GUI tool that may have been installed on a client computing device).
  • the result data set 6222 may include, for example, some combination of the following: statistical distributions 6232 of one or more output variables of the evaluation run, one or more currently selected or MLS-proposed values of prediction interpretation thresholds (PITs) 6234 (e.g., cutoffs for binary classification), and/or values of one or more quality metrics 6236 (e.g., accuracy, false positive rate, etc.) pertaining to the evaluation run.
  • PITs prediction interpretation thresholds
  • quality metrics 6236 e.g., accuracy, false positive rate, etc.
  • instructions or guidelines on how the result data is to be displayed may also be transmitted from a back-end MLS server to the device at which the graphical view of the data is to be generated.
  • the interactive graphical interface 6260 may include various controls allowing clients to view the results of the evaluation during a given interaction session, experiment with various prediction settings such as classification cutoffs and the like, and observe the tradeoffs associated with making changes to such settings. Examples of components of the interactive graphical display, as well as various controls that may be used in different embodiments are shown in FIG. 63 - FIG. 69.
  • the client to whom the evaluation result data is displayed may use one or more of the controls to indicate desired or target values for one or more settings.
  • the selection of target values may involve several client interaction iterations 6241 during a given session, in which for example, a client may make one change, observe the impact of that change, undo that change, then make another change and view its impact, and so on.
  • the client may select a particular setting such as a target value for a prediction interpretation threshold (PIT) 6242.
  • the target value selected may differ from the PIT value 6234 that may have been initially proposed by the MLS in at least some scenarios, although the client may in some cases decide not to change the proposed PIT value.
  • the client-selected PIT value 6242 may be stored in a repository of the MLS, e.g., artifact repository 120 of FIG. 1.
  • the saved PIT value 6242 may be used for generating results of one or more subsequent runs of trained model 6204, e.g., runs that may be performed using a model executor 6254A on post-evaluation or production data set 6214.
  • the same model executor 6254A e.g., the same back-end MLS server
  • FIG. 63 illustrates an example view of results of an evaluation run of a binary classification model that may be provided via an interactive graphical interface, according to at least some embodiments.
  • the results may be displayed in a web page 6300 that forms part of a browser-based console for interactions with the machine learning service.
  • a similar view with interactive controls may be provided using a standalone GUI (e.g., a thin client program or a thick client program executing at a customer's computing device such as a laptop, desktop, tablet, or smart phone) which does not require the use of a web browser.
  • Message area 6302 of web page 6300 indicates that the data being displayed corresponds to a particular evaluation run of a model ("M-1231") in which a particular data set "EDS1" was used as input to the model.
  • M-1231 is a binary classification model in the depicted example - i.e., a model whose goal is to classify observation records of the evaluation data set EDS1 into one of two classes, such as classes simply labeled "0" and "1".
  • the message area also includes explanatory text pertaining to graph Gl and the use of the slider control SI .
  • Graph Gl illustrates the distribution of an output variable labeled "Score": that is, the X axis represents values of Score while the Y-axis indicates the number of observation records of the evaluation data set EDS1.
  • Score an output variable labeled "Score”: that is, the X axis represents values of Score while the Y-axis indicates the number of observation records of the evaluation data set EDS1.
  • Each of the observation records is placed in one of the two classes “0" and “1” based on the Score values and a class boundary called a "cutoff. For example, if the Score values are real numbers within the range 0 and 1, and the cutoff value is set to 0.5, an observation record of EDS with a Score of 0.49 would be placed in the "0" class, while an observation record with a Score of 0.51 would be placed in the "1" class in the depicted scenario.
  • the cutoff value for a binary classification represents one example of a prediction interpretation threshold (PIT); other prediction interpretation thresholds may be used in various types of machine learning problems.
  • PIT prediction interpretation threshold
  • the boundaries of the sub-range of an output variable that represent predictions within an acceptable mean-squared error range e.g., mean-squared-error values between X and Y
  • mean-squared-error values between X and Y may serve as prediction interpretation thresholds.
  • the boundary values for one or more output variables that are used to decide which of N classes a particular observation record is to be placed in (or whether the observation record should be considered unclassified) may represent the prediction interpretation thresholds.
  • Each of the observation records in EDS may include a label "0" or “1” in the illustrated example, indicating the "ground truth” regarding the observation record. These labels are used to divide the observation records for plotting graph Gl - e.g., records whose label is “0” are indicated using the curve “Records labeled “0”", while the remaining records are indicated using the curve “Records labeled “1””. Within each of the two groups, given a value of 0.5 for the cutoff (as indicated in element 6350 of page 6300), some observation records are placed in the correct class, while others are placed in the incorrect class.
  • ground truth value is "0" for a given observation record, and the Score is less than the cutoff, a correct classification result called a "true negative” results - that is, the correct value of the label is "0", and the class selected using the cutoff matches the correct value. If the ground truth value is "1" and the Score is higher than the cutoff, a correct classification called a "true positive” results. If the ground truth value is "0” and the Score is higher than the cutoff, an incorrect classification called a "false positive” results. Finally, if the ground truth value is "1” and the Score is lower than the cutoff, the observation record is placed in the "0" class, and an incorrect classification called a "false negative” results.
  • prediction interpretation decisions The four types of decisions that are possible for a given observation record in a binary classification problem (true positive, true negative, false positive and false negative) may be referred to as respective "prediction interpretation decisions" herein. Other types of prediction interpretation decisions may be made when other types of machine learning models.
  • the percentages and/or the actual counts of the observation records in the evaluation data set corresponding to the four types of prediction interpretation decisions may be shown in web page 6300.
  • 4502 or 45% of the observation records of EDS 1 correspond to true negatives
  • 698 or 7% are false negatives
  • 1103 or 11% are false positives
  • the remaining 3698 records of EDS 1, or 37%) are true positives.
  • web page 6300 may also indicate at least some metrics in a tabular form in the depicted embodiment.
  • region 6351 of the web page may indicate the total number of observation records of EDS 1, the cutoff value, the number/percentage of records placed in the "1" class (the sum of the false positives and the true positives) and in the "0" class (the sum of the true negatives and the false negatives), the number/percentage of records for which the classification decision was made correctly (the sum of the true negatives and true positives) and the number/percentage of records for which an incorrect decision was made (the sum of the false positives and the false negatives).
  • a number of the graphic and/or text elements may be dynamically re-drawn or updated in response to user interaction.
  • a user granted the appropriate permissions may use a mouse (or, in the case of touch-screen interfaces, a stylus or a finger) to manipulate the slider control SI .
  • SI may be moved to the left (as indicated by arrow 6310) to decrease the cutoff value, or to the right (as indicated by the arrow 6311) to increase the cutoff value.
  • the cutoff value is changed, the number of observation records that fall into some or all of the four decision groups may change (as illustrated in FIG. 64a and FIG.
  • the values of the metrics shown in region 6351 may also be dynamically updated as the cutoff value is changed. Such dynamic updates may provide a user an easy-to- understand view of the impact of changing the cutoff value on the metrics that are of interest to the user.
  • users may be able to change the set of metrics whose values are displayed and updated dynamically, e.g., either the metrics whose values are shown by default or "advanced" metrics that are displayed as a result of clicking on link 6354.
  • other visual cues such as color coding, lines of varying thickness, varying fonts etc. may be used to distinguish among the various parts of Graph Gl, Bar Bl, region 6351 etc.
  • the machine learning service may save a cutoff value (or other prediction interpretation threshold values) currently associated with a given model in a repository.
  • the initial proposed value of the cutoff may be selected by the MLS itself, and this value (e.g., 0.5 in the example scenario shown in FIG. 63) may be stored as the default.
  • An authorized user may use an interface such as web page 6300 to explore the impact of changing the cutoff, and then decide that a new value of the cutoff should be used for one or more subsequent runs (e.g., either additional evaluation runs, or post-evaluation production runs) of the model.
  • the MLS may be instructed to save a new value of the cutoff for future runs using the "Save new cutoff button of button control set 6352 of web page 6300.
  • users may be able to change the class labels (such as "0" and "1") to more meaningful strings, e.g., using the "Edit class labels” button control.
  • the cutoff may be re-set to its default value using the "Reset cutoff button control.
  • a user who is dissatisfied with the evaluation results being displayed may submit a request to re-evaluate the model or re-train the model via web page 6300, e.g., using button controls "Re-evaluate model” or "Re-train model” shown in button control set 6352.
  • Some of the requests may require further interaction with the client for the MLS back-end to determine additional parameters (e.g., a new evaluation data set may be specified for a re- evaluation).
  • a different web page may be displayed in response to a client's click on one of the buttons 6352 in the depicted embodiment to enable the indication of the additional parameters.
  • Other types of controls than those shown in FIG.
  • continuous-variation controls elements may be implemented to enable clients to change settings such as cutoff values smoothly, while in other embodiments, discrete-variation control elements may be used that allow users to choose from among a few discrete pre-d values.
  • FIG. 64a and 64b collectively illustrate an impact of a change to a prediction interpretation threshold value, indicated by a client via a particular control of an interactive graphical interface, on a set of model quality metrics, according to at least some embodiments.
  • FIG.64a illustrates the results of an evaluation run of a binary classification model with the cutoff set to a value CI . With this cutoff value, as indicated in graph G2 and bar B2, true negative decisions are made for 4600 observation records of an example evaluation data set (46% of the total), while true positive decisions are made for 3400 observation records. 700 decisions are false negatives, and 1300 are false positives.
  • a client may assign different priorities or different importance levels to various quality metrics pertaining to a model. For example, if the negative business consequences of false positive classifications are much higher than the negative business consequences of false negatives, the client may decide that the interpretation threshold(s) for the model should be changed in a direction such that, in general, fewer false positive decisions would be likely to occur.
  • the client may decide that the interpretation threshold(s) for the model should be changed in a direction such that, in general, fewer false positive decisions would be likely to occur.
  • the e-business operator may decide that if a tradeoff is to be made between false negatives and false positives, they would prefer more false negatives than false positives.
  • the opposite tradeoff may be preferable in scenarios in which the real-world consequences of false negatives are much higher - e.g., in tumor detection applications in which treatment for a possible tumor may be denied to a patient whose observation is incorrectly classified as a false negative.
  • the client has determined that the rate of false positives is too high, and has therefore decided to increase the cutoff value from CI to C2 using slider SI, as indicated by arrow 6434.
  • the impact of the increase is illustrated in FIG. 64b.
  • the visual properties e.g., shadings, colors etc.
  • the number of false positives decreases as intended, falling from 1300 (in FIG. 64a) to 500 (in FIG. 64b). While the number of true negatives remains unchanged at 4600, the number of false negatives increases substantially, from 700 to 1800.
  • the number of true positives decreases somewhat as well, from 3400 to 3100.
  • the dynamic visualization of the effects of changing the cutoff may help the MLS client make more informed decisions in various embodiments than may have been possible if only text representations of the various metrics were provided.
  • providing only text representations may make it harder to decide on a particular target for a cutoff or other similar prediction interpretation threshold, because it may be much harder in the text-only scenario to understand the rates of change of the various metrics around specific values of the threshold.
  • small changes to the cutoff value may have much larger impacts on the false positive rates or false negative rates in some sub-ranges of the Score values than others, and such higher-order effects may be hard to appreciate without dynamically updated graphs such as those shown in FIG. 64a and 64b.
  • FIG. 65 illustrates examples of advanced metrics pertaining to an evaluation run of a machine learning model for which respective controls may be included in an interactive graphical interface, according to at least some embodiments.
  • Much of the content displayed in FIG. 63 is identical to the content of web page 6300 of FIG. 63.
  • the main difference between FIG. 63 and FIG. 65 is that as a result of the user clicking on link 6354 of web page 6300, additional metrics (beyond those shown in region 6351) are now being displayed.
  • respective horizontal slider controls 6554 are shown for prediction quality metrics sensitivity (slider 6554A), specificity (slider 6554B), precision (slider 6554C) and Fl score (slider 6554D).
  • clients may be able to decide which metrics they wish to view and/or modify, either as part of the region 6351 displaying a default or core group of metrics, or in an advanced metrics region.
  • the metrics available for display and/or manipulation may vary depending on the type of model in various embodiments, and may include, among others: an accuracy metric, a recall metric, a sensitivity metric, a true positive rate, a specificity metric, a true negative rate, a precision metric, a false positive rate, a false negative rate, an Fl score, a coverage metric, an absolute percentage error metric, a squared error metric, or an AUC (area under a curve) metric.
  • clients may be able to use the interface to move metrics between the core metrics group and the advanced metrics group, and/or to define additional metrics to be included in one or both groups.
  • the combination of the sliders 6554A- 6554D and slider SI may be used by a client to visually explore the relationships between different metrics. For example, changing the cutoff using slider SI may result in dynamic updates to the positions of sliders 6554A- 6554D (as well as updates to the bar Bl and to region 6351), visually indicating how the cutoff value influences sensitivity, specificity, precision and the Fl score. Changing the position of any one of the sliders 6554A-6554D may result in corresponding real-time changes to SI, bar Bl, and the remaining sliders 6554.
  • clients may be able to change the layout of the various regions displayed in the interactive interface, e.g., by choosing the particular types of controls (sliders, knobs, etc.) to be used for different metrics, which metrics are to be directly modifiable using graphical controls and which metrics are to be shown in text format.
  • controls slidingers, knobs, etc.
  • FIG. 66 illustrates examples of elements of an interactive graphical interface that may be used to modify classification labels and to view details of observation records selected based on output variable values, according to at least some embodiments.
  • the MLS or the client on whose behalf the model is trained and evaluated
  • the client may decide that more user- friendly names should be used for the classes.
  • the "Edit class labels" button may be clicked, and a smaller pop-up window 6605 may be displayed.
  • the user may enter new names for the labels, such as "Won't buy” (replacing the label “0”)and “Will-buy” (replacing the label "1") indicating that the model is classifying shoppers based on predictions about the likelihood that the shoppers will make a purchase (the "1" class) or will not make a purchase (the "0" class).
  • a number of other controls may be provided to users of the interactive graphical interface of the MLS in various embodiments.
  • clients may wish to examine the details of observation records for which a particular Score was computed by the model.
  • a user may mouse click at various points within graph Gl (e.g., at point 6604, corresponding to a Score of approximately 0.23), and the interface may respond by displaying a list 6603 of observation records with Score values close to that indicated by the clicked-at point.
  • Other types of interfaces such as a fingertip or a stylus, may be used in other implementations.
  • a list 6603 of three observation records OR231142, OR4498 and OR3124 with corresponding links may be shown. If and when the client clicks on one of the identifiers of the observation records of the list, the values of various variables of that observation record may be displayed in another window or panel, such as OR content panel 6642 in the depicted example.
  • the values of input variables IV 1, IV2, ..., IVn of observation record OR4498 may be shown as a result of a click on the corresponding link of list 6603 in the example illustrated in FIG. 66.
  • FIG. 67 illustrates an example view of results of an evaluation run of a multi-way classification model that may be provided via an interactive graphical interface, according to at least some embodiments.
  • web page 6700 includes a message area 6702 indicating that the data being displayed corresponds to a particular evaluation run of a model ("M-1615") in which a particular data set "EDS3" was used as input to the model.
  • An enhanced confusion matrix 6770 for a 4-way classification is shown for the evaluation run. For four classes, "Class 1" through “Class 4", the actual or true populations (and corresponding actual percentages) are shown in the columns labeled 6772. These four classes may collectively be referred to herein as "non-default classes”.
  • the model "M-1615” categorizes observation records into five classes (the four non- default classes “Class 1” through “Class 4" as well as a default class labeled "None”) based on at least two factors in the depicted embodiment: (a) predicted probabilities that any given observation record belongs to any of the four non-default classes and (b) a minimum predicted probability threshold (MPPT) for placing a record into a non-default class instead of the default class.
  • MPPT minimum predicted probability threshold
  • the default or proposed MPPT value may be set by the MLS to (l/(the number of non-default classes)) (e.g., for four non-default classes, the model would propose 1 ⁇ 4 or 25% as the MPPT).
  • the MPPT may thus be considered an example of a prediction interpretation threshold (PIT) for multi-way classification models.
  • other metrics may be indicated using similar techniques as those illustrated in FIG. 63 - e.g., a set of core metrics pertaining to multi-way classification or a link to view advanced metrics may be provided in various embodiments.
  • users may be able to specify respective MPPTs for different classes and may be able to view the effects of those changes dynamically.
  • the matrix elements may be color coded - e.g., as a percentage gets closer to 100%, the corresponding element's color or background may be set closer to dark green, and as a percentage gets closer to 0%), the corresponding element's color or background may be set closer to bright red.
  • the MLS may provide an interactive graphical display to enable users to define or select exactly how prediction errors for regression models are to be defined, and/or to explore the distribution of the prediction errors for selected error tolerance thresholds.
  • FIG. 68 illustrates an example view of results of an evaluation run of a regression model that may be provided via an interactive graphical interface, according to at least some embodiments.
  • web page 6800 includes a message area 6802 indicating that the data being displayed corresponds to a particular evaluation run of a model ("M-0087") in which a particular data set "EDS7" was used as input to the model.
  • the client is provided several different options to select the error definition of most interest, and a slider SI in region 6812 is provided to indicate the error tolerance threshold to be used for displaying error distributions in graph 6800.
  • the absolute value of the difference between the predicted value of the output variable and the true value has currently been selected as the error definition (as indicated by the selected radio button control in region 6804).
  • the slider SI is currently positioned to indicate that errors with values no greater than 60 (out of a maximum possible error of 600 in view of the current error definition of region 6804) are tolerable.
  • the distribution of the acceptable predictions i.e., predictions within the tolerance limit currently indicated by slider SI
  • the out-of-tolerance predictions for different ranges of the true values is shown.
  • the boundaries between the acceptable predictions 6868 and the out-of-tolerance predictions 6867 may change.
  • the client wishes to use a different definition of error, several choices are available. For example, by selecting the radio button in region 6806 instead of the button in region 6804, the client could define error as the (non-absolute) arithmetic difference between the true value and the predicted value, indicating that the direction of the predicted error is important to the client.
  • both the direction of the error and its value relative to the true value may be included in the error definition.
  • Some users may wish to indicate their own definitions of error, which may be done by selecting the radio button in region 6810 and clicking on the provided link.
  • the maximum error in the error tolerance slider scale of region 6812 may also be changed accordingly in at least some embodiments.
  • MLS clients may be able to select the most appropriate definitions of error for their particular regression problem, and also to determine (based on their error tolerance levels) the ranges of output values for which the largest and smallest amounts of error were predicted.
  • Other types of interactive visualizations for regression models may also or instead be displayed in some embodiments. .
  • FIG. 69 is a flow diagram illustrating aspects of operations that may be performed at a machine learning service that implements interactive graphical interfaces enabling clients to modify prediction interpretation settings based on exploring evaluation results, according to at least some embodiments.
  • a particular model Ml may be trained at a machine learning service, e.g., in response to a request received via a programmatic interface from a client.
  • the model may compute values of one or more output variables such as OVl for each observation record of a given set of observation records.
  • an evaluation run ERl may be conducted to obtain a respective OVl value for each record of a given evaluation data set.
  • a data set DSl representing at least a selected subset of results of the evaluation run ERl may be generated for display via an interactive graphical display (element 6907).
  • the interactive display for which DSl is obtained may include various control elements such as continuous-variation slider elements and/or discrete-variation elements that can be used to vary one or more prediction-related settings, such as classification cutoffs and/or various other types of prediction interpretation thresholds.
  • Any of a number of different data elements corresponding to ERl may be included in data set DSl for display, such as statistical distributions of OVl or other output or input variables, one or more prediction quality metrics such as (in the case of a binary classification model evaluation) the number and/or percentage of true positives, false positives, true negatives and false negatives, as well as at least one proposed or default value of a prediction interpretation threshold.
  • the data set DS1 may be transmitted to a device (e.g., a client-owned computing device with a web browser or a standalone GUI tool installed) on which the graphical interface is to be displayed (element 6910) in the depicted embodiment.
  • a target value for a particular prediction interpretation threshold such as a cutoff value for binary classification (element 6913) may be determined.
  • the manipulations of the controls (which may be performed using a mouse, stylus, or a fingertip, for example) may be detected at the computing device where the graphics are being displayed, and may be communicated back to one or more other components (such as back-end servers) of the MLS in some embodiments, e.g., using invocations of one or more APIs similar to those described earlier.
  • indications of the manipulation of the controls need not be transmitted to back-end MLS servers; instead, some or all of the computations required to update the display may be performed on the device at which the graphical interface is displayed.
  • a change to one or more other elements of DS1, resulting from the manipulation of the control, may be computed (element 6916), and the corresponding changes to the display may be initiated in real time as the user moves the control element.
  • the changes to the position of a graphical control element such as a slider may be tracked as they are performed, and corresponding updated values of various metrics may be transmitted to the display device as quickly as possible, to give the user the impression of an instantaneous or near-instantaneous response to the manipulation of the graphical control element.
  • the target value may be stored in an MLS repository in the depicted embodiment (element 6919).
  • different PIT1 values may be saved for different combinations of models, users, evaluation data sets, and/or use cases - e.g., a repository record containing a selected PIT value may be indexed using some combination of a tuple (model ID, evaluation data set ID, user/client ID, use case ID).
  • Results of one or more post-evaluation model executions may be generated using the saved PIT1 value and provided to the interested clients (element 6922).
  • the saved PIT1 value may be used for other evaluations as well as or instead of being used for post-evaluation runs.
  • the initial request to train the model (or requests to retrain/re-evaluate the model) may also be received via elements of the interactive graphical interface.
  • the graphical interface may also display alerts or informational messages pertaining to model evaluations and/or other activities performed on behalf of a client, such as a list of anomalies or unusual results detected during a given evaluation run.
  • the MLS may, for example, check how much the statistical distribution of an input variable of an evaluation data set differs from the statistical distribution of the same variable in the training data set in one embodiment, and display an alert if the distributions are found to be substantially different.
  • results of several different evaluation runs may be displayed in a single view of the interface (e.g., by emulating a 3-dimensional display in which results for different evaluation runs are shown at different "depths", or by computing the average results from the different evaluation runs).
  • an MLS client may submit one or more requests via a command-line tool or an API invocation to receive an indication of the distribution of prediction results of an evaluation run of various types of models, including classification and/or regression models.
  • the client may interact with the interface (e.g., submit a new command, or invoke a different API) to indicate changes to prediction interpretation threshold values, and the corresponding changes to various metrics may be displayed accordingly (e.g., in text format).
  • the client may use the API or command line indicate that a particular interpretation threshold value is to be saved for use in subsequent runs of the model.
  • approximations of at least some of the graphical displays illustrated in FIG. 63 - 68 may be provided using text symbols - e.g., a relatively crude version of a graph may be displayed using combinations of ASCII characters.
  • Voice and/or gesture-based MLS interfaces may be used in some embodiments.
  • observation records may be split into several types of data sets for respective phases of model development and use. For example, some observations may be included in a training data set used to generate a model, and others may be included in one or more test or evaluation data sets to be used to determine the quality of the model's predictions.
  • test data set and “evaluation data set” may be used synonymously herein; similarly, the process of determining the quality or accuracy of a model's predictions may be referred to either as “evaluation” or “testing” of the model.
  • One of the primary goals of using test data sets subsequent to training a model is to determine how well the trained model is able to generalize beyond the training data: that is, how accurately the trained model can predict output variable values for "new" observations that were not included in the training data set.
  • test data set happens to include many observations that were also in the training data set, the accuracy of the predictions made using the test data set may appear to be high largely due to the duplication of the observation records between the training and test data sets, and not because of the model's superior generalization capability.
  • each of these data sets may potentially comprise millions of observation records, and it may sometimes be the case that at least some observation records may "leak" from a training data set to a corresponding test data set - e.g., due to errors in splitting the data between training and test data sets, or due to inadvertent use of similar or overlapping data files for training and testing phases.
  • the probability of such data leakage may be even greater when the training and evaluation phases of a model are separated in time (e.g., by hours, days or weeks) and/or performed on different sets of MLS servers, as may be the case given the sizes of the training data sets and the distributed and parallel architecture of the MLS.
  • the MLS may provide support for efficient detection of observation records that are (or at least are likely to be) duplicates across data sets. In the absence of such support, the customer may wait until the end of a test or evaluation run, examine the results of the run, and only then be able to make a subjective judgment (e.g., if the results seem unexpectedly accurate) as to whether the test data included training data observation records.
  • MLS customers may be informed relatively early during the processing of a given data set DS1 (such as a test data set for a model) whether DS1 has a high probability of containing records that were also in a second data set DS2 (such as the training data set for the model), and may thereby be able to avoid wasting resources.
  • data set DS1 such as a test data set for a model
  • DS2 such as the training data set for the model
  • FIG. 70 illustrates an example duplicate detector that may utilize space-efficient representations of machine learning data sets to determine whether one data set is likely to include duplicate observation records of another data set at a machine learning service, according to at least some embodiments.
  • a training data set 7002 to be used to train a particular machine learning model 7020 may be identified at the MLS in the depicted embodiment, e.g., as a result of a client's invocation of a programmatic interface of the MLS such as the "createModel" interface described earlier. Later, the client on whose behalf the model was trained may wish to have the quality of the model 7020 evaluated using a test data set 7004, or the MLS itself may identify the test data set 7004 to be used for the evaluation.
  • Each of the data sets 7002 and 7004 may include some number of observation records (ORs), such as ORs Tr-0, Tr-1, and Tr-2 of training data set 7002, and ORs Te-0 and, Te-1 of the test data set 7004.
  • ORs observation records
  • Individual ones of the ORs of either data set may comprise respective values for some number of input variable (IVs) such as IVl, IV2, and so on, as well as one or more output variables OV.
  • IVs input variable
  • Not all of the ORs of either data set may necessarily contain values for all the IVs in at least some embodiments - e.g., the values of some input variables may be missing in some observation records.
  • a test data set 7004 may not necessarily have been identified at the time that the model 7020 is trained using training data set 7002.
  • At least one space-efficient alternate representation 7030 of the training data set which may be used for duplicate detection such as a Bloom filter
  • other types of alternate representations may be constructed, such as skip lists or quotient filters.
  • a corresponding definition 7035 of duplication may be used in some embodiments, such as a definition that indicates whether all the variables of the observation records are to be considered when designating an OR as a duplicate of another, or whether some subset of the variables are to be considered. Examples of different duplication definitions 7035 that may be applicable to a given data set are provided in FIG. 72 and discussed below in further detail.
  • the alternate representation may be generated and stored in parallel with the training of the model, so that, for example, only a single pass through the training data set 7002 may be needed for both (a) training the model and (b) creating and storing the alternate representation 7030.
  • the alternate representation may require much less (e.g., orders of magnitude less) storage or memory than is occupied by the training data set itself in some implementations.
  • a probabilistic duplicate detector 7036 of the MLS may use the alternate representation 7030 to make one of the following determinations regarding a given OR Te-k of the test data set 7004: either (a) Te-k is not a duplicate of any of the ORs of the training data set or (b) Te-k has a non-zero probability of being a duplicate of an OR of the training data set. That is, while it may not be possible for the probabilistic duplicate detector 7036 to provide 100% certainty regarding the existence of duplicates, the detector may be able to determine with 100% certainty that a given test data set OR is not a duplicate.
  • the probabilistic duplicate detector 7036 may be able to estimate or compute a confidence level or certainty level associated with a labeling of a given OR as a duplicate.
  • the duplicate detector 7036 may examine some number of ORs of the test data set 7004 and obtain one or more duplication metrics 7040 for the examined ORs.
  • the duplication metric may itself be probabilistic in nature in some embodiments. For example, it may represent the logical equivalent of the statement "X% of the test set observation records have respective probabilities greater than or equal to Y% of being duplicates".
  • the client may be provided with an indication of a confidence level as to whether one or more of the observation records are duplicates.
  • the metric 7040 may indicate with 100% certainty that the examined test data is duplicate-free.
  • the duplicate detector 7036 may also take into account an expected rate of false-positive duplicate detection associated with the particular alternate representation being used. For example, if a Bloom filter being used as the alternate representation 7030 has an 8% expected rate of false positives, and the fraction of duplicates detected is also 8% (or less), the duplication metric may simply indicate that the number of possible duplicates identified is within an acceptable range.
  • various parameters used in the generation of the alternate representation may be selected based on factors such as the size of the training data set, the desired false positive rate of the alternate representation's duplicate predictions, and so on.
  • duplication responses 7045 may be implemented by the MLS. Any of a number of different responsive actions may be undertaken in different embodiments - e.g., clients may be sent warning messages indicating the possibility of duplicates, likely duplicates may be removed or deleted from the test data set 7004, a machine learning job that involves the use of the test data may be suspended, canceled or abandoned, and so on.
  • the responsive action taken by the MLS may be dependent on the duplication metric 7040.
  • a warning message indicating the (small) fraction of potential duplicates may be transmitted to the client, while if a large fraction of the test data set is found to be potentially duplicate, the evaluation of the model 7020 may be suspended or stopped until the client has addressed the problem.
  • the duplication analysis may be performed in parallel with the evaluation of the model 7020 using the test data set 7004, so that only a single pass through the test data set may be needed.
  • the client may indicate (e.g., via the MLS's programmatic interfaces) one or more parameters (or other forms of guidance) to be used by the MLS to determine whether a threshold criterion requiring a responsive action has been met.
  • a client could indicate that if the probability that a randomly selected observation record of the test data set is a duplicate exceeds PI, a particular responsive action should be taken.
  • the MLS may then translate such high-level guidance into the specific numerical threshold values to be used for the test data set (e.g., that a responsive action is to be taken only if at least X out of the Y test data set records available have been identified as duplicates).
  • the clients would not necessarily have to be aware of low-level details such as the total number of the test data set records or the actual number of duplicates that are to trigger the responses.
  • clients may programmatically specify the responses that are to be implemented for one or more duplication metric thresholds, and/or low-level details of the thresholds themselves.
  • the duplicate detector 7036 may not wait to process the entire test data set 7004 before initiating a generation of a response 7045 - e.g., if more than 80 of the first 100 observation records that are examined from a test data set with a million ORs have non-zero probabilities of being duplicates, a response may be generated without waiting to examine the remaining ORs.
  • the techniques illustrated in FIG. 72 may be used for identifying possible duplicates within a given data set (e.g., within the training data set itself, within the test data set itself, or within a pre-split data set that is to be divided into training and test data sets), or across any desired pairing of data sets.
  • the use of the techniques may not be limited just to checking whether test data sets may contain duplicates of training data observation records. It is noted that in one embodiment, at least for some data sets, an alternate representation used for duplicate detection need not necessarily utilize less storage (or less memory) than the original representation of the data set.
  • FIG. 71a and 71b collectively illustrate an example of a use of a Bloom filter for probabilistic detection of duplicate observation records at a machine learning service, according to at least some embodiments.
  • a Bloom filter 7104 comprising 16 bits (BitO through Bitl5) is shown being constructed from a training data set comprising ORs 7110A and 7110B in the depicted scenario.
  • a given OR 7110 may be provided as input to each of a set of hash functions HO, HI and H2 in the depicted embodiment.
  • the output of each hash function may then be mapped, e.g., using a modulo function, to one of the 16 bits of the filter 7104, and that bit may be set to 1.
  • bit2 of the Bloom filter is set to 1 using hash function HO
  • bit6 is set to 1 using hash function HI
  • bit9 is set to 1 using hash function H2.
  • bit4, bit9 (which was already set to
  • bitl3 are set to 1.
  • bit9 to which both OR 7110A and 7110B are mapped, the presence of a 1 at a given location within the Bloom filter may result from hash values generated for different ORs (or even from hash values generated for the same OR using different hash functions).
  • the presence of Is at any given set of bit locations of the filter may not uniquely or necessarily imply the existence of a corresponding OR in the data set use to construct the filter.
  • the size of the Bloom filter 7104 may be much smaller than the data set used to build the filter - for example, a filter of 512 bits may be used as an alternate representation of several megabytes of data.
  • the same hash functions may be applied to the test data set ORs 7150 (e.g., 7150A and 7150B) to detect possible duplicates with respect to the training data set. If a particular test data set OR 7150 maps to a set of bits that contains at least one zero, the duplicate detector may determine with certainty that the OR is not a duplicate. Thus, OR 7150A is mapped to bit3, bit6 and bitlO (using hash functions HO, HI and H2 respectively), two of which (bit3 and bit 10) happen to contain zeroes in the Bloom filter 7104 after the filter has been fully populated using the entire training data set.
  • OR 7150 is indicated as not being a duplicate.
  • OR 7150B is mapped to bit4, bit9 and bitl3, all of which happen to contain Is in the fully-populated Bloom filter.
  • OR 7150 may be indicated as a probable duplicate, with some underlying false positive rate of FPl .
  • the false positive rate FPl may be a function of the size of the Bloom filter (the number of bits used, 16 in this case), the number and/or type of hash functions used, and/or the number of observation records used to build the filter.
  • the filter size and the number and type of hash functions used may be selected via tunable parameters 7144 of the Bloom filter generation process.
  • Different parameter values may be selected, for example, based on the estimated or expected number of observation records of the training data set, the estimated or expected sizes of the observation records, and so on. Other similar parameters may govern the false positive rates expected from other types of alternate representations of data sets such as quotient filters or skip lists.
  • the size of the illustrated Bloom filter 7104 (16 bits) is not intended to represent a preferred or required size; any desired number of bits may be used, and any desired number of hash functions of any preferred type may be employed in different embodiments. For example, some implementations may use a MurmurHash function, while others may use a Jenkins hash function, a Fowler-Noll- Vo hash function, a CityHash function, or any desired combination of such hash functions.
  • parameters such as the size of the filter and/or the number and types of hash functions used may be selected at the MLS based on factors such as the estimated or actual size of the training data set, the desired false positive rate, the computation requirements of the different hash functions, the randomizing capabilities of different hash functions, and so on.
  • the MLS may estimate the number of observation records in the training data set by examining the first few records, and dividing the file size of the training data set file by the average size of the first few records. This approach may enable the MLS to generate the Bloom filter 7104 in a single pass through the training data set, e.g., while the model is also being trained, instead of requiring one pass to determine the exact number of ORs and then another pass to construct the filter.
  • a cryptographic-strength hash function may be used to generate signatures of each of the test data set ORs, and the signatures generated using the same hash function on the test data may be used to detect duplicates with a very high rate of accuracy.
  • using cryptographic hash functions may be computationally expensive compared to weaker hash functions that may be used to generate Bloom filters, and the space efficiency achieved using the cryptographic hashes may not be as great as is achievable using Bloom filters.
  • the MLS may be able to trade off the accuracy of duplicate detection with the resource usage or cost associated with the duplicate detection technique selected - e.g., as the accuracy rises, the resource needs of the technique may also typically rise. It is noted that at least in some embodiments and/or for some data set sizes, a deterministic duplicate detection technique rather than a probabilistic technique may be selected - e.g., a test data OR being tested for possible duplication may be compared to the original ORs of the training data set instead of using a space-efficient representation.
  • the MLS may determine a definition of duplication that is to be applied - i.e., exactly what properties of an OR 01 should be considered when declaring 01 a probable or actual duplicate of a different OR 02.
  • FIG. 72 illustrates examples of alternative duplicate definitions that may be used at a duplicate detector of a machine learning service, according to at least some embodiments. In the depicted embodiment, three example duplicate definitions DDI, DD2 and DD3 are shown.
  • DDI all the input variables and output variables that are included in any OR of the training data set 7210 are to be considered when deciding whether a given OR is a duplicate of another.
  • DD2 all the input variables, but none of the output variables, are to be considered.
  • DD3 only a strict subset of the input variables (e.g., IV 1 and IV3 in the illustrated scenario) needs to match for an OR to be considered a duplicate.
  • the client may wish to exclude IV-k from the set of variables to be used to determine duplication.
  • clients may not wish to include the output variables when considering duplicates, since the predictions of the models are based entirely on the input variables.
  • different alternate representations of the training set may be created based on the duplication definition selected.
  • training data set 7210 in which observation records include input variables IV 1, IV2, IV3 and IV4, and output variable OV all five variable may be used (e.g., as combined input to a set of hash functions) if definition DDI is used. If DD2 is used, IV 1, IV2, IV3 and IV4 may be used to generate the alternate representation, and OV may be excluded. If DD3 is used, only IV 1 and IV3 may be used for the alternate representation.
  • the MLS may decide to use multiple duplication definitions concurrently, e.g., respective alternate representations of the training data set 7210 may be created in accordance with each definition used, and duplication metrics corresponding to each of the definitions may be obtained.
  • Duplication analysis results 7260A, 7260B and/or 7260C may be generated based on the definition and alternate representation used.
  • OR 7251 of test data set 7220 happens to match OR 7201 in all five variables. All three results 7260A, 7260B and 7260C may therefore identify OR 7250A as a probable duplicate with some non-zero probability.
  • OR 7252 matches OR 7201 in all the input variables, but not in the output variable. As a result, OR 7250B may be classified as a probable duplicate if DD2 or DD3 are used, but not if DDI is used.
  • OR 7253 which has the same values of IV 1 and IV3 as OR 7202 of the training set, but differs in all other variables, may be classified as a possible duplicate only if DD3 is used, and may be declared a non-duplicate if either of the other definitions are used.
  • FIG. 73 illustrates an example of a parallelized approach towards duplicate detection for large data sets at a machine learning service, according to at least some embodiments.
  • training data set 7302 may be divided into four partitions P0, PI, P2 and P3, and a respective Bloom filter creation (BFC) job may be generated and scheduled corresponding to each partition.
  • BFC jobs JO through J3 may be scheduled for the partitions P0 through P3, respectively.
  • the jobs JO through J3 may also be used for other tasks as well, such as training the model, and need not necessarily be limited to creating Bloom filters or other alternate representations in various embodiments.
  • the creation of Bloom filters or other alternate representations may be considered one example of a feature processing transformation, and a recipe language similar to that described earlier may be used to request the generation of the representations.
  • Each of the BFC jobs may produce a partition-level Bloom filter such as BF0, BF1, BF2 or BF3 in the depicted example scenario.
  • the partition level filters may then be logically combined or aggregated, e.g., using simple Boolean "or" operations, to produce a complete Bloom filter BF- all.
  • BF-all may then be used for parallelized duplicate detection in the depicted embodiment - e.g., by scheduling three duplicate checking jobs J4, J5 and J6 for respective partitions P0-test, PI -test and P2-test of a training data set 7310.
  • different MLS servers such as SO through S7 may be used for at least some of the jobs JO - J6.
  • the degree of parallelism e.g., the number of different jobs that are scheduled, and/or the number of different servers that are used
  • Similar parallelization approaches may be used with other types of duplicate detection algorithms, e.g., for techniques that do not necessarily employ Bloom filters.
  • FIG. 74 illustrates an example of probabilistic duplicate detection within a given machine learning data set, according to at least some embodiments.
  • a space-efficient representation 7430 of the data set may gradually be populated.
  • the under-construction alternate representation 7430 may contain entries corresponding to the K processed records 7422.
  • the probabilistic duplicate detector 7035 may use the alternate representation 7430 to determine whether the record represents a duplicate of an already-processed observation record of the same data set 7410.
  • the newly encountered OR may be classified as a possible duplicate, or as a confirmed non-duplicate, using the kinds of techniques described earlier.
  • the duplicate detector may keep track of the ORs that are classified as having non-zero probabilities of being duplicates, and may include the list in intra-data-set duplicate detection results 7444 provided to the client on whose behalf the data set 7210 is being processed.
  • the duplicate detector may take other actions, such as simply notifying the client regarding the number of probably duplicates, or the duplicate detector may initiate the removal of the probable duplicates from the data set 7210.
  • FIG. 75 is a flow diagram illustrating aspects of operations that may be performed at a machine learning service that implements duplicate detection of observation records, according to at least some embodiments.
  • the MLS may determine that a first or target set of observation records (e.g., a test data set) is to be checked for duplicates with respect to a second or source set of observation records (e.g., a corresponding training data set) in accordance with some selected definition of duplication.
  • a default duplication definition may require the MLS to consider the values of all the input and output variables of observation records of the source set when identifying possible duplicates.
  • duplication definitions may be used in some embodiments, in which one or more output variables and/or one or more input variables are to be excluded when determining duplicates.
  • clients of the MLS may indicate whether they want duplicate detection to be performed on specified data sets, or the particular definition of duplication to be used, e.g., using programmatic interfaces implemented by the MLS.
  • the MLS may also determine respective responsive actions to be taken if various levels of duplication are identified (element 7504) in the depicted embodiment. Examples of such actions may include transmitting warning or alert messages to the client that simply indicate the number or fraction of potential duplicate records (i.e., those observation records of the target data set for which the probability of being duplicates is non-zero), providing a list of the suspected duplicates, or providing estimates of the certainty levels or confidence levels associated with the designations of the records as duplicates. In one implementation, respective confidence levels associated with individual observation records suspected to being duplicates may be provided. In some embodiments, the responsive actions may include removing the probable duplicates from the target data set and/or providing statistical estimates of the impact of removing the duplicates on prediction errors of the associated model.
  • the MLS in response to the identification of potential or likely duplicates within a data set, may suspend, abandon or cancel a machine learning job which involves the use of the data set or is otherwise associated with the data set.
  • Different responses may be selected for respective duplication levels in some embodiments - e.g., a warning may be generated if the fraction of duplicates is estimated to be no between 5% and 10%, while duplicates may simply be discarded if they are collectively less than 2% of the target data set.
  • MLS clients may specify the types of actions they want taken for different extents of possible duplication in some embodiments.
  • one or more MLS components may generate, e.g., in parallel with other operations that involve a traversal of the source set such as the training of a model, an alternate representation of the source set that can be used for probabilistic duplicate detection (element 7507).
  • a Bloom filter, a quotient filter, a skip list, a list of cryptographic signatures of the source records, or some other space-efficient structure may be used in various embodiments as the alternate representation.
  • the MLS may first reformat at least some of the source data set's observation records - e.g., before feeding an observation record to a hash function used for generating a Bloom filter, the set of variable separators may be checked for consistency, trailing and leading blanks may be removed from text variables, numerical variables may be formatted in a uniform manner, and so on.
  • the alternate representation may optionally be stored in an MLS artifact repository (such as repository 120 shown in FIG. 1) in some embodiments (element 7510), e.g., as an add- on artifact associated with the model that was trained during the same pass through the source data set.
  • the alternate representation may be retained for a selected duration in the repository.
  • the MLS may keep track of when the alternate representation was last used for duplicate detection, and it may be discarded if it has not been for some threshold time interval.
  • a duplicate detector of the MLS may determine whether the target data set is entirely duplicate-free, or whether at least some of the records of the target data set have non-zero probabilities of being duplicates (element 7513).
  • a duplication metric may be generated, indicating for example the number or fraction of suspected duplicates and the associated non-zero probabilities.
  • the duplication metric may take into account the baseline false positive duplicate prediction rate associated with the alternate representation. For example, for a Bloom filter, the false positive rate may depend on the size (number of bits) of the Bloom filter, the number and/or types of hash functions used, and/or the number of observation records used to populate the filter.
  • the duplication metric may be based at least in part on the difference between Num Probable Duplicates Found (the number of observation records identified as possible duplicates) and Num Expected False Positives (the number of observation records that are expected to be classified falsely as duplicates), for example.
  • either the generation of the alternate representation, the checking of the test data set for potential duplicates, or both these tasks may be performed in a parallelized or distributed fashion using a plurality of MLS jobs as illustrated in FIG. 73. If the duplication metric exceeds a threshold, a corresponding responsive action (e.g., one or more of the actions identified in operations corresponding to element 7504) may be performed in the depicted embodiment (element 7516).
  • operations other than those illustrated in the flow diagrams of FIG. 9a, 9b, 10a, 10b, 17, 25, 32, 39, 48, 54, 55, 61, 69 and 75 may be used to implement at least some of the techniques of a machine learning service described above. Some of the operations shown may not be implemented in some embodiments, may be implemented in a different order, or in parallel rather than sequentially. For example, with respect to FIG. 9b, a check as to whether the client's resource quota has been exhausted may be performed subsequent to determining the workload strategy in some embodiments, instead of being performed before the strategy is determined.
  • Best practices developed over years of experience with different data cleansing approaches, transformation types, parameter settings for transformations as well as models may be incorporated into the programmatic interfaces (such as easy-to learn and easy-to-use APIs) of the MLS, e.g., in the form of default settings that users need not even specify.
  • Users of the MLS may submit requests for various machine learning tasks or operations, some of which may depend on the completion of other tasks, without having to manually manage the scheduling or monitor the progress of the tasks (some of which may take hours or days, depending on the nature of the task or the size of the data set involved).
  • Users may be provided interactive graphical displays of model evaluations and other executions in some embodiments, enabling the users to make informed decisions regarding interpretation-related settings such as classification cutoffs.
  • the detection of potential duplicates between a test or evaluation data set and the corresponding training data may be performed by default in some embodiments, enabling clients of the MLS to avoid wasting resources on evaluations based on data that is not likely to provide insights into a model's generalization capabilities.
  • a logically centralized repository of machine learning objects corresponding to numerous types of entities may enable multiple users or collaborators to share and re -use feature-processing recipes on a variety of data sets.
  • Expert users or model developers may add to the core functionality of the MLS by registering third- party or custom libraries and functions.
  • the MLS may support isolated execution of certain types of operations for which enhanced security is required.
  • the MLS may be used for, and may incorporate techniques optimized for, a variety of problem domains covering both supervised and unsupervised learning, such as, fraud detection, financial asset price predictions, insurance analysis, weather prediction, geophysical analysis, image/video processing, audio processing, natural language processing, medicine and bioinformatics and so on.
  • Specific optimization techniques such as pruning of depth-first decision trees, limiting the size of linear models by efficiently pruning feature weights, or performing concurrent quantile binning, may be implemented by default in some cases without the MLS clients even being aware of the use of the techniques.
  • optimizations such as optimizations between training-time resource usage and prediction-time resource usage, clients may interact with the machine learning service to decide upon a mutually acceptable feature processing proposal.
  • a server that implements one or more of the components of a machine learning service may include control-plane components such as API request handlers, input record handlers, recipe validators and recipe run-time managers, feature processing managers, plan generators, job schedulers, artifact repositories, and the like, as well as data plane components such as MLS servers used for model generation/training, implementing decision tree optimizations, model pruning and/or category-based sampling, generating and/or displaying evaluation results graphically, and so on) may include a general-purpose computer system that includes or is configured to access one or more computer-accessible media.
  • FIG. 76 illustrates such a general-purpose computing device 9000.
  • computing device 9000 includes one or more processors 9010 coupled to a system memory 9020 (which may comprise both non- volatile and volatile memory modules) via an input/output (I/O) interface 9030.
  • Computing device 9000 further includes a network interface 9040 coupled to I/O interface 9030.
  • computing device 9000 may be a uniprocessor system including one processor 9010, or a multiprocessor system including several processors 9010 (e.g., two, four, eight, or another suitable number).
  • Processors 9010 may be any suitable processors capable of executing instructions.
  • processors 9010 may be general-purpose or embedded processors implementing any of a variety of instruction set architectures (IS As), such as the x86, PowerPC, SPARC, or MIPS IS As, or any other suitable ISA.
  • IS As instruction set architectures
  • processors 9010 may commonly, but not necessarily, implement the same ISA.
  • graphics processing units GPUs may be used instead of, or in addition to, conventional processors.
  • System memory 9020 may be configured to store instructions and data accessible by processor(s) 9010.
  • the system memory 9020 may comprise both volatile and non-volatile portions; in other embodiments, only volatile memory may be used.
  • the volatile portion of system memory 9020 may be implemented using any suitable memory technology, such as static random access memory (SRAM), synchronous dynamic RAM or any other type of memory.
  • SRAM static random access memory
  • synchronous dynamic RAM any other type of memory.
  • flash-based memory devices including NAND-flash devices, may be used.
  • the non-volatile portion of the system memory may include a power source, such as a supercapacitor or other power storage device (e.g., a battery).
  • a power source such as a supercapacitor or other power storage device (e.g., a battery).
  • memristor based resistive random access memory (ReRAM), three-dimensional NAND technologies, Ferroelectric RAM, magnetoresistive RAM (MRAM), or any of various types of phase change memory (PCM) may be used at least for the non-volatile portion of system memory.
  • program instructions and data implementing one or more desired functions are shown stored within system memory 9020 as code 9025 and data 9026.
  • I/O interface 9030 may be configured to coordinate I/O traffic between processor 9010, system memory 9020, and any peripheral devices in the device, including network interface 9040 or other peripheral interfaces such as various types of persistent and/or volatile storage devices.
  • I/O interface 9030 may perform any necessary protocol, timing or other data transformations to convert data signals from one component (e.g., system memory 9020) into a format suitable for use by another component (e.g., processor 9010).
  • I/O interface 9030 may include support for devices attached through various types of peripheral buses, such as a variant of the Peripheral Component Interconnect (PCI) bus standard or the Universal Serial Bus (USB) standard, for example.
  • PCI Peripheral Component Interconnect
  • USB Universal Serial Bus
  • I/O interface 9030 may be split into two or more separate components, such as a north bridge and a south bridge, for example. Also, in some embodiments some or all of the functionality of I/O interface 9030, such as an interface to system memory 9020, may be incorporated directly into processor 9010.
  • Network interface 9040 may be configured to allow data to be exchanged between computing device 9000 and other devices 9060 attached to a network or networks 9050, such as other computer systems or devices as illustrated in FIG. 1 through FIG. 75, for example.
  • network interface 9040 may support communication via any suitable wired or wireless general data networks, such as types of Ethernet network, for example.
  • network interface 9040 may support communication via telecommunications/telephony networks such as analog voice networks or digital fiber communications networks, via storage area networks such as Fibre Channel SANs, or via any other suitable type of network and/or protocol.
  • system memory 9020 may be one embodiment of a computer- accessible medium configured to store program instructions and data as described above for FIG. 1 through FIG. 75 for implementing embodiments of the corresponding methods and apparatus.
  • program instructions and/or data may be received, sent or stored upon different types of computer-accessible media.
  • a computer- accessible medium may include non-transitory storage media or memory media such as magnetic or optical media, e.g., disk or DVD/CD coupled to computing device 9000 via I/O interface 9030.
  • a non-transitory computer-accessible storage medium may also include any volatile or non-volatile media such as RAM (e.g.
  • a computer-accessible medium may include transmission media or signals such as electrical, electromagnetic, or digital signals, conveyed via a communication medium such as a network and/or a wireless link, such as may be implemented via network interface 9040.
  • a communication medium such as a network and/or a wireless link, such as may be implemented via network interface 9040.
  • Portions or all of multiple computing devices such as that illustrated in FIG. 76 may be used to implement the described functionality in various embodiments; for example, software components running on a variety of different devices and servers may collaborate to provide the functionality.
  • portions of the described functionality may be implemented using storage devices, network devices, or special-purpose computer systems, in addition to or instead of being implemented using general-purpose computer systems.
  • the term "computing device”, as used herein, refers to at least all these types of devices, and is not limited to these types of devices.
  • a system comprising:
  • one or more computing devices configured to:
  • the entity type comprises one or more of: (a) a data source to be used for a machine learning model, (b) a set of statistics to be computed from a particular data source, (c) a set of feature processing transformation operations to be performed on a specified data set, (d) a machine learning model employing a selected algorithm, (e) an alias associated with a machine learning model, or (f) a result of a particular machine learning model; insert a job object corresponding to the first request in a job queue of the machine learning service;
  • a method comprising:
  • the entity type comprises one or more of: (a) a data source to be used for generating a machine learning model, (b) a set of feature processing transformation operations to be performed on a specified data set, (c) a machine learning model employing a selected algorithm, or (d) an alias associated with a machine learning model;
  • the particular operation comprises assignment of an alias usable by a designated group of users of the machine learning service to execute a particular machine learning model, wherein the alias comprises a pointer to the particular machine learning model, wherein at least some users of the designated group of users are not permitted to modify the pointer.
  • identifying a workload distribution strategy for the first request comprises one or more of: (a) determining a number of passes of processing a data set of the particular operation (b) determining a parallelization level for processing a data set of the particular operation, (c) determining a convergence criterion to be used to terminate the particular operation, (d) determining a target durability level for intermediate data produced during the particular operation, or (e) determining a resource capacity limit for implementing the particular operation.
  • identifying a particular security container from which to select at least one resource to be used for the particular operation in response to determining that performing the particular operation includes an execution of a module developed by an entity external to the provider network, identifying a particular security container from which to select at least one resource to be used for the particular operation.
  • a non-transitory computer-accessible storage medium storing program instructions that when executed on one or more processors:
  • the entity type comprises one or more of: (a) a data source to be used for generating a machine learning model, (b) a set of statistics to be computed from a particular data source, (c) a machine learning model employing a selected algorithm, or (d) an alias associated with a machine learning model;
  • non-transitory computer-accessible storage medium as recited in clause 16, wherein the particular operation comprises assignment of an alias usable by a designated group of users of the machine learning service to execute a particular machine learning model, wherein the alias comprises a pointer to the particular machine learning model, wherein at least some users of the designated group of users are not permitted to modify the pointer.
  • the third request in response to a determination that the third request is a duplicate of an earlier-submitted request, provide an indication of success of the third request to the client, without inserting an additional job object corresponding to the third request in the job queue.
  • a system comprising:
  • one or more computing devices configured to:
  • a text representation of a recipe comprising one or more of: (a) a group definitions section indicating one or more groups of variables, wherein individual ones of the one or more groups comprise a plurality of variables on which at least one common transformation operation is to be applied, (b) an assignment section defining one or more intermediate variables, (c) a dependency section indicating respective references to one or more machine learning artifacts stored in a repository, or (d) an output section indicating one or more transformation operations to be applied to at least one entity indicated in the group definitions section, the assignment section, or the dependency section;
  • a method comprising:
  • a first representation of a recipe comprising one or more of: (a) a group definitions section indicating one or more groups of variables, wherein individual ones of the one or more groups comprise a plurality of data set variables on which at least one common transformation operation is to be applied and (b) an output section indicating one or more transformation operations to be applied to at least one entity indicated in one or more of: (i) the group definitions section or (ii) an input data set;
  • determining that the recipe is to be applied to a particular data set verifying that the particular data set meets a run-time acceptance criterion; and applying, using one or more selected provider network resources, a particular transformation operation of the one or more transformation operations to the particular data set.
  • a data type of at least one variable of an input data record of the particular data set comprises one or more of: (a) text, (b) a numeric data type, (c) Boolean, (d) a binary data type, (d) a categorical data type, (e) an image processing data type, (f) an audio processing data type, (g) a bioinformatics data type, or (h) a structured data type.
  • the first representation comprises an assignment section defining an intermediate variable in terms of one or more of: (a) an input data set variable or (b) an entity defined in the group definitions section, wherein the intermediate variable is referenced in the output section.
  • the plurality of parameter value options comprise one or more of: (a) respective lengths of n-grams to be derived from a language processing data set, (b) respective quantile bin boundaries for a particular variable, (c) image processing parameter values, (d) a number of clusters into which a data set is to be classified, (e) values for a cluster boundary threshold, or (f) dimensionality values for a vector representation of a text document.
  • a non-transitory computer-accessible storage medium storing program instructions that when executed on one or more processors:
  • a first representation of a recipe comprising one or more of: (a) a group definitions section indicating one or more groups of variables, wherein individual ones of the one or more groups comprise a plurality of data set variables on which at least one common transformation operation is to be applied, or (b) an output section indicating one or more transformation operations to be applied to at least one entity indicated in one or more of (i) the group definitions section or (ii) an input data set of the recipe;
  • the first representation comprises an assignment section defining an intermediate variable in terms of one or more of: (a) an input data set variable or (b) an entity defined in the group definitions section, wherein the intermediate variable is referenced in the output section.
  • non-transitory computer-accessible storage medium as recited in any of clauses 23 - 25, wherein the particular artifact comprises one or more of: (a) a machine learning model, (b) a different recipe, (c) an alias or (d) a set of statistics.
  • a system comprising:
  • one or more computing devices configured to:
  • a filtering plan to perform a sequence of chunk-level filtering operations on the plurality of contiguous chunks, wherein an operation type of individual ones of the sequence of filtering operations comprises one or more of: (a) sampling, (b) shuffling, (c) splitting, or (d) partitioning for parallel computation, and wherein the filtering plan includes a first chunk-level filtering operation followed by a second chunk-level filtering operation; execute, to implement the first chunk-level filtering operation, at least a set of reads directed to one or more persistent storage devices at which at least a subset of the plurality of contiguous chunks are stored, wherein, subsequent to the set of reads, the first memory portion comprises at least the particular contiguous chunk;
  • a method comprising:
  • the first chunk-level filtering operation initiating, to implement the first chunk-level filtering operation, a set of data transfers directed to one or more persistent storage devices at which at least a subset of the plurality of chunks is stored, wherein, subsequent to the set of data transfers, the first memory portion comprises at least the particular chunk; implementing the second chunk-level filtering operation on an in-memory result set of the first chunk-level filtering operation;
  • the one or more data sources comprise one or more storage objects including a particular storage object
  • said mapping the particular data set into the plurality of chunks comprises determining, based at least in part on a chunk size parameter, a candidate offset within the particular storage object as a candidate ending boundary of the particular chunk, further comprising performing, by the one or more computing devices:
  • the first delimiter identifying, in a sequential read of the particular storage object in order of increasing offsets, the first delimiter with an offset higher than the candidate offset as the ending boundary of the particular chunk.
  • the one or more data sources comprise one or more of: (a) a single-host file system, (b) a distributed file system, (c) a storage object accessible via a web service interface from a network-accessible storage service, (d) a storage volume presenting a block-level device interface, or (e) a database.
  • a non-transitory computer-accessible storage medium storing program instructions that when executed on one or more processors:
  • a plan to perform one or more chunk-level operations including a first chunk-level operation on a plurality of chunks of the particular data set, wherein an operation type of the first chunk-level operation comprises one or more of: (a) sampling, (b) shuffling, (c) splitting, or (d) partitioning for parallel computation; initiate, to implement the first chunk-level operation, a set of data transfers directed to one or more persistent storage devices at which at least a subset of the plurality of chunks is stored, wherein, subsequent to the set of data transfers, a first memory portion of a particular server of the machine learning service comprises at least a particular chunk of the plurality of chunks; and
  • a system comprising:
  • one or more computing devices configured to:
  • consistency metadata to be used for one or more training-and-evaluation iterations of a machine learning model, wherein the consistency metadata comprises at least a particular initialization parameter value for a pseudorandom number source;
  • first training set from the plurality of chunks, wherein the first training set includes at least a portion of the first chunk, wherein observation records of the first training set are used to train the machine learning model during a first training-and- evaluation iteration of the one or more training-and-evaluation iterations, and wherein the first set of pseudo-random numbers is obtained using the consistency metadata;
  • a first test set from the plurality of chunks, wherein the first test set includes at least a portion of the second chunk, wherein observation records of the first test set are used to evaluate the machine learning model during the first training-and- evaluation iteration, and wherein the second set of pseudo-random numbers is obtained using the consistency metadata.
  • a method comprising:
  • one or more computing devices configured to:
  • consistency metadata to be used for one or more training-and- evaluation iterations of a machine learning model, wherein the consistency metadata comprises at least a particular parameter value for a pseudorandom number source;
  • determining a number of chunks into which the address space is to be sub-divided based at least in part on one or more of: (a) a size of available memory at a particular server or (b) a client request.
  • a non-transitory computer-accessible storage medium storing program instructions that when executed on one or more processors:
  • consistency metadata to be used for one or more training-and-evaluation iterations of a machine learning model, wherein the consistency metadata comprises at least a particular parameter value for a pseudo-random number source;
  • a first training set from a plurality of chunks of a particular data set, wherein individual ones of the plurality of chunks comprise one or more observation records, wherein the first training set includes at least a portion of a first chunk of the plurality of chunks, and wherein observation records of the first training set are used to train the machine learning model during a first training-and-evaluation iteration of the one or more training-and- evaluation iterations;
  • first test set from the plurality of chunks, wherein the first test set includes at least a portion of a second chunk of the plurality of chunks, and wherein observation records of the first test set are used to evaluate the machine learning model during the first training-and-evaluation iteration.
  • a system comprising:
  • one or more computing devices configured to:
  • a respective value of a predictive utility metric (PUM), wherein a particular PUM value associated with a particular node of the one or more nodes is a measure of an expected contribution of the particular node to a prediction generated using the machine learning model;
  • the PUM comprises one or more of: (a) an indication of a Gini impurity, (b) an information gain metric, or (c) an entropy metric.
  • the one or more run-time optimization goals include one or more of: (a) a prediction time goal, (b) a processor utilization goal, or (c) a budget goal.
  • the machine learning model comprises one or more of: (a) a Random Forest model, (b) a classification and regression tree (CART) model, or (c) an adaptive boosting model.
  • a method comprising:
  • PUM predictive utility metric
  • generating comprises removing at least the particular node from the particular decision tree, wherein the particular node is selected for removal based at least in part on the particular PUM value;
  • the PUM comprises one or more of: (a) an indication of a Gini impurity, (b) an information gain metric, or (c) an entropy metric.
  • the machine learning model comprises one or more of: (a) a Random Forest model, (b) a classification and regression tree (CART) model, or (c) an adaptive boosting model.
  • the machine learning model is configured to utilize a plurality of decision trees including the particular decision tree, wherein the particular prediction is obtained at a particular thread of execution of a plurality of threads of execution of a machine learning service, further comprising: obtaining a second prediction using a modified version of a second decision tree of the plurality of decision trees at a different thread of execution of the plurality of threads of execution.
  • a non-transitory computer-accessible storage medium storing program instructions that when executed on one or more processors:
  • a respective value of a predictive utility metric PUM
  • PUM predictive utility metric
  • non-transitory computer-accessible storage medium as recited in clause 16, wherein the particular node is selected for removal based at least in part on one or more run-time optimization goals for an execution of the machine learning model, including one or more of: (a) a memory-footprint goal (b) a prediction time goal, (c) a processor utilization goal, or (d) a budget goal.
  • a system comprising:
  • one or more computing devices configured to:
  • the set of candidate feature processing transformations includes a particular feature processing transformation; determine (a) a quality estimate indicative of an effect, on the particular prediction quality metric, of implementing the particular candidate feature processing transformation, and (b) a cost estimate indicative of an effect, on a particular run-time performance metric associated with the particular prediction run-time goal, of implementing the particular candidate feature processing transformation; generate, based at least in part on the quality estimate and at least in part on the cost estimate, a feature processing proposal to be provided to the client for approval, wherein the feature processing proposal includes a recommendation to implement the particular feature processing transformation; and
  • the one or more computing devices implement a plurality of evaluation runs of the machine learning model, including a first evaluation run in which a first set of values of the particular processed variable are provided as input to the machine learning model, and a second evaluation run in which a different set of values of the particular processed variable are provided as input to the machine learning model.
  • the one or more computing devices implement respective evaluation runs of a first variant of the machine learning model and a second variant of the machine learning model, wherein the first variant is trained using a first training set that includes the particular processed variable, and the second variant is trained using a second training set that excludes the particular processed variable.
  • the particular prediction quality metric comprises one or more of: (a) an AUC (area under curve) metric, (b) an accuracy metric, (c) a recall metric, (d) a sensitivity metric, (e) a true positive rate, (f) a specificity metric, (g) a true negative rate, (h) a precision metric, (i) a false positive rate, (j) a false negative rate, (k) an Fl score, (1) a coverage metric, (m) an absolute percentage error metric, or (n) a squared error metric.
  • the particular feature processing transformation comprises a use of one or more of: (a) a quantile bin function, (b) a Cartesian product function, (c) a bi-gram function, (d) an n-gram function, (e) an orthogonal sparse bigram function, (f) a calendar function, (g) an image processing function, (h) an audio processing function, (i) a bio-informatics processing function, or (j) a natural language processing function.
  • a method comprising:
  • determining (a) a quality estimate indicative of an effect, on a particular prediction quality metric, of implementing the particular feature processing transformation, and (b) a cost estimate indicative of an effect, on a performance metric associated with a particular prediction goal, of implementing the particular feature processing transformation; and implementing, based at least in part on the quality estimate and at least in part on the cost estimate, a feature processing plan that includes the particular feature processing transformation.
  • generating one or more feature processing proposals including a particular feature processing proposal recommending the particular feature processing transformation, based at least in part on an analysis of respective quality estimates and respective cost estimates corresponding to a plurality of candidate feature processing transformations;
  • a model creation request comprising respective indications of one or more of: (a) the one or more target variables, (b) one or more prediction quality metrics including the particular prediction quality metric, (c) one or more prediction goals including the particular prediction goal, or (d) one or more constraints including a particular constraint identifying a mandatory feature processing transformation.
  • the particular prediction quality metric comprises one or more of: (a) an AUC (area under curve) metric, (b) an accuracy metric, (c) a recall metric, (d) a sensitivity metric, (e) a true positive rate, (f) a specificity metric, (g) a true negative rate, (h) a precision metric, (i) a false positive rate, (j) a false negative rate, (k) an Fl score, (1) a coverage metric, (m) an absolute percentage error metric, or (n) a squared error metric.
  • the particular feature processing transformation comprises a use of one or more of: (a) a quantile bin function, (b) a Cartesian product function, (c) a bi-gram function, (d) an n-gram function, (e) an orthogonal sparse bigram function, (f) a calendar function, (g) an image processing function, (h) an audio processing function, (i) a bio-informatics processing function, or (j) a natural language processing function.
  • the particular prediction goal comprises one or more of: (a) a model execution time goal, (b) a memory usage goal, (c) a processor usage goal, (d) a storage usage goal, (e) a network usage goal, or (f) a budget.
  • determining the quality estimate comprises implementing a plurality of evaluation runs of the machine learning model, including a first evaluation run in which a first set of values of the particular processed variable are provided as input to the machine learning model, and a second evaluation run in which a different set of values of the particular processed variable are provided as input to the machine learning model.
  • determining the cost estimate comprises implementing respective evaluation runs of a first variant of the machine learning model and a second variant of the machine learning model, wherein the first variant is trained using a first set of input variables that includes the particular processed variable, and the second variant is trained using a second set of input variables that excludes the particular processed variable.
  • a non-transitory computer-accessible storage medium storing program instructions that when executed on one or more processors:
  • a machine learning service identify, at a machine learning service, a set of candidate input variables usable to train a machine learning model to predict one or more target variables, wherein the set of candidate input variables includes at least a particular processed variable resulting from a particular feature processing transformation applicable to one or more input variables of a training data set;

Abstract

La présente invention concerne, au niveau d'un service d'apprentissage machine, l'identification d'un ensemble de variables candidates pouvant être utilisées pour faire l'apprentissage d'un modèle, ledit ensemble comprenant au moins une variable traitée produite par une transformation de traitement de caractéristiques. Une estimation des coûts indicative d'un effet de mise en œuvre de la transformation de traitement de caractéristiques sur une métrique de performance associée à un objectif de prédiction du modèle est déterminée. Sur la base, au moins en partie, de l'estimation des coûts, une proposition de traitement de caractéristiques qui exclut la transformation de traitement de caractéristiques est mise en œuvre.
EP15739124.4A 2014-06-30 2015-06-30 Gestion de compromis pour un traitement de caractéristiques Withdrawn EP3161731A1 (fr)

Applications Claiming Priority (11)

Application Number Priority Date Filing Date Title
US14/319,880 US9886670B2 (en) 2014-06-30 2014-06-30 Feature processing recipes for machine learning
US14/319,902 US10102480B2 (en) 2014-06-30 2014-06-30 Machine learning service
US14/460,312 US11100420B2 (en) 2014-06-30 2014-08-14 Input processing for machine learning
US14/460,314 US10540606B2 (en) 2014-06-30 2014-08-14 Consistent filtering of machine learning data
US14/463,434 US10339465B2 (en) 2014-06-30 2014-08-19 Optimized decision tree based models
US14/484,201 US10318882B2 (en) 2014-09-11 2014-09-11 Optimized training of linear machine learning models
US14/489,449 US9672474B2 (en) 2014-06-30 2014-09-17 Concurrent binning of machine learning data
US14/489,448 US10169715B2 (en) 2014-06-30 2014-09-17 Feature processing tradeoff management
US14/538,723 US10452992B2 (en) 2014-06-30 2014-11-11 Interactive interfaces for machine learning model evaluations
US14/569,458 US10963810B2 (en) 2014-06-30 2014-12-12 Efficient duplicate detection for machine learning data sets
PCT/US2015/038589 WO2016004062A1 (fr) 2014-06-30 2015-06-30 Gestion de compromis pour un traitement de caractéristiques

Publications (1)

Publication Number Publication Date
EP3161731A1 true EP3161731A1 (fr) 2017-05-03

Family

ID=53674329

Family Applications (5)

Application Number Title Priority Date Filing Date
EP15739125.1A Pending EP3161732A1 (fr) 2014-06-30 2015-06-30 Recettes de traitement de caractéristique pour un apprentissage machine
EP15739128.5A Withdrawn EP3161733A1 (fr) 2014-06-30 2015-06-30 Interfaces interactives pour des évaluations de modèle d'apprentissage machine
EP15739127.7A Active EP3161635B1 (fr) 2014-06-30 2015-06-30 Service d'apprentissage machine
EP15739124.4A Withdrawn EP3161731A1 (fr) 2014-06-30 2015-06-30 Gestion de compromis pour un traitement de caractéristiques
EP23205030.2A Pending EP4328816A1 (fr) 2014-06-30 2015-06-30 Service d'apprentissage machine

Family Applications Before (3)

Application Number Title Priority Date Filing Date
EP15739125.1A Pending EP3161732A1 (fr) 2014-06-30 2015-06-30 Recettes de traitement de caractéristique pour un apprentissage machine
EP15739128.5A Withdrawn EP3161733A1 (fr) 2014-06-30 2015-06-30 Interfaces interactives pour des évaluations de modèle d'apprentissage machine
EP15739127.7A Active EP3161635B1 (fr) 2014-06-30 2015-06-30 Service d'apprentissage machine

Family Applications After (1)

Application Number Title Priority Date Filing Date
EP23205030.2A Pending EP4328816A1 (fr) 2014-06-30 2015-06-30 Service d'apprentissage machine

Country Status (5)

Country Link
EP (5) EP3161732A1 (fr)
JP (4) JP6419859B2 (fr)
CN (2) CN106575246B (fr)
CA (6) CA2953817C (fr)
WO (4) WO2016004073A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10943186B2 (en) 2017-11-22 2021-03-09 Advanced New Technologies Co., Ltd. Machine learning model training method and device, and electronic device

Families Citing this family (141)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9286047B1 (en) 2013-02-13 2016-03-15 Cisco Technology, Inc. Deployment and upgrade of network devices in a network environment
US10374904B2 (en) 2015-05-15 2019-08-06 Cisco Technology, Inc. Diagnostic network visualization
US9800497B2 (en) 2015-05-27 2017-10-24 Cisco Technology, Inc. Operations, administration and management (OAM) in overlay data center environments
US9967158B2 (en) 2015-06-05 2018-05-08 Cisco Technology, Inc. Interactive hierarchical network chord diagram for application dependency mapping
US10089099B2 (en) 2015-06-05 2018-10-02 Cisco Technology, Inc. Automatic software upgrade
US10033766B2 (en) 2015-06-05 2018-07-24 Cisco Technology, Inc. Policy-driven compliance
US10142353B2 (en) 2015-06-05 2018-11-27 Cisco Technology, Inc. System for monitoring and managing datacenters
US10536357B2 (en) 2015-06-05 2020-01-14 Cisco Technology, Inc. Late data detection in data center
CA3010769A1 (fr) * 2016-01-08 2017-07-13 Sunil Mehta Procede et systeme d'agent virtuel de donnees pour la transmission d'indices de donnees avec une intelligence artificielle
US10171357B2 (en) 2016-05-27 2019-01-01 Cisco Technology, Inc. Techniques for managing software defined networking controller in-band communications in a data center network
US10931629B2 (en) 2016-05-27 2021-02-23 Cisco Technology, Inc. Techniques for managing software defined networking controller in-band communications in a data center network
US10289438B2 (en) 2016-06-16 2019-05-14 Cisco Technology, Inc. Techniques for coordination of application components deployed on distributed virtual machines
US10897474B2 (en) 2016-06-23 2021-01-19 Cisco Technology, Inc. Adapting classifier parameters for improved network traffic classification using distinct private training data sets
EP3475770A4 (fr) * 2016-06-23 2019-12-04 Telefonaktiebolaget LM Ericsson (publ) Procédé et système de commande d'approvisionnement d'un réseau de communication pour commander l'approvisionnement en services depuis un réseau d'approvisionnement vers des dispositifs de communication
US10708183B2 (en) 2016-07-21 2020-07-07 Cisco Technology, Inc. System and method of providing segment routing as a service
US10579751B2 (en) * 2016-10-14 2020-03-03 International Business Machines Corporation System and method for conducting computing experiments
US10339320B2 (en) 2016-11-18 2019-07-02 International Business Machines Corporation Applying machine learning techniques to discover security impacts of application programming interfaces
US10972388B2 (en) 2016-11-22 2021-04-06 Cisco Technology, Inc. Federated microburst detection
CN108133222B (zh) * 2016-12-01 2021-11-02 富士通株式会社 为数据库确定卷积神经网络cnn模型的装置和方法
US11238528B2 (en) * 2016-12-22 2022-02-01 American Express Travel Related Services Company, Inc. Systems and methods for custom ranking objectives for machine learning models applicable to fraud and credit risk assessments
US10685293B1 (en) * 2017-01-20 2020-06-16 Cybraics, Inc. Methods and systems for analyzing cybersecurity threats
CN108334951B (zh) 2017-01-20 2023-04-25 微软技术许可有限责任公司 针对决策树的节点的数据的预统计
US10708152B2 (en) 2017-03-23 2020-07-07 Cisco Technology, Inc. Predicting application and network performance
US10523512B2 (en) 2017-03-24 2019-12-31 Cisco Technology, Inc. Network agent for generating platform specific network policies
US10594560B2 (en) 2017-03-27 2020-03-17 Cisco Technology, Inc. Intent driven network policy platform
US10764141B2 (en) 2017-03-27 2020-09-01 Cisco Technology, Inc. Network agent for reporting to a network policy system
US10250446B2 (en) 2017-03-27 2019-04-02 Cisco Technology, Inc. Distributed policy store
US10873794B2 (en) 2017-03-28 2020-12-22 Cisco Technology, Inc. Flowlet resolution for application performance monitoring and management
CN106952475B (zh) * 2017-04-26 2019-04-23 清华大学 用户均衡原则下的双路径路网个性化诱导率分配方法
CN107203469B (zh) * 2017-04-28 2020-04-03 北京大学 基于机器学习的编译器测试加速方法
US11620571B2 (en) 2017-05-05 2023-04-04 Servicenow, Inc. Machine learning with distributed training
US11443226B2 (en) 2017-05-17 2022-09-13 International Business Machines Corporation Training a machine learning model in a distributed privacy-preserving environment
US11599783B1 (en) 2017-05-31 2023-03-07 Databricks, Inc. Function creation for database execution of deep learning model
US11144845B2 (en) 2017-06-02 2021-10-12 Stitch Fix, Inc. Using artificial intelligence to design a product
CN107273979B (zh) * 2017-06-08 2020-12-01 第四范式(北京)技术有限公司 基于服务级别来执行机器学习预测的方法及系统
US11238544B2 (en) * 2017-07-07 2022-02-01 Msm Holdings Pte System and method for evaluating the true reach of social media influencers
JP6607885B2 (ja) * 2017-07-10 2019-11-20 株式会社三菱総合研究所 情報処理装置及び情報処理方法
JP6889835B2 (ja) * 2017-07-14 2021-06-18 コニカミノルタ株式会社 ファクシミリ通信装置およびプログラム
US10680887B2 (en) 2017-07-21 2020-06-09 Cisco Technology, Inc. Remote device status audit and recovery
KR101828503B1 (ko) 2017-08-23 2018-03-29 주식회사 에이젠글로벌 앙상블 모델 생성 장치 및 방법
CN107506938A (zh) * 2017-09-05 2017-12-22 江苏电力信息技术有限公司 一种基于机器学习的物料质量评估方法
JP7000766B2 (ja) * 2017-09-19 2022-01-19 富士通株式会社 学習データ選択プログラム、学習データ選択方法、および、学習データ選択装置
GB2567147A (en) * 2017-09-28 2019-04-10 Int Consolidated Airlines Group Machine learning query handling system
US10558920B2 (en) 2017-10-02 2020-02-11 Servicenow, Inc. Machine learning classification with confidence thresholds
US10360214B2 (en) * 2017-10-19 2019-07-23 Pure Storage, Inc. Ensuring reproducibility in an artificial intelligence infrastructure
US11494692B1 (en) 2018-03-26 2022-11-08 Pure Storage, Inc. Hyperscale artificial intelligence and machine learning infrastructure
US11861423B1 (en) 2017-10-19 2024-01-02 Pure Storage, Inc. Accelerating artificial intelligence (‘AI’) workflows
US11455168B1 (en) 2017-10-19 2022-09-27 Pure Storage, Inc. Batch building for deep learning training workloads
US10671434B1 (en) 2017-10-19 2020-06-02 Pure Storage, Inc. Storage based artificial intelligence infrastructure
US10554501B2 (en) 2017-10-23 2020-02-04 Cisco Technology, Inc. Network migration assistant
US10523541B2 (en) 2017-10-25 2019-12-31 Cisco Technology, Inc. Federated network and application data analytics platform
US10594542B2 (en) 2017-10-27 2020-03-17 Cisco Technology, Inc. System and method for network root cause analysis
US11182394B2 (en) * 2017-10-30 2021-11-23 Bank Of America Corporation Performing database file management using statistics maintenance and column similarity
JP6828830B2 (ja) * 2017-11-02 2021-02-10 日本電気株式会社 評価システム、評価方法および評価用プログラム
US11170309B1 (en) 2017-11-22 2021-11-09 Amazon Technologies, Inc. System for routing machine learning model inferences
US20190156244A1 (en) * 2017-11-22 2019-05-23 Amazon Technologies, Inc. Network-accessible machine learning model training and hosting system
US11126927B2 (en) * 2017-11-24 2021-09-21 Amazon Technologies, Inc. Auto-scaling hosted machine learning models for production inference
US11004012B2 (en) 2017-11-29 2021-05-11 International Business Machines Corporation Assessment of machine learning performance with limited test data
JP2019101902A (ja) * 2017-12-06 2019-06-24 株式会社グルーヴノーツ データ処理装置、データ処理方法及びデータ処理プログラム
US11061905B2 (en) * 2017-12-08 2021-07-13 International Business Machines Corporation Job management in data processing system
US20190197549A1 (en) * 2017-12-21 2019-06-27 Paypal, Inc. Robust features generation architecture for fraud modeling
JPWO2019130433A1 (ja) * 2017-12-26 2020-12-17 株式会社ウフル 情報処理結果提供システム、情報処理結果提供方法及びプログラム
JPWO2019130434A1 (ja) * 2017-12-26 2020-12-17 株式会社ウフル 機械学習処理結果提供システム、機械学習処理結果提供方法及びプログラム
US11233821B2 (en) 2018-01-04 2022-01-25 Cisco Technology, Inc. Network intrusion counter-intelligence
US11765046B1 (en) 2018-01-11 2023-09-19 Cisco Technology, Inc. Endpoint cluster assignment and query generation
US10798015B2 (en) 2018-01-25 2020-10-06 Cisco Technology, Inc. Discovery of middleboxes using traffic flow stitching
US10873593B2 (en) 2018-01-25 2020-12-22 Cisco Technology, Inc. Mechanism for identifying differences between network snapshots
US10574575B2 (en) 2018-01-25 2020-02-25 Cisco Technology, Inc. Network flow stitching using middle box flow stitching
US10999149B2 (en) 2018-01-25 2021-05-04 Cisco Technology, Inc. Automatic configuration discovery based on traffic flow data
US10826803B2 (en) 2018-01-25 2020-11-03 Cisco Technology, Inc. Mechanism for facilitating efficient policy updates
US10917438B2 (en) 2018-01-25 2021-02-09 Cisco Technology, Inc. Secure publishing for policy updates
US11128700B2 (en) 2018-01-26 2021-09-21 Cisco Technology, Inc. Load balancing configuration based on traffic flow telemetry
US11461737B2 (en) 2018-04-20 2022-10-04 Microsoft Technology Licensing, Llc Unified parameter and feature access in machine learning models
WO2019216938A1 (fr) * 2018-05-07 2019-11-14 Google Llc Plateforme de développement d'applications et kits de développement de logiciels fournissant des services d'apprentissage machine complets
US11263540B2 (en) * 2018-05-07 2022-03-01 Apple Inc. Model selection interface
US11481580B2 (en) * 2018-05-31 2022-10-25 Fujitsu Limited Accessible machine learning
CN108921840A (zh) * 2018-07-02 2018-11-30 北京百度网讯科技有限公司 显示屏外围电路检测方法、装置、电子设备及存储介质
CN109085174A (zh) * 2018-07-02 2018-12-25 北京百度网讯科技有限公司 显示屏外围电路检测方法、装置、电子设备及存储介质
CN108961238A (zh) * 2018-07-02 2018-12-07 北京百度网讯科技有限公司 显示屏质量检测方法、装置、电子设备及存储介质
CN108846841A (zh) 2018-07-02 2018-11-20 北京百度网讯科技有限公司 显示屏质量检测方法、装置、电子设备及存储介质
KR102092617B1 (ko) * 2018-07-05 2020-05-22 인하대학교 산학협력단 단방향 데이터 변환을 이용한 프라이버시 보장형 기계 학습 방법
US10739979B2 (en) * 2018-07-16 2020-08-11 Microsoft Technology Licensing, Llc Histogram slider for quick navigation of a time-based list
JP7095479B2 (ja) * 2018-08-10 2022-07-05 株式会社リコー 学習装置および学習方法
WO2020068141A1 (fr) * 2018-09-26 2020-04-02 Google Llc Variables prédites en programmation
JP7028746B2 (ja) * 2018-10-05 2022-03-02 株式会社日立製作所 質問生成装置および質問生成方法
CN109446078B (zh) * 2018-10-18 2022-02-18 网易(杭州)网络有限公司 代码测试方法及装置、存储介质、电子设备
WO2020088681A1 (fr) * 2018-11-01 2020-05-07 华为技术有限公司 Procédé de gestion de fichiers de modèle et dispositif terminal
CN110046634B (zh) * 2018-12-04 2021-04-27 创新先进技术有限公司 聚类结果的解释方法和装置
CN109600255A (zh) * 2018-12-04 2019-04-09 中山大学 一种去中心化的参数服务器优化算法
KR102190100B1 (ko) * 2018-12-27 2020-12-11 (주)아크릴 인공 신경망 학습 방법
US11068694B2 (en) 2019-01-23 2021-07-20 Molecular Devices, Llc Image analysis system and method of using the image analysis system
CN109816043B (zh) * 2019-02-02 2021-01-01 拉扎斯网络科技(上海)有限公司 用户识别模型的确定方法、装置、电子设备及存储介质
JP7059220B2 (ja) * 2019-02-15 2022-04-25 株式会社日立製作所 機械学習プログラム検証装置および機械学習プログラム検証方法
US11487739B2 (en) * 2019-02-19 2022-11-01 Nasdaq, Inc. System and methods for data model detection and surveillance
CN111694675B (zh) 2019-03-15 2022-03-08 上海商汤智能科技有限公司 任务调度方法及装置、存储介质
JP7022714B2 (ja) * 2019-03-26 2022-02-18 Kddi株式会社 クライアント装置、情報処理方法、及びプログラム
JP7178314B2 (ja) * 2019-03-29 2022-11-25 株式会社日立製作所 モデルの採否判断を支援するシステム及び方法
US20220222554A1 (en) 2019-04-23 2022-07-14 Nec Corporation Operation result predicting method, electronic device, and computer program product
US11106689B2 (en) * 2019-05-02 2021-08-31 Tate Consultancy Services Limited System and method for self-service data analytics
US11886960B2 (en) 2019-05-07 2024-01-30 International Business Machines Corporation Elastic training of machine learning models via re-partitioning based on feedback from the training algorithm
US11573803B2 (en) * 2019-05-07 2023-02-07 International Business Machines Corporation Parallel training of machine learning models
JP7238610B2 (ja) * 2019-06-04 2023-03-14 富士フイルムビジネスイノベーション株式会社 情報処理装置及びプログラム
US11694124B2 (en) * 2019-06-14 2023-07-04 Accenture Global Solutions Limited Artificial intelligence (AI) based predictions and recommendations for equipment
US20220358149A1 (en) * 2019-07-12 2022-11-10 Telefonaktiebolaget Lm Ericsson (Publ) Life cycle management
US11487973B2 (en) 2019-07-19 2022-11-01 UiPath, Inc. Retraining a computer vision model for robotic process automation
EP3770760A1 (fr) * 2019-07-23 2021-01-27 Siemens Aktiengesellschaft Prédiction de la consommation de ressources pour des fonctions
US11392796B2 (en) 2019-08-20 2022-07-19 Micron Technology, Inc. Feature dictionary for bandwidth enhancement
US11449796B2 (en) 2019-09-20 2022-09-20 Amazon Technologies, Inc. Machine learning inference calls for database query processing
CN112685518B (zh) * 2019-10-18 2023-10-20 菜鸟智能物流控股有限公司 一种服务提供对象的分配方法、一种订单分配方法和装置
JP7012696B2 (ja) * 2019-10-21 2022-01-28 株式会社三菱総合研究所 情報処理装置及び情報処理方法
CN110806923B (zh) * 2019-10-29 2023-02-24 百度在线网络技术(北京)有限公司 一种区块链任务的并行处理方法、装置、电子设备和介质
CN110889255B (zh) * 2019-10-31 2022-09-13 国网湖北省电力有限公司 一种基于级联深度森林的电力系统暂态稳定评估方法
WO2021100546A1 (fr) * 2019-11-20 2021-05-27 富士フイルム株式会社 Système de traitement d'intelligence artificielle, dispositif de gestion de téléchargements vers l'amont, procédé et programme
US11551652B1 (en) * 2019-11-27 2023-01-10 Amazon Technologies, Inc. Hands-on artificial intelligence education service
WO2021111431A1 (fr) * 2019-12-05 2021-06-10 Orbotech Ltd. Amélioration de la précision de modèles de classification
CN113014413A (zh) * 2019-12-20 2021-06-22 中兴通讯股份有限公司 应用于通信系统的阈值优化方法、装置和计算机可读介质
US11507836B1 (en) 2019-12-20 2022-11-22 Apple Inc. Federated learning using local ground truth estimation
FR3105863B1 (fr) * 2019-12-31 2022-01-21 Bull Sas Procédé ET système de conception d’un modèle de prédiction
US11948050B2 (en) * 2020-02-19 2024-04-02 EMC IP Holding Company LLC Caching of machine learning model training parameters
CN111339298B (zh) * 2020-02-25 2024-04-09 北京小米松果电子有限公司 一种分类预测方法、装置及存储介质
US11907172B2 (en) 2020-03-17 2024-02-20 Nec Corporation Information processing system, information processing method, and recording medium
CN111582381B (zh) * 2020-05-09 2024-03-26 北京市商汤科技开发有限公司 确定性能参数的方法及装置、电子设备和存储介质
TWI756685B (zh) * 2020-05-15 2022-03-01 昕力資訊股份有限公司 排程和執行工作的電腦程式產品和裝置
CN111767222A (zh) * 2020-06-28 2020-10-13 杭州数梦工场科技有限公司 数据模型的验证方法、装置、电子设备、存储介质
WO2022044233A1 (fr) * 2020-08-27 2022-03-03 日本電信電話株式会社 Dispositif d'estimation, procédé d'estimation et programme
US20230261857A1 (en) * 2020-10-29 2023-08-17 Hewlett-Packard Development Company, L.P. Generating statements
JP7168630B2 (ja) * 2020-11-11 2022-11-09 株式会社日立製作所 計算機システム及びジョブの実行制御方法
CN112445462A (zh) * 2020-11-16 2021-03-05 北京思特奇信息技术股份有限公司 基于面向对象设计的人工智能建模平台和方法
EP4016295A1 (fr) * 2020-12-15 2022-06-22 Aptiv Technologies Limited Gestion d'un environnement d'apprentissage automatique
US20220230249A1 (en) * 2021-01-19 2022-07-21 Better Holdco, Inc. Condition tree optimization
CN112835910B (zh) * 2021-03-05 2023-10-17 天九共享网络科技集团有限公司 一种企业信息与政策信息的处理方法和装置
WO2022210017A1 (fr) * 2021-03-31 2022-10-06 日本電気株式会社 Système d'analyse par ia, procédé de calcul de frais d'utilisation, et support d'enregistrement
US11675688B2 (en) * 2021-05-20 2023-06-13 Nextmv.Io Inc. Runners for optimization solvers and simulators
CN113239060B (zh) * 2021-05-31 2023-09-29 康键信息技术(深圳)有限公司 数据资源分配处理方法、装置、设备及存储介质
AU2022297419A1 (en) * 2021-06-22 2023-10-12 C3.Ai, Inc. Methods, processes, and systems to deploy artificial intelligence (ai)-based customer relationship management (crm) system using model-driven software architecture
CN117597579A (zh) * 2021-07-08 2024-02-23 杰富意钢铁株式会社 检查方法、分类方法、管理方法、钢材的制造方法、学习模型的生成方法、学习模型、检查装置以及钢材的制造设备
US20230128532A1 (en) * 2021-10-24 2023-04-27 International Business Machines Corporation Distributed computing for dynamic generation of optimal and interpretable prescriptive policies with interdependent constraints
KR102433830B1 (ko) * 2021-11-10 2022-08-18 한국인터넷진흥원 인공지능 기반 보안위협 이상행위 탐지 시스템 및 방법
CN114205164B (zh) * 2021-12-16 2023-07-18 北京百度网讯科技有限公司 流量分类方法及装置、训练方法及装置、设备和介质
WO2023154558A1 (fr) * 2022-02-14 2023-08-17 The Trustees Of Princeton University Multiplexage de données pour réseaux neuronaux
WO2024073531A1 (fr) * 2022-09-29 2024-04-04 Amazon Technologies, Inc. Compresseur/décompresseur de données configurable à domaines multiples

Family Cites Families (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6681383B1 (en) * 2000-04-04 2004-01-20 Sosy, Inc. Automatic software production system
US7054847B2 (en) * 2001-09-05 2006-05-30 Pavilion Technologies, Inc. System and method for on-line training of a support vector machine
US9792359B2 (en) * 2005-04-29 2017-10-17 Entit Software Llc Providing training information for training a categorizer
JP2009093334A (ja) * 2007-10-05 2009-04-30 Seiko Epson Corp 識別方法及びプログラム
JP5133775B2 (ja) * 2008-05-19 2013-01-30 株式会社野村総合研究所 ジョブ管理装置
JP4591566B2 (ja) * 2008-07-14 2010-12-01 ソニー株式会社 情報処理装置、情報処理方法、およびプログラム
US8522085B2 (en) * 2010-01-27 2013-08-27 Tt Government Solutions, Inc. Learning program behavior for anomaly detection
US8489745B2 (en) * 2010-02-26 2013-07-16 International Business Machines Corporation Optimizing power consumption by dynamic workload adjustment
US8438122B1 (en) * 2010-05-14 2013-05-07 Google Inc. Predictive analytic modeling platform
US9020871B2 (en) * 2010-06-18 2015-04-28 Microsoft Technology Licensing, Llc Automated classification pipeline tuning under mobile device resource constraints
US8566746B2 (en) * 2010-08-30 2013-10-22 Xerox Corporation Parameterization of a categorizer for adjusting image categorization and retrieval
CN104484322A (zh) * 2010-09-24 2015-04-01 新加坡国立大学 用于自动化文本校正的方法和系统
US8595154B2 (en) * 2011-01-26 2013-11-26 Google Inc. Dynamic predictive modeling platform
US8533222B2 (en) * 2011-01-26 2013-09-10 Google Inc. Updateable predictive analytical modeling
WO2012151198A1 (fr) * 2011-05-04 2012-11-08 Google Inc. Évaluation de la précision d'une modélisation prédictive analytique
US8229864B1 (en) * 2011-05-06 2012-07-24 Google Inc. Predictive model application programming interface
US8370280B1 (en) * 2011-07-14 2013-02-05 Google Inc. Combining predictive models in predictive analytical modeling
US9361273B2 (en) * 2011-07-21 2016-06-07 Sap Se Context-aware parameter estimation for forecast models
EP2629247B1 (fr) * 2012-02-15 2014-01-08 Alcatel Lucent Procédé de cartographie de composants de média utilisant l'apprentissage de la machine
US8775576B2 (en) * 2012-04-17 2014-07-08 Nimbix, Inc. Reconfigurable cloud computing
US20140046879A1 (en) * 2012-08-13 2014-02-13 Predixion Software, Inc. Machine learning semantic model
JP5881048B2 (ja) * 2012-09-18 2016-03-09 株式会社日立製作所 情報処理システム、及び、情報処理方法
CN103218263B (zh) * 2013-03-12 2016-03-23 北京航空航天大学 MapReduce参数的动态确定方法及装置
CN103336869B (zh) * 2013-07-05 2016-07-06 广西大学 一种基于高斯过程联立mimo模型的多目标优化方法
CN103593323A (zh) * 2013-11-07 2014-02-19 浪潮电子信息产业股份有限公司 一种MapReduce任务资源配置参数的机器学习方法
US9886670B2 (en) * 2014-06-30 2018-02-06 Amazon Technologies, Inc. Feature processing recipes for machine learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
None *
See also references of WO2016004062A1 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10943186B2 (en) 2017-11-22 2021-03-09 Advanced New Technologies Co., Ltd. Machine learning model training method and device, and electronic device

Also Published As

Publication number Publication date
EP4328816A1 (fr) 2024-02-28
CA2953826C (fr) 2021-07-13
WO2016004073A1 (fr) 2016-01-07
CN106575246A (zh) 2017-04-19
JP2017527008A (ja) 2017-09-14
WO2016004075A1 (fr) 2016-01-07
WO2016004062A1 (fr) 2016-01-07
CA2953817C (fr) 2023-07-04
CA2953969A1 (fr) 2016-01-07
JP6445055B2 (ja) 2018-12-26
JP6419859B2 (ja) 2018-11-07
JP6419860B2 (ja) 2018-11-07
EP3161733A1 (fr) 2017-05-03
JP2017524183A (ja) 2017-08-24
CA2953817A1 (fr) 2016-01-07
JP6371870B2 (ja) 2018-08-08
CN113157448B (zh) 2024-04-12
CN106575246A8 (zh) 2017-07-07
JP2017530435A (ja) 2017-10-12
CA2953959C (fr) 2021-02-02
EP3161732A1 (fr) 2017-05-03
JP2017529583A (ja) 2017-10-05
CA2953959A1 (fr) 2016-01-07
CA2953969C (fr) 2023-08-01
EP3161635B1 (fr) 2023-11-01
EP3161635A1 (fr) 2017-05-03
CN113157448A (zh) 2021-07-23
CN106575246B (zh) 2021-01-01
CA3200347A1 (fr) 2016-01-07
CA2953826A1 (fr) 2016-01-07
CA3198484A1 (fr) 2016-01-07
WO2016004063A1 (fr) 2016-01-07

Similar Documents

Publication Publication Date Title
US20210374610A1 (en) Efficient duplicate detection for machine learning data sets
CA2953959C (fr) Recettes de traitement de caracteristique pour un apprentissage machine
US20220335338A1 (en) Feature processing tradeoff management
US20200050968A1 (en) Interactive interfaces for machine learning model evaluations
US9672474B2 (en) Concurrent binning of machine learning data
CN106663037B (zh) 用于管理特征处理的系统和方法
US10318882B2 (en) Optimized training of linear machine learning models
US10339465B2 (en) Optimized decision tree based models
US11182691B1 (en) Category-based sampling of machine learning data

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20170130

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20190315

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN

18W Application withdrawn

Effective date: 20211125