WO2019215713A1 - Multiple-part machine learning solutions generated by data scientists - Google Patents

Multiple-part machine learning solutions generated by data scientists Download PDF

Info

Publication number
WO2019215713A1
WO2019215713A1 PCT/IL2019/050358 IL2019050358W WO2019215713A1 WO 2019215713 A1 WO2019215713 A1 WO 2019215713A1 IL 2019050358 W IL2019050358 W IL 2019050358W WO 2019215713 A1 WO2019215713 A1 WO 2019215713A1
Authority
WO
WIPO (PCT)
Prior art keywords
machine learning
data
solution
business
processor
Prior art date
Application number
PCT/IL2019/050358
Other languages
French (fr)
Inventor
Amir RASKIN
Keren SHAKED
Original Assignee
Shoodoo Analytics Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shoodoo Analytics Ltd. filed Critical Shoodoo Analytics Ltd.
Publication of WO2019215713A1 publication Critical patent/WO2019215713A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0203Market surveys; Market polls

Definitions

  • the present invention relates generally to artificial intelligence and more particularly to machine learning.
  • R aka‘GNU S’ or“the R Project for Statistical Computing” is an example of an available language and environment which enables data scientists to perform statistical computing and graphics including various statistical and graphical techniques such as but not limited to linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering.
  • CRAN is a network of FTP and web servers which stores code and documentation for R
  • Certain embodiments of the present invention seek to provide data processing system, method and computer-program product, based on multiple-part machine learning solutions generated by data scientists.
  • Certain embodiments of the present invention seek to provide a data processing system comprising: a system-scientist data interface controlled by a first processor to accept and store in a digital Machine Learning Solution repository, Machine Learning Solutions from scientists, each Machine Learning Solution including multiple blocks of code typically comprising a sequence of blocks of code, each block of code typically performing predefined operations defined by a challenge sent to the scientists and typically having a predefined format (which may be defined in a work-order sent to the data scientists) allowing any first block of code generated within a challenge according to the format by a first data scientist to use output of any second block of code generated within the challenge according to the format by a second data scientist if the second block of code precedes the first block of code in the sequence; and/or a second processor configured to mix and match Machine Learning Solution blocks stored in the Machine Learning Solution repository thereby to generate Machine Learning Solution Block Combinations, thereby, typically, to address plural respective Business Solutions and each Business Solution’s respective Analytical Solution, and/or a system-business user interface controlled by a third processor to run the Machine
  • a challenge defines a sequence of blocks including at least block 1 followed by block2, and each data scientist responding to the challenge generates each block in the sequence and the second processor configured to mix and match is operative to generate at least one pipeline comprising a block 1 generated by a first data scientist followed by a block2 generated by a second data scientist, other than the first data scientist.
  • a challenge may define a sequence: blockl, block2, blocks, block4 and the second processor configured to mix and match may generate inter alia at least one pipeline including a blockl, block2, blocks and block4 respectively generated by 4 different data scientists all of whom responded to the challenge.
  • circuitry typically comprising at least one processor in communication with at least one memory, with instructions stored in such memory executed by the processor to provide functionalities which are described herein in detail. Any functionality described herein may be firmware-implemented or processor- implemented as appropriate.
  • any reference herein to, or recitation of, an operation being performed is, e.g. if the operation is performed at least partly in software, intended to include both an embodiment where the operation is performed in its entirety by a server A, and also to include any type of“outsourcing” or“cloud” embodiments in which the operation, or portions thereof, is or are performed by a remote processor P (or several such), which may be deployed off-shore or“on a cloud", and an output of the operation is then communicated to, e.g. over a suitable computer network, and used by, server A.
  • the remote processor P may not, itself, perform all of the operation and instead, the remote processor P itself may receive output/s of portion/s of the operation from yet another processor/s P, may be deployed off-shore relative to P, or“on a cloud”, and so forth.
  • Embodiment 1 A data processing system comprising: a system-scientist data interface controlled by a first processor to accept and store in a digital repository, multiple-part Machine Learning Solutions from scientists each including multiple blocks; a second processor compiling plural Machine Learning Solution pipelines to address plural respective Business Solutions and each Business Solution’s respective Analytical Solution, wherein the compiling comprises mixing-and-matching Machine Learning Solution blocks stored in the Machine Learning Solution repository thereby to generate Machine Learning Solution Block Combinations, and a system-business user interface controlled by a third processor to generate a Machine Learning Solution pipeline output, for at least one business user presenting an individual Business Solution including business data (aka business data set - Business Data Set) using the Machine Learning Solutions pipeline compiled by the processor for the individual Business Solution.
  • a system-scientist data interface controlled by a first processor to accept and store in a digital repository, multiple-part Machine Learning Solutions from scientists each including multiple blocks
  • a second processor compiling plural Machine Learning Solution pipelines to address plural respective Business Solutions and each Business Solution’s respective Analytical Solution, where
  • Embodiment 2 A system according to any of the preceding embodiments and also comprising a fourth processor operative to communicate to at least one Data Engineer, a challenge inviting the at least one Data Engineer to provide at least one Machine Learning Solution.
  • Embodiment 3 A system according to any of the preceding embodiments wherein the multiple blocks include a Business Solution-Feature Engineering and Extraction block.
  • Embodiment 4 A system according to any of the preceding embodiments wherein the multiple blocks include a Machine Learning Solution -Features Selection block.
  • Embodiment 5. A system according to any of the preceding embodiments wherein the multiple blocks include a Machine Learning Solution - T raining/Leaming/Full Evaluation block.
  • Embodiment 6 A system according to any of the preceding embodiments wherein the multiple blocks include a Machine Learning Solution - Final Prediction block.
  • Embodiment 7 A system according to any of the preceding embodiments wherein for at least one Business Solution, the Machine Learning Solution Block combinations automatically compete with one another including identifying a best block combination and wherein the Machine Learning Solution pipeline compiled by the processor for the individual Business Solution comprises the best block combination.
  • Embodiment 8 A system according to any of the preceding embodiments wherein the Business Solution-Feature Engineering and Extraction block includes code operative for preprocessing data before machine learning analysis..
  • Embodiment 9 A system according to any of the preceding embodiments wherein all pre-processing of data before machine learning analysis occurs in the Business Solution-Feature
  • Embodiment 10 A system according to any of the preceding embodiments wherein the pre-processing includes removing data columns.
  • Embodiment 11 A system according to any of the preceding embodiments wherein the pre-processing includes extracting additional data from existing data and using the additional data as input to the machine learning analysis.
  • Embodiment 12 A system according to any of the preceding embodiments wherein the Machine Learning Solution -Features Selection block includes code operative for selecting features which are stronger predictors of at least one target variable to be predicted in a given challenge and not . selecting features which are less strong predictors of the at least one target variable.
  • Embodiment 13 A system according to any of the preceding embodiments wherein all code operative for selecting features which are stronger predictors of at least one target variable to be predicted in a given challenge and not . selecting features which are less strong predictors of the at least one target variable is included in the Machine Learning Solution -Features Selection block and not in other blocks.
  • Embodiment 14 A system according to any of the preceding embodiments wherein the Machine Learning Solution - Training/LeamingZFull Evaluation block includes machine learning code operative to train and evaluate a machine learning training set.
  • Embodiment 15 A system according to any of the preceding embodiments wherein all code operative to train and evaluate a machine learning training set is included in the Machine Learning Solution - Training/Leaming/Full Evaluation block and not in other blocks.
  • Embodiment 16 A system according to any of the preceding embodiments wherein the Machine Learning Solution - Final Prediction block is operative to run machine learning code on a test set including providing a final prediction for each data record in the test set
  • Embodiment 17 A system according to any of the preceding embodiments wherein the Machine Learning Solution -Final Prediction block is operative to run machine learning code included in a Machine Learning Solution - Training/Leaming/Full Evaluation block on a test set including providing a final prediction for each data record in the test set
  • Embodiment 18 A system according to any of the preceding embodiments wherein all code operative to run machine learning code on a test set including providing a final prediction for each data record in the test set is performed by the Machine Learning Solution -Final Prediction block and not in other blocks.
  • Embodiment 19 A data processing method comprising: Providing a system-scientist data interface controlled by a first processor to accept and store in a digital repository, multiple-part Machine Learning Solutions from scientists each including multiple blocks; Providing a second processor compiling plural Machine Learning Solution pipelines to address plural respective Business Solutions and each Business Solution’s respective Analytical Solution, wherein the compiling comprises mixing-and-matching Machine Learning Solution blocks stored in the Machine Learning Solution repository thereby to generate Machine Learning Solution Block Combinations, and Providing a system-business user interface controlled by a third processor to generate a Machine Learning Solution pipeline output, for at least one business user presenting an individual Business Solution including business data (aka business data set - Business Data Set) using the Machine Learning Solutions pipeline compiled by the processor for the individual Business Solution.
  • Providing a system-scientist data interface controlled by a first processor to accept and store in a digital repository, multiple-part Machine Learning Solutions from scientists each including multiple blocks Providing a second processor compiling plural Machine Learning Solution pipelines to address plural respective Business Solutions and each Business Solution’
  • Embodiment 20 A computer program product, comprising a non-transitory tangible computer readable medium having computer readable program code embodied therein, the computer readable program code adapted to be executed to implement a data processing method comprising: Providing a system-scientist data interface controlled by a first processor to accept and store in a digital repository, multiple-part Machine Learning Solutions from scientists each including multiple blocks; Providing a second processor compiling plural Machine Learning Solution pipelines to address plural respective Business Solutions and each Business Solution’s respective Analytical Solution, wherein the compiling comprises mixing-and-matching Machine Learning Solution blocks stored in the Machine Learning Solution repository thereby to generate Machine Learning Solution Block Combinations, and Providing a system-business user interface controlled by a third processor to generate a Machine Learning Solution pipeline output, for at least one business user presenting an individual Business Solution including business data (aka business data set - Business Data Set) using the Machine Learning Solutions pipeline compiled by the processor for the individual Business Solution.
  • a data processing method comprising: Providing a system-scientist data interface controlled
  • a computer program comprising computer program code means for performing any of the methods shown and described herein when the program is run on at least one computer; and a computer program product, comprising a typically non- transitory computer-usable or -readable medium e.g. non-transitory computer -usable or -readable storage medium, typically tangible, having a computer readable program code embodied therein, the computer readable program code adapted to be executed to implement any or all of the methods shown and described herein.
  • the operations in accordance with the teachings herein may be performed by at least one computer specially constructed for the desired purposes or general purpose computer specially configured for the desired purpose by at least one computer program stored in a typically non-transitory computer readable storage medium.
  • the term "non-transitory” is used herein to exclude transitory, propagating signals or waves, but to otherwise include any volatile or non-volatile computer memory technology suitable to the application.
  • processor/s, display and input means may be used to process, display e.g. on a computer screen or other computer output device, store, and accept information such as information used by or generated by any of the methods and apparatus shown and described herein; the above processor/s, display and input means including computer programs, in accordance with some or all of the embodiments of the present invention.
  • Any or all functionalities of the invention shown and described herein, such as but not limited to operations within flowcharts, may be performed by any one or more of: at least one conventional personal computer processor, workstation or other programmable device or computer or electronic computing device or processor, either general-purpose or specifically constructed, used for processing; a computer display screen and/or printer and/or speaker for displaying; machine-readable memory such as optical disks, CDROMs, DVDs, BluRays, magnetic-optical discs or other discs; RAMs, ROMs, EPROMs, EEPROMs, magnetic or optical or other cards, for storing, and keyboard or mouse for accepting.
  • Modules shown and described herein may include any one or combination or plurality of: a server, a data processor, a memory/computer storage, a communication interface, a computer program stored in memory/computer storage.
  • processor is intended to include any type of computation or manipulation or transformation of data represented as physical, e.g. electronic, phenomena which may occur or reside e.g. within registers and /or memories of at least one computer or processor.
  • processor is intended to include a plurality of processing units which may be distributed or remote
  • server is intended to include plural typically interconnected modules running on plural respective servers, and so forth.
  • the above devices may communicate via any conventional wired or wireless digital communication means, e.g. via a wired or cellular telephone network or a computer network such as the Internet.
  • the apparatus of the present invention may include, according to certain embodiments of the invention, machine readable memory containing or otherwise storing a program of instructions which, when executed by the machine, implements some or all of the apparatus, methods, features and functionalities of the invention shown and described herein.
  • the apparatus of the present invention may include, according to certain embodiments of the invention, a program as above which may be written in any conventional programming language, and optionally a machine for executing the program such as but not limited to a general purpose computer which may optionally be configured or activated in accordance with the teachings of the present invention. Any of the teachings incorporated herein may whereever suitable operate on signals representative of physical objects or substances.
  • the term“computer” should be broadly construed to cover any kind of electronic device with data processing capabilities, including, by way of non-limiting example, personal computers, servers, embedded cores, computing system, communication devices, processors (e.g. digital signal processor (DSP), microcontrollers, field programmable gate array (FPGA), application specific integrated circuit (ASIC), etc.) and other electronic computing devices.
  • DSP digital signal processor
  • FPGA field programmable gate array
  • ASIC application specific integrated circuit
  • Any reference to a computer, controller or processor is intended to include one or more hardware devices e.g. chips, which may be co-located or remote from one another.
  • Any controller or processor may for example comprise at least one CPU, DSP, FPGA or ASIC, suitably configured in accordance with the logic and functionalities described herein.
  • an element or feature may exist is intended to include (a) embodiments in which the element or feature exists; (b) embodiments in which the element or feature does not exist; and (c) embodiments in which the element or feature exist selectably e.g. a user may configure or select whether the element or feature does or does not exist.
  • Any suitable input device such as but not limited to a sensor, may be used to generate or otherwise provide information received by the apparatus and methods shown and described herein.
  • Any suitable output device or display may be used to display or output information generated by the apparatus and methods shown and described herein.
  • Any suitable processor/s may be employed to compute or generate information as described herein and/or to perform functionalities described herein and/or to implement any engine, interface or other system described herein.
  • Any suitable computerized data storage e.g. computer memory may be used to store information received by or generated by the systems shown and described herein.
  • Functionalities shown and described herein may be divided between a server computer and a plurality of client computers. These or any other computerized components shown and described herein may communicate between themselves via a suitable computer network.
  • arrows between modules may be implemented as APIs and any suitable technology may be used for interconnecting functional components or modules illustrated herein in a suitable sequence or order e.g. via a suitable API/Interface.
  • Table which may include some or all of the fields and/or records and/or cells, rows or columns shown herein.
  • Fig. 1 is a top level flow which may for example be performed by the system of Fig. 10.
  • Fig. 2 is a zoom-in on operation 400 of Fig. 1: matchmaking
  • Fig. 3 is a zoom-in on phase i aka operation 410 of Fig. 2.
  • Fig. 4 is a zoom-in on phase ii aka operation 420 of Fig. 2.
  • Fig. 5 is a pictorial illustration showing operations 100 - 400 of Fig. 1, inter alia.
  • Fig. 6 is a table useful in understanding certain embodiments.
  • Fig. 7 is a table of an example data structure for a Business Solution repository.
  • Fig. 8 is a table useful in understanding certain embodiments.
  • Fig. 9 is a table of an example data structure for an Analytical Solution (AS) repository.
  • Fig 10 is a diagram useful in understanding certain embodiments.
  • Fig. 11 is a table of an example data structure for a Machine Learning Solution (MLS) repository.
  • MLS Machine Learning Solution
  • Fig. 12 is an example of operation of an MLS pipeline population method in accordance with certain embodiments.
  • Fig. 13 is a table useful in understanding certain embodiments.
  • Fig. 14 is a table useful in understanding certain embodiments.
  • Fig. 15 is a table showing example formalization for Business Questions.
  • Methods and systems included in the scope of the present invention may include some (e.g. any suitable subset) or all of the functional blocks shown in the specifically illustrated implementations by way of example, in any suitable order e.g. as shown.
  • Computational, functional or logical components described and illustrated herein can be implemented in various forms, for example, as hardware circuits such as but not limited to custom VLSI circuits or gate arrays or programmable hardware devices such as but not limited to FPGAs, or as software program code stored on at least one tangible or intangible computer readable medium and executable by at least one processor, or any suitable combination thereof.
  • a specific functional component may be formed by one particular sequence of software code, or by a plurality of such, which collectively act or behave or act as described herein with reference to the functional component in question.
  • the component may be distributed over several code sequences such as but not limited to objects, procedures, functions, routines and programs and may originate from several computer files which typically operate synergistically.
  • Each functionality or method herein may be implemented in software, firmware, hardware or any combination thereof. Functionality or operations stipulated as being software-implemented may alternatively be wholly or fully implemented by an equivalent hardware or firmware module and vice-versa.
  • Firmware implementing functionality described herein if provided, may be held in any suitable memory device and a suitable processing unit (aka processor) may be configured for executing firmware code.
  • processor a suitable processing unit
  • certain embodiments described herein may be implemented partly or exclusively in hardware in which case some or all of the variables, parameters, and computations described herein may be in hardware.
  • modules or functionality described herein may comprise a suitably configured hardware component or circuitry.
  • modules or functionality described herein may be performed by a general purpose computer or more generally by a suitable microprocessor, configured in accordance with: methods shown and described herein, or any suitable subset, in any suitable order, of the operations included in such methods, or in accordance with methods known in the art
  • Any logical functionality described herein may be implemented as a real time application if and as appropriate and which may employ any suitable architectural option such as but not limited to FPGA, ASIC or DSP or any suitable combination thereof.
  • Any hardware component mentioned herein may in fact include either one or more hardware devices e.g. chips, which may be co-located or remote from one another.
  • Any method described herein is intended to include within the scope of the embodiments of the present invention also any software or computer program performing some or all of the method’s operations, including a mobile application, platform or operating system e.g. as stored in a medium, as well as combining the computer program with a hardware device to perform some or all of the operations of the method.
  • Suitable computer data storage or information retention apparatus may include apparatus which is primary, secondary, tertiary or off-line; which is of any type or level or amount or category of volatility, differentiation, mutability, accessibility, addressability, capacity, performance and energy use; and which is based on any suitable technologies such as semiconductor, magnetic, optical, paper and others.
  • ADDS Additional Data Set: new features and/or files generated by a module e.g. by executing Additions Rules that may be predefined for each of a Data Sets features in the business solution and analytics solution,
  • ADS Analytical Data Set: defines data, typically required data, associated with the AS,
  • AS Analytical Solution: typically comprises a (statistical e.g.) representation of plural business solutions which are all, typically, statistically similar
  • a suitable ontology may be used to define business questions such as the following example ontology, which includes some or all of the key words below. According to certain embodiments, each entity and/or verb and/or event in the business question must be selected from the following; the ontology may be stored as Business Question metadata:
  • CSV term of the art e.g. as defined in https://support.bigcommerce.com/articles/Public/What-is- a-CSV-file-and-how-do-I-save-my-spreadsheet-as-one
  • F file e.g. Data File
  • UMS User Data Set
  • a data science platform is shown and described herein.
  • an analytical Business Question requests a business answer which, based on past data, predicts the future or arranges the business entities in a new way.
  • a Business Question is meant for decision support of managerial nature (e.g.“what is the best decision I should take now?”), or of operational nature (e.g.“define the next action to be implemented in a process or for an asset”).
  • Certain embodiments include at least one Data Science process setup in advance e.g. as described herein and/or at least one automatic run typically initiated by the business user.
  • the run matches a good or preferably best produced analytical answer with:
  • BDS Business Data Set
  • UDS User Data Set
  • the system of Fig. 10 might store at least one additional, typically non-user-specific e.g. suitable commercially Available Additional Data Set (ADDS) that may be merged with the User Data Set to enrich e.g. as per operation 520.
  • ADDS commercially Available Additional Data Set
  • 520 is the point where these enriching Data Sets are merged. From this point on, they are part of the complete Data Sets (they are no longer considered special) enriching may comprise adding additional Data Sets in operation 520, which rather than including end-user data include commercial and/or open and/or government data.
  • Flows which may be performed by the system typically comprise all or any subset of the following:
  • FIG. 1 Top level flow of Fig. 1 which may include some or any subset of the following, suitably ordered e.g. As shown: Operation 100: Pre-storing Business Solutions
  • Operation 300 Pre-storing Machine Learning Solutions
  • Operation 400 Matchmaking the best Machine Learning Solutions (MLSs) (where“best” typically refers to a best predicting solution nor a solution having a highest benchmark score). Operation 400 is operative for running all MLSs and selecting between all the available MLSs that are candidates (e.g. included in the pipeline).
  • MLSs Machine Learning Solutions
  • each analytical solution comprises a template and a challenge relates to a specific instance of that template.
  • Each instance may include a specific data set and there may be differences between two individual challenges "under" a single analytical solution, beyond differences in the data set e.g. differences in the specific operational parameters.
  • Each analytical solution typically includes a statistical function and a data question and a structure for the data set without the specific data itself.
  • Each challenge distributed to data scientists typically includes a work order and typically includes a data set, which typically includes example deliverable/s.
  • each data scientist is called upon to write all or a specified subset of the following four blocks of code (four programs):
  • the Machine Learning Solution -FEE block Features Engineering & extraction - make the data as productive as possible for machine learning analysis use e.g. by removing empty features, e.g. data columns such as date, scalars, codes etc. and/or extracting from the data as many derived (additional) features as possible. This extracts as much information as possible from a provided Pn Data Set
  • the MLS-FS ( Machine Learning Solutions- Features Selection) block - selects the best performing (e.g. having greatest statistical influence on the target of the analytical goal, e.g. best prediction) features from the features previously engineered and extracted in the Machine Learning Solutions - Feature Engineering & Extraction (FEE) block.
  • “Target” is the term for whatever is being predicted, in a particular challenge. Any suitable method may be employed to determine which feature/s perform best such as, for example, analysis of variance to determine which feature/s explain a larger proportion of the variance of the target variable/s than other feature/s.
  • the MLS-TLF (Machine Learning Solutions - Training/leaming/full evaluation) block uses the features selected by the previous block to train and evaluate the train data set to form the Prediction Set (PS). It is appreciated that in machine learning, three data sets are employed, typically termed the training, test, and validation sets the " train data set " referred to herein is the same as "training set”. The "prediction set” may be the same as the test set or validation set, according to certain embodiments.
  • the MLSs-FP Machine Learning Solutions-Final Prediction
  • matchmaking may include some or any subset of the following, suitably ordered e.g. as shown in Fig. 2:
  • Operation 410 aka Phase I - Data pre-processing
  • Phase 1 aka operation 410 may include some or any subset of the following, suitably ordered e.g. As shown in fig. 3:
  • operation 510 aka module 1 - data validation and preparation
  • Operation 510 aka module 2 - data features additions
  • Operation 520 aka module la - rerun operation 510 (module 1 - data validation and preparation) on the complete data sets
  • Operation 530 aka module 3 - data holistic validation
  • Operation 540 aka module 4 - data reduction
  • Operation 550 aka module 5 - data set obfuscation
  • Operation 560 aka module 6 - data set splitting Operation 570 aka module 7a of phase 1 - benchmark calculation for baseline model zoom-in on phase ii aka operation 420 may include some or any subset of the following, suitably ordered e.g. As shown in fig. 4:
  • Operation 670 aka Module 7b of phase ii - Populate Machine Learning Solution Pipeline
  • Operation 680 aka Module 8 - Feature engineering & extraction aka Machine Learning Solution - Feature Engineering and Extraction
  • Operation 690 aka Module 9 - Feature selection aka Machine Learning Solution - Features Selection
  • Operation 700 aka Module 10 - Training/learning/full evaluation aka Machine Learning Solution -Training/Leaming/Full Evaluation.
  • Operation 710 aka Module 11 - Final Prediction aka Machine Learning Solution - Final Prediction.
  • Operation 720 aka Module 12 - Benchmarking and Machine Learning Solution Selection
  • FIG. 5 An overview of operations 100 and 200; 300 and 400, according to an embodiment of the invention, is provided in the pictorial diagram of Fig. 5 which includes three processes which converge at a“main” server e.g. process 1 - pre-storing analytical and business solutions;
  • process 2 pre-storing machine learning solutions, and process 3 - automatic run: matchmaking the best analytical answer.
  • a Business Solution answers a Business Question.
  • business users“pick a solution” to a Business Question that they have, from a platform (e.g. an Amazon- or Oracle-based software as a service platform which may have, say site-to-site VPN connection with PAAS cloud services), e.g. in accordance with some or all of the architecture shown and described herein.
  • a platform e.g. an Amazon- or Oracle-based software as a service platform which may have, say site-to-site VPN connection with PAAS cloud services
  • Each business user typically uploads data which is defined by the platform as the data elements required for the Business Question.
  • the platform alternatively, or in addition, interfaces with data scientists who develop models, e.g. based on anonymized or obfuscated data e.g. data provided by business users and obfuscated by the platform’s server.
  • prediction models generated by data scientists are accumulated e.g.
  • Each prediction model may subsequently be matched to and/or employed by more than one or many business users, wherein each business user typically is provided with an analytical answer based on the model which is a“best model” for her or his question and data.
  • challenges are each identified by a unique identifier e.g. serial number, and each data scientist or service provider fills out an order form for at least one challenge, with the scientist’s contact information inter alia.
  • the delivery and testing of Deliverables generated by data scientists responsive to a challenge are typically as per the work order sent out for that challenge.
  • Deliverables may include drawings, software, algorithm/s, certifications, documentation, codes, samples inter alia.
  • the system of Fig. 10 may store plural (e.g. 10 or 20 or a few dozen or a few hundred) Business Questions which represent frequently asked Business Questions that computerized organizations deal with frequently e.g. daily.
  • a Business Question typically includes at least the following components aka Business Question (minimal) form:
  • Verb component ⁇ verb ⁇ e.g. predict
  • event component ⁇ business event ⁇ e.g. sale
  • a typical Business Question may be represented formally in memory as: verb a business event per targeted business entity, where each component is stored for each Business Question.
  • each Business Question may be deemed to have a“target” - to ⁇ verb ⁇ e.g. predict the ⁇ targeted business entity ⁇ e.g. product.
  • a set of typical Business Questions, all or any subset of which may be provided, may appear as follows where“customer” here refers to the system end-user’s customer :
  • Business Question (1)“How many products are defected per (number) produced?”, Business Question (2)“What will be most effective method to grow my network of prospective customers?” and Business Question (3) What is the likelihood that each customer repeats purchases in (a certain period)”), etc., may be formalized as shown in the table of Fig. 15.
  • Each Business Question typically includes, in addition to an actual question e.g. as above, also an associated Business Data Set (BDS).
  • BDS Business Data Set
  • a Data Set defines any required and/or optional data elements associated with the Business Question.
  • Each Business Data Set consists of one of more Data Files (DF), which are tables of features - keys and additional fields. Data Files are interconnected via keys to each other (e.g. Customer ID may appear in each of the customer related files, Date & Time may appear in all times eries related files).
  • DF Data Files
  • a third part - the Business Solution Execution Parameters are typically provided and may for example be stored as shown below in the table of Fig. 7. These allow the business user to accommodate the Business Solution to the specific business and operational needs.
  • the Business Solution -Execution Parameters may for example include inter alia, either or both of the parameters in Fig. 6.
  • a Business Question may comprise text which may be in natural language and/or may include at least the following components aka Business Question (minimal) form: ⁇ verb ⁇ e.g. predict & ⁇ business event ⁇ e.g. sale & per ⁇ targeted business entity ⁇ e.g. product
  • the SINGLE/MANY column in the table of Fig. 7 indicates whether each data item typically has but a single instance per Business Solution, or whether plural instances of that data item may occur per Business Solution.
  • Fig. 7 is a table illustrating an example data structure for a Business Solution data repository; all or any subset of the data items illustrated may be provided, typically for each of a plurality of business solutions.
  • Fig. 8 shows an example Business Solution with all its data items. The table of Fig. 8 relates to a Business Solution for the“how many products are defective per number produced?” Business Question listed above by way of example.
  • 2nd Layer - Pre-storing the Analytical Solution An Analytical Solution (AS) typically comprises a statistical representation of plural Business Solutions whose statistical essence is similar. The Analytical Solution is, as the Business Solution, prepared in advanced in a repository of fig. 10, and each Business Solution has a specific Analytical Solution which typically statistically or formally describes that business solution BS.
  • an Analytical Solution answers an Analytical Question that is an“abstracted” version of many statistically similar Business Questions.
  • a Business Question might be“predict the SALES of a SALES PERSON per MONTH and PRODUCT TYPE”.
  • the Business Question “abstracted” matching Analytical Question might be“Predict the ⁇ TIME SERIAL ACTION ⁇ of a ⁇ BUSINESS ENTITY 1 ⁇ per ⁇ BUSINESS EN TITY 2 ⁇ and ⁇ BUSINESS ENTITY 3 ⁇ .
  • Other business questions with the same matching Analytical Question might be to predict some other action of some other business entityl (perhaps not a sale person) per business entity 2 and 3 (perhaps not a month and/or not a product type).
  • a Business Question typically has a target - to ⁇ VERB ⁇ e.g. predict the ⁇ TARGETED BUSINESS ENTITY n.
  • each Analytical Solution typically comprises an Analytical Question and an associated Analytical Data Set (ADS).
  • ADS Analytical Data Set
  • An Analytical Data Set defines any required data associated with the Business Question.
  • Each Analytical Data Set consists of one of more Data Files (ADF), which are tables of features - keys and additional fields. Data Files are connected via keys (e.g. ID, Date & Time) to each other in a way where (in a relational database terms) a master-detailed relationship is formed.
  • ADF Data Files
  • keys e.g. ID, Date & Time
  • each MO (month) Header Data Business Data F has an MOHD (month Header Data) Order ID field which relates to one of more records in the MO Details Batch Data via the same MOHD Order ID field which is found there as well.
  • the Analytical Data Set includes the targeted feature (e.g. the feature the solution may predict e.g. end-user’s customer will leave/stay) and the statistical method used to evaluate the prediction power of the solution.
  • the targeted feature e.g. the feature the solution may predict e.g. end-user’s customer will leave/stay
  • the statistical method used to evaluate the prediction power of the solution e.g. end-user
  • Fig. 9 is a table illustrating an example data structure for an as data repository; all or any subset of the data items illustrated may be provided, typically for each of a plurality of Analytical Business Solutions.
  • Phase i comprises preparing for a challenge (e.g. issuing a Machine Learning Solution creation challenge to a population of External Data scientistss).
  • a repository e.g. as shown in fig. 10, may pre-store Analytical Solutions and matching Business Solutions, based on which External Data Engineers (EDS) may be challenged to create Machine Learning Solutions (MLS). These competing Machine Learning Solutions all address a specific Analytical Solution (including data structure, and parameters).
  • EDS External Data Engineers
  • MLS Machine Learning Solutions
  • the Machine Learning Solutions may be created and tested to answer a specific user selected Business Solution with the user’s data and that Business Solutions matched Analytical Solutions. These users, aka External Data scientistss, may be entitled to pre-load Machine Learning Solutions to be used later in the matchmaking process.
  • the Business Solution and Analytical Solution design described herein creates advantageous flexibility in using the Machine Learning Solutions - since if each Business Solution is correlated to an Analytical Solution, and each Analytical Solution may be correlated to several Business Solutions, each Machine Learning Solution may address several Business Solutions during Machine Learning Solutions to Analytical Solution matchmaking.
  • Internal Data scientistss select an Analytical Solution for that Business Solution, and set up the parameters for that Business Solution (as described in fig.
  • WO challenge Work Order
  • Analytical Solutions typically comprises a natural language document which covers all the definitions of the Analytical Solutions e.g. as described herein translating the specific Business Solution data and parameters as chosen by the user.
  • Each Machine Learning Solution comprises a Machine Learning model e.g. code written in designated statistical coding language as Python or R which is built by External Data Engineers, tested and validated by the system of Fig. 10 in modules 10 to 12 in the next use case (including e.g. matchmaking the best analytical answer), is uploaded to the Machine Learning Solutions repository.
  • Machine Learning model e.g. code written in designated statistical coding language as Python or R which is built by External Data Engineers, tested and validated by the system of Fig. 10 in modules 10 to 12 in the next use case (including e.g. matchmaking the best analytical answer), is uploaded to the Machine Learning Solutions repository.
  • Each Machine Learning Solution (programmed and supplied by External Data Engineers) may be required to have any subset of or all of the following four blocks :
  • PS Prediction Set
  • Final Prediction - FP - Runs the pre-trained and evaluated Machine Learning code written so far (as described in the previous blocks) on the test set (out of time predictions) to provide a Final Prediction for each data record in the test-set (creating the Final Prediction Set - FPS).
  • This phase typically comprises running the challenge including running each Machine Learning Solution provided by an External Data Engineer responsive to the challenge.
  • Each Machine Learning Solution provided by an External Data Engineer may be run using a similar process as the (below described) third use case (“matchmaking the best analytical answer").
  • the only difference between this challenge run and the matchmaking may be that in the challenge run, the Machine Learning Solution is selected only from the Machine Learning Solutions provided in the specific challenge, whereas in matchmaking, the Machine Learning Solution is selected from a Machine Learning Solution repository, e.g. in the embodiment of fig. 10 , since the Machine Learning Solutions may have been pre-developed during past challenges.
  • Each Machine Learning Solution provided by the External Data scientistss may use the input obfuscated data included in the FODS (in CSV or similar open data format or database) to predict the analytical target as defined in the Analytical Question in the Analytical Solution, as well as the Business Question in the Business Solution.
  • Each Machine Learning Solution typically generates additional statically supportive features e.g. as described in the Features engineering & extraction (FEE) block description, and a prediction set, then store all the results in a specific folder as one or more data files (CSV or any other format) to be used.
  • FEE Features engineering & extraction
  • the Machine Learning Solution conformity to any requirements of the Machine Learning Solution blocks as described above - features engineering & extraction (Machine Learning Solution -FEE), features selection (Machine Learning Solution -FS), Training/learning/full evaluation (Machine Learning Solution -TLF) and Final Prediction (Machine Learning Solution -FP).
  • Module 8 Feature engineering & extraction, the blocks supplied by the External Data scientistss are run.
  • the Machine Learning Solution s flexibility e.g. ability to be selected by the future matchmaking process as many times as possible (e.g. the Machine Learning Solution’s estimated compatibility to different Business Solutions and User Data Sets), the tests may run the Machine Learning Solutions in randomly changing situations of data sets and parameters. These stress test may include (among additional tests): omitting and adding features, changing the period parameters etc. in order to to stress test the blocks written by the External Data scientistss.
  • technical code QA quality assurance
  • quality assurance May include all or any subset of: Stress tests in changing Operating Systems, Compiler versions, technically faulty parameters etc.
  • a challenge typically defines a format for an expected deliverable and each data scientist participating in the challenge then uploads her or his deliverable in the expected format typically, a work order, provided to each external data scientist, defines formats and may specify these, e.g. in natural language.
  • Each analytic solution (such as, say, each or any of the following 5 example analytic solutions: 1. binary classification of time series variable, 2. predict time series based continuous variable per object, 3. multivariate classification of time series variable, 4.multivariate clustering of time series variable and 5. predict regression of variable over time) is typically defined by its own work order.
  • the system and software products generated there within then operate in accordance with the definitions in the work order.
  • a work order typically includes one or both of the following:
  • a human“rule coder” aka Internal Data Engineer, (or a machine) may convert the business language into formal or statistical language.
  • An example is translating“predict at which price a customer will buy a specific product” into “predict timeseries continuous variable for a specific X”.
  • Data generalization describing any required data sets in a logical structure of data files and the key features that connect them into an integrated data set scheme.
  • a human coder or internal data scientist may be operative for translating“customer profile” and “Customer Payments” data files into“Entity main file” and“entity timeseries actions”, as well as translating each of the features (e.g. a customer’s city) into generic features (ex. X31) while capturing their statistical/mathematical natures (e.g. an ordinal category).
  • a first example work order for a“predict time series based continuous variable per item” challenge aka SD 102-66 may include all or any portion of the following italicized natural language text (example:“about two thirds are provided as a training set, and the remaining third are used as a test set“ may be amended to specify proportions other than 1/3 and 2/3) or all or any subset of the following bullets, or all or any subset of the following sentences or paragraphs:
  • the Data comprises of billions of rows by various tables, which together describe a huge number of actions. Of these, about two-thirds are provided as a training set, and the remaining third is used as a test set (with their labels held out), the train/test split is by time. You are encouraged to submit your model and receive a score for a small part of the test set on a daily basis.
  • Models will be evaluated and ranked according to the median of root mean square error (RMSE) metric scores for each item, applied to the test set to select a winner.
  • RMSE root mean square error
  • ⁇ he Data are given in the Raw Data directory, via four types of time series (ts) tables, which provide information about an item that is defined by an item id. Each table is provided as a CSV file. There are two types of features: XA_i and XB J, which have a different meaning.
  • the 1 st ts type is separated into many tsl_m,..,tsl_n2 tables, where each table represents data for a specific item id, including also the target feature.
  • This feature contains a numeric continuous value for an item in a specific time.
  • the 2 nd ts type is separated into many ts2 mi,.. Js2 m2 tables, where each table represents data for a specific item id, including also the activity id feature.
  • the 3 rd is type is separated into many ts3_ki,..,ls3_k 2 tables, where each table represents data for a specific activity id.
  • the 4 th ts type is a single table, ts4, and is not unique to the combination of item id and time.
  • Threshold Bentchmark Surpassing minimal Evaluation Metric is essential in order to become a Winner in the Final Win stage (“Threhold Bentchmark”).
  • the Threshold Bentchmark is equal to 0.262104.
  • AWS Amazon Web Services
  • o Hardware appropriate CPU, RAM, disk and network resources required for developing a model on the given Data.
  • Your Deliverable will be developed and delivered using only free open source software packages of Python and R.
  • the Deliverables include only script files and not new data files that you created. Your Deliverable must be reasonably intelligible to any expert data scientist. It will adhere to standard industry good practice coding style (including but not limited to: variable naming, organization, comments and general readability).
  • Daily Assessment optionalal
  • Preliminary acceptance Preliminary feedback of your score status
  • final win/no win notice
  • Preliminary Ranking In the event that your Deliverable is an Accepted Deliverable, We will inform you of your Deliverable performance against the test set, in the form of your ranking in decreasing order of Evaluation Metric among additional Accepted Deliverables provided to us by other service providers, if any (“Preliminary Ranking”) and your Evaluation Metric score.
  • the Winner will be derived from the Ranking, as the service provider ranked first, e.g., the one whose accepted Deliverable achieves the highest Evaluation Metric score among all Accepted Deliverables, as provided by other service providers, if any. Also, a Winner’s Evaluation Metric score should be higher then the Threshold Benchmark. Evaluation will be conducted against completely new data, which is also different from the data used during the Daily Assessment.
  • Preliminary Delivery Within 10 days of the Kickoff date, you will place a preliminary version of your Deliverable in the Preliminary directory in your Deliverables directory.
  • Preliminary Feedback Within 14 days of the Kickoff date, We will review your Preliminary Deliverables, if any, and provide you with the Preliminary Ranking.
  • modeLpredict function(model, test. data) ⁇
  • train.dataSgroups fread(pasteO(data.dir, '/train-groups. csv))
  • train.data$ts2 fread(pasteO(data. dir, Ytrain-ts2.csv' ))
  • train.data$ts3 fread(pasteO(data.dir, Vtrain-ts3.csv' ))
  • train.data$ts4 fread(pasteO(data. dir, Ytrain-ts4.csv' ))
  • train.dataSthings fread(pasteO(data.dir, '/things, csv' ))
  • test.data$items fread(pasteO(data.dir, Ytest-items.csv' ))
  • test.data$groups Jread(pasteO(data.dir, '/test-groups. csv ' ) )
  • test.data$tsl jread(pasteO(data. dir, Ytest-tsl.csv' ))
  • test.data$ts2 fread(pasteO(data.dir, Ytest-ts2.csv' ))
  • test.data$ts3 fread(pasteO(data.dir, Ytest-ts3.csv' ))
  • test.data$ts4 fread(pasteO(data.dir, Ytest-ts4.csv' ))
  • model model.train(train.data)
  • predictions model.predict(model, tesldata)
  • a second example work order for a binary classification challenge may include all or any portion of the following italicized natural language text or all or any subset of the following sentences or paragraphs; the text is merely exemplary and is not intended to be limiting so for example, for this and the previous work order any sentence including the word‘will” may be modified or omitted:
  • this binary classification aka“binary classification of time series variable”
  • the service provider (“You”) will work on a unique dataset that combines relational and multivariate time series features (“Data ").
  • the structure of the Data creates plenty of room for modeling creativity and skill, while the relatively straightforward modeling question will allow you to begin working immediately.
  • the Data comprise about 12,000,000 rows of various tables, which together describe approximately 50,000 items. Of these, about two thirds are provided as a training set, and the remaining third are used as a test set (with their labels held out), the train/test split is random.
  • Models will be evaluated and ranked according to the area under curve (AUC) metric, applied to the test set to select a winner.
  • AUC area under curve
  • the Data are given via the main items table, a grouping table, 4 time series (ts) tables, and a things table. Each table is given as a CSV file. Values of categorical variables are coded as “cXXX” where XXX is a category number. Note that the assignment of category numbers was random.
  • Each row in the items table describes one item, which is uniquely defined by an item id.
  • the target column provided with the training data is the binary class of each item. In the test set this column is missing, and your task is to predict those values.
  • the grouping table contains zero or more rows per item. Each row associates an item id with a group id. This induces a set of overlapping groups of items.
  • the time series tables contain zero or more entries for each item id, together with a time column that indicates the location of the entry on a certain time axis for that type of series (not necessarily physical time, and not necessarily unique per row per item).
  • time series table ts3
  • ts3 is special in that it is actually a combined table containing multiple similar time series per item.
  • Each time series is identified by a series.id, and each (item id, series id) has its own time counter, where missing (item id, series id) entries mean that the value for that entry is zero.
  • the columns series. fine, and series.coarse are (hierarchical) groupings of related series ids.
  • the supporting things table describes a secondary object -whose ids (thing.id column) are referred to in both the items and ts4 tables. Some rows are all missing values; These are only included for completeness.
  • Evaluation Metric Will be calculated according to the test set area under curve (AUC) metric (“Evaluation Metric”).
  • the Data must be read and manipulated only within the Development Environment. You may send your already existing personal code to your development environment provided by us, via a designated process We will supply. Deliverables
  • Your Deliverable will be a single R script that may be run as-is within a fresh R session on the Development Environment as originally supplied (with the exception that any CRAN dependencies are installed as necessary, and with the working directory set to the same directory as the delivered script).
  • model, train junction (train, data): takes a list containing the train set tables as
  • table objects resulting from a“plain vanilla” call to fireadQ on test set CSV files provided and returns a trained model object (in any reasonable way you 'd like).
  • model.predict function(model, test. data): with arguments similar to the above, that produces a numeric vector with predictions in [0, 1] in which each entry corresponds to its respective row in testitems.
  • Preliminary Ranking In the event that your Deliverable is an Accepted Deliverable, We will inform you of your test set performance, in the form of your ranking in decreasing order of Evaluation Metric among additional Accepted Deliverables provided to us by other service providers, if any (“Preliminary Ranking”).
  • the Winner will be derived from the Ranking, as the service provider ranked first, e.g., the one whose accepted Deliverable achieves the highest Evlauation Metric among all Accepted Deliverables, as provided by other service providers, if any.
  • str name file tmp[: -4J. splitC )
  • str name file_tmp[:-4] .splitf J)
  • df test pd.read_csvCTest / TS_l/ts_l_%d.csv' % item jd)
  • test prop 0.2
  • Data scientists may for example communicate with us using the messaging junctions on our online bidding platform.
  • login and usage instructions for the data scientists’ server may include some or all of the following:
  • Python IDEs PyCharm, atom
  • R IDEs The built-in GUI, R Studio, R
  • Utilities Libre Office, Notepad ⁇ ⁇ for other text file editing needs.
  • the DS Workspaces are isolated and secured e.g. Files and/or text from/to the server cannot be copied and suitable administrative functions are disabled.
  • the 3 rd use case may be activated by the product business user (e.g. as shown in fig. a) by selecting a Business Solution from a Business Solution repository in memory e.g. in the embodiment of fig. 10 via the web interface. Then, the product business user may select the right Business Solution EP for the task and load his data based on the Business Data Set’s instructions (with all the mandatory attributes and any optional features).
  • Phase I - Data pre-processing e.g. operation 410 in Fig. 2 and Module 1 - Data Validation and Preparation are now described in detail.
  • the Data validation and preparation module processes all the Data Sets (DSs), typically including user data (Business Data Sets) and/or the additional Data Sets (ADS) based on the definitions found in the Business Data Sets.
  • DSs Data Sets
  • ADS additional Data Sets
  • Every aspect of the file’s structure is defined, e.g. which fields, type etc. in order to:
  • the Data validation and preparation module uses validation rules to check for correctness, meaningfulness, and consistency of the data that are input to the system.
  • the validation rules may include all or any subset of:
  • Operation 510 aka Module 2 - Data Features Additions
  • the features data additions module may elaborate the User Data Set (UDS) and Analytical Data Set (ADS) with additional features (presented as additional features to the User Data Set or Analytical Data Set or additional files with the additional features) that are statistical manipulations of their already-existing data.
  • UDS User Data Set
  • ADS Analytical Data Set
  • additional features presented as additional features to the User Data Set or Analytical Data Set or additional files with the additional features
  • the goal is to ease any data science algorithms, processing these Data Sets later, to reach a strong statistical result.
  • the module may generate new features and/or new files and features (ADDS) by executing Additions Rules that may be predefined for each of the Data Sets features in the Business Solution and Analytical Solution.
  • Additions Rules may be predefined for each of the Data Sets features in the Business Solution and Analytical Solution.
  • the features data additions module might create a new feature based on the monthly average sales of each product.
  • ADDS additional features in User Data Set or Analytical Data Set and/or additional files
  • the purpose of the Data holistic validation module is to check all the Data Sets as a whole. For example, the Data holistic validation module might check whether there is information for each of the dates found in the user’s User Data Set
  • the holistic validation may run the same validation rules as of Module 1 (Data validation and Preparation) excluding the ones that are marked as“not required for module 3”. As a result, if no errors are produced, a Final Data Set (FDS) is generated.
  • FDS Final Data Set
  • the Data Reduction module may attempt to reduce the number of files in the Final Data Set for the ease of the External Data Engineer that may develop the data science solution.
  • the Data Reduction module may process Reduction rules that may make sure each file merged will not result in loss of statistical meaning. For example, if two of the files include data at the customer and day level - they may be merged safely (statistically speaking).
  • Operation 550 aka Module 5 - Data set obfuscation is now described in detail.
  • a default anonymization level (e.g.“medium”, if not otherwise defined by the system’s end user), may be defined.
  • the obfuscation processes the Final Data Set (FDS) and converts the FDS into Final Obfuscated Data Set (FODS) for the use of the FDS by 2 x 2 levels: the file and the feature level, and the name and content level:
  • the feature is a key (aka unique identifier) (example customer ID or product SKU)
  • the feature is replaced in a synchronized way e.g. as described above in all the files to make sure the keys relationships between the files are maintained (e.g. customer ID)
  • Operation 560 aka Module 6 - Data set splitting is now described in detail.
  • process configured for organizing and splitting the data into subsets (by time period, prediction rounds (e.g. train vs. test), and access authorization (e.g. eds-available vs. eds-unavailable), according to predefined rules, with a view to allow for further data science modeling and prediction.
  • prediction rounds e.g. train vs. test
  • access authorization e.g. eds-available vs. eds-unavailable
  • the data shall be split based on all or any subset of the following three parameters: prediction period, train vs. test set, and availability vs. unavailability to External Data scientistss:
  • First the data shall be split by prediction period, e.g. in 5 prediction periods (PI, P2, P3, P4, P5).
  • the length of the prediction period is uniform over time. Prediction period length is typically a parameter taken as-is from the“statistical solution. Separating the data according to information per time period is useful in avoiding leakage, thereby protecting the model from eventual biases due to statistical issues (e.g. data leakage https://en. wikinedia.org/wiki/Informa1ion leakage ); in addition, separating the data according to information per time period allows for training and testing the model on different data subsets, per time period, and therefore separating the data according to information per time period increases the data’s robustness for re-run and eventually reuse.
  • the prediction period may or may not be the same as the "time period”.
  • Time periods may be in days, weeks, months, quarters, years.
  • the data is then typically randomly split into train and test sets, per time and expected prediction period.
  • Each time period PI, P2, P3 shall be randomly divided between test and train set, in a 1 ⁇ 2 and 3 ⁇ 4 proportion.
  • P4 is always a test set because targets for this set are not available yet. This is a real-time prediction period.
  • PI and P2 with their train and test sets, may be made fully available (e.g. with target values) to the External Data Engineer.
  • P3 may be split into train and test, but the values of the target may be kept from External Data Engineer.
  • the train set and test set may both be delivered to the External Data Engineer with no target
  • P4 is only a train set. So P4 typically may not be split between train and test because the true values of the target are not available at the time of prediction. This set is not made available to External Data scientistss at all. This method of splitting is advantageous because before delivering any out of time prediction to the system end-user, the objective is to train and test twice on the original data (during roundl ).
  • the process performed by module 6 aka operation 560 may be viewed as a model building pipeline, based on two prediction rounds.
  • Round 1 ranking At the end of roundl, the n best solutions may“win” a relatively small ranking, compared to rankings“won” by best solutions in round 2.
  • Round 2 (second training and prediction cycle, based on rolled historical data to predict on next period where target is unknown):
  • the input and output for this phase may be defined by two processes.
  • Process 1 Split based on time period and prediction round: input data includes several anonymized data tables that refer to entities, during periods PI, P2. The output data is organized in 2 folders: one folder per round, Rl, R2 according to the description above.
  • the output data is organized in subfolders, Rl and R2 as follows:
  • Each data record from the train data (PI & P2) is associated with a target, while entities from the test data (P3) are not, and the External Data Engineer is expected to estimate the values of the test set.
  • the values of the test set may be stored in a 4th file called (P3, True Targets).
  • P3, True Targets Each of them is of the same format as the original Data Sets (e.g. the splitting does not change the structure, only splits down the records, vertically and not horizontally, selecting part of the records every time, leaving all the fields)
  • Each data record may be part of either the train data or the test data.
  • the split is random in a 1 ⁇ 2 and 3 ⁇ 4 proportion.
  • Operation 570 aka Module 7a - Benchmark calculation for baseline model is now described in detail.
  • Module 5 typically comprises computing the benchmark for this baseline model (single figure or percentage) when used on the test set.
  • This benchmark may be later compared with benchmarks computed for new Machine Learning models produced by the automatic platform.
  • phase i of the matchmaking run - the data typically already undergoes pre-processing, typically including all or any subset of validation (operations 510 and/or 530 e.g.), enrichment (additions e.g. as per operation 520), obfuscation e.g. as per operation 550, reduction e.g. as per operation 540, splitting e.g. as per operation 560 and bench and baseline model calculation e.g. as per operation 570 - ready for the automated platform to choose the best Machine Learning Solution.
  • the pre-processed data e.g. as generated by operation 410 presents information about entities, already split between train and test sets, and between the in-focus time periods. As already described, the data consists of several Data Sets (each may include several Data Files), that may contribute to solve the business problem, and whose structure is, typically, predefined in the data structures in the Business Solution and AS - all combined into the FODS Pn.
  • each Business Solution Internal Data Engineers (human experts; as opposed to External human experts which accept challenges as described herein) select an Analytical Solution for that Business Solution, and set up its relevant parameters, to properly address the Business Question in the Business Solution.
  • Each Analytical Solution has been previously associated, e.g. matched as relevant to, several Machine Learning Solutions (or at least one) in the course of matchmaking operation 400 of Fig. 1.
  • the pipeline may be populated with as much potential Machine Learning Solutions as possible, that have a potential to properly address a given Business Solution and the Business Solution’s associated Analytical Solution.
  • the parts or building-blocks or parts of a Machine Learning Solution typically include all or any subset of: Machine Learning Solution - Feature Engineering and Extraction, Machine Learning Solution - Features Selection, Machine Learning Solution - Training/Leaming/Full Evaluation and Machine Learning Solution - Final Prediction.
  • Solution blocks in the repository may then be suitably mix and matched to generate plural Machine Learning Solution pipelines to address plural respective Business Solutions and each the Business Solution’s respective Analytical Solution.
  • the pipeline includes all combinations of the first three parts, In each cycle of the matchmaking, said combination is being evaluated. A“winning” combination is selected. Then, only with the “winning” combination, the full evaluation is executed. Following run strategies, prediction for each entity in the test-set after the model has learnt from the data in the train-set, is produced. This “populate pipeline” may generate several Machine Learning Solutions-Block Combinations (MLS-BC) that typically automatically compete with one another.
  • MLS-BC Machine Learning Solutions-Block Combinations
  • Each successful completion (e.g. completed its run without errors) of a Machine Learning Solution pipeline may typically run a version of the code provided for the Machine Learning Solutions by the scientist who authored the Machine Learning Solution (the code subsequently being stored in the Machine Learning Solution repository) and the data generated by the code of that part (e.g. a given block from among the 4 blocks generated by that scientist) may be stored for the use of the next Machine Learning Solution part . If a script fails at any stage (say after the 1 st or 2 nd or 3 rd or 4 th block of code authored by this scientist has been run), a failure notification shall be logged. Typically the user is notified about successful completion after the“Benchmarking and Machine Learning Solution Selection” (module 12 - Operation 720).
  • Table 11 is an example data structure for storing Machine Learning Solutions in a repository.
  • Each instance of a Machine Learning Solution stored in the Machine Learning Solution repository includes: 4 bodies of code, for the Machine Learning Solution parts respectively, plus a statistical benchmark result aka benchmark.
  • the benchmark is calculated using the baseline model as per Operation 570 aka Module 7a of phase 1 - Benchmark calculation for baseline model
  • Operation Al The Machine Learning Solution pipeline runs operation 680 aka the Features Engineering and Extraction Module (Module 8), for each of models (aka Machine Learning Solutions) MLS1 , MLS2, ...MLSn and stores one output per Machine Learning Solution model: A data set with one row per entity from the trained set and the trained set’s new extracted original features.
  • Operation A2 the Machine Learning Solution pipeline outputs one joint table that gathers all features extracted by all the selected Machine Learning Solution models, per entity, for each entity in the initial data set.
  • Operation A3 the Machine Learning Solution pipeline runs operation 690 aka the Features Selection Module, while running each building block that populates the Block 2 (say, or more generally, the block M) Folder, on the joint table created in operation a2.
  • the Machine Learning Solution pipeline then stores the output separately in a dedicated folder for each model in (say) block 2: B21, B22,... B2n where n is the number of selected Machine Learning Solutions operation a4: the Machine Learning Solution pipeline runs the next module for each model and stores results per model in a dedicated folder. There is no additional stacking in this strategy, because the stacking process happened during operation al which runs Feature Engineering and Extraction operation 680.
  • Fig. 12 is an example of the operation of strategy a.
  • Strategy B :
  • Operation Bl The Machine Learning Solution pipeline runs each model separately on the input data, per module e.g. for all the modules of figs. 3, 4 herein. Typically each building block is run on each input data set and outputs a data set for the next building block of the same model.
  • Output Selected features from FODS PI, P2, P3, P4 and P5 and from the added features
  • Input selected features from FODS PI , P2, P3, P4 and P5 and from the added features
  • Input PS of the selected MLS-BC.
  • This selection is done by comparing the results obtained by the competing models, on the test set Output: PS of the selected Machine Learning Solution -BC.
  • Input Prediction Set of the selected Machine Learning Solution -BC.
  • the product may be able to provide, based on a successful selection of a“winning” (e.g. best Machine Learning Solution) to query the resulted data following several strategies reflecting his optimal use of the analytical answer in his operational or business process.
  • the organization may receive a guide e.g. in natural language of“how should I use the result in my operations”.
  • the strategies may be customized to the organization’s needs based on its selected Strategy Parameters (SP) e.g. as shown in the table of Fig. 13.
  • SP Strategy Parameters
  • a simulation report may be generated.
  • operations 510 - 720 all run in parallel e.g. are cascaded to run on different requests for predictions.
  • Fig. 10 is merely exemplary; any or all of the modules therein need not be provided.
  • system end-users upload data files over secured web service to a private storage area (e.g. HOT data).
  • Data during the transfer process and in the storage area are typically both encrypted.
  • Each customers data may be kept is a separated secured location.
  • Backend processes or serves e.g. data verification/preparation are provided as internal system Internal processes and only these processes are entitled to access customers HOT data for verification/preparation tasks.
  • Metadata is kept on a proprietary database structure and is available for the system’s internal services only (backend services).
  • code solutions provided e.g. by human data scientists e.g. in response to a challenge need not necessarily comprise the specific 4 blocks shown and described herein. Instead, less than or more than 4 blocks may be provided by each data scientist, and the operative content (the functionalities performed by each block) may be different, as long as a set or sequence of blocks is predefined, and the functionalities to be performed by each block are predefined, typically in sequence such that each block operates on input generated by at least one previous block in the sequence.
  • Each module or component or processor may be centralized in a single physical location or physical device or distributed over several physical locations or physical devices.
  • electromagnetic signals in accordance with the description herein.
  • These may carry computer-readable instructions for performing any or all of the operations of any of the methods shown and described herein, in any suitable order including simultaneous performance of suitable groups of operations as appropriate; machine-readable instructions for performing any or all of the operations of any of the methods shown and described herein, in any suitable order; program storage devices readable by machine, tangibly embodying a program of instructions executable by the machine to perform any or all of the operations of any of the methods shown and described herein, in any suitable order i.e.
  • a computer program product comprising a computer useable medium having computer readable program code, such as executable code, having embodied therein, and/or including computer readable program code for performing, any or all of the operations of any of the methods shown and described herein, in any suitable order; any technical effects brought about by any or all of the operations of any of the methods shown and described herein, when performed in any suitable order; any suitable apparatus or device or combination of such, programmed to perform, alone or in combination, any or all of the operations of any of the methods shown and described herein, in any suitable order; electronic devices each including at least one processor and/or cooperating input device and/or output device and operative to perform e.g.
  • Any computer-readable or machine-readable media described herein is intended to include non-transitory computer- or machine-readable media.
  • Any computations or other forms of analysis described herein may be performed by a suitable computerized method. Any operation or functionality described herein may be wholly or partially computer-implemented e.g. by one or more processors.
  • the invention shown and described herein may include (a) using a computerized method to identify a solution to any of the problems or for any of the objectives described herein, the solution optionally include at least one of a decision, an action, a product, a service or any other information described herein that impacts, in a positive manner, a problem or objectives described herein; and (b) outputting the solution.
  • the system may if desired be implemented as a web-based system employing software, computers, routers and telecommunications equipment as appropriate.
  • a server may store certain applications, for download to clients, which are executed at the client side, the server side serving only as a storehouse.
  • Some or all functionalities e.g. software functionalities shown and described herein may be deployed in a cloud environment
  • Clients e.g. mobile communication devices such as smartphones may be operatively associated with but external to the cloud.
  • any“if -then” logic described herein is intended to include embodiments in which a processor is programmed to repeatedly determine whether condition x, which is sometimes true and sometimes false, is currently true or false and to perform y each time x is determined to be true, thereby to yield a processor which performs y at least once, typically on an“if and only if’ basis e.g. triggered only by determinations that x is true and never by determinations that x is false.
  • a system embodiment is intended to include a corresponding process embodiment and vice versa.
  • each system embodiment is intended to include a server-centered“view” or client centered“view”, or“view” from any other node of the system, of the entire functionality of the system , computer-readable medium, apparatus, including only those functionalities performed at that server or client or node.
  • Features may also be combined with features known in the art and particularly although not limited to those described in the Background section or in publications mentioned therein.
  • features of the invention including operations, which are described for brevity in the context of a single embodiment or in a certain order may be provided separately or in any suitable sub combination, including with features known in the art (particularly although not limited to those described in the Background section or in publications mentioned therein) or in a different order "e.g.” is used herein in the sense of a specific example which is not intended to be limiting.
  • Each method may comprise some or all of the operations illustrated or described, suitably ordered e.g. as illustrated or described herein.
  • Devices, apparatus or systems shown coupled in any of the drawings may in fact be integrated into a single platform in certain embodiments or may be coupled via any appropriate wired or wireless coupling such as but not limited to optical fiber, Ethernet, Wireless LAN, HomePNA, power line communication, cell phone, Smart Phone (e.g. iPhone), Tablet, Laptop, PDA, Blackberry GPRS, Satellite including GPS, or other mobile delivery.
  • any appropriate wired or wireless coupling such as but not limited to optical fiber, Ethernet, Wireless LAN, HomePNA, power line communication, cell phone, Smart Phone (e.g. iPhone), Tablet, Laptop, PDA, Blackberry GPRS, Satellite including GPS, or other mobile delivery.
  • functionalities described or illustrated as systems and sub-units thereof can also be provided as methods and operations therewithin
  • functionalities described or illustrated as methods and operations therewithin can also be provided as systems and sub-units thereof.
  • the scale used to illustrate various elements in the drawings is merely exemplary and/or appropriate for clarity of presentation and is not intended to be limiting.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • Theoretical Computer Science (AREA)
  • Finance (AREA)
  • Data Mining & Analysis (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Game Theory and Decision Science (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data processing system comprising a system-scientist data interface controlled by a first processor to accept and store in a digital repository, multiple-part Machine Learning Solutions from scientists each including multiple blocks; a second processor compiling plural Machine Learning Solution pipelines to address plural respective Business Solutions and each Business Solution's respective Analytical Solution, wherein the compiling comprises mixing-and-matching Machine Learning Solution blocks stored in the Machine Learning Solution repository thereby to generate Machine Learning Solution Block Combinations, and a system-business user interface controlled by a third processor to generate a Machine Learning Solution pipeline output, for at least one business user presenting an individual Business Solution including business data (aka Business Data Set) using the Machine Learning Solutions pipeline compiled by the processor for the individual Business Solution.

Description

MULTIPLE-PART MACHINE LEARNING SOLUTIONS GENERATED BY DATA SCIENTISTS
FIELD OF THIS DISCLOSURE
The present invention relates generally to artificial intelligence and more particularly to machine learning.
BACKGROUND FOR THIS DISCLOSURE
R aka‘GNU S’ or“the R Project for Statistical Computing”, is an example of an available language and environment which enables data scientists to perform statistical computing and graphics including various statistical and graphical techniques such as but not limited to linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering. CRAN is a network of FTP and web servers which stores code and documentation for R
The disclosures of all publications and patent documents mentioned in the specification, and of the publications and patent documents cited therein directly or indirectly, are hereby incorporated by reference. Materiality of such publications and patent documents to patentability is not conceded.
SUMMARY OF CERTAIN EMBODIMENTS
Certain embodiments of the present invention seek to provide data processing system, method and computer-program product, based on multiple-part machine learning solutions generated by data scientists.
Certain embodiments of the present invention seek to provide a data processing system comprising: a system-scientist data interface controlled by a first processor to accept and store in a digital Machine Learning Solution repository, Machine Learning Solutions from scientists, each Machine Learning Solution including multiple blocks of code typically comprising a sequence of blocks of code, each block of code typically performing predefined operations defined by a challenge sent to the scientists and typically having a predefined format (which may be defined in a work-order sent to the data scientists) allowing any first block of code generated within a challenge according to the format by a first data scientist to use output of any second block of code generated within the challenge according to the format by a second data scientist if the second block of code precedes the first block of code in the sequence; and/or a second processor configured to mix and match Machine Learning Solution blocks stored in the Machine Learning Solution repository thereby to generate Machine Learning Solution Block Combinations, thereby, typically, to address plural respective Business Solutions and each Business Solution’s respective Analytical Solution, and/or a system-business user interface controlled by a third processor to run the Machine Learning Solutions pipeline compiled by the processor for said individual Business Solution, typically on business data provided by at least one business user who typically provides the system with an individual Business Solution including said business data (aka business data set - Business Data Set), thereby to generate a Machine Learning Solution pipeline output, and typically to send the output to the at least one business user.
Typically, a challenge defines a sequence of blocks including at least block 1 followed by block2, and each data scientist responding to the challenge generates each block in the sequence and the second processor configured to mix and match is operative to generate at least one pipeline comprising a block 1 generated by a first data scientist followed by a block2 generated by a second data scientist, other than the first data scientist. For example, a challenge may define a sequence: blockl, block2, blocks, block4 and the second processor configured to mix and match may generate inter alia at least one pipeline including a blockl, block2, blocks and block4 respectively generated by 4 different data scientists all of whom responded to the challenge.
Certain embodiments of the present invention seek to provide circuitry typically comprising at least one processor in communication with at least one memory, with instructions stored in such memory executed by the processor to provide functionalities which are described herein in detail. Any functionality described herein may be firmware-implemented or processor- implemented as appropriate.
It is appreciated that any reference herein to, or recitation of, an operation being performed is, e.g. if the operation is performed at least partly in software, intended to include both an embodiment where the operation is performed in its entirety by a server A, and also to include any type of“outsourcing” or“cloud" embodiments in which the operation, or portions thereof, is or are performed by a remote processor P (or several such), which may be deployed off-shore or“on a cloud", and an output of the operation is then communicated to, e.g. over a suitable computer network, and used by, server A. Analogously, the remote processor P may not, itself, perform all of the operation and instead, the remote processor P itself may receive output/s of portion/s of the operation from yet another processor/s P, may be deployed off-shore relative to P, or“on a cloud”, and so forth.
The present invention typically includes at least the following embodiments:
Embodiment 1: A data processing system comprising: a system-scientist data interface controlled by a first processor to accept and store in a digital repository, multiple-part Machine Learning Solutions from scientists each including multiple blocks; a second processor compiling plural Machine Learning Solution pipelines to address plural respective Business Solutions and each Business Solution’s respective Analytical Solution, wherein the compiling comprises mixing-and-matching Machine Learning Solution blocks stored in the Machine Learning Solution repository thereby to generate Machine Learning Solution Block Combinations, and a system-business user interface controlled by a third processor to generate a Machine Learning Solution pipeline output, for at least one business user presenting an individual Business Solution including business data (aka business data set - Business Data Set) using the Machine Learning Solutions pipeline compiled by the processor for the individual Business Solution.
Embodiment 2. A system according to any of the preceding embodiments and also comprising a fourth processor operative to communicate to at least one Data Scientist, a challenge inviting the at least one Data Scientist to provide at least one Machine Learning Solution.
Embodiment 3. A system according to any of the preceding embodiments wherein the multiple blocks include a Business Solution-Feature Engineering and Extraction block.
Embodiment 4. A system according to any of the preceding embodiments wherein the multiple blocks include a Machine Learning Solution -Features Selection block. Embodiment 5. A system according to any of the preceding embodiments wherein the multiple blocks include a Machine Learning Solution - T raining/Leaming/Full Evaluation block.
Embodiment 6. A system according to any of the preceding embodiments wherein the multiple blocks include a Machine Learning Solution - Final Prediction block.
Embodiment 7. A system according to any of the preceding embodiments wherein for at least one Business Solution, the Machine Learning Solution Block combinations automatically compete with one another including identifying a best block combination and wherein the Machine Learning Solution pipeline compiled by the processor for the individual Business Solution comprises the best block combination.
Embodiment 8. A system according to any of the preceding embodiments wherein the Business Solution-Feature Engineering and Extraction block includes code operative for preprocessing data before machine learning analysis..
Embodiment 9. A system according to any of the preceding embodiments wherein all pre-processing of data before machine learning analysis occurs in the Business Solution-Feature
Engineering and Extraction block and not in other blocks.
Embodiment 10. A system according to any of the preceding embodiments wherein the pre-processing includes removing data columns.
Embodiment 11. A system according to any of the preceding embodiments wherein the pre-processing includes extracting additional data from existing data and using the additional data as input to the machine learning analysis.
Embodiment 12. A system according to any of the preceding embodiments wherein the Machine Learning Solution -Features Selection block includes code operative for selecting features which are stronger predictors of at least one target variable to be predicted in a given challenge and not . selecting features which are less strong predictors of the at least one target variable.
Embodiment 13. A system according to any of the preceding embodiments wherein all code operative for selecting features which are stronger predictors of at least one target variable to be predicted in a given challenge and not . selecting features which are less strong predictors of the at least one target variable is included in the Machine Learning Solution -Features Selection block and not in other blocks.
Embodiment 14. A system according to any of the preceding embodiments wherein the Machine Learning Solution - Training/LeamingZFull Evaluation block includes machine learning code operative to train and evaluate a machine learning training set.
Embodiment 15. A system according to any of the preceding embodiments wherein all code operative to train and evaluate a machine learning training set is included in the Machine Learning Solution - Training/Leaming/Full Evaluation block and not in other blocks.
Embodiment 16. A system according to any of the preceding embodiments wherein the Machine Learning Solution - Final Prediction block is operative to run machine learning code on a test set including providing a final prediction for each data record in the test set
Embodiment 17. A system according to any of the preceding embodiments wherein the Machine Learning Solution -Final Prediction block is operative to run machine learning code included in a Machine Learning Solution - Training/Leaming/Full Evaluation block on a test set including providing a final prediction for each data record in the test set
Embodiment 18. A system according to any of the preceding embodiments wherein all code operative to run machine learning code on a test set including providing a final prediction for each data record in the test set is performed by the Machine Learning Solution -Final Prediction block and not in other blocks.
Embodiment 19. A data processing method comprising: Providing a system-scientist data interface controlled by a first processor to accept and store in a digital repository, multiple-part Machine Learning Solutions from scientists each including multiple blocks; Providing a second processor compiling plural Machine Learning Solution pipelines to address plural respective Business Solutions and each Business Solution’s respective Analytical Solution, wherein the compiling comprises mixing-and-matching Machine Learning Solution blocks stored in the Machine Learning Solution repository thereby to generate Machine Learning Solution Block Combinations, and Providing a system-business user interface controlled by a third processor to generate a Machine Learning Solution pipeline output, for at least one business user presenting an individual Business Solution including business data (aka business data set - Business Data Set) using the Machine Learning Solutions pipeline compiled by the processor for the individual Business Solution.
Embodiment 20. A computer program product, comprising a non-transitory tangible computer readable medium having computer readable program code embodied therein, the computer readable program code adapted to be executed to implement a data processing method comprising: Providing a system-scientist data interface controlled by a first processor to accept and store in a digital repository, multiple-part Machine Learning Solutions from scientists each including multiple blocks; Providing a second processor compiling plural Machine Learning Solution pipelines to address plural respective Business Solutions and each Business Solution’s respective Analytical Solution, wherein the compiling comprises mixing-and-matching Machine Learning Solution blocks stored in the Machine Learning Solution repository thereby to generate Machine Learning Solution Block Combinations, and Providing a system-business user interface controlled by a third processor to generate a Machine Learning Solution pipeline output, for at least one business user presenting an individual Business Solution including business data (aka business data set - Business Data Set) using the Machine Learning Solutions pipeline compiled by the processor for the individual Business Solution.
Also provided, excluding signals, is a computer program comprising computer program code means for performing any of the methods shown and described herein when the program is run on at least one computer; and a computer program product, comprising a typically non- transitory computer-usable or -readable medium e.g. non-transitory computer -usable or -readable storage medium, typically tangible, having a computer readable program code embodied therein, the computer readable program code adapted to be executed to implement any or all of the methods shown and described herein. The operations in accordance with the teachings herein may be performed by at least one computer specially constructed for the desired purposes or general purpose computer specially configured for the desired purpose by at least one computer program stored in a typically non-transitory computer readable storage medium. The term "non-transitory" is used herein to exclude transitory, propagating signals or waves, but to otherwise include any volatile or non-volatile computer memory technology suitable to the application.
Any suitable processor/s, display and input means may be used to process, display e.g. on a computer screen or other computer output device, store, and accept information such as information used by or generated by any of the methods and apparatus shown and described herein; the above processor/s, display and input means including computer programs, in accordance with some or all of the embodiments of the present invention. Any or all functionalities of the invention shown and described herein, such as but not limited to operations within flowcharts, may be performed by any one or more of: at least one conventional personal computer processor, workstation or other programmable device or computer or electronic computing device or processor, either general-purpose or specifically constructed, used for processing; a computer display screen and/or printer and/or speaker for displaying; machine-readable memory such as optical disks, CDROMs, DVDs, BluRays, magnetic-optical discs or other discs; RAMs, ROMs, EPROMs, EEPROMs, magnetic or optical or other cards, for storing, and keyboard or mouse for accepting. Modules shown and described herein may include any one or combination or plurality of: a server, a data processor, a memory/computer storage, a communication interface, a computer program stored in memory/computer storage.
The term "process" as used above is intended to include any type of computation or manipulation or transformation of data represented as physical, e.g. electronic, phenomena which may occur or reside e.g. within registers and /or memories of at least one computer or processor. Use of nouns in singular form is not intended to be limiting; thus the term processor is intended to include a plurality of processing units which may be distributed or remote, the term server is intended to include plural typically interconnected modules running on plural respective servers, and so forth.
The above devices may communicate via any conventional wired or wireless digital communication means, e.g. via a wired or cellular telephone network or a computer network such as the Internet.
The apparatus of the present invention may include, according to certain embodiments of the invention, machine readable memory containing or otherwise storing a program of instructions which, when executed by the machine, implements some or all of the apparatus, methods, features and functionalities of the invention shown and described herein. Alternatively or in addition, the apparatus of the present invention may include, according to certain embodiments of the invention, a program as above which may be written in any conventional programming language, and optionally a machine for executing the program such as but not limited to a general purpose computer which may optionally be configured or activated in accordance with the teachings of the present invention. Any of the teachings incorporated herein may whereever suitable operate on signals representative of physical objects or substances.
The embodiments referred to above, and other embodiments, are described in detail in the next section.
Any trademark occurring in the text or drawings is the property of its owner and occurs herein merely to explain or illustrate one example of how an embodiment of the invention may be implemented.
Unless stated otherwise, terms such as, "processing", "computing", "estimating", "selecting", "ranking", "grading", "calculating", "determining", "generating", "reassessing", "classifying", "generating", "producing", "stereo-matching", "registering", "detecting", "associating", "superimposing", "obtaining", "providing", "accessing", "setting" or the like, refer to the action and/or processes of at least one computer/s or computing system/s, or processor/s or similar electronic computing device/s or circuitry, that manipulate and/or transform data which may be represented as physical, such as electronic, quantities e.g. within the computing system's registers and/or memories, and/or may be provided on-the-fly, into other data which may be similarly represented as physical quantities within the computing system's memories, registers or other such information storage, transmission or display devices or may be provided to external factors e.g. via a suitable data network. The term“computer” should be broadly construed to cover any kind of electronic device with data processing capabilities, including, by way of non-limiting example, personal computers, servers, embedded cores, computing system, communication devices, processors (e.g. digital signal processor (DSP), microcontrollers, field programmable gate array (FPGA), application specific integrated circuit (ASIC), etc.) and other electronic computing devices. Any reference to a computer, controller or processor is intended to include one or more hardware devices e.g. chips, which may be co-located or remote from one another. Any controller or processor may for example comprise at least one CPU, DSP, FPGA or ASIC, suitably configured in accordance with the logic and functionalities described herein.
The present invention may be described, merely for clarity, in terms of terminology specific to, or references to, particular programming languages, operating systems, browsers, system versions, individual products, protocols and the like. It will be appreciated that this terminology or such reference/s is intended to convey general principles of operation clearly and briefly, by way of example, and is not intended to limit the scope of the invention solely to a particular programming language, operating system, browser, system version, or individual product or protocol. Nonetheless, the disclosure of the standard or other professional literature defining the programming language, operating system, browser, system version, or individual product or protocol in question, is incorporated by reference herein in its entirety.
Elements separately listed herein need not be distinct components and alternatively may be the same structure. A statement that an element or feature may exist is intended to include (a) embodiments in which the element or feature exists; (b) embodiments in which the element or feature does not exist; and (c) embodiments in which the element or feature exist selectably e.g. a user may configure or select whether the element or feature does or does not exist.
Any suitable input device, such as but not limited to a sensor, may be used to generate or otherwise provide information received by the apparatus and methods shown and described herein. Any suitable output device or display may be used to display or output information generated by the apparatus and methods shown and described herein. Any suitable processor/s may be employed to compute or generate information as described herein and/or to perform functionalities described herein and/or to implement any engine, interface or other system described herein. Any suitable computerized data storage e.g. computer memory may be used to store information received by or generated by the systems shown and described herein. Functionalities shown and described herein may be divided between a server computer and a plurality of client computers. These or any other computerized components shown and described herein may communicate between themselves via a suitable computer network.
BRIEF DESCRIPTION OF THE DRAWINGS
Certain embodiments of the present invention are illustrated in the following drawings; in the block diagrams, arrows between modules may be implemented as APIs and any suitable technology may be used for interconnecting functional components or modules illustrated herein in a suitable sequence or order e.g. via a suitable API/Interface. Table which may include some or all of the fields and/or records and/or cells, rows or columns shown herein. Fig. 1 is a top level flow which may for example be performed by the system of Fig. 10.
Fig. 2 is a zoom-in on operation 400 of Fig. 1: matchmaking
Fig. 3 is a zoom-in on phase i aka operation 410 of Fig. 2.
Fig. 4 is a zoom-in on phase ii aka operation 420 of Fig. 2.
Fig. 5 is a pictorial illustration showing operations 100 - 400 of Fig. 1, inter alia.
Fig. 6 is a table useful in understanding certain embodiments.
Fig. 7 is a table of an example data structure for a Business Solution repository.
Fig. 8 is a table useful in understanding certain embodiments.
Fig. 9 is a table of an example data structure for an Analytical Solution (AS) repository. Fig 10 is a diagram useful in understanding certain embodiments.
Fig. 11 is a table of an example data structure for a Machine Learning Solution (MLS) repository.
Fig. 12 is an example of operation of an MLS pipeline population method in accordance with certain embodiments.
Fig. 13 is a table useful in understanding certain embodiments.
Fig. 14 is a table useful in understanding certain embodiments.
Fig. 15 is a table showing example formalization for Business Questions.
Methods and systems included in the scope of the present invention may include some (e.g. any suitable subset) or all of the functional blocks shown in the specifically illustrated implementations by way of example, in any suitable order e.g. as shown.
Computational, functional or logical components described and illustrated herein can be implemented in various forms, for example, as hardware circuits such as but not limited to custom VLSI circuits or gate arrays or programmable hardware devices such as but not limited to FPGAs, or as software program code stored on at least one tangible or intangible computer readable medium and executable by at least one processor, or any suitable combination thereof. A specific functional component may be formed by one particular sequence of software code, or by a plurality of such, which collectively act or behave or act as described herein with reference to the functional component in question. For example, the component may be distributed over several code sequences such as but not limited to objects, procedures, functions, routines and programs and may originate from several computer files which typically operate synergistically.
Each functionality or method herein may be implemented in software, firmware, hardware or any combination thereof. Functionality or operations stipulated as being software-implemented may alternatively be wholly or fully implemented by an equivalent hardware or firmware module and vice-versa. Firmware implementing functionality described herein , if provided, may be held in any suitable memory device and a suitable processing unit (aka processor) may be configured for executing firmware code. Alternatively, certain embodiments described herein may be implemented partly or exclusively in hardware in which case some or all of the variables, parameters, and computations described herein may be in hardware.
Any module or functionality described herein may comprise a suitably configured hardware component or circuitry. Alternatively or in addition, modules or functionality described herein may be performed by a general purpose computer or more generally by a suitable microprocessor, configured in accordance with: methods shown and described herein, or any suitable subset, in any suitable order, of the operations included in such methods, or in accordance with methods known in the art
Any logical functionality described herein may be implemented as a real time application if and as appropriate and which may employ any suitable architectural option such as but not limited to FPGA, ASIC or DSP or any suitable combination thereof.
Any hardware component mentioned herein may in fact include either one or more hardware devices e.g. chips, which may be co-located or remote from one another.
Any method described herein is intended to include within the scope of the embodiments of the present invention also any software or computer program performing some or all of the method’s operations, including a mobile application, platform or operating system e.g. as stored in a medium, as well as combining the computer program with a hardware device to perform some or all of the operations of the method.
Data can be stored on one or more tangible or intangible computer readable media stored at one or more different locations, different network nodes or different storage devices at a single node or location. It is appreciated that any computer data storage technology, including any type of storage or memory and any type of computer components and recording media that retain digital data used for computing for an interval of time, and any type of information retention technology, may be used to store the various data provided and employed herein. Suitable computer data storage or information retention apparatus may include apparatus which is primary, secondary, tertiary or off-line; which is of any type or level or amount or category of volatility, differentiation, mutability, accessibility, addressability, capacity, performance and energy use; and which is based on any suitable technologies such as semiconductor, magnetic, optical, paper and others.
DETAILED DESCRIPTION OF CERTAIN EMBODIMENTS
Acronyms used herein include:
ADDS (Additional Data Set): new features and/or files generated by a module e.g. by executing Additions Rules that may be predefined for each of a Data Sets features in the business solution and analytics solution,
ADS (Analytical Data Set): defines data, typically required data, associated with the AS,
AS (Analytical Solution): typically comprises a (statistical e.g.) representation of plural business solutions which are all, typically, statistically similar
BC: Combinations of blocks that typically automatically compete with each other,
BDS (Business Data Set)
Business Question (BQ). A suitable ontology may be used to define business questions such as the following example ontology, which includes some or all of the key words below. According to certain embodiments, each entity and/or verb and/or event in the business question must be selected from the following; the ontology may be stored as Business Question metadata:
Figure imgf000013_0001
Defect Prediction, Source Code Metric, Change Metrics
Marketing Mix, Long-Term Effects, Brand Performance, Empirical
Generalizations
Data Mining, Rule Generators, (Promotion) Event Forecasting
Choice Models, Defensive Strategy, Customer Satisfaction, Perceived Quality
Customers, Brands, Sales Promotions, Market Prices
Human Centered Computing, Human Computer Interaction, Web-Based
Interaction
Chum Rate, Customer Loyalty, Retention, Target Market Prediction
Inventory Forecasting, ATM, Optimization, Product Usage
Sales Personnel, Performance Tests, Sales Management
Pricing, Bundle Composition, Products Selection
Customer Satisfaction, Improvement Techniques, Customer Retention
BS. Business solution (BS).
Business solution (BS) Execution Parameters (BS-EP)
CSV: term of the art e.g. as defined in https://support.bigcommerce.com/articles/Public/What-is- a-CSV-file-and-how-do-I-save-my-spreadsheet-as-one
Data File (DF),
Data Set (DS).
EP: Execution Parameters
External Data Scientists (EDS)
F: file e.g. Data File
FODS Final Obfuscated Data Set
Internal Data Scientists (IDS)
Machine Learning Solutions (MLS) MO: month
MOH: month header data
QA: Quality Assurance
User Data Set (UDS)
A data science platform is shown and described herein. Typically, an analytical Business Question (BQ) requests a business answer which, based on past data, predicts the future or arranges the business entities in a new way. A Business Question is meant for decision support of managerial nature (e.g.“what is the best decision I should take now?”), or of operational nature (e.g.“define the next action to be implemented in a process or for an asset”).
Certain embodiments include at least one Data Science process setup in advance e.g. as described herein and/or at least one automatic run typically initiated by the business user. Typically, the run matches a good or preferably best produced analytical answer with:
the specific Business Question selected, and/or
one or more operational and/or business parameters the business user selects, and/or
Business Data Set (BDS) requirements and the User Data Set (UDS) - typically comprising raw data extracted from the business user’s databases.
In addition to the User Data Set, the system of Fig. 10 might store at least one additional, typically non-user-specific e.g. suitable commercially Available Additional Data Set (ADDS) that may be merged with the User Data Set to enrich e.g. as per operation 520. 520 is the point where these enriching Data Sets are merged. From this point on, they are part of the complete Data Sets (they are no longer considered special) enriching may comprise adding additional Data Sets in operation 520, which rather than including end-user data include commercial and/or open and/or government data.
Flows which may be performed by the system typically comprise all or any subset of the following:
Top level flow of Fig. 1 which may include some or any subset of the following, suitably ordered e.g. As shown: Operation 100: Pre-storing Business Solutions
Operation 200: Pre-storing Analytical Solutions
Operation 300: Pre-storing Machine Learning Solutions
Phase I - Preparing for a challenge
Phase P - Running a challenge
Operation 400: Matchmaking the best Machine Learning Solutions (MLSs) (where“best” typically refers to a best predicting solution nor a solution having a highest benchmark score). Operation 400 is operative for running all MLSs and selecting between all the available MLSs that are candidates (e.g. included in the pipeline).
Typically, each analytical solution comprises a template and a challenge relates to a specific instance of that template. Each instance may include a specific data set and there may be differences between two individual challenges "under" a single analytical solution, beyond differences in the data set e.g. differences in the specific operational parameters. Each analytical solution typically includes a statistical function and a data question and a structure for the data set without the specific data itself. Each challenge distributed to data scientists typically includes a work order and typically includes a data set, which typically includes example deliverable/s. Typically, each data scientist is called upon to write all or a specified subset of the following four blocks of code (four programs):
a. the Machine Learning Solution -FEE block— Features Engineering & extraction - make the data as productive as possible for machine learning analysis use e.g. by removing empty features, e.g. data columns such as date, scalars, codes etc. and/or extracting from the data as many derived (additional) features as possible. This extracts as much information as possible from a provided Pn Data Set
b. the MLS-FS ( Machine Learning Solutions- Features Selection) block - selects the best performing (e.g. having greatest statistical influence on the target of the analytical goal, e.g. best prediction) features from the features previously engineered and extracted in the Machine Learning Solutions - Feature Engineering & Extraction (FEE) block. "Target" is the term for whatever is being predicted, in a particular challenge. Any suitable method may be employed to determine which feature/s perform best such as, for example, analysis of variance to determine which feature/s explain a larger proportion of the variance of the target variable/s than other feature/s. c. the MLS-TLF (Machine Learning Solutions - Training/leaming/full evaluation) block uses the features selected by the previous block to train and evaluate the train data set to form the Prediction Set (PS). It is appreciated that in machine learning, three data sets are employed, typically termed the training, test, and validation sets the " train data set " referred to herein is the same as "training set”. The "prediction set" may be the same as the test set or validation set, according to certain embodiments.
d. the MLSs-FP (Machine Learning Solutions-Final Prediction) runs the pre-trained and evaluated Machine Learning code of block c on the test set (out of time predictions) to provide a Final Prediction for each data record in the test-set thereby to generate a Final Prediction Set - FPS.
Operation 400: matchmaking may include some or any subset of the following, suitably ordered e.g. as shown in Fig. 2:
Operation 410 aka Phase I - Data pre-processing
Operation 420 aka Phase II -Machine Learning Solutions Matchmaking
Operation 430 aka Phase III - Production of Analytical Results and of Statistical Story
Phase 1 aka operation 410 may include some or any subset of the following, suitably ordered e.g. As shown in fig. 3:
operation 510 aka module 1 - data validation and preparation
Operation 510 aka module 2 - data features additions
Operation 520 aka module la - rerun operation 510 (module 1 - data validation and preparation) on the complete data sets
Operation 530 aka module 3 - data holistic validation
Operation 540 aka module 4 - data reduction
Operation 550 aka module 5 - data set obfuscation
Operation 560 aka module 6 - data set splitting Operation 570 aka module 7a of phase 1 - benchmark calculation for baseline model zoom-in on phase ii aka operation 420 may include some or any subset of the following, suitably ordered e.g. As shown in fig. 4:
Operation 670 aka Module 7b of phase ii - Populate Machine Learning Solution Pipeline
Operation 680 aka Module 8 - Feature engineering & extraction aka Machine Learning Solution - Feature Engineering and Extraction
Operation 690 aka Module 9 - Feature selection aka Machine Learning Solution - Features Selection
Operation 700 aka Module 10 - Training/learning/full evaluation aka Machine Learning Solution -Training/Leaming/Full Evaluation.
Operation 710 aka Module 11 - Final Prediction aka Machine Learning Solution - Final Prediction.
Operation 720 aka Module 12 - Benchmarking and Machine Learning Solution Selection
An overview of operations 100 and 200; 300 and 400, according to an embodiment of the invention, is provided in the pictorial diagram of Fig. 5 which includes three processes which converge at a“main” server e.g. process 1 - pre-storing analytical and business solutions;
process 2 - pre-storing machine learning solutions, and process 3 - automatic run: matchmaking the best analytical answer.
A 1st Use Case - Pre-storing the Business and Analytical Solutions and a 1st Layer operative for Pre-storing the Business Solution are now described in detail.
A Business Solution answers a Business Question.
According to certain embodiments, business users“pick a solution” to a Business Question that they have, from a platform (e.g. an Amazon- or Oracle-based software as a service platform which may have, say site-to-site VPN connection with PAAS cloud services), e.g. in accordance with some or all of the architecture shown and described herein. Each business user typically uploads data which is defined by the platform as the data elements required for the Business Question. The platform alternatively, or in addition, interfaces with data scientists who develop models, e.g. based on anonymized or obfuscated data e.g. data provided by business users and obfuscated by the platform’s server. According to certain embodiments, prediction models generated by data scientists are accumulated e.g. using a“challenge” process as described herein in which the platform’s server auto-transforms Business Questions into standardized analytical tasks e.g. by a business question obfuscation process performed in Operation 300, phase I of Fig. 1. Each prediction model may subsequently be matched to and/or employed by more than one or many business users, wherein each business user typically is provided with an analytical answer based on the model which is a“best model” for her or his question and data.
Typically, challenges are each identified by a unique identifier e.g. serial number, and each data scientist or service provider fills out an order form for at least one challenge, with the scientist’s contact information inter alia. The delivery and testing of Deliverables generated by data scientists responsive to a challenge are typically as per the work order sent out for that challenge. Deliverables may include drawings, software, algorithm/s, certifications, documentation, codes, samples inter alia.
The system of Fig. 10 (say) may store plural (e.g. 10 or 20 or a few dozen or a few hundred) Business Questions which represent frequently asked Business Questions that computerized organizations deal with frequently e.g. daily. A Business Question typically includes at least the following components aka Business Question (minimal) form:
Verb component: {verb} e.g. predict
event component: {business event} e.g. sale
entity component: per {targeted business entity} e.g. product
There is typically one verb and one event, but there may be plural entities.
Thus, a typical Business Question may be represented formally in memory as: verb a business event per targeted business entity, where each component is stored for each Business Question.
As a result, each Business Question may be deemed to have a“target” - to {verb} e.g. predict the {targeted business entity} e.g. product.
A set of typical Business Questions, all or any subset of which may be provided, may appear as follows where“customer” here refers to the system end-user’s customer :
What is the likelihood that each customer repeats purchases in (period)? How likely is each customer to leave my business for a competitor by (period)?
How many products are defective per (number) produced?
How many days will it take to see (percent) increase in revenue after implementation of (marketing programs) per customer segment?
What is the most effective method to grow my network of prospective customers?
Which strategy would better improve customer satisfaction?
What is the optimal length for a price promotion for (Product A)?
Which online platform creates the most positive, effective customer experience?
What is the probability that each customer switches from (product A) to (B) within (period)? [same company, diff products]
When will an ATM reach the point of refill?
Which of my sales persons gives most satisfaction to the customers?
What percentage of bundle sales are composed of (Product A), (Product B), and (Product C)?
What the probability that a given customer is unsatisfied with service/product he received? What is the likeliness that a new customer adopts our product/service?
What is the probability that revenue streams will increase/decrease by (percent) within (period)?
How many customers will be digitally activated by (period)?
What is the probability that each customer refers the product/service to a friend within (a certain period)?
How much investment is needed to support a (percent) growth in revenue ?
Business Question (1)“How many products are defected per (number) produced?”, Business Question (2)“What will be most effective method to grow my network of prospective customers?” and Business Question (3) What is the likelihood that each customer repeats purchases in (a certain period)”), etc., may be formalized as shown in the table of Fig. 15.
Each Business Question typically includes, in addition to an actual question e.g. as above, also an associated Business Data Set (BDS). A Data Set defines any required and/or optional data elements associated with the Business Question. Each Business Data Set consists of one of more Data Files (DF), which are tables of features - keys and additional fields. Data Files are interconnected via keys to each other (e.g. Customer ID may appear in each of the customer related files, Date & Time may appear in all times eries related files).
For a Business Solution to initiate the Data Science process, where the analytical results may be received, a third part - the Business Solution Execution Parameters (BS-EP) are typically provided and may for example be stored as shown below in the table of Fig. 7. These allow the business user to accommodate the Business Solution to the specific business and operational needs. The Business Solution -Execution Parameters may for example include inter alia, either or both of the parameters in Fig. 6.
One data item in a Business Solution typically comprises the Business Question. It is appreciated that if Business Questions are expressed in natural language they are parsed or understood typically in phase I (i) or operation 300 of Fig. 1. A Business Question may comprise text which may be in natural language and/or may include at least the following components aka Business Question (minimal) form: {verb} e.g. predict & {business event} e.g. sale & per {targeted business entity} e.g. product
The SINGLE/MANY column in the table of Fig. 7 indicates whether each data item typically has but a single instance per Business Solution, or whether plural instances of that data item may occur per Business Solution.
Fig. 7 is a table illustrating an example data structure for a Business Solution data repository; all or any subset of the data items illustrated may be provided, typically for each of a plurality of business solutions. Fig. 8 shows an example Business Solution with all its data items. The table of Fig. 8 relates to a Business Solution for the“how many products are defective per number produced?” Business Question listed above by way of example. 2nd Layer - Pre-storing the Analytical Solution: An Analytical Solution (AS) typically comprises a statistical representation of plural Business Solutions whose statistical essence is similar. The Analytical Solution is, as the Business Solution, prepared in advanced in a repository of fig. 10, and each Business Solution has a specific Analytical Solution which typically statistically or formally describes that business solution BS. As a result, an Analytical Solution answers an Analytical Question that is an“abstracted” version of many statistically similar Business Questions. For example, a Business Question might be“predict the SALES of a SALES PERSON per MONTH and PRODUCT TYPE”. Whereas the Business Question’s “abstracted” matching Analytical Question might be“Predict the {TIME SERIAL ACTION} of a {BUSINESS ENTITY 1 } per {BUSINESS EN TITY 2} and {BUSINESS ENTITY 3 } . Other business questions with the same matching Analytical Question might be to predict some other action of some other business entityl (perhaps not a sale person) per business entity 2 and 3 (perhaps not a month and/or not a product type).
As in the above example, a Business Question typically has a target - to {VERB} e.g. predict the {TARGETED BUSINESS ENTITY n.
As in the Business Solution, each Analytical Solution typically comprises an Analytical Question and an associated Analytical Data Set (ADS).
An Analytical Data Set defines any required data associated with the Business Question. Each Analytical Data Set consists of one of more Data Files (ADF), which are tables of features - keys and additional fields. Data Files are connected via keys (e.g. ID, Date & Time) to each other in a way where (in a relational database terms) a master-detailed relationship is formed. For example, each MO (month) Header Data Business Data F has an MOHD (month Header Data) Order ID field which relates to one of more records in the MO Details Batch Data via the same MOHD Order ID field which is found there as well.
In addition, the Analytical Data Set includes the targeted feature (e.g. the feature the solution may predict e.g. end-user’s customer will leave/stay) and the statistical method used to evaluate the prediction power of the solution.
Typically, the files are networked in a single data set composed of relations between files (e.g. tables); relational database schemas may be employed. Fig. 9 is a table illustrating an example data structure for an as data repository; all or any subset of the data items illustrated may be provided, typically for each of a plurality of Analytical Business Solutions.
Operation 300 aka 2nd use case - Pre-storing the Machine Learning Solutions, is now described in detail. Typically, this operation includes phases i and ii. Phase i comprises preparing for a challenge (e.g. issuing a Machine Learning Solution creation challenge to a population of External Data Scientists).
A repository e.g. as shown in fig. 10, may pre-store Analytical Solutions and matching Business Solutions, based on which External Data Scientists (EDS) may be challenged to create Machine Learning Solutions (MLS). These competing Machine Learning Solutions all address a specific Analytical Solution (including data structure, and parameters).
The Machine Learning Solutions may be created and tested to answer a specific user selected Business Solution with the user’s data and that Business Solutions matched Analytical Solutions. These users, aka External Data Scientists, may be entitled to pre-load Machine Learning Solutions to be used later in the matchmaking process. The Business Solution and Analytical Solution design described herein creates advantageous flexibility in using the Machine Learning Solutions - since if each Business Solution is correlated to an Analytical Solution, and each Analytical Solution may be correlated to several Business Solutions, each Machine Learning Solution may address several Business Solutions during Machine Learning Solutions to Analytical Solution matchmaking. At the creation of the Business Solution, Internal Data Scientists select an Analytical Solution for that Business Solution, and set up the parameters for that Business Solution (as described in fig. 7, the Business Solution data items table, thereby to properly address the Business Question in the Business Solution. The External Data Scientists may receive a challenge Work Order (WO), which typically comprises a natural language document which covers all the definitions of the Analytical Solutions e.g. as described herein translating the specific Business Solution data and parameters as chosen by the user.
The different External Data Scientists may create a machine-code program to form respective potentially competing Machine Learning Solutions. Each Machine Learning Solution comprises a Machine Learning model e.g. code written in designated statistical coding language as Python or R which is built by External Data Scientists, tested and validated by the system of Fig. 10 in modules 10 to 12 in the next use case (including e.g. matchmaking the best analytical answer), is uploaded to the Machine Learning Solutions repository.
Each Machine Learning Solution (programmed and supplied by External Data Scientists) may be required to have any subset of or all of the following four blocks :
Features Engineering & Extraction - FEE - Extract as much information as possible from a provided Pn Data Set The essence of this block is for the External Data Scientists to initially make the data as productive as possible for machine learning analysis use (for example, remove empty features, e.g. data columns such as date, scalars, codes etc.) by machine learning algorithms as well as extract from the data as much derived (additional) features as possible.
Features selection (Machine Learning Solutions - FS) - Select the best performing (e.g. statistically influencing on the target of the analytical goal, e.g. prediction) features from the features previously engineered and extracted during Machine Learning Solutions - Features Engineering & Extraction.
Training/leaming/full evaluation - TLF) - The selected features are used to train and evaluate the train data set to form the Prediction Set (PS).
Final Prediction - FP - Runs the pre-trained and evaluated Machine Learning code written so far (as described in the previous blocks) on the test set (out of time predictions) to provide a Final Prediction for each data record in the test-set (creating the Final Prediction Set - FPS).
Operation 300, Phase ii of Fig. 1 is now described in detail. This phase typically comprises running the challenge including running each Machine Learning Solution provided by an External Data Scientist responsive to the challenge.
Each Machine Learning Solution provided by an External Data Scientist may be run using a similar process as the (below described) third use case (“matchmaking the best analytical answer"). The only difference between this challenge run and the matchmaking may be that in the challenge run, the Machine Learning Solution is selected only from the Machine Learning Solutions provided in the specific challenge, whereas in matchmaking, the Machine Learning Solution is selected from a Machine Learning Solution repository, e.g. in the embodiment of fig. 10 , since the Machine Learning Solutions may have been pre-developed during past challenges.
Each Machine Learning Solution provided by the External Data Scientists may use the input obfuscated data included in the FODS (in CSV or similar open data format or database) to predict the analytical target as defined in the Analytical Question in the Analytical Solution, as well as the Business Question in the Business Solution. Each Machine Learning Solution typically generates additional statically supportive features e.g. as described in the Features engineering & extraction (FEE) block description, and a prediction set, then store all the results in a specific folder as one or more data files (CSV or any other format) to be used. Each Machine Learning Solution provided by the External Data Scientist passes designated validation tests to ensure all or any subset of:
a. the Machine Learning Solution’s conformity to any requirements of the Machine Learning Solution blocks as described above - features engineering & extraction (Machine Learning Solution -FEE), features selection (Machine Learning Solution -FS), Training/learning/full evaluation (Machine Learning Solution -TLF) and Final Prediction (Machine Learning Solution -FP).
b. in Module 8 - Feature engineering & extraction, the blocks supplied by the External Data Scientists are run.
c. the Machine Learning Solution’s flexibility e.g. ability to be selected by the future matchmaking process as many times as possible (e.g. the Machine Learning Solution’s estimated compatibility to different Business Solutions and User Data Sets), the tests may run the Machine Learning Solutions in randomly changing situations of data sets and parameters. These stress test may include (among additional tests): omitting and adding features, changing the period parameters etc. in order to to stress test the blocks written by the External Data Scientists.
d. technical code QA (quality assurance). May include all or any subset of: Stress tests in changing Operating Systems, Compiler versions, technically faulty parameters etc.
e. code does not infringe copyright, mainly ensuring all code provided by the External Data Scientists is valid open source and/or their own creation. Inter alia, a challenge (such as, say, a binary classification challenge or a“predict the numeric continuous value of the test set items." challenge) typically defines a format for an expected deliverable and each data scientist participating in the challenge then uploads her or his deliverable in the expected format typically, a work order, provided to each external data scientist, defines formats and may specify these, e.g. in natural language.
Each analytic solution (such as, say, each or any of the following 5 example analytic solutions: 1. binary classification of time series variable, 2. predict time series based continuous variable per object, 3. multivariate classification of time series variable, 4.multivariate clustering of time series variable and 5. predict regression of variable over time) is typically defined by its own work order. The system and software products generated there within then operate in accordance with the definitions in the work order. A work order typically includes one or both of the following:
a. Problem formulation, describing the statistical/computational task and a result required to solve a business question as included in the system library. A human“rule coder” aka Internal Data Scientist, (or a machine) may convert the business language into formal or statistical language. An example is translating“predict at which price a customer will buy a specific product” into “predict timeseries continuous variable for a specific X”.
b. Data generalization, describing any required data sets in a logical structure of data files and the key features that connect them into an integrated data set scheme. For example, a human coder or internal data scientist (or a machine) may be operative for translating“customer profile” and “Customer Payments” data files into“Entity main file” and“entity timeseries actions”, as well as translating each of the features (e.g. a customer’s city) into generic features (ex. X31) while capturing their statistical/mathematical natures (e.g. an ordinal category).
A first example work order for a“predict time series based continuous variable per item" challenge aka SD 102-66, may include all or any portion of the following italicized natural language text (example:“about two thirds are provided as a training set, and the remaining third are used as a test set“ may be amended to specify proportions other than 1/3 and 2/3) or all or any subset of the following bullets, or all or any subset of the following sentences or paragraphs:
In this predict time series based continuous variable per item challenge, the service provider (“You”) will work on a unique dataset that combines relational and multivariate time series features (“Data”). The structure of the Data creates plenty of room for modeling creativity and skill, while the relatively straightforward modeling question will allow you to begin working immediately.
The Data comprises of billions of rows by various tables, which together describe a huge number of actions. Of these, about two-thirds are provided as a training set, and the remaining third is used as a test set (with their labels held out), the train/test split is by time. You are encouraged to submit your model and receive a score for a small part of the test set on a daily basis.
You are tasked with building a model to predict the numeric continuous value of the test set items.
We will provide you with a dedicated development environment where you may access the Data and work on your model. You are to develop your model strictly in Python or R, in the form of three separated module:
1) features engineering & extraction
2) Features selection
3) Training, learning (optimization) and full evaluation
Models will be evaluated and ranked according to the median of root mean square error (RMSE) metric scores for each item, applied to the test set to select a winner.
At the Kickoff date, We will supply you with the following: a. Data
• Ί he Data are given in the Raw Data directory, via four types of time series (ts) tables, which provide information about an item that is defined by an item id. Each table is provided as a CSV file. There are two types of features: XA_i and XB J, which have a different meaning.
• The 1st ts type is separated into many tsl_m,..,tsl_n2 tables, where each table represents data for a specific item id, including also the target feature. This feature contains a numeric continuous value for an item in a specific time.
• The 2nd ts type is separated into many ts2 mi,.. Js2 m2 tables, where each table represents data for a specific item id, including also the activity id feature.
• The 3rd is type is separated into many ts3_ki,..,ls3_k2 tables, where each table represents data for a specific activity id. • The 4th ts type is a single table, ts4, and is not unique to the combination of item id and time.
• The time features found in the tables are not necessarily physical time, and are not necessarily unique per row, item or activity. b. Example deliverable
• We will provide you with an example deliverable (“Example Example.py). a Evaluation Metric
• Will be calculated according to the test set root mean square error (RMSE) metric for each item when the summary score is the median of all scores (“Evaluation Metric ").
• Surpassing minimal Evaluation Metric is essential in order to become a Winner in the Final Win stage (“Threhold Bentchmark”). The Threshold Bentchmark is equal to 0.262104. d. Development Environment
We will supply you with the most suitable cloud based work environment, e.g. in Amazon Web Services (AWS), that will include:
o Hardware: appropriate CPU, RAM, disk and network resources required for developing a model on the given Data.
o Operating system: Microsoft Windows.
o Development software for Python and R:
Python 3.6, Python 2.7, PyCharm, atom, and all relevant libraries.
R, RStudio.
o Development supporting software: LibreOffice, etc.
o File directories accommodating the Data and Example we will supply, as well as any additional data and source code created by you. You will receive an AWS personal Workspace, which will allow you to log on and use the Development Environment. The username and password must be kept confidential and must not be shared with anyone without previous written approval from us.
You will develop your Deliverable strictly within this Development Environment. Specifically, the Data must be read and manipulated only within the Development Environment. You may send your already existing personal code to your development environment supplied by us via a personal message you send us e.g. via an agreed upon bid management tool such as, say, DeltaBid. Once the code is approved, we will place your personal code in the Private_Code directory. Your Deliverable will be in form of three separate modules aka blocks, that may each be run as-is within a fresh Python or R session on the Development Environment as originally supplied (with the exception that any Python library are installed as necessary, and with the working directory set to the same directory as the delivered script). These three modules are:
1) Features engineering & extraction— Extraction as much information as possible from the Data. The output features of this module should be gathered into one single output table.
2) Features selection— Selection of the most informative features from the previous module’s output. The selected features of this module should be gathered into one single output table.
3) Training, learning (optimization), and full evaluation— Using one single table, a model must undergo training, including the optimization operations. Finally, the module should perform the evaluation with the Evaluation Metric.
• Your Deliverable must apply module (3) above.
Your Deliverable will be developed and delivered using only free open source software packages of Python and R. In addition, please note that the Deliverables include only script files and not new data files that you created. Your Deliverable must be reasonably intelligible to any expert data scientist. It will adhere to standard industry good practice coding style (including but not limited to: variable naming, organization, comments and general readability). We will provide you with the following feedback: Daily Assessment (optional), Preliminary acceptance, Preliminary feedback of your score status and final win/no win notice:
• Daily Assessment (optional): On a daily basis, We -will evaluate your most updated Deliverable you may place in designated shared folder Daily_Assessment, and feedback you -with the score based on the Evaluation Metric per item in result file in the same folder. Evaluation will be conducted against part of the test data which was not given to you.
• Preliminary Acceptance: We will provide you with a notice whether your preliminary Deliverable meets the requirements given above (“Accepted Deliverable”).
• Preliminary Ranking: In the event that your Deliverable is an Accepted Deliverable, We will inform you of your Deliverable performance against the test set, in the form of your ranking in decreasing order of Evaluation Metric among additional Accepted Deliverables provided to us by other service providers, if any (“Preliminary Ranking”) and your Evaluation Metric score.
• Final Win/No Win: At the Final Feedback milestone, the Winner will be derived from the Ranking, as the service provider ranked first, e.g., the one whose accepted Deliverable achieves the highest Evaluation Metric score among all Accepted Deliverables, as provided by other service providers, if any. Also, a Winner’s Evaluation Metric score should be higher then the Threshold Benchmark. Evaluation will be conducted against completely new data, which is also different from the data used during the Daily Assessment.
Milestones
• Kickoff: The date on which We provides you with access to the Development Environment.
• Preliminary Delivery: Within 10 days of the Kickoff date, you will place a preliminary version of your Deliverable in the Preliminary directory in your Deliverables directory. • Preliminary Feedback: Within 14 days of the Kickoff date, We will review your Preliminary Deliverables, if any, and provide you with the Preliminary Ranking.
• Final Delivery: Within 21 days of the Kickoff date, you will place the final version of your Deliverable in the Final directory in your Deliverables directory.
• Final feedback: Within 28 days of the Kickoff date, We will review your final Deliverable and provide you with the final Feedback.
One possible“example deliverable” (referred to above) might include all or any portion of the following italicized text:
library(data. table)
Hbrary(randomForest)
# For this basic example, we'll only use a few columns from the items table, and our model will be
# a simple logistic regression. The features used are arbitrary and selected for demonstration only.
data.dir = 'S:/SD 102-45 Data from the system shown and described in this patent application set.seedf 123456789)
model, train = Junction (train.data) {
model = glm (target ~ ix2 + ix6 + ix7 4 ix9 4 ixlO, train.dataSitems, family = binomial) return (model)
}
modeLpredict = function(model, test. data) {
predictfmodel, test.data$items, type = 'response)'
}
it Load the data
# NOTE: we will ignore the warning this gives here for simplicity. You may provide file
It with column types. Otherwise use suppress Warnings if you wish.
cat(dateQ, 'Loading dataW)
train.data = listQ train.dataSitems = fread(pasteO(data.dir, Vtrain-items.csv' ))
train.dataSgroups = fread(pasteO(data.dir, '/train-groups. csv))
train. data$tsl = fread(pasteO(data.dir, Vtrain-tsl.csv' ))
train.data$ts2 = fread(pasteO(data. dir, Ytrain-ts2.csv' ))
train.data$ts3 = fread(pasteO(data.dir, Vtrain-ts3.csv' ))
train.data$ts4 = fread(pasteO(data. dir, Ytrain-ts4.csv' ))
train.dataSthings = fread(pasteO(data.dir, '/things, csv' ))
test, data = listQ
test.data$items = fread(pasteO(data.dir, Ytest-items.csv' ))
test.data$groups = Jread(pasteO(data.dir, '/test-groups. csv ' ) )
test.data$tsl = jread(pasteO(data. dir, Ytest-tsl.csv' ))
test.data$ts2 = fread(pasteO(data.dir, Ytest-ts2.csv' ))
test.data$ts3 = fread(pasteO(data.dir, Ytest-ts3.csv' ))
test.data$ts4 = fread(pasteO(data.dir, Ytest-ts4.csv' ))
test. dataSthings = fread(pasteO(data.dir, Ythings.csv' ))
# Fit the model
model = model.train(train.data)
# Generate predictions
predictions = model.predict(model, tesldata)
if (F) {
# And we be like: (well, more or less)
library(Metrics)
tesUargels = freadC test-targets. csv)
modelauc = auc(test.targets$target, predictions)
catfdateQ, 'AUC on test set is', model.auc, '\n)
}
A second example work order for a binary classification challenge aka SD 102-45, may include all or any portion of the following italicized natural language text or all or any subset of the following sentences or paragraphs; the text is merely exemplary and is not intended to be limiting so for example, for this and the previous work order any sentence including the word‘will” may be modified or omitted: In this binary classification (aka“binary classification of time series variable”) challenge, the service provider (“You”) will work on a unique dataset that combines relational and multivariate time series features (“Data "). The structure of the Data creates plenty of room for modeling creativity and skill, while the relatively straightforward modeling question will allow you to begin working immediately.
The Data comprise about 12,000,000 rows of various tables, which together describe approximately 50,000 items. Of these, about two thirds are provided as a training set, and the remaining third are used as a test set (with their labels held out), the train/test split is random.
You are tasked with building a model to predict the class of the test set items.
We will provide you with a dedicated development environment where you may access the Data and work on your model. You are to develop your model strictly in R, and deliver it in the form of a single script.
Models will be evaluated and ranked according to the area under curve (AUC) metric, applied to the test set to select a winner.
The Data are given via the main items table, a grouping table, 4 time series (ts) tables, and a things table. Each table is given as a CSV file. Values of categorical variables are coded as “cXXX” where XXX is a category number. Note that the assignment of category numbers was random.
• Each row in the items table describes one item, which is uniquely defined by an item id. The target column provided with the training data is the binary class of each item. In the test set this column is missing, and your task is to predict those values.
• The grouping table contains zero or more rows per item. Each row associates an item id with a group id. This induces a set of overlapping groups of items.
• The time series tables contain zero or more entries for each item id, together with a time column that indicates the location of the entry on a certain time axis for that type of series (not necessarily physical time, and not necessarily unique per row per item).
• Note that the third time series table, ts3, is special in that it is actually a combined table containing multiple similar time series per item. Each time series is identified by a series.id, and each (item id, series id) has its own time counter, where missing (item id, series id) entries mean that the value for that entry is zero. In addition, the columns series. fine, and series.coarse are (hierarchical) groupings of related series ids.
• The supporting things table describes a secondary object -whose ids (thing.id column) are referred to in both the items and ts4 tables. Some rows are all missing values; These are only included for completeness.
• We -will provide you with an example deliverable (“Example example. r).
• Evaluation Metric Will be calculated according to the test set area under curve (AUC) metric (“Evaluation Metric”).
• We will supply you with a Development Environment that will include:
o Hardware: appropriate CPU, RAM, disk and network recourses required for developing a model on the given Data.
o Operating system: Microsoft Windows.
o Development software: R, RStudio and access to all CRAN.
o Development supporting software: Microsoft Office etc.
o File directories accommodating the Data and Example supplied by us, as well as any additional data and source code created by you.
• You will receive a personal account, protected by a two-factor authentication process (using SMS sent to your mobile each login), which will allow you to log on and use the Development Environment. The username and password must be kept confidential and must not be shared with anyone without previous written approval from us.
• You will develop your Deliverable strictly within this Development Environment.
Specifically, the Data must be read and manipulated only within the Development Environment. You may send your already existing personal code to your development environment provided by us, via a designated process We will supply. Deliverables
• Your Deliverable will be a single R script that may be run as-is within a fresh R session on the Development Environment as originally supplied (with the exception that any CRAN dependencies are installed as necessary, and with the working directory set to the same directory as the delivered script).
• When run, your Deliverable should define two top-level junctions:
1. model, train = junction (train, data): takes a list containing the train set tables as
data, table objects resulting from a“plain vanilla” call to fireadQ on test set CSV files provided and returns a trained model object (in any reasonable way you 'd like).
2. model.predict = function(model, test. data): with arguments similar to the above, that produces a numeric vector with predictions in [0, 1] in which each entry corresponds to its respective row in testitems.
• Your Deliverable must apply junctions (1) and (2) above, and produce a (top-level environment) numeric vector named predictions that contains your model’s predictions. We will measure the AUC of the resultant predictions with respect to the true test set labels. See the supplied Example.
• Your Deliverable will be developed and delivered using only the free open source software package(s) included in the R Project for Statistical Computing fwww.r- oroiect.ore).
• Your Deliverable must be reasonably intelligible to any expert data scientist. It will adhere to standard industry good practice coding style (including but not limited to: variable naming, organization, comments and general readability). Deliverable feedbacks
• We will provide you with the following feedback: Preliminary acceptance of the deliverables, Preliminary feedback of your ranking and final win/no win notice.
• Preliminary Acceptance: We will provide you with a notice whether your preliminary Deliverable meets the requirements given above (“Accepted Deliverable "). If your deliverable is not accepted, We may provide you a chance to rectify any issues that lead to said rejection.
• Preliminary Ranking: In the event that your Deliverable is an Accepted Deliverable, We will inform you of your test set performance, in the form of your ranking in decreasing order of Evaluation Metric among additional Accepted Deliverables provided to us by other service providers, if any (“Preliminary Ranking”).
• Final Win/No Win: At the Final Feedback milestone, the Winner will be derived from the Ranking, as the service provider ranked first, e.g., the one whose accepted Deliverable achieves the highest Evlauation Metric among all Accepted Deliverables, as provided by other service providers, if any.
Milestones
• Kickoff: The date on which we provide you with access to the Development Environment.
• Preliminary Delivery: Within 14 days of the Kickoff date, you will place a preliminary version of your Deliverable in the preliminary directory in your Development Environment.
• Preliminary Feedback: Within 21 days of the Kickoff date, We will review your preliminary Deliverable and provide you with the a preliminary feedback of its Preliminary Acceptance and Preliminary Ranking.
• Final Delivery: Within 28 days of the Kickoff date, you will place the final version of your Deliverable in the final directory in your Development Environment.
• Final feedback: Within 35 days of the Kickoff date, We will review your final Deliverable and provide you with the final Feedback.
Work orders often make reference to an“example deliverable”, here (in italics) is one instance of an“example deliverable”:
# Note: This script is adapted to Python 2.7
import pandas as pd
import numpy as np from os import listdir
from os import path, walk
import csv
import pyflux as pf
from skleam.metrics import mean squared error as rmse
import sys
from multiprocessing import Pool
def run data engineeringQ:
ttt
Extraction as much information as possible from the Data.
The output features of this module should be gathered into one single output table. ttt return " def run features selectionQ:
ttt
Selection of the most informative features from the previous module’s output.
The selected features of this module should be gathered into one single output table. ttt return " def run_arimax(parameters, data train, data test, n, test jprop): model = pf.ARlMAX(data ^data train, ar^int(parameters[0]), ma =int(parameters[l ]), integ^int(parameters[2]), formula- target~l+XA_25', family^pf.NormalQ)
model.fit(MLE)
iflen(test j rop):
return model. predict(h ^intfnp. ceil (n * test _prop)), oos data ^data test)
else:
return model. predict(h ^n, oos data ^data test)
def run_oplimize_arimax(item id, df all, lest _prop):
data train = data_aIl.iloc[:int(len(dala_all)* (1 - lest _prop))]
data train. set index('time_J ' inplace=True)
data Jest = data_all.iloc[int(len(data_all) * (1 - test j>rop)):]
data tesLset indexCtime l ' inplace=True)
y test = data test [’target], to jhameQ
n = len(data all) parameters = [1, 1, 1]
optimize arr = xrange(l,15)
integ arr = xrange(l, 7)
for iter idx in xrange(2):
results = np.zeros([len(optimize_arr), 1J)
for i in optimize arr:
parametersfO] = i predicted = run_arimax(parameters, data train, data test, n, test j>rop) results [i - 1] = rmsefy test, predicted)
parameters[0] = optimize_arr[np.argmin(resuIts)] results = np.zeros([len(optimize_arr), 1J)
for i in optimize arr:
parametersflj = i
predicted = run_arimax(parameters, data train, data Jest, n, test _prop) results [i - 1] = rmse(y test, predicted)
parameters[l] = optimize _arr[np.argmin(results)J results = np.zeros([Ien(integ_arr), 1])
for i in integ arr:
results = np.zeros([len(integ arr), 1J)
parameters[2] = i
predicted = run arimax(parameters, data train, data test, n, test prop) results [i - 1] = rmse(y test, predicted)
parameters[2] = integ arr[np.argmin(results)] with openCresults/paramteres_arimax_%d.csv' % item id,‘wb) as myfile: wr = csv.writer(myfile)
wr.writerow (parameters) def run test_arimax(item id, data train, data test): data testset indexCtime l ' inplace ^True)
data train.set indexCtime l ' inplace ^True)
y test ^ data test ['target'], to JrameQ
n = len(data_test)
parameters = []
with openCresults/paramteres_arimax_%d.csv' % item id, Y) as f:
rr = csv. reader (f)
parameters = list(rr)[0J
predicted = run_arimax(parameters, data train, data test, n, [])
return rmse(yjest, predicted)
def run_train(idx, test _prop):
files = Hstdir("Raw DataTS_I ")
file tmp = filespdx]
str name = file tmp[: -4J. splitC )
item id = int(str name[-l])
df tmp = pdread csvCRaw Data/TS 1/ts 1 %d.csv' % item id)
printfOptimize arimax item %d' % item id)
run optimize arimax(item id, df tmp, test prop)
def run jpredictQ:
# Note: You cannot run this Junction because the test files are not available to you with open (predictions2. csv *wb) as myfile: wr = csv.writer(myfile, quoting=csv. QUOTE ALL) wr.writerow(['Item ' 'AR1MAX1, 'Number of samples'])
for file tmp in listdir("Test/IS_l "):
str name = file_tmp[:-4] .splitf J)
itemjd = int(str_name[-l])
df test = pd.read_csvCTest/TS_l/ts_l_%d.csv' % item jd) df train = pd. read_csv('Raw_Data/TS_ l/ts_ l %d. cs v ' % itemjd) mea list = [itemjd]
print(T est arimax item %d' % itemjd)
meajist.append(run test_arimax(item id, df train, df test)) mea list.append(len(df test))
wr. writerow (mea list)
def run process(idx):
test prop = 0.2
run data engineeringO
run features selectionQ
run trainfidx, test _prop)
if name _ — -— ' main ':
path, dirs, files = walkC'Raw _Data/TS l ").nextQ
n = len(files) pool = Pool/processes =3)
pool. map/run jprocess, xrangefn))
pool.cbseQ
# run _predictQ
Data Scientists may be provided with a virtual workstation e.g. as described in the following (italicized) Reference Guide (which may or may not be specific to an individual challenge) in natural language; all or any subset of the following may be provided:
We have prepared a designated AWS based station for you equipped with everything required to successfully perform your data science.
Data scientists may for example communicate with us using the messaging junctions on our online bidding platform.
login and usage instructions for the data scientists’ server may include some or all of the following:
Logging In: We will send you a Registration code, and Username in a private message. Launch the client and enter the Registration code provided by us. You will receive your Password by SMS on your mobile. Log in using the Username and Password provided by us.
In your user on the system Server you may find the folders specified in the table of fig. 14.
We have prepared for you the following open source software: Python IDEs: PyCharm, atom; R IDEs: The built-in GUI, R Studio, R; Utilities: Libre Office, Notepad^ ÷ for other text file editing needs.
The DS Workspaces are isolated and secured e.g. Files and/or text from/to the server cannot be copied and suitable administrative functions are disabled.
your solution may only rely on free-for-commercial-use open-source software and your own code. 3rd use case - Matchmaking the best analytical answer e.g. as shown in Fig. 2: The 3rd use case may be activated by the product business user (e.g. as shown in fig. a) by selecting a Business Solution from a Business Solution repository in memory e.g. in the embodiment of fig. 10 via the web interface. Then, the product business user may select the right Business Solution EP for the task and load his data based on the Business Data Set’s instructions (with all the mandatory attributes and any optional features).
- The automatic run that selects the best Machine Learning Solution commences.
Phase I - Data pre-processing e.g. operation 410 in Fig. 2 and Module 1 - Data Validation and Preparation are now described in detail.
Input: BDS, UDS, ADS, Validation Rules
Process: The Data validation and preparation module processes all the Data Sets (DSs), typically including user data (Business Data Sets) and/or the additional Data Sets (ADS) based on the definitions found in the Business Data Sets. In the Business Data Set, every aspect of the file’s structure is defined, e.g. which fields, type etc. in order to:
Validate the Data Set’s compatibility with the Business Data Set - check that all fields exist and the Data Set’s content is clean (e.g. includes any required data as defined in the Business Data Set (figure 7)), correct (e.g. its structure should conform to the definitions in the Business Data Set) and statistically meaningful (e.g. explain the other parts of the data e.g. in an non-monotonous manner . The Data validation and preparation module uses validation rules to check for correctness, meaningfulness, and consistency of the data that are input to the system. The validation rules may include all or any subset of:
o Technical validation of one or both of:
Of each field’s data type and format vs the definitions in the BDS Consistency checks: Key matching according to conventional Referential Integrity defined in the Business Data Set
o Statistical validation:
Presentation of distributions and correlations
Content Validation based on exploratory data analysis: mandatory & optional fields, sampling, homogeneity train/test, leakage, missing values, anomalies and more Output: Validated UDS, error and warning log
Operation 510 aka Module 2 - Data Features Additions
Input: BDS, UDS, ADS
Process: The features data additions module may elaborate the User Data Set (UDS) and Analytical Data Set (ADS) with additional features (presented as additional features to the User Data Set or Analytical Data Set or additional files with the additional features) that are statistical manipulations of their already-existing data. The goal is to ease any data science algorithms, processing these Data Sets later, to reach a strong statistical result.
The module may generate new features and/or new files and features (ADDS) by executing Additions Rules that may be predefined for each of the Data Sets features in the Business Solution and Analytical Solution. For example, the features data additions module might create a new feature based on the monthly average sales of each product.
Output: ADDS (additional features in User Data Set or Analytical Data Set and/or additional files)
Operation 520 aka Module 1A - run operation 510 aka Module 1 (Data validation and Preparation) again on the complete Data Sets is now described in detail.
Input: ADDS, Validation Rules
Output: Validated UDS, error and warning log
Operation 530 aka Module 3 - Data holistic validation is now described in detail.
Input: UDS, ADS, ADDS
Process: The purpose of the Data holistic validation module is to check all the Data Sets as a whole. For example, the Data holistic validation module might check whether there is information for each of the dates found in the user’s User Data Set
The holistic validation may run the same validation rules as of Module 1 (Data validation and Preparation) excluding the ones that are marked as“not required for module 3”. As a result, if no errors are produced, a Final Data Set (FDS) is generated.
Output: Final Data Set, error and warning log Operation 540 aka Module 4 - Data Reduction is now described in detail.
Input: FDS, Reduction Rules
Process: The Data Reduction module may attempt to reduce the number of files in the Final Data Set for the ease of the External Data Scientist that may develop the data science solution. The Data Reduction module may process Reduction rules that may make sure each file merged will not result in loss of statistical meaning. For example, if two of the files include data at the customer and day level - they may be merged safely (statistically speaking).
Output: FDS
Operation 550 aka Module 5 - Data set obfuscation is now described in detail.
This operation makes sensitive information safe for wider visibility and analysis by the External Data Scientist, typically while maintaining certain characteristics deemed“main” characteristics of the data and data structure deemed central for ML analysis and prediction. This may address regulatory requirements as well. There are three potential anonymization levels (low, medium, high). A default anonymization level (e.g.“medium”, if not otherwise defined by the system’s end user), may be defined.
Input: FDS, Obfuscation functions (per feature type), Feature’s types in the Business Data Set
Process: The obfuscation processes the Final Data Set (FDS) and converts the FDS into Final Obfuscated Data Set (FODS) for the use of the FDS by 2 x 2 levels: the file and the feature level, and the name and content level:
• Names: All the file and features names are replaced by the matching names in the Analytical Data Set (whereas the Final Data Set had the original business names as in the Business Data Set)
• Content: All the content of the features is replaced whereas:
o Per each of the feature type performs an obfuscation function. These are 100% clear explanations to the’’regular programmer” which is the theoretical audience:
Number - linear transformation
Category - replace the categories with renumber categories
Date - change into number, and rescale the zero point Geo location - turn into polygon and then turn into number and then obfuscate as number
Text - Turn into categories and then obfuscate as categories
o If the feature is a key (aka unique identifier) (example customer ID or product SKU), the feature is replaced in a synchronized way e.g. as described above in all the files to make sure the keys relationships between the files are maintained (e.g. customer ID)
Output: FODS
Operation 560 aka Module 6 - Data set splitting is now described in detail.
Input: FODS, BDS
Process: . Cuts or partitions the Data Sets into several files across their records (e.g. say, of 1M records, 100, 100, 200, 200, 400 records in each files - .
process configured for organizing and splitting the data into subsets (by time period, prediction rounds (e.g. train vs. test), and access authorization (e.g. eds-available vs. eds-unavailable), according to predefined rules, with a view to allow for further data science modeling and prediction.
The data shall be split based on all or any subset of the following three parameters: prediction period, train vs. test set, and availability vs. unavailability to External Data Scientists:
• First the data shall be split by prediction period, e.g. in 5 prediction periods (PI, P2, P3, P4, P5).
The External Data Scientists only get pi and p2 (see below)
The length of the prediction period is uniform over time. Prediction period length is typically a parameter taken as-is from the“statistical solution. Separating the data according to information per time period is useful in avoiding leakage, thereby protecting the model from eventual biases due to statistical issues (e.g. data leakage https://en. wikinedia.org/wiki/Informa1ion leakage ); in addition, separating the data according to information per time period allows for training and testing the model on different data subsets, per time period, and therefore separating the data according to information per time period increases the data’s robustness for re-run and eventually reuse.
The prediction period may or may not be the same as the "time period".
Time periods may be in days, weeks, months, quarters, years.
• The data is then typically randomly split into train and test sets, per time and expected prediction period. Each time period PI, P2, P3 shall be randomly divided between test and train set, in a ½ and ¾ proportion. P4 is always a test set because targets for this set are not available yet. This is a real-time prediction period.
• The data is then typically split between data made available or not to the External Data Scientist
PI and P2, with their train and test sets, may be made fully available (e.g. with target values) to the External Data Scientist.
P3 may be split into train and test, but the values of the target may be kept from External Data Scientist. The train set and test set may both be delivered to the External Data Scientist with no target
P4 is only a train set. So P4 typically may not be split between train and test because the true values of the target are not available at the time of prediction. This set is not made available to External Data Scientists at all. This method of splitting is advantageous because before delivering any out of time prediction to the system end-user, the objective is to train and test twice on the original data (during roundl ). The process performed by module 6 aka operation 560, may be viewed as a model building pipeline, based on two prediction rounds.
To better incentivize External Data Scientists to produce robust models, they may train their models only during round 1 (the first training & prediction cycle), therefore their code should be written in such a way that when training on future data (during round 2) from the same source, their model may still perform well. This may increase their willingness to build and plan their code accordingly.
During Round 1 (first training and prediction cycle, based on 3 time periods pi, p2, p3): • PI and P2 - the two first periods may be given to the External Data Scientist for exploration and training, with the observed value of the target
• P3 - the third period may be given to the External Data Scientist but without the target.
Feedback on the performance of the External Data Scientist solution (e.g. target values) may be given only for part of this period.
• Round 1 ranking: At the end of roundl, the n best solutions may“win” a relatively small ranking, compared to rankings“won” by best solutions in round 2.
During Round 2 (second training and prediction cycle, based on rolled historical data to predict on next period where target is unknown):
• P2 and P3 - the second and third periods may be used for training, using the solutions created by the External Data Scientist during round 1 (without changing them)
• P4 - the models builds using the Machine Learning solutions created with PI and P2, and the data from P3 may be used to give predictions for P4.
• P4 true results - at the end of round 2, the customer may provide the true results for the final scoring, and the benchmark may then be computed.
• Round 2 ranking: Again, the n best solutions may“win” a ranking, but this time a higher one. This way it is ensured that the solutions provided by the External Data Scientist are planned to be robust and general to the best possible extent.
The input and output for this phase may be defined by two processes.
• Process 1 : Split based on time period and prediction round: input data includes several anonymized data tables that refer to entities, during periods PI, P2. The output data is organized in 2 folders: one folder per round, Rl, R2 according to the description above.
• process 2 - Split based on train and test: input data comes from 2 folders and is split by period. Rl includes periods PI, P2, P3 and R2 includes periods P2, P3, and P4 (rolled data).
The output data is organized in subfolders, Rl and R2 as follows:
• In Rl:
• Train and Test sets PI & P2 with true targets available to External Data Scientists. • Test set without targets on which the External Data Scientists apply their trained model.
• Test set with targets on which the first benchmark is computed and the award awarded.
• Each data record from the train data (PI & P2) is associated with a target, while entities from the test data (P3) are not, and the External Data Scientist is expected to estimate the values of the test set. The values of the test set may be stored in a 4th file called (P3, True Targets). Each of them is of the same format as the original Data Sets (e.g. the splitting does not change the structure, only splits down the records, vertically and not horizontally, selecting part of the records every time, leaving all the fields)
• In R2:
• Train and test sets P2 & P3 with true targets available only to IDS (internal data scientists)
• Train test P4 available to Internal Data Scientists and then to customers after prediction.
• Each data record may be part of either the train data or the test data. The split is random in a ½ and ¾ proportion.
Output: FODS split to PI, P2, P3, P4 and P5
Operation 570 aka Module 7a - Benchmark calculation for baseline model is now described in detail.
Input: FODS split to PI, P2, P3, P4 and P5
Process: At the formulation process of 101-300 an Internal Data Scientist team already produced a baseline model that best describes the intuitive heuristic currently used by customers to address the business question. Module 5 typically comprises computing the benchmark for this baseline model (single figure or percentage) when used on the test set.
This benchmark may be later compared with benchmarks computed for new Machine Learning models produced by the automatic platform.
Output: BM-P1,2,3,4,5 Operation 420 aka Phase II—MLS Matchmaking - selecting a suitable e.g. best Machine Learning Solution is now described in detail.
During operation 410 aka phase i of the matchmaking run - the data typically already undergoes pre-processing, typically including all or any subset of validation (operations 510 and/or 530 e.g.), enrichment (additions e.g. as per operation 520), obfuscation e.g. as per operation 550, reduction e.g. as per operation 540, splitting e.g. as per operation 560 and bench and baseline model calculation e.g. as per operation 570 - ready for the automated platform to choose the best Machine Learning Solution. The pre-processed data e.g. as generated by operation 410 presents information about entities, already split between train and test sets, and between the in-focus time periods. As already described, the data consists of several Data Sets (each may include several Data Files), that may contribute to solve the business problem, and whose structure is, typically, predefined in the data structures in the Business Solution and AS - all combined into the FODS Pn.
Several potentially competing Machine Learning Solutions (codes written in designated statistical coding language as Python or R) have been previously built by External Data Scientists, tested and validated e. g. by the system of Fig. 10, and uploaded to the Machine Learning Solutions repository.
These competing Machine Learning Solutions all address a specific Analytical Solution (including data structure, and parameters) and a Business Solution. Since each Business Solution is correlated to a specific Analytical Solution, each Machine Learning Solution may address several Business Solutions.
At the creation of each Business Solution, Internal Data Scientists (human experts; as opposed to External human experts which accept challenges as described herein) select an Analytical Solution for that Business Solution, and set up its relevant parameters, to properly address the Business Question in the Business Solution. Each Analytical Solution has been previously associated, e.g. matched as relevant to, several Machine Learning Solutions (or at least one) in the course of matchmaking operation 400 of Fig. 1. To be able to provide a statistically strong answer, adjusted to any new data from a user (based, of course, on the definitions in the Business Data Set and Analytical Data Set, the pipeline may be populated with as much potential Machine Learning Solutions as possible, that have a potential to properly address a given Business Solution and the Business Solution’s associated Analytical Solution. The parts or building-blocks or parts of a Machine Learning Solution typically include all or any subset of: Machine Learning Solution - Feature Engineering and Extraction, Machine Learning Solution - Features Selection, Machine Learning Solution - Training/Leaming/Full Evaluation and Machine Learning Solution - Final Prediction.
Operation 670 aka Module 7b - Populate Machine Learning Solution Pipeline
Input: Business Solution, Analytical Solution, FODS split to PI, P2, P3, P4 and P5
Process:
Populating the Machine Learning Solution pipeline with“building blocks” that pertain to each preselected Machine Learning Solution, and re-arranging these blocks into dedicated libraries according to their function (Features Engineering and Extraction, Features Selection, and Training Learning and Full Evaluation Libraries). The separation of the blocks as described herein, supports the functionality of producing possible combinations of blocks for new runs, based on a suitable strategy such as but not limited to strategies a, b described below. The process is typically operative to synthetically generate many more solutions than the number of solutions provided by the External Data Scientists (which is one per External Data Scientist), the combinations are permutations of each of the External Data Scientists’ Machine Learning Solutions blocks resulting in many more“Synthetic” Machine Learning Solutions. The four-part Machine Learning Solutions provided by such scientists, each solution comprising 4 blocks, allow a specific Machine Learning Solution to be compiled automatically from relevant Machine Learning Solution blocks existing in the Machine Learning Solution repository e.g. as a result of previous competitions aka challenges in each of which at least one Data Scientist provided at least one Machine Learning Solution including at least one of and typically all of the following blocks: Machine Learning Solution - Feature Engineering and Extraction, Machine Learning Solution - Features Selection, Machine Learning Solution - Training/Leaming/Full Evaluation and Machine Learning Solution - Final Prediction. Once a repository of these blocks is available to the system, Machine Learning
Solution blocks in the repository may then be suitably mix and matched to generate plural Machine Learning Solution pipelines to address plural respective Business Solutions and each the Business Solution’s respective Analytical Solution. The pipeline includes all combinations of the first three parts, In each cycle of the matchmaking, said combination is being evaluated. A“winning” combination is selected. Then, only with the “winning” combination, the full evaluation is executed. Following run strategies, prediction for each entity in the test-set after the model has learnt from the data in the train-set, is produced. This “populate pipeline” may generate several Machine Learning Solutions-Block Combinations (MLS-BC) that typically automatically compete with one another.
Each successful completion (e.g. completed its run without errors) of a Machine Learning Solution pipeline may typically run a version of the code provided for the Machine Learning Solutions by the scientist who authored the Machine Learning Solution (the code subsequently being stored in the Machine Learning Solution repository) and the data generated by the code of that part (e.g. a given block from among the 4 blocks generated by that scientist) may be stored for the use of the next Machine Learning Solution part . If a script fails at any stage (say after the 1st or 2nd or 3rd or 4th block of code authored by this scientist has been run), a failure notification shall be logged. Typically the user is notified about successful completion after the“Benchmarking and Machine Learning Solution Selection” (module 12 - Operation 720).
Table 11 is an example data structure for storing Machine Learning Solutions in a repository.
Each instance of a Machine Learning Solution stored in the Machine Learning Solution repository includes: 4 bodies of code, for the Machine Learning Solution parts respectively, plus a statistical benchmark result aka benchmark. The benchmark is calculated using the baseline model as per Operation 570 aka Module 7a of phase 1 - Benchmark calculation for baseline model
2 example strategies to populate an MLS pipeline are now described:
Strategy A:
Operation Al: The Machine Learning Solution pipeline runs operation 680 aka the Features Engineering and Extraction Module (Module 8), for each of models (aka Machine Learning Solutions) MLS1 , MLS2, ...MLSn and stores one output per Machine Learning Solution model: A data set with one row per entity from the trained set and the trained set’s new extracted original features. Operation A2: the Machine Learning Solution pipeline outputs one joint table that gathers all features extracted by all the selected Machine Learning Solution models, per entity, for each entity in the initial data set.
Operation A3: the Machine Learning Solution pipeline runs operation 690 aka the Features Selection Module, while running each building block that populates the Block 2 (say, or more generally, the block M) Folder, on the joint table created in operation a2. The Machine Learning Solution pipeline then stores the output separately in a dedicated folder for each model in (say) block 2: B21, B22,... B2n where n is the number of selected Machine Learning Solutions operation a4: the Machine Learning Solution pipeline runs the next module for each model and stores results per model in a dedicated folder. There is no additional stacking in this strategy, because the stacking process happened during operation al which runs Feature Engineering and Extraction operation 680.
Fig. 12 is an example of the operation of strategy a. Strategy B:
Operation Bl. The Machine Learning Solution pipeline runs each model separately on the input data, per module e.g. for all the modules of figs. 3, 4 herein. Typically each building block is run on each input data set and outputs a data set for the next building block of the same model.
Operation B2. After operations 410, 420 of Fig. 2 have been completed, Machine Learning stacking is performed in which combinations of various machine codes are generated in accordance with the specific combination stipulated by each model. This is the way the pipeline is populated, e.g. in Operation 670 aka Module 7b of phase ii - Populate Machine Learning Solution Pipeline
Operation B3. Finally, the Machine Learning Solution pipeline runs the next module for each model and stores results per model in a dedicated folder. Output: MLS-BC 1..n
Module 8 - Feature engineering & extraction is now described in detail.
Input: BS, AS, FODS PI, P2, P3, P4 and P5
Process: Run the MLS-BC l..n MLS-FEE l..n Output: additional features added to FODS PI, P2, P3, P4 and P5
Module 9 - Feature selection is now described in detail.
Input: additional features added to FODS PI, P2, P3, P4 and P5
Process: Run the MLS-BC l..n MLS-FS l..n
Output: Selected features from FODS PI, P2, P3, P4 and P5 and from the added features
Module 10 - Training/learning/full evaluation is now described in detail.
Input: selected features from FODS PI , P2, P3, P4 and P5 and from the added features
Process: Run the MLS-BC l ..n MLS-TLF l..n
Output: MLS-BC l..n PS
Module 11 - Final prediction is now described in detail.
Input: FODS PI , P2, P3, P4 and P5
Process: Run the MLS-BC l ..n MLS-TLF l..n
Output: MLS-BC L.n PS
Module 12 - Benchmarking and MLS Selection is now described in detail.
Input: PS of the selected MLS-BC.
Process: Selecting the optimal Machine Learning Solution - BC from all Machine Learning Solution -BCs e.g. in the system of Fig. 10’s repository which offers a suitable solution to address the Business Question and User Data Set.
This selection is done by comparing the results obtained by the competing models, on the test set Output: PS of the selected Machine Learning Solution -BC.
Phase IP - Production of Analytical Results and Statistical Story
Input: Prediction Set of the selected Machine Learning Solution -BC.
The product may be able to provide, based on a successful selection of a“winning” (e.g. best Machine Learning Solution) to query the resulted data following several strategies reflecting his optimal use of the analytical answer in his operational or business process. The organization may receive a guide e.g. in natural language of“how should I use the result in my operations”. The strategies may be customized to the organization’s needs based on its selected Strategy Parameters (SP) e.g. as shown in the table of Fig. 13.
As a result, a simulation report may be generated.
Output: Optional strategies report.
It is appreciated that only one example embodiment of each of the above modules is described. The input may be other than described above; the process may be other than described above and different or additional outputs may be provided, in each respective module.
According to one embodiment, operations 510 - 720 all run in parallel e.g. are cascaded to run on different requests for predictions.
It is appreciated that the architecture of Fig. 10 is merely exemplary; any or all of the modules therein need not be provided.
Typically, system end-users upload data files over secured web service to a private storage area (e.g. HOT data). Data during the transfer process and in the storage area are typically both encrypted. Each customers data may be kept is a separated secured location. Typically, Backend processes or serves e.g. data verification/preparation are provided as internal system Internal processes and only these processes are entitled to access customers HOT data for verification/preparation tasks.
Typically, there is a different Storage area for Obfuscated data. System end-users’ data separation is typically maintained for obfuscated data as for raw data e.g. as above typically, Obfuscated data is available for Data scientists models.
Typically, Metadata is kept on a proprietary database structure and is available for the system’s internal services only (backend services).
It is appreciated that the code solutions provided e.g. by human data scientists e.g. in response to a challenge, need not necessarily comprise the specific 4 blocks shown and described herein. Instead, less than or more than 4 blocks may be provided by each data scientist, and the operative content (the functionalities performed by each block) may be different, as long as a set or sequence of blocks is predefined, and the functionalities to be performed by each block are predefined, typically in sequence such that each block operates on input generated by at least one previous block in the sequence.
It is appreciated that terminology such as "mandatory", "required", "need" and "must" refer to implementation choices made within the context of a particular implementation or application described here- within for clarity and are not intended to be limiting since in an alternative implantation, the same elements might be defined as not mandatory and not required or might even be eliminated altogether.
Components described herein as software may, alternatively, be implemented wholly or partly in hardware and/or firmware, if desired, using conventional techniques, and vice-versa. Each module or component or processor may be centralized in a single physical location or physical device or distributed over several physical locations or physical devices.
Included in the scope of the present disclosure, inter alia, are electromagnetic signals in accordance with the description herein. These may carry computer-readable instructions for performing any or all of the operations of any of the methods shown and described herein, in any suitable order including simultaneous performance of suitable groups of operations as appropriate; machine-readable instructions for performing any or all of the operations of any of the methods shown and described herein, in any suitable order; program storage devices readable by machine, tangibly embodying a program of instructions executable by the machine to perform any or all of the operations of any of the methods shown and described herein, in any suitable order i.e. not necessarily as shown, including performing various operations in parallel or concurrently rather than sequentially as shown; a computer program product comprising a computer useable medium having computer readable program code, such as executable code, having embodied therein, and/or including computer readable program code for performing, any or all of the operations of any of the methods shown and described herein, in any suitable order; any technical effects brought about by any or all of the operations of any of the methods shown and described herein, when performed in any suitable order; any suitable apparatus or device or combination of such, programmed to perform, alone or in combination, any or all of the operations of any of the methods shown and described herein, in any suitable order; electronic devices each including at least one processor and/or cooperating input device and/or output device and operative to perform e.g. in software any operations shown and described herein; information storage devices or physical records, such as disks or hard drives, causing at least one computer or other device to be configured so as to cany out any or all of the operations of any of the methods shown and described herein, in any suitable order; at least one program pre-stored e.g. in memory or on an information network such as the Internet, before or after being downloaded, which embodies any or all of the operations of any of the methods shown and described herein, in any suitable order, and the method of uploading or downloading such, and a system including server/s and/or client/s for using such; at least one processor configured to perform any combination of the described operations or to execute any combination of the described modules; and hardware which performs any or all of the operations of any of the methods shown and described herein, in any suitable order, either alone or in conjunction with software. Any computer-readable or machine-readable media described herein is intended to include non-transitory computer- or machine-readable media.
Any computations or other forms of analysis described herein may be performed by a suitable computerized method. Any operation or functionality described herein may be wholly or partially computer-implemented e.g. by one or more processors. The invention shown and described herein may include (a) using a computerized method to identify a solution to any of the problems or for any of the objectives described herein, the solution optionally include at least one of a decision, an action, a product, a service or any other information described herein that impacts, in a positive manner, a problem or objectives described herein; and (b) outputting the solution.
The system may if desired be implemented as a web-based system employing software, computers, routers and telecommunications equipment as appropriate.
Any suitable deployment may be employed to provide functionalities e.g. software functionalities shown and described herein. For example, a server may store certain applications, for download to clients, which are executed at the client side, the server side serving only as a storehouse. Some or all functionalities e.g. software functionalities shown and described herein may be deployed in a cloud environment Clients e.g. mobile communication devices such as smartphones may be operatively associated with but external to the cloud.
The scope of the present invention is not limited to structures and functions specifically described herein and is also intended to include devices which have the capacity to yield a structure, or perform a function, described herein, such that even though users of the device may not use the capacity, they are if they so desire able to modify the device to obtain the structure or function. Any“if -then” logic described herein is intended to include embodiments in which a processor is programmed to repeatedly determine whether condition x, which is sometimes true and sometimes false, is currently true or false and to perform y each time x is determined to be true, thereby to yield a processor which performs y at least once, typically on an“if and only if’ basis e.g. triggered only by determinations that x is true and never by determinations that x is false.
Features of the present invention, including operations, which are described in the context of separate embodiments may also be provided in combination in a single embodiment. For example, a system embodiment is intended to include a corresponding process embodiment and vice versa. Also, each system embodiment is intended to include a server-centered“view” or client centered“view”, or“view” from any other node of the system, of the entire functionality of the system , computer-readable medium, apparatus, including only those functionalities performed at that server or client or node. Features may also be combined with features known in the art and particularly although not limited to those described in the Background section or in publications mentioned therein.
Conversely, features of the invention, including operations, which are described for brevity in the context of a single embodiment or in a certain order may be provided separately or in any suitable sub combination, including with features known in the art (particularly although not limited to those described in the Background section or in publications mentioned therein) or in a different order "e.g." is used herein in the sense of a specific example which is not intended to be limiting. Each method may comprise some or all of the operations illustrated or described, suitably ordered e.g. as illustrated or described herein.
Devices, apparatus or systems shown coupled in any of the drawings may in fact be integrated into a single platform in certain embodiments or may be coupled via any appropriate wired or wireless coupling such as but not limited to optical fiber, Ethernet, Wireless LAN, HomePNA, power line communication, cell phone, Smart Phone (e.g. iPhone), Tablet, Laptop, PDA, Blackberry GPRS, Satellite including GPS, or other mobile delivery. It is appreciated that in the description and drawings shown and described herein, functionalities described or illustrated as systems and sub-units thereof can also be provided as methods and operations therewithin, and functionalities described or illustrated as methods and operations therewithin can also be provided as systems and sub-units thereof. The scale used to illustrate various elements in the drawings is merely exemplary and/or appropriate for clarity of presentation and is not intended to be limiting.

Claims

1. A data processing system comprising:
a system-scientist data interface controlled by a first processor to accept and store in a digital repository, multiple-part Machine Learning Solutions from scientists each including multiple blocks;
a second processor compiling plural Machine Learning Solution pipelines to address plural respective Business Solutions and each Business Solution’s respective Analytical Solution, wherein said compiling comprises mixing-and-matching Machine Learning Solution blocks stored in the Machine Learning Solution repository thereby to generate Machine Learning Solution
Block Combinations, and
a system-business user interface controlled by a third processor to generate a Machine Learning Solution pipeline output, for at least one business user presenting an individual Business Solution including business data (aka business data set - Business Data Set) using the Machine Learning Solutions pipeline compiled by the processor for said individual Business Solution.
2. A system according to claim 1 and also comprising a fourth processor operative to communicate to at least one Data Scientist, a challenge inviting the at least one Data Scientist to provide at least one Machine Learning Solution.
3. A system according to claim 1 wherein said multiple blocks include a Business Solution- Feature Engineering and Extraction block.
4. A system according to claim 1 wherein said multiple blocks include a Machine Learning Solution -Features Selection block.
5. A system according to claim 1 wherein said multiple blocks include a Machine Learning Solution - Training/Leaming/Full Evaluation block.
6. A system according to claim 1 wherein said multiple blocks include a Machine Learning Solution -Final Prediction block.
7. A system according to claim 1 wherein for at least one Business Solution, said Machine Learning Solution Block combinations automatically compete with one another including identifying a best block combination and wherein the Machine Learning Solution pipeline compiled by the processor for said individual Business Solution comprises said best block combination.
8. A system according to claim 3 wherein the Business Solution-Feature Engineering and Extraction block includes code operative for pre-processing data before machine learning analysis..
9. A system according to claim 3 wherein all pre-processing of data before machine learning analysis occurs in the Business Solution-Feature Engineering and Extraction block and not in other blocks.
10. A system according to claim 9 wherein said pre-processing includes removing data columns.
11. A system according to claim 9 wherein said pre-processing includes extracting additional data from existing data and using said additional data as input to the machine learning analysis.
12. A system according to claim 4 wherein the Machine Learning Solution -Features Selection block includes code operative for selecting features which are stronger predictors of at least one target variable to be predicted in a given challenge and not . selecting features which are less strong predictors of the at least one target variable.
13. A system according to claim 4 wherein all code operative for selecting features which are stronger predictors of at least one target variable to be predicted in a given challenge and not . selecting features which are less strong predictors of the at least one target variable is included in the Machine Learning Solution -Features Selection block and not in other blocks.
14. A system according to claim 5 wherein the Machine Learning Solution - Training/Leaming/Fu 11 Evaluation block includes machine learning code operative to train and evaluate a machine learning training set
15. A system according to claim 5 wherein all code operative to train and evaluate a machine learning training set is included in the Machine Learning Solution - Training/Learning/Full Evaluation block and not in other blocks.
16. A system according to claim 6 wherein the Machine Learning Solution -Final Prediction block is operative to run machine learning code on a test set including providing a final prediction for each data record in the test set.
17. A system according to claim 6 wherein the Machine Learning Solution -Final Prediction block is operative to run machine learning code included in a Machine Learning Solution - Training/Leaming/Full Evaluation block on a test set including providing a final prediction for each data record in the test set
18. A system according to claim 6 wherein all code operative to run machine learning code on a test set including providing a final prediction for each data record in the test set is performed by the Machine Learning Solution -Final Prediction block and not in other blocks.
19. A data processing method comprising:
Providing a system-scientist data interface controlled by a first processor to accept and store in a digital repository, multiple-part Machine Learning Solutions from scientists each including multiple blocks;
Providing a second processor compiling plural Machine Learning Solution pipelines to address plural respective Business Solutions and each Business Solution’s respective Analytical Solution, wherein said compiling comprises mixing-and-matching Machine Learning Solution blocks stored in the Machine Learning Solution repository thereby to generate Machine Learning Solution Block Combinations, and
Providing a system-business user interface controlled by a third processor to generate a Machine Learning Solution pipeline output, for at least one business user presenting an individual Business Solution including business data (aka business data set - Business Data Set) using the Machine Learning Solutions pipeline compiled by the processor for said individual Business Solution.
20. A computer program product, comprising a non-transitory tangible computer readable medium having computer readable program code embodied therein, said computer readable program code adapted to be executed to implement a data processing method comprising:
Providing a system-scientist data interface controlled by a first processor to accept and store in a digital repository, multiple-part Machine Learning Solutions from scientists each including multiple blocks;
Providing a second processor compiling plural Machine Learning Solution pipelines to address plural respective Business Solutions and each Business Solution’s respective Analytical Solution, wherein said compiling comprises mixing-and-matching Machine Learning Solution blocks stored in the Machine Learning Solution repository thereby to generate Machine Learning Solution Block Combinations, and
Providing a system-business user interface controlled by a third processor to generate a Machine Learning Solution pipeline output, for at least one business user presenting an individual Business Solution including business data (aka business data set - Business Data Set) using the Machine Learning Solutions pipeline compiled by the processor for said individual Business Solution.
PCT/IL2019/050358 2018-05-07 2019-03-28 Multiple-part machine learning solutions generated by data scientists WO2019215713A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201862667632P 2018-05-07 2018-05-07
US62/667,632 2018-05-07

Publications (1)

Publication Number Publication Date
WO2019215713A1 true WO2019215713A1 (en) 2019-11-14

Family

ID=68467989

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IL2019/050358 WO2019215713A1 (en) 2018-05-07 2019-03-28 Multiple-part machine learning solutions generated by data scientists

Country Status (1)

Country Link
WO (1) WO2019215713A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254885A (en) * 2020-02-13 2021-08-13 支付宝(杭州)信息技术有限公司 Machine learning model protection method and device
EP3968244A1 (en) * 2020-09-02 2022-03-16 Fujitsu Limited Automatically curating existing machine learning projects into a corpus adaptable for use in new machine learning projects
EP3968245A1 (en) * 2020-09-02 2022-03-16 Fujitsu Limited Automatically generating a pipeline of a new machine learning project from pipelines of existing machine learning projects stored in a corpus
US20220129783A1 (en) * 2020-10-23 2022-04-28 EMC IP Holding Company LLC Acceptance Status Classification of Product-Related Data Structures Using Models With Multiple Training Periods
WO2022174792A1 (en) * 2021-02-18 2022-08-25 International Business Machines Corporation Automated time series forecasting pipeline ranking
US11625632B2 (en) 2020-04-17 2023-04-11 International Business Machines Corporation Automated generation of a machine learning pipeline

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140372346A1 (en) * 2013-06-17 2014-12-18 Purepredictive, Inc. Data intelligence using machine learning
US20170308800A1 (en) * 2016-04-26 2017-10-26 Smokescreen Intelligence, LLC Interchangeable Artificial Intelligence Perception Systems and Methods

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140372346A1 (en) * 2013-06-17 2014-12-18 Purepredictive, Inc. Data intelligence using machine learning
US20170308800A1 (en) * 2016-04-26 2017-10-26 Smokescreen Intelligence, LLC Interchangeable Artificial Intelligence Perception Systems and Methods

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254885A (en) * 2020-02-13 2021-08-13 支付宝(杭州)信息技术有限公司 Machine learning model protection method and device
US11625632B2 (en) 2020-04-17 2023-04-11 International Business Machines Corporation Automated generation of a machine learning pipeline
EP3968244A1 (en) * 2020-09-02 2022-03-16 Fujitsu Limited Automatically curating existing machine learning projects into a corpus adaptable for use in new machine learning projects
EP3968245A1 (en) * 2020-09-02 2022-03-16 Fujitsu Limited Automatically generating a pipeline of a new machine learning project from pipelines of existing machine learning projects stored in a corpus
US11403304B2 (en) 2020-09-02 2022-08-02 Fujitsu Limited Automatically curating existing machine learning projects into a corpus adaptable for use in new machine learning projects
US11551151B2 (en) 2020-09-02 2023-01-10 Fujitsu Limited Automatically generating a pipeline of a new machine learning project from pipelines of existing machine learning projects stored in a corpus
US20220129783A1 (en) * 2020-10-23 2022-04-28 EMC IP Holding Company LLC Acceptance Status Classification of Product-Related Data Structures Using Models With Multiple Training Periods
WO2022174792A1 (en) * 2021-02-18 2022-08-25 International Business Machines Corporation Automated time series forecasting pipeline ranking
GB2618952A (en) * 2021-02-18 2023-11-22 Ibm Automated time series forecasting pipeline ranking

Similar Documents

Publication Publication Date Title
WO2019215713A1 (en) Multiple-part machine learning solutions generated by data scientists
Fabijan et al. The evolution of continuous experimentation in software product development: from data to a data-driven organization at scale
Marion et al. New product development practices and early‐stage firms: Two in‐depth case studies
US11625647B2 (en) Methods and systems for facilitating analysis of a model
Gajic et al. Method of evaluating the impact of ERP implementation critical success factors–a case study in oil and gas industries
US20150310358A1 (en) Modeling consumer activity
US20210026761A1 (en) Testing of complex data processing systems
Sigala Evaluating the performance of destination marketing systems (DMS): stakeholder perspective
US20110093309A1 (en) System and method for predictive categorization of risk
Saarikallio et al. Quality culture boosts agile transformation—Action research in a business‐to‐business software business
US20230394591A1 (en) Systems and Methods for Benefit Plan Quality Assurance and Certification
Štůsek et al. Strategic importance of the quality of information technology for improved competitiveness of agricultural companies and its evaluation
Wieczorek et al. Systems and Software Quality
Zillner Business models and ecosystem for big data
CN112801446A (en) Cloud service station and cloud service method
Grimheden et al. Concretizing CRISP-DM for Data-Driven Financial Decision Support Tools
Kumar Software Engineering for Big Data Systems
Gillain et al. Planning optimal agile releases via requirements optimization
Wood Applied Guide for Event Study Research in Supply Chain Management
Hamid Upgrading distributed agile development
US11875138B2 (en) System and method for matching integration process management system users using deep learning and matrix factorization
Brandes Examining how UNUM Group can accelerate the adoption of DEVOPS capabilities through the use of Value Stream Mapping Methods: A case study
Jony Preprocessing solutions for telecommunication specific big data use cases
Ali Dynamic Analysis of the Implementation of the Blockchain Technology in the Supply Chain
Quiroga et al. Artifact Development for the Prediction of Stress Levels on Higher Education Students using Machine Learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19799736

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19799736

Country of ref document: EP

Kind code of ref document: A1