US20220092419A1 - Systems and methods to use neural networks for model transformations - Google Patents

Systems and methods to use neural networks for model transformations Download PDF

Info

Publication number
US20220092419A1
US20220092419A1 US17/464,796 US202117464796A US2022092419A1 US 20220092419 A1 US20220092419 A1 US 20220092419A1 US 202117464796 A US202117464796 A US 202117464796A US 2022092419 A1 US2022092419 A1 US 2022092419A1
Authority
US
United States
Prior art keywords
model
data
neural network
dataset
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US17/464,796
Inventor
Vincent Pham
Anh Truong
Fardin Abdi Taghi Abad
Jeremy Goodsitt
Austin Walters
Mark Watson
Reza Farivar
Kenneth Taylor
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Capital One Services LLC
Original Assignee
Capital One Services LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Capital One Services LLC filed Critical Capital One Services LLC
Priority to US17/464,796 priority Critical patent/US20220092419A1/en
Assigned to CAPITAL ONE SERVICES, LLC reassignment CAPITAL ONE SERVICES, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: TAYLOR, KENNETH, ABAD, FARDIN ABDI TAGHI, FARIVAR, REZA, GOODSITT, JEREMY, PHAM, VINCENT, TRUONG, ANH, WALTERS, AUSTIN, WATSON, MARK
Publication of US20220092419A1 publication Critical patent/US20220092419A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/541Interprogram communication via adapters, e.g. between incompatible applications
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • G06F11/3608Software analysis for verifying properties of programs using formal methods, e.g. model checking, abstract interpretation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/362Software debugging
    • G06F11/3628Software debugging of optimised code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/362Software debugging
    • G06F11/3636Software debugging by tracing the execution of the program
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3684Test management for test design, e.g. generating new test cases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3688Test management for test execution, e.g. scheduling of test suites
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2237Vectors, bitmaps or matrices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2264Multidimensional index structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2423Interactive query statement specification based on a database schema
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/248Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9038Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2115Selection of the most significant subset of features by evaluating different subsets according to an optimisation criterion, e.g. class separability, forward selection or backward elimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • G06F18/2193Validation; Performance evaluation; Active pattern learning techniques based on specific statistical tests
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/285Selection of pattern recognition techniques, e.g. of classifiers in a multi-classifier system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/40Software arrangements specially adapted for pattern recognition, e.g. user interfaces or toolboxes therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/552Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/71Version control; Configuration management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/547Remote procedure calls [RPC]; Web services
    • G06K9/6227
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/005
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/001Texturing; Colouring; Generation of texture or colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/248Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/254Analysis of motion involving subtraction of images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/768Arrangements for image or video recognition or understanding using pattern recognition or machine learning using context analysis, e.g. recognition aided by known co-occurring patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/98Detection or correction of errors, e.g. by rescanning the pattern or by human intervention; Evaluation of the quality of the acquired patterns
    • G06V10/993Evaluation of the quality of the acquired pattern
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/192Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
    • G06V30/194References adjustable by an adaptive method, e.g. learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/196Recognition using electronic means using sequential comparisons of the image signals with a plurality of references
    • G06V30/1983Syntactic or structural pattern recognition, e.g. symbolic string recognition
    • G06V30/1985Syntactic analysis, e.g. using a grammatical approach
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1491Countermeasures against malicious traffic using deception as countermeasure, e.g. honeypots, honeynets, decoys or entrapment
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/2866Architectures; Arrangements
    • H04L67/30Profiles
    • H04L67/306User profiles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/34Network arrangements or protocols for supporting network services or applications involving the movement of software or configuration parameters 
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/23412Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs for generating or manipulating the scene composition of objects, e.g. MPEG-4 objects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8146Monomedia components thereof involving graphical data, e.g. 3D object, 2D graphics
    • H04N21/8153Monomedia components thereof involving graphical data, e.g. 3D object, 2D graphics comprising still images, e.g. texture, background image

Definitions

  • the disclosed embodiments concern a platform for management of artificial intelligence systems.
  • the disclosed embodiments concern using the disclosed platform to create neural network models of data based on previously existing models, including legacy models.
  • a legacy model is a model that runs in a programming environment that has been or is being replaced by a different programming environment. These data models can be used, for example, to generate synthetic data for testing or training artificial intelligence systems.
  • the disclosed embodiments also concern improvements transforming any model, including a legacy model, into a neural network model.
  • Neural network models provide advantages over conventional modeling approaches. Neural network models model data efficiently, adaptably, and accurately than conventional models. Neural network models may be recurrent neural network models, deep learning models, long-short-term memory (LSTM) models, convolutional neural network (CNN) models, generative adversarial networks (GANs), and the like. Neural network models meet different modeling needs than conventional models and can analyze a wide variety of data as data analysis objectives evolve.
  • LSTM long-short-term memory
  • CNN convolutional neural network
  • GANs generative adversarial networks
  • a neural network model may be designed according to an underlying physical or relational understanding of the data structure. Further, to assist user understanding of a model, it may be desirable to base the neural network model on an original model, e.g., a conventional model, that captures physical or statistical relationships between data elements. Further, a neural network model designed based on the features of the original model may behave more like the original model when encountering new datasets, and thereby yield similar insights that are easier for the user to understand. Therefore, it is often desirable to transform original models into neural network models instead of, for instance, replacing original models with completely new neural network models. For example, a neural network model may train more efficiently and more accurately predict an outcome if certain nodes are trained to replicate a linear regression model that had already identified certain statistically significant relationships between input variables. Further, a previously trained predictive neural network model may be an efficient seed model for an updated, more accurate predictive neural network model because the previously trained model may have learned to make predictions with some accuracy.
  • models of a given type e.g., a random forest model, a gradient boosting machine, a regression model, a linear regression model, a symbolic model, a neural network model, or other model
  • a legacy model e.g., an organization may be in the process of phasing out SAS and implementing PYTHON across a platform, so the organization may need to transform legacy SAS models to PYTHON models.
  • the disclosed embodiments provide unconventional systems and methods for transforming models that are more efficient, less costly, more flexible, and more accurate than conventional approaches.
  • the disclosed embodiments provide unconventional systems and methods to transform any input model into a neural network model.
  • the transformed neural network model is designed based on the features of the input model and is designed to overfit the input model. By overfitting the input model, the transformed neural network model accurately reproduces the modeling results of the input model on training datasets and is likely to behave like the input model when analyzing other, non-training datasets.
  • the disclosed embodiments provide systems and methods to transform legacy models into new models of the same type using machine learning.
  • neural network models to create a new model that is the same type as the legacy model but will run in a different environment than the legacy model
  • the unconventional systems and methods of disclosed embodiments save time, reduce costs, and reduce errors.
  • the disclosed embodiments include a system for transforming a model into a neural network model.
  • the system may include one or more memory units for storing instructions, and one or more processors configured to execute the instructions to perform operations.
  • the operations may include receiving input data comprising an input model, an input dataset, and an input command specifying model selection criteria.
  • the operations may include applying the input model to the input dataset to generate model output.
  • the operations may include storing model output and at least one of input model features or a map of the input model and generating a plurality of candidate neural network models.
  • the parameters of the candidate neural network models may be based on the input model features.
  • the operations may include tuning the plurality of candidate neural network models to the input model.
  • the operations may include receiving model output from the plurality of candidate neural network models and selecting a neural network model from the plurality of the candidate neural network models based on the candidate model output and the model selection criteria. In some aspects, the operations may include returning the selected neural network model.
  • a method for transforming a model into a neural network model may include receiving input data comprising an input model, an input dataset, and an input command specifying model selection criteria.
  • the method may include applying the input model to the input dataset to generate model output.
  • the method may include storing model output and at least one of input model features or a map of the input model and generating a plurality of candidate neural network models.
  • the parameters of the candidate neural network models may be based on the input model features.
  • the method may include tuning the plurality of candidate neural network models to the input model.
  • the method may include receiving model output from the plurality of candidate neural network models and selecting a neural network model from the plurality of the candidate neural network models based on the candidate model output and the model selection criteria. In some aspects, the method may include returning the selected neural network model.
  • non-transitory computer readable storage media may store program instructions, which are executed by at least one processor device and perform any of the methods described herein.
  • FIG. 1 depicts an exemplary cloud-computing environment for generating data models, consistent with disclosed embodiments.
  • FIG. 2 depicts an exemplary process for generating data models, consistent with disclosed embodiments.
  • FIG. 3 depicts an exemplary process for generating synthetic data using existing data models, consistent with disclosed embodiments.
  • FIG. 4 depicts an exemplary implementation of the cloud-computing environment of FIG. 1 , consistent with disclosed embodiments.
  • FIG. 5A depicts an exemplary process for generating synthetic data using class-specific models, consistent with disclosed embodiments.
  • FIG. 5B depicts an exemplary process for generating synthetic data using class and subclass-specific models, consistent with disclosed embodiments.
  • FIG. 6 depicts an exemplary process for training a classifier for generation of synthetic data, consistent with disclosed embodiments.
  • FIG. 7 depicts an exemplary process for training a classifier for generation of synthetic data, consistent with disclosed embodiments.
  • FIG. 8 depicts an exemplary process for training a generative adversarial using a normalized reference dataset, consistent with disclosed embodiments.
  • FIG. 9 depicts an exemplary process for training a generative adversarial network using a loss function configured to ensure a predetermined degree of similarity, consistent with disclosed embodiments.
  • FIG. 10 depicts an exemplary process for supplementing or transform datasets using code-space operations, consistent with disclosed embodiments.
  • FIGS. 11A and 11B depict an exemplary illustration of points in code-space, consistent with disclosed embodiments.
  • FIG. 12A depicts an exemplary illustration of supplementing datasets using code-space operations, consistent with disclosed embodiments.
  • FIG. 12B depicts an exemplary illustration of transforming datasets using code-space operations, consistent with disclosed embodiments.
  • FIG. 13 depicts an exemplary cloud computing system for generating a synthetic data stream that tracks a reference data stream, consistent with disclosed embodiments.
  • FIG. 14 depicts a process for generating synthetic JSON log data using the cloud computing system of FIG. 13 , consistent with disclosed embodiments.
  • FIG. 15 depicts a system for secure generation and insecure use of models of sensitive data, consistent with disclosed embodiments.
  • FIG. 16 depicts a process for transforming any model into a neural network model, consistent with disclosed embodiments.
  • FIG. 17 depicts a process for transforming a legacy model, consistent with disclosed embodiments.
  • the disclosed embodiments can be used to create models of datasets, which may include sensitive datasets (e.g., customer financial information, patient healthcare information, and the like). Using these models, the disclosed embodiments can produce fully synthetic datasets with similar structure and statistics as the original sensitive or non-sensitive datasets. The disclosed embodiments also provide tools for desensitizing datasets and tokenizing sensitive values.
  • the disclosed systems can include a secure environment for training a model of sensitive data, and a non-secure environment for generating synthetic data with similar structure and statistics as the original sensitive data.
  • the disclosed systems can be used to tokenize the sensitive portions of a dataset (e.g., mailing addresses, social security numbers, email addresses, account numbers, demographic information, and the like).
  • the disclosed systems can be used to replace parts of sensitive portions of the dataset (e.g., preserve the first or last 3 digits of an account number, social security number, or the like; change a name to a first and last initial).
  • the dataset can include one or more JSON (JavaScript Object Notation) or delimited files (e.g., comma-separated value, or CSV, files).
  • JSON JavaScript Object Notation
  • delimited files e.g., comma-separated value, or CSV, files.
  • the disclosed systems can automatically detect sensitive portions of structured and unstructured datasets and automatically replace them with similar but synthetic values.
  • FIG. 1 depicts a cloud-computing environment 100 for generating data models.
  • Environment 100 can be configured to support generation and storage of synthetic data, generation and storage of data models, optimized choice of parameters for machine learning, and imposition of rules on synthetic data and data models.
  • Environment 100 can be configured to expose an interface for communication with other systems.
  • Environment 100 can include computing resources 101 , dataset generator 103 , database 105 , model optimizer 107 , model storage 109 , model curator 111 , and interface 113 . These components of environment 100 can be configured to communicate with each other, or with external components of environment 100 , using network 115 .
  • the particular arrangement of components depicted in FIG. 1 is not intended to be limiting.
  • System 100 can include additional components, or fewer components. Multiple components of system 100 can be implemented using the same physical computing device or different physical computing devices.
  • Computing resources 101 can include one or more computing devices configurable to train data models.
  • the computing devices can be special-purpose computing devices, such as graphical processing units (GPUs) or application-specific integrated circuits.
  • the cloud computing instances can be general-purpose computing devices.
  • the computing devices can be configured to host an environment for training data models. For example, the computing devices can host virtual machines, pods, or containers.
  • the computing devices can be configured to run applications for generating data models. For example, the computing devices can be configured to run SAGEMAKER, or similar machine learning training applications.
  • Computing resources 101 can be configured to receive models for training from model optimizer 107 , model storage 109 , or another component of system 100 .
  • Computing resources 101 can be configured provide training results, including trained models and model information, such as the type and/or purpose of the model and any measures of classification error.
  • Dataset generator 103 can include one or more computing devices configured to generate data.
  • Dataset generator 103 can be configured to provide data to computing resources 101 , database 105 , to another component of system 100 (e.g., interface 113 ), or another system (e.g., an APACHE KAFKA cluster or other publication service).
  • Dataset generator 103 can be configured to receive data from database 105 or another component of system 100 .
  • Dataset generator 103 can be configured to receive data models from model storage 109 or another component of system 100 .
  • Dataset generator 103 can be configured to generate synthetic data.
  • dataset generator 103 can be configured to generate synthetic data by identifying and replacing sensitive information in data received from database 103 or interface 113 .
  • dataset generator 103 can be configured to generate synthetic data using a data model without reliance on input data.
  • the data model can be configured to generate data matching statistical and content characteristics of a training dataset.
  • the data model can be configured to map from a random or pseudorandom vector to elements in the training data space.
  • Database 105 can include one or more databases configured to store data for use by system 100 .
  • the databases can include cloud-based databases (e.g., AMAZON WEB SERVICES S3 buckets) or on-premises databases.
  • Model optimizer 107 can include one or more computing systems configured to manage training of data models for system 100 .
  • Model optimizer 107 can be configured to generate models for export to computing resources 101 .
  • Model optimizer 107 can be configured to generate models based on instructions received from a user or another system. These instructions can be received through interface 113 .
  • model optimizer 107 can be configured to receive a graphical depiction of a machine learning model and parse that graphical depiction into instructions for creating and training a corresponding neural network on computing resources 101 .
  • Model optimizer 107 can be configured to select model training parameters. This selection can be based on model performance feedback received from computing resources 101 .
  • Model optimizer 107 can be configured to provide trained models and descriptive information concerning the trained models to model storage 109 .
  • Model storage 109 can include one or more databases configured to store data models and descriptive information for the data models. Model storage 109 can be configured to provide information regarding available data models to a user or another system. This information can be provided using interface 113 .
  • the databases can include cloud-based databases (e.g., AMAZON WEB SERVICES S3 buckets) or on-premises databases.
  • the information can include model information, such as the type and/or purpose of the model and any measures of classification error.
  • Model curator 111 can be configured to impose governance criteria on the use of data models. For example, model curator 111 can be configured to delete or control access to models that fail to meet accuracy criteria. As a further example, model curator 111 can be configured to limit the use of a model to a particular purpose, or by a particular entity or individual. In some aspects, model curator 11 can be configured to ensure that data model satisfies governance criteria before system 100 can process data using the data model.
  • Interface 113 can be configured to manage interactions between system 100 and other systems using network 115 .
  • interface 113 can be configured to publish data received from other components of system 100 (e.g., dataset generator 103 , computing resources 101 , database 105 , or the like). This data can be published in a publication and subscription framework (e.g., using APACHE KAFKA), through a network socket, in response to queries from other systems, or using other known methods. The data can be synthetic data, as described herein.
  • interface 113 can be configured to provide information received from model storage 109 regarding available datasets.
  • interface 113 can be configured to provide data or instructions received from other systems to components of system 100 .
  • interface 113 can be configured to receive instructions for generating data models (e.g., type of data model, data model parameters, training data indicators, training parameters, or the like) from another system and provide this information to model optimizer 107 .
  • interface 113 can be configured to receive data including sensitive portions from another system (e.g. in a file, a message in a publication and subscription framework, a network socket, or the like) and provide that data to dataset generator 103 or database 105 .
  • Network 115 can include any combination of electronics communications networks enabling communication between components of system 100 .
  • network 115 may include the Internet and/or any type of wide area network, an intranet, a metropolitan area network, a local area network (LAN), a wireless network, a cellular communications network, a Bluetooth network, a radio network, a device bus, or any other type of electronics communications network known to one of skill in the art.
  • FIG. 2 depicts a process 200 for generating data models.
  • Process 200 can be used to generate a data model for a machine learning application, consistent with disclosed embodiments.
  • the data model can be generated using synthetic data in some aspects.
  • This synthetic data can be generated using a synthetic dataset model, which can in turn be generated using actual data.
  • the synthetic data may be similar to the actual data in terms of values, value distributions (e.g., univariate and multivariate statistics of the synthetic data may be similar to that of the actual data), structure and ordering, or the like.
  • the data model for the machine learning application can be generated without directly using the actual data.
  • the actual data may include sensitive information, and generating the data model may require distribution and/or review of training data, the use of the synthetic data can protect the privacy and security of the entities and/or individuals whose activities are recorded by the actual data.
  • Process 200 can then proceed to step 201 .
  • interface 113 can provide a data model generation request to model optimizer 107 .
  • the data model generation request can include data and/or instructions describing the type of data model to be generated.
  • the data model generation request can specify a general type of data model (e.g., neural network, recurrent neural network, generative adversarial network, kernel density estimator, random data generator, or the like) and parameters specific to the particular type of model (e.g., the number of features and number of layers in a generative adversarial network or recurrent neural network).
  • a recurrent neural network can include long short term memory modules (LSTM units), or the like.
  • Process 200 can then proceed to step 203 .
  • step 203 one or more components of system 100 can interoperate to generate a data model.
  • a data model can be trained using computing resources 101 using data provided by dataset generator 103 .
  • this data can be generated using dataset generator 103 from data stored in database 105 .
  • the data used to train dataset generator 103 can be actual or synthetic data retrieved from database 105 .
  • model optimizer 107 can be configured to select model parameters (e.g., number of layers for a neural network, kernel function for a kernel density estimator, or the like), update training parameters, and evaluate model characteristics (e.g., the similarity of the synthetic data generated by the model to the actual data).
  • model optimizer 107 can be configured to provision computing resources 101 with an initialized data model for training.
  • the initialized data model can be, or can be based upon, a model retrieved from model storage 109 .
  • model optimizer 107 can evaluate the performance of the trained synthetic data model.
  • model optimizer 107 can be configured to store the trained synthetic data model in model storage 109 .
  • model optimizer 107 can be configured to determine one or more values for similarity and/or predictive accuracy metrics, as described herein. In some embodiments, based on values for similarity metrics, model optimizer 107 can be configured to assign a category to the synthetic data model.
  • the synthetic data model generates data maintaining a moderate level of correlation or similarity with the original data, matches well with the original schema, and does not generate too many row or value duplicates.
  • the synthetic data model may generate data maintaining a high level of correlation or similarity of the original level, and therefore could potentially cause the original data to be discernable from the original data (e.g., a data leak).
  • a synthetic data model generating data failing to match the schema with the original data or providing many duplicated rows and values may also be placed in this category.
  • the synthetic data model may likely generate data maintaining a high level of correlation or similarity with the original data, likely allowing a data leak.
  • a synthetic data model generating data badly failing to match the schema with the original data or providing far too many duplicated rows and values may also be placed in this category.
  • system 100 can be configured to provide instructions for improving the quality of the synthetic data model. If a user requires synthetic data reflecting less correlation or similarity with the original data, the use can change the models' parameters to make them perform worse (e.g., by decreasing number of layers in GAN models, or reducing the number of training iterations). If the users want the synthetic data to have better quality, they can change the models' parameters to make them perform better (e.g., by increasing number of layers in GAN models, or increasing the number of training iterations).
  • Process 200 can then proceed to step 207 , in step 207 , model curator 111 can evaluate the trained synthetic data model for compliance with governance criteria.
  • FIG. 3 depicts a process 300 for generating a data model using an existing synthetic data model, consistent with disclosed embodiments.
  • Process 300 can include the steps of retrieving a synthetic dataset model from model storage 109 , retrieving data from database 105 , providing synthetic data to computing resources 101 , providing an initialized data model to computing resources 101 , and providing a trained data model to model optimizer 107 . In this manner, process 300 can allow system 100 to generate a model using synthetic data.
  • Process 300 can then proceed to step 301 .
  • dataset generator 103 can retrieve a training dataset from database 105 .
  • the training dataset can include actual training data, in some aspects.
  • the training dataset can include synthetic training data, in some aspects.
  • dataset generator 103 can be configured to generate synthetic data from sample values.
  • dataset generator 103 can be configured to use the generative network of a generative adversarial network to generate data samples from random-valued vectors. In such embodiments, process 300 may forgo step 301 .
  • Process 300 can then proceed to step 303 .
  • dataset generator 103 can be configured to receive a synthetic data model from model storage 109 .
  • model storage 109 can be configured to provide the synthetic data model to dataset generator 103 in response to a request from dataset generator 103 .
  • model storage 109 can be configured to provide the synthetic data model to dataset generator 103 in response to a request from model optimizer 107 , or another component of system 100 .
  • the synthetic data model can be a neural network, recurrent neural network (which may include LSTM units), generative adversarial network, kernel density estimator, random value generator, or the like.
  • dataset generator 103 can generate synthetic data.
  • Dataset generator 103 can be configured, in some embodiments, to identify sensitive data items (e.g., account numbers, social security numbers, names, addresses, API keys, network or IP addresses, or the like) in the data received from model storage 109 .
  • dataset generator 103 can be configured to identify sensitive data items using a recurrent neural network.
  • Dataset generator 103 can be configured to use the data model retrieved from model storage 109 to generate a synthetic dataset by replacing the sensitive data items with synthetic data items.
  • Dataset generator 103 can be configured to provide the synthetic dataset to computing resources 101 .
  • dataset generator 103 can be configured to provide the synthetic dataset to computing resources 101 in response to a request from computing resources 101 , model optimizer 107 , or another component of system 100 .
  • dataset generator 103 can be configured to provide the synthetic dataset to database 105 for storage.
  • computing resources 101 can be configured to subsequently retrieve the synthetic dataset from database 105 directly, or indirectly through model optimizer 107 or dataset generator 103 .
  • Process 300 can then proceed to step 307 .
  • computing resources 101 can be configured to receive a data model from model optimizer 107 , consistent with disclosed embodiments.
  • the data model can be at least partially initialized by model optimizer 107 .
  • at least some of the initial weights and offsets of a neural network model received by computing resources 101 in step 307 can be set by model optimizer 107 .
  • computing resources 101 can be configured to receive at least some training parameters from model optimizer 107 (e.g., batch size, number of training batches, number of epochs, chunk size, time window, input noise dimension, or the like).
  • Process 300 can then proceed to step 309 .
  • computing resources 101 can generate a trained data model using the data model received from model optimizer 107 and the synthetic dataset received from dataset generator 103 .
  • computing resources 101 can be configured to train the data model received from model optimizer 107 until some training criterion is satisfied.
  • the training criterion can be, for example, a performance criterion (e.g., a Mean Absolute Error, Root Mean Squared Error, percent good classification, and the like), a convergence criterion (e.g., a minimum required improvement of a performance criterion over iterations or over time, a minimum required change in model parameters over iterations or over time), elapsed time or number of iterations, or the like.
  • the performance criterion can be a threshold value for a similarity metric or prediction accuracy metric as described herein. Satisfaction of the training criterion can be determined by one or more of computing resources 101 and model optimizer 107 .
  • computing resources 101 can be configured to update model optimizer 107 regarding the training status of the data model.
  • computing resources 101 can be configured to provide the current parameters of the data model and/or current performance criteria of the data model.
  • model optimizer 107 can be configured to stop the training of the data model by computing resources 101 .
  • model optimizer 107 can be configured to retrieve the data model from computing resources 101 .
  • computing resources 101 can be configured to stop training the data model and provide the trained data model to model optimizer 107 .
  • FIG. 4 depicts a specific implementation (system 400 ) of system 100 of FIG. 1 .
  • the functionality of system 100 can be divided between a distributor 401 , a dataset generation instance 403 , a development environment 405 , a model optimization instance 409 , and a production environment 411 .
  • system 100 can be implemented in a stable and scalable fashion using a distributed computing environment, such as a public cloud-computing environment, a private cloud computing environment, a hybrid cloud computing environment, a computing cluster or grid, or the like.
  • dataset generator 103 and model optimizer 107 can be hosted by separate virtual computing instances of the cloud computing system.
  • Distributor 401 can be configured to provide, consistent with disclosed embodiments, an interface between the components of system 400 , and between the components of system 400 and other systems.
  • distributor 401 can be configured to implement interface 113 and a load balancer.
  • Distributor 401 can be configured to route messages between computing resources 101 (e.g., implemented on one or more of development environment 405 and production environment 411 ), dataset generator 103 (e.g., implemented on dataset generator instance 403 ), and model optimizer 107 (e.g., implemented on model optimization instance 409 ).
  • the messages can include data and instructions.
  • the messages can include model generation requests and trained models provided in response to model generation requests.
  • the messages can include synthetic data sets or synthetic data streams.
  • distributor 401 can be implemented using one or more EC2 clusters or the like.
  • Data generation instance 403 can be configured to generate synthetic data, consistent with disclosed embodiments. In some embodiments, data generation instance 403 can be configured to receive actual or synthetic data from data source 417 . In various embodiments, data generation instance 403 can be configured to receive synthetic data models for generating the synthetic data. In some aspects, the synthetic data models can be received from another component of system 400 , such as data source 417 .
  • Development environment 405 can be configured to implement at least a portion of the functionality of computing resources 101 , consistent with disclosed embodiments.
  • development environment 405 can be configured to train data models for subsequent use by other components of system 400 .
  • development instances e.g., development instance 407
  • development environment 405 can train one or more individual data models.
  • development environment 405 be configured to spin up additional development instances to train additional data models, as needed.
  • a development instance can implement an application framework such as TENSORBOARD, JUPYTER and the like; as well as machine learning applications like TENSORFLOW, CUDNN, KERAS, and the like. Consistent with disclosed embodiments, these application frameworks and applications can enable the specification and training of data models.
  • development environment 405 can be implemented using one or more EC2 clusters or the like.
  • Model optimization instance 409 can be configured to manage training and provision of data models by system 400 .
  • model optimization instance 409 can be configured to provide the functionality of model optimizer 107 .
  • model optimization instance 409 can be configured to provide training parameters and at least partially initialized data models to development environment 405 . This selection can be based on model performance feedback received from development environment 405 .
  • model optimization instance 409 can be configured to determine whether a data model satisfies performance criteria.
  • model optimization instance 409 can be configured to provide trained models and descriptive information concerning the trained models to another component of system 400 .
  • model optimization instance 409 can be implemented using one or more EC2 clusters or the like.
  • Production environment 405 can be configured to implement at least a portion of the functionality of computing resources 101 , consistent with disclosed embodiments.
  • production environment 405 can be configured to use previously trained data models to process data received by system 400 .
  • a production instance e.g., production instance 413
  • development environment 411 can be configured to process data using a previously trained data model.
  • the production instance can implement an application framework such as TENSORBOARD, JUPYTER and the like; as well as machine learning applications like TENSORFLOW, CUDNN, KERAS, and the like. Consistent with disclosed embodiments, these application frameworks and applications can enable processing of data using data models.
  • development environment 405 can be implemented using one or more EC2 clusters or the like.
  • a component of system 400 can determine the data model and data source for a production instance according to the purpose of the data processing. For example, system 400 can configure a production instance to produce synthetic data for consumption by other systems. In this example, the production instance can then provide synthetic data for testing another application. As a further example, system 400 can configure a production instance to generate outputs using actual data. For example, system 400 can configure a production instance with a data model for detecting fraudulent transactions. The production instance can then receive a stream of financial transaction data and identify potentially fraudulent transactions. In some aspects, this data model may have been trained by system 400 using synthetic data created to resemble the stream of financial transaction data. System 400 can be configured to provide an indication of the potentially fraudulent transactions to another system configured to take appropriate action (e.g., reversing the transaction, contacting one or more of the parties to the transaction, or the like).
  • appropriate action e.g., reversing the transaction, contacting one or more of the parties to the transaction, or the like.
  • Production environment 411 can be configured to host a file system 415 for interfacing between one or more production instances and data source 417 .
  • data source 417 can be configured to store data in file system 415
  • the one or more production instances can be configured to retrieve the stored data from file system 415 for processing.
  • file system 415 can be configured to scale as needed.
  • file system 415 can be configured to support parallel access by data source 417 and the one or more production instances.
  • file system 415 can be an instance of AMAZON ELASTIC FILE SYSTEM (EFS) or the like.
  • Data source 417 can be configured to provide data to other components of system 400 .
  • data source 417 can include sources of actual data, such as streams of transaction data, human resources data, web log data, web security data, web protocols data, or system logs data.
  • System 400 can also be configured to implement model storage 109 using a database (not shown) accessible to at least one other component of system 400 (e.g., distributor 401 , dataset generation instance 403 , development environment 405 , model optimization instance 409 , or production environment 411 ).
  • the database can be an s3 bucket, relational database, or the like.
  • FIG. 5A depicts process 500 for generating synthetic data using class-specific models, consistent with disclosed embodiments.
  • System 100 may be configured to use such synthetic data in training a data model for use in another application (e.g., a fraud detection application).
  • Process 500 can include the steps of retrieving actual data, determining classes of sensitive portions of the data, generating synthetic data using a data model for the appropriate class, and replacing the sensitive data portions with the synthetic data portions.
  • the data model can be a generative adversarial network trained to generate synthetic data satisfying a similarity criterion, as described herein.
  • process 500 can generate better synthetic data that more accurately models the underlying actual data than randomly generated training data that lacks the latent structures present in the actual data. Because the synthetic data more accurately models the underlying actual data, a data model trained using this improved synthetic data may perform better processing the actual data.
  • Process 500 can then proceed to step 501 .
  • dataset generator 103 can be configured to retrieve actual data.
  • the actual data may have been gathered during the course of ordinary business operations, marketing operations, research operations, or the like.
  • Dataset generator 103 can be configured to retrieve the actual data from database 105 or from another system.
  • the actual data may have been purchased in whole or in part by an entity associated with system 100 . As would be understood from this description, the source and composition of the actual data is not intended to be limiting.
  • Process 500 can then proceed to step 503 .
  • dataset generator 103 can be configured to determine classes of the sensitive portions of the actual data.
  • classes could include account numbers and merchant names.
  • classes could include employee identification numbers, employee names, employee addresses, contact information, marital or beneficiary information, title and salary information, and employment actions.
  • dataset generator 103 can be configured with a classifier for distinguishing different classes of sensitive information.
  • dataset generator 103 can be configured with a recurrent neural network for distinguishing different classes of sensitive information.
  • Dataset generator 103 can be configured to apply the classifier to the actual data to determine that a sensitive portion of the training dataset belongs to the data class. For example, when the data stream includes the text string “Lorem ipsum 012-34-5678 dolor sit amet,” the classifier may be configured to indicate that positions 13-23 of the text string include a potential social security number. Though described with reference to character string substitutions, the disclosed systems and methods are not so limited.
  • the actual data can include unstructured data (e.g., character strings, tokens, and the like) and structured data (e.g., key-value pairs, relational database files, spreadsheets, and the like).
  • Process 500 can then proceed to step 505 .
  • dataset generator 103 can be configured to generate a synthetic portion using a class-specific model.
  • dataset generator 103 can generate a synthetic social security number using a synthetic data model trained to generate social security numbers.
  • this class-specific synthetic data model can be trained to generate synthetic portions similar to those appearing in the actual data.
  • social security numbers include an area number indicating geographic information and a group number indicating date-dependent information
  • the range of social security numbers present in an actual dataset can depend on the geographic origin and purpose of that dataset.
  • a dataset of social security numbers for elementary school children in a particular school district may exhibit different characteristics than a dataset of social security numbers for employees of a national corporation.
  • the social security-specific synthetic data model could generate the synthetic portion “03-74-3285.”
  • Process 500 can then proceed to step 507 .
  • dataset generator 103 can be configured to replace the sensitive portion of the actual data with the synthetic portion.
  • dataset generator 103 could be configured to replace the characters at positions 13-23 of the text string with the values “013-74-3285,” creating the synthetic text string “Lorem ipsum 013-74-3285 dolor sit amet.”
  • This text string can now be distributed without disclosing the sensitive information originally present. But this text string can still be used to train models that make valid inferences regarding the actual data, because synthetic social security numbers generated by the synthetic data model share the statistical characteristic of the actual data.
  • FIG. 5B depicts a process 510 for generating synthetic data using class and subclass-specific models, consistent with disclosed embodiments.
  • Process 510 can include the steps of retrieving actual data, determining classes of sensitive portions of the data, selecting types for synthetic data used to replace the sensitive portions of the actual data, generating synthetic data using a data model for the appropriate type and class, and replacing the sensitive data portions with the synthetic data portions.
  • the data model can be a generative adversarial network trained to generate synthetic data satisfying a similarity criterion, as described herein. This improvement addresses a problem with synthetic data generation, that a synthetic data model may fail to generate examples of proportionately rare data subclasses.
  • a model of the synthetic data may generate only examples of the most common first data subclasses.
  • the synthetic data model effectively focuses on generating the best examples of the most common data subclasses, rather than acceptable examples of all the data subclasses.
  • Process 510 addresses this problem by expressly selecting subclasses of the synthetic data class according to a distribution model based on the actual data.
  • Process 510 can then proceed through step 511 and step 513 , which resemble step 501 and step 503 in process 500 .
  • dataset generator 103 can be configured to receive actual data.
  • dataset generator can be configured to determine classes of sensitive portions of the actual data.
  • dataset generator 103 can be configured to determine that a sensitive portion of the data may contain a financial service account number.
  • Dataset generator 103 can be configured to identify this sensitive portion of the data as a financial service account number using a classifier, which may in some embodiments be a recurrent neural network (which may include LSTM units).
  • Process 510 can then proceed to step 515 .
  • dataset generator 103 can be configured to select a subclass for generating the synthetic data. In some aspects, this selection is not governed by the subclass of the identified sensitive portion. For example, in some embodiments the classifier that identifies the class need not be sufficiently discerning to identify the subclass, relaxing the requirements on the classifier. Instead, this selection is based on a distribution model. For example, dataset generator 103 can be configured with a statistical distribution of subclasses (e.g., a univariate distribution of subclasses) for that class and can select one of the subclasses for generating the synthetic data according to the statistical distribution.
  • subclasses e.g., a univariate distribution of subclasses
  • dataset generator 103 can be configured to select the trust account subclass 1 time in 20, and use a synthetic data model for financial service account numbers for trust accounts to generate the synthetic data.
  • dataset generator 103 can be configured with a recurrent neural network that estimates the next subclass based on the current and previous subclasses.
  • healthcare records can include cancer diagnosis stage as sensitive data. Most cancer diagnosis stage values may be “no cancer” and the value of “stage 1” may be rare, but when present in a patient record this value may be followed by “stage 2,” etc.
  • the recurrent neural network can be trained on the actual healthcare records to use prior and cancer diagnosis stage values when selecting the subclass. For example, when generating a synthetic healthcare record, the recurrent neural network can be configured to use the previously selected cancer diagnosis stage subclass in selecting the present cancer diagnosis stage subclass. In this manner, the synthetic healthcare record can exhibit an appropriate progression of patient health that matches the progression in the actual data.
  • Process 510 can then proceed to step 517 .
  • step 517 which resembles step 505
  • dataset generator 103 can be configured to generate synthetic data using a class and subclass specific model.
  • dataset generator 103 can be configured to use a synthetic data for trust account financial service account numbers to generate the synthetic financial server account number.
  • Process 510 can then proceed to step 519 .
  • step 519 which resembles step 507
  • dataset generator 103 can be configured to replace the sensitive portion of the actual data with the generated synthetic data.
  • dataset generator 103 can be configured to replace the financial service account number in the actual data with the synthetic trust account financial service account number.
  • FIG. 6 depicts a process 600 for training a classifier for generation of synthetic data.
  • a classifier could be used by dataset generator 103 to classify sensitive data portions of actual data, as described above with regards to FIGS. 5A and 5B .
  • Process 600 can include the steps of receiving data sequences, receiving content sequences, generating training sequences, generating label sequences, and training a classifier using the training sequences and the label sequences. By using known data sequences and content sequences unlikely to contain sensitive data, process 600 can be used to automatically generate a corpus of labeled training data.
  • Process 600 can be performed by a component of system 100 , such as dataset generator 103 or model optimizer 107 .
  • Process 600 can then proceed to step 601 .
  • system 100 can receive training data sequences.
  • the training data sequences can be received from a dataset.
  • the dataset providing the training data sequences can be a component of system 100 (e.g., database 105 ) or a component of another system.
  • the data sequences can include multiple classes of sensitive data.
  • the data sequences can include account numbers, social security numbers, and full names.
  • Process 600 can then proceed to step 603 .
  • system 100 can receive context sequences.
  • the context sequences can be received from a dataset.
  • the dataset providing the context sequences can be a component of system 100 (e.g., database 105 ) or a component of another system.
  • the context sequences can be drawn from a corpus of pre-existing data, such as an open-source text dataset (e.g., Yelp Open Dataset or the like).
  • the context sequences can be snippets of this pre-existing data, such as a sentence or paragraph of the pre-existing data.
  • Process 600 can then proceed to step 605 .
  • system 100 can generate training sequences.
  • system 100 can be configured to generate a training sequence by inserting a data sequence into a context sequence.
  • the data sequence can be inserted into the context sequence without replacement of elements of the context sequence or with replacement of elements of the context sequence.
  • the data sequence can be inserted into the context sequence between elements (e.g., at a whitespace character, tab, semicolon, html closing tag, or other semantic breakpoint) or without regard to the semantics of the context sequence.
  • the training sequence can be “Lorem ipsum dolor sit amet, 013-74-3285 consectetur adipiscing elit, sed do eiusmod,” “Lorem ipsum dolor sit amet, 013-74-3285 adipiscing elit, sed do eiusmod,” or “Lorem ipsum dolor sit amet, conse013-74-3285ctetur adipiscing elit, sed do eiusmod.”
  • a training sequence can include multiple data sequences.
  • process 600 can proceed to step 607 .
  • system 100 can generate a label sequence.
  • the label sequence can indicate a position of the inserted data sequence in the training sequence.
  • the label sequence can indicate the class of the data sequence.
  • the label sequence can be “0000000000000000111111111110000000000000000000,” where the value “0” indicates that a character is not part of a sensitive data portion and the value “1” indicates that a character is part of the social security number.
  • a different class or subclass of data sequence could include a different value specific to that class or subclass. Because system 100 creates the training sequences, system 100 can automatically create accurate labels for the training sequences.
  • Process 600 can then proceed to step 609 .
  • system 100 can be configured to use the training sequences and the label sequences to train a classifier.
  • the label sequences can provide a “ground truth” for training a classifier using supervised learning.
  • the classifier can be a recurrent neural network (which may include LSTM units).
  • the recurrent neural network can be configured to predict whether a character of a training sequence is part of a sensitive data portion. This prediction can be checked against the label sequence to generate an update to the weights and offsets of the recurrent neural network. This update can then be propagated through the recurrent neural network, according to methods described in “Training Recurrent Neural Networks,” 2013, by Ilya Sutskever, which is incorporated herein by reference in its entirety.
  • FIG. 7 depicts a process 700 for training a classifier for generation of synthetic data, consistent with disclosed embodiments.
  • a data sequence 701 can include preceding samples 703 , current sample 705 , and subsequent samples 707 .
  • data sequence 701 can be a subset of a training sequence, as described above with regard to FIG. 6 .
  • Data sequence 701 may be applied to recurrent neural network 709 .
  • neural network 709 can be configured to estimate whether current sample 705 is part of a sensitive data portion of data sequence 701 based on the values of preceding samples 703 , current sample 705 , and subsequent samples 707 .
  • preceding samples 703 can include between 1 and 100 samples, for example between 25 and 75 samples.
  • subsequent samples 707 can include between 1 and 100 samples, for example between 25 and 75 samples.
  • the preceding samples 703 and the subsequent samples 707 can be paired and provided to recurrent neural network 709 together. For example, in a first iteration, the first sample of preceding samples 703 and the last sample of subsequent samples 707 can be provided to recurrent neural network 709 . In the next iteration, the second sample of preceding samples 703 and the second-to-last sample of subsequent samples 707 can be provided to recurrent neural network 709 .
  • System 100 can continue to provide samples to recurrent neural network 709 until all of preceding samples 703 and subsequent samples 707 have been input to recurrent neural network 709 .
  • System 100 can then provide current sample 705 to recurrent neural network 709 .
  • the output of recurrent neural network 709 after the input of current sample 705 can be estimated label 711 .
  • Estimated label 711 can be the inferred class or subclass of current sample 705 , given data sequence 701 as input.
  • estimated label 711 can be compared to actual label 713 to calculate a loss function. Actual label 713 can correspond to data sequence 701 .
  • actual label 713 can be an element of the label sequence corresponding to the training sequence.
  • actual label 713 can occupy the same position in the label sequence as occupied by current sample 705 in the training sequence.
  • system 100 can be configured to update recurrent neural network 709 using loss function 715 based on a result of the comparison.
  • FIG. 8 depicts a process 800 for training a generative adversarial network using a normalized reference dataset.
  • the generative adversarial network can be used by system 100 (e.g., by dataset generator 103 ) to generate synthetic data (e.g., as described above with regards to FIGS. 2, 3, 5A and 5B ).
  • the generative adversarial network can include a generator network and a discriminator network.
  • the generator network can be configured to learn a mapping from a sample space (e.g., a random number or vector) to a data space (e.g. the values of the sensitive data).
  • the discriminator can be configured to determine, when presented with either an actual data sample or a sample of synthetic data generated by the generator network, whether the sample was generated by the generator network or was a sample of actual data. As training progresses, the generator can improve at generating the synthetic data and the discriminator can improve at determining whether a sample is actual or synthetic data. In this manner, a generator can be automatically trained to generate synthetic data similar to the actual data.
  • a generative adversarial network can be limited by the actual data. For example, an unmodified generative adversarial network may be unsuitable for use with categorical data or data including missing values, not-a-numbers, or the like. For example, the generative adversarial network may not know how to interpret such data. Disclosed embodiments address this technical problem by at least one of normalizing categorical data or replacing missing values with supra-normal values.
  • Process 800 can then proceed to step 801 .
  • system 100 e.g., dataset generator 103
  • the reference dataset can include categorical data.
  • the reference dataset can include spreadsheets or relational databases with categorical-valued data columns.
  • the reference dataset can include missing values, not-a-number values, or the like.
  • Process 800 can then proceed to step 803 .
  • system 100 e.g., dataset generator 103
  • system 100 can generate a normalized training dataset by normalizing the reference dataset.
  • system 100 can be configured to normalize categorical data contained in the reference dataset.
  • system 100 can be configured to normalize the categorical data by converting this data to numerical values.
  • the numerical values can lie within a predetermined range.
  • the predetermined range can be zero to one.
  • system 100 can be configured to map these days to values between zero and one.
  • system 100 can be configured to normalize numerical data in the reference dataset as well, mapping the values of the numerical data to a predetermined range.
  • Process 800 can then proceed to step 805 .
  • system 100 e.g., dataset generator 103
  • system 100 can generate the normalized training dataset by converting special values to values outside the predetermined range.
  • system 100 can be configured to assign missing values a first numerical value outside the predetermined range.
  • system 100 can be configured to assign not-a-number values to a second numerical value outside the predetermined range.
  • the first value and the second value can differ.
  • system 100 can be configured to map the categorical values and the numerical values to the range of zero to one.
  • system 100 can then map missing values to the numerical value 1.5.
  • system 100 can then map not-a-number values to the numerical value of ⁇ 0.5. In this manner system 100 can preserve information about the actual data while enabling training of the generative adversarial network.
  • Process 800 can then proceed to step 807 .
  • system 100 e.g., dataset generator 103
  • FIG. 9 depicts a process 900 for training a generative adversarial network using a loss function configured to ensure a predetermined degree of similarity, consistent with disclosed embodiments.
  • System 100 can be configured to use process 900 to generate synthetic data that is similar, but not too similar to the actual data, as the actual data can include sensitive personal information. For example, when the actual data includes social security numbers or account numbers, the synthetic data would preferably not simply recreate these numbers. Instead, system 100 would preferably create synthetic data that resembles the actual data, as described below, while reducing the likelihood of overlapping values. To address this technical problem, system 100 can be configured to determine a similarity metric value between the synthetic dataset and the normalized reference dataset, consistent with disclosed embodiments.
  • System 100 can be configured to use the similarity metric value to update a loss function for training the generative adversarial network. In this manner, system 100 can be configured to determine a synthetic dataset differing in value from the normalized reference dataset at least a predetermined amount according to the similarity metric.
  • dataset generator 103 can be configured to use such trained synthetic data models to generate synthetic data (e.g., as described above with regards to FIGS. 2 and 3 ).
  • development instances e.g., development instance 407
  • production instances e.g., production instance 413
  • data similar to a reference dataset can be configured to generate data similar to a reference dataset according to the disclosed systems and methods.
  • Process 900 can then proceed to step 901 , which can resemble step 801 .
  • system 100 e.g., model optimizer 107 , computational resources 101 , or the like
  • system 100 can receive a reference dataset.
  • system 100 can be configured to receive the reference dataset from a database (e.g., database 105 ).
  • the reference dataset can include categorical and/or numerical data.
  • the reference dataset can include spreadsheet or relational database data.
  • the reference dataset can include special values, such as missing values, not-a-number values, or the like.
  • Process 900 can then proceed to step 903 .
  • system 100 e.g., dataset generator 103 , model optimizer 107 , computational resources 101 , or the like
  • system 100 can be configured to normalize the reference dataset.
  • system 100 can be configured to normalize the reference dataset as described above with regard to steps 803 and 805 of process 800 .
  • system 100 can be configured to normalize the categorical data and/or the numerical data in the reference dataset to a predetermined range.
  • system 100 can be configured to replace special values with numerical values outside the predetermined range.
  • Process 900 can then proceed to step 905 .
  • system 100 e.g., model optimizer 107 , computational resources 101 , or the like
  • system 100 can generate a synthetic training dataset using the generative network.
  • system 100 can apply one or more random samples to the generative network to generate one or more synthetic data items.
  • system 100 can be configured to generate between 200 and 400,000 data items, or preferably between 20,000 and 40,000 data items.
  • Process 900 can then proceed to step 907 .
  • system 100 e.g., model optimizer 107 , computational resources 101 , or the like
  • System 100 can determine a similarity metric value using the normalized reference dataset and the synthetic training dataset.
  • System 100 can be configured to generate the similarity metric value according to a similarity metric.
  • the similarity metric value can include at least one of a statistical correlation score (e.g., a score dependent on the covariances or univariate distributions of the synthetic data and the normalized reference dataset), a data similarity score (e.g., a score dependent on a number of matching or similar elements in the synthetic dataset and normalized reference dataset), or data quality score (e.g., a score dependent on at least one of a number of duplicate elements in each of the synthetic dataset and normalized reference dataset, a prevalence of the most common value in each of the synthetic dataset and normalized reference dataset, a maximum difference of rare values in each of the synthetic dataset and normalized reference dataset, the differences in schema between the synthetic dataset and normalized reference dataset, or the like).
  • System 100 can be configured to calculate these scores using the synthetic dataset and a reference dataset.
  • the similarity metric can depend on a covariance of the synthetic dataset and a covariance of the normalized reference dataset.
  • system 100 can be configured to generate a difference matrix using a covariance matrix of the normalized reference dataset and a covariance matrix of the synthetic dataset.
  • the difference matrix can be the difference between the covariance matrix of the normalized reference dataset and the covariance matrix of the synthetic dataset.
  • the similarity metric can depend on the difference matrix.
  • the similarity metric can depend on the summation of the squared values of the difference matrix. This summation can be normalized, for example by the square root of the product of the number of rows and number of columns of the covariance matrix for the normalized reference dataset.
  • the similarity metric can depend on a univariate value distribution of an element of the synthetic dataset and a univariate value distribution of an element of the normalized reference dataset.
  • system 100 can be configured to generate histograms having the same bins.
  • system 100 can be configured to determine a difference between the value of the bin for the synthetic data histogram and the value of the bin for the normalized reference dataset histogram.
  • the values of the bins can be normalized by the total number of datapoints in the histograms.
  • system 100 can be configured to determine a value (e.g., a maximum difference, an average difference, a Euclidean distance, or the like) of these differences.
  • the similarity metric can depend on a function of this value (e.g., a maximum, average, or the like) across the common elements.
  • the normalized reference dataset can include multiple columns of data.
  • the synthetic dataset can include corresponding columns of data.
  • the normalized reference dataset and the synthetic dataset can include the same number of rows.
  • System 100 can be configured to generate histograms for each column of data for each of the normalized reference dataset and the synthetic dataset.
  • system 100 can determine the difference between the count of datapoints in the normalized reference dataset histogram and the synthetic dataset histogram. System 100 can determine the value for this column to be the maximum of the differences for each bin. System 100 can determine the value for the similarity metric to be the average of the values for the columns. As would be appreciated by one of skill in the art, this example is not intended to be limiting.
  • the similarity metric can depend on a number of elements of the synthetic dataset that match elements of the reference dataset.
  • the matching can be an exact match, with the value of an element in the synthetic dataset matching the value of an element in the normalized reference dataset.
  • the similarity metric can depend on the number of rows of the synthetic dataset that have the same values as rows of the normalized reference dataset.
  • the normalized reference dataset and synthetic dataset can have duplicate rows removed prior to performing this comparison.
  • System 100 can be configured to merge the non-duplicate normalized reference dataset and non-duplicate synthetic dataset by all columns. In this non-limiting example, the size of the resulting dataset will be the number of exactly matching rows. In some embodiments, system 100 can be configured to disregard columns that appear in one dataset but not the other when performing this comparison.
  • the similarity metric can depend on a number of elements of the synthetic dataset that are similar to elements of the normalized reference dataset.
  • System 100 can be configured to calculate similarity between an element of the synthetic dataset and an element of the normalized reference dataset according to distance measure.
  • the distance measure can depend on a Euclidean distance between the elements. For example, when the synthetic dataset and the normalized reference dataset include rows and columns, the distance measure can depend on a Euclidean distance between a row of the synthetic dataset and a row of the normalized reference dataset.
  • the distance measure when comparing a synthetic dataset to an actual dataset including categorical data (e.g., a reference dataset that has not been normalized), can depend on a Euclidean distance between numerical row elements and a Hamming distance between non-numerical row elements.
  • the Hamming distance can depend on a count of non-numerical elements differing between the row of the synthetic dataset and the row of the actual dataset.
  • the distance measure can be a weighted average of the Euclidean distance and the Hamming distance.
  • system 100 can be configured to disregard columns that appear in one dataset but not the other when performing this comparison.
  • system 100 can be configured to remove duplicate entries from the synthetic dataset and the normalized reference dataset before performing the comparison.
  • system 100 can be configured to calculate a distance measure between each row of the synthetic dataset (or a subset of the rows of the synthetic dataset) and each row of the normalized reference dataset (or a subset of the rows of the normalized reference dataset). System 100 can then determine the minimum distance value for each row of the synthetic dataset across all rows of the normalized reference dataset.
  • the similarity metric can depend on a function of the minimum distance values for all rows of the synthetic dataset (e.g., a maximum value, an average value, or the like).
  • the similarity metric can depend on a frequency of duplicate elements in the synthetic dataset and the normalized reference dataset.
  • system 100 can be configured to determine the number of duplicate elements in each of the synthetic dataset and the normalized reference dataset.
  • system 100 can be configured to determine the proportion of each dataset represented by at least some of the elements in each dataset. For example, system 100 can be configured to determine the proportion of the synthetic dataset having a particular value. In some aspects, this value may be the most frequent value in the synthetic dataset.
  • System 100 can be configured to similarly determine the proportion of the normalized reference dataset having a particular value (e.g., the most frequent value in the normalized reference dataset).
  • the similarity metric can depend on a relative prevalence of rare values in the synthetic and normalized reference dataset.
  • such rare values can be those present in a dataset with frequencies less than a predetermined threshold.
  • the predetermined threshold can be a value less than 20%, for example 10%.
  • System 100 can be configured to determine a prevalence of rare values in the synthetic and normalized reference dataset. For example, system 100 can be configured to determine counts of the rare values in a dataset and the total number of elements in the dataset. System 100 can then determine ratios of the counts of the rare values to the total number of elements in the datasets.
  • the similarity metric can depend on differences in the ratios between the synthetic dataset and the normalized reference dataset.
  • an exemplary dataset can be an access log for patient medical records that tracks the job title of the employee accessing a patient medical record.
  • the job title “Administrator” may be a rare value of job title and appear in 3% of the log entries.
  • System 100 can be configured to generate synthetic log data based on the actual dataset, but the job title “Administrator” may not appear in the synthetic log data.
  • the similarity metric can depend on difference between the actual dataset prevalence (3%) and the synthetic log data prevalence (0%).
  • the job title “Administrator” may be overrepresented in the synthetic log data, appearing in 15% of the of the log entries (and therefore not a rare value in the synthetic log data when the predetermined threshold is 10%).
  • the similarity metric can depend on difference between the actual dataset prevalence (3%) and the synthetic log data prevalence (15%).
  • the similarity metric can depend on a function of the differences in the ratios between the synthetic dataset and the normalized reference dataset.
  • the actual dataset may include 10 rare values with a prevalence under 10% of the dataset.
  • the difference between the prevalence of these 10 rare values in the actual dataset and the normalized reference dataset can range from ⁇ 5% to 4%.
  • the similarity metric can depend on the greatest magnitude difference (e.g., the similarity metric could depend on the value ⁇ 5% as the greatest magnitude difference).
  • the similarity metric can depend on the average of the magnitude differences, the Euclidean norm of the ratio differences, or the like.
  • the similarity metric can depend on a difference in schemas between the synthetic dataset and the normalized reference dataset.
  • system 100 can be configured to determine a number of mismatched columns between the synthetic and normalized reference datasets, a number of mismatched column types between the synthetic and normalized reference datasets, a number of mismatched column categories between the synthetic and normalized reference datasets, and number of mismatched numeric ranges between the synthetic and normalized reference datasets.
  • the value of the similarity metric can depend on the number of at least one of the mismatched columns, mismatched column types, mismatched column categories, or mismatched numeric ranges.
  • the similarity metric can depend on one or more of the above criteria.
  • the similarity metric can depend on one or more of (1) a covariance of the output data and a covariance of the normalized reference dataset, (2) a univariate value distribution of an element of the synthetic dataset, (3) a univariate value distribution of an element of the normalized reference dataset, (4) a number of elements of the synthetic dataset that match elements of the reference dataset, (5) a number of elements of the synthetic dataset that are similar to elements of the normalized reference dataset, (6) a distance measure between each row of the synthetic dataset (or a subset of the rows of the synthetic dataset) and each row of the normalized reference dataset (or a subset of the rows of the normalized reference dataset), (7) a frequency of duplicate elements in the synthetic dataset and the normalized reference dataset, (8) a relative prevalence of rare values in the synthetic and normalized reference dataset, and (9) differences in the ratios between the synthetic dataset and the normalized reference dataset.
  • System 100 can compare a synthetic dataset to a normalized reference dataset, a synthetic dataset to an actual (unnormalized) dataset, or to compare two datasets according to a similarity metric consistent with disclosed embodiments.
  • model optimizer 107 can be configured to perform such comparisons.
  • model storage 105 can be configured to store similarity metric information (e.g., similarity values, indications of comparison datasets, and the like) together with a synthetic dataset.
  • Process 900 can then proceed to step 909 .
  • system 100 e.g., model optimizer 107 , computational resources 101 , or the like
  • system 100 can train the generative adversarial network using the similarity metric value.
  • system 100 can be configured to determine that the synthetic dataset satisfies a similarity criterion.
  • the similarity criterion can concern at least one of the similarity metrics described above.
  • the similarity criterion can concern at least one of a statistical correlation score between the synthetic dataset and the normalized reference dataset, a data similarity score between the synthetic dataset and the reference dataset, or a data quality score for the synthetic dataset.
  • synthetic data satisfying the similarity criterion can be too similar to the reference dataset.
  • System 100 can be configured to update a loss function for training the generative adversarial network to decrease the similarity between the reference dataset and synthetic datasets generated by the generative adversarial network when the similarity criterion is satisfied.
  • the loss function of the generative adversarial network can be configured to penalize generation of synthetic data that is too similar to the normalized reference dataset, up to a certain threshold.
  • a penalty term can be added to the loss function of the generative adversarial network. This term can penalize the calculated loss if the dissimilarity between the synthetic data and the actual data goes below a certain threshold.
  • this penalty term can thereby ensure that the value of the similarity metric exceeds some similarity threshold, or remains near the similarity threshold (e.g., the value of the similarity metric may exceed 90% of the value of the similarity threshold).
  • decreasing values of the similarity metric can indicate increasing similarity.
  • System 100 can then update the loss function such that the likelihood of generating synthetic data like the current synthetic data is reduced. In this manner, system 100 can train the generative adversarial network using a loss function that penalizes generation of data differing from the reference dataset by less than the predetermined amount.
  • FIG. 10 depicts a process 1000 for supplementing or transforming datasets using code-space operations, consistent with disclosed embodiments.
  • Process 1000 can include the steps of generating encoder and decoder models that map between a code space and a sample space, identifying representative points in code space, generating a difference vector in code space, and generating extreme points or transforming a dataset using the difference vector.
  • process 1000 can support model validation and simulation of conditions differing from those present during generation of a training dataset.
  • process 1000 can support model validation by inferring datapoints that occur infrequently or outside typical operating conditions.
  • a training data include operations and interactions typical of a first user population.
  • Process 1000 can support simulation of operations and interactions typical of a second user population that differs from the first user population.
  • a young user population may interact with a system.
  • Process 1000 can support generation of a synthetic training dataset representative of an older user population interacting with the system. This synthetic training dataset can be used to simulate performance of the system with an older user population, before developing that userbase.
  • system 1001 can generate an encoder model and a decoder model.
  • system 100 can be configured to generate an encoder model and decoder model using an adversarially learned inference model, as disclosed in “Adversarially Learned Inference” by Vincent Dumoulin, et al.
  • an encoder maps from a sample space to a code space and a decoder maps from a code space to a sample space.
  • the encoder and decoder are trained by selecting either a code and generating a sample using the decoder or by selecting a sample and generating a code using the encoder.
  • the resulting pairs of code and sample are provided to a discriminator model, which is trained to determine whether the pairs of code and sample came from the encoder or decoder.
  • the encoder and decoder can be updated based on whether the discriminator correctly determined the origin of the samples.
  • the encoder and decoder can be trained to fool the discriminator.
  • the joint distribution of code and sample for the encoder and decoder match.
  • other techniques of generating a mapping from a code space to a sample space may also be used. For example, a generative adversarial network can be used to learn a mapping from the code space to the sample space.
  • Process 1000 can then proceed to step 1003 .
  • system 100 can identify representative points in the code space.
  • System 100 can identify representative points in the code space by identifying points in the sample space, mapping the identified points into code space, and determining the representative points based on the mapped points, consistent with disclosed embodiments.
  • the identified points in the sample space can be elements of a dataset (e.g., an actual dataset or a synthetic dataset generated using an actual dataset).
  • System 100 can identify points in the sample space based on sample space characteristics. For example, when the sample space includes financial account information, system 100 can be configured to identify one or more first accounts belonging to users in their 20s and one or more second accounts belonging to users in their 40s.
  • identifying representative points in the code space can include a step of mapping the one or more first points in the sample space and the one or more second points in the sample space to corresponding points in the code space.
  • the one or more first points and one or more second points can be part of a dataset.
  • the one or more first points and one or more second points can be part of an actual dataset or a synthetic dataset generated using an actual dataset.
  • System 100 can be configured to select first and second representative points in the code space based on the mapped one or more first points and the mapped one or more second points. As shown in FIG. 11A , when the one or more first points include a single point, the mapping of this single point to the code space (e.g., point 1101 ) can be a first representative point in code space 1100 . Likewise, when the one or more second points include a single point, the mapping of this single point to the code space (e.g., point 1103 ) can be a second representative point in code space 1100 .
  • system 100 can be configured to determine a first representative point in code space 1110 .
  • system 100 can be configured to determine the first representative point based on the locations of the mapped one or more first points in the code space.
  • the first representative point can be a centroid or a medoid of the mapped one or more first points.
  • system 100 can be configured to determine the second representative point based on the locations of the mapped one or more second points in the code space.
  • the second representative point can be a centroid or a medoid of the mapped one or more second points.
  • system 100 can be configured to identify point 1113 as the first representative point based on the locations of mapped points 1111 a and 1111 b .
  • system 100 can be configured to identify point 1117 as the second representative point based on the locations of mapped points 1115 a and 1115 b.
  • the code space can include a subset of R n .
  • System 100 can be configured to map a dataset to the code space using the encoder. System 100 can then identify the coordinates of the points with respect to a basis vector in R n (e.g., one of the vectors of the identity matrix). System 100 can be configured to identify a first point with a minimum coordinate value with respect to the basis vector and a second point with a maximum coordinate value with respect to the basis vector. System 100 can be configured to identify these points as the first and second representative points. For example, taking the identity matrix as the basis, system 100 can be configured to select as the first point the point with the lowest value of the first element of the vector. To continue this example, system 100 can be configured to select as the second point the point with the highest value of the first element of the vector. In some embodiments, system 100 can be configured to repeat process 1000 for each vector in the basis.
  • Process 1000 can then proceed to step 1005 .
  • system 100 can determine a difference vector connecting the first representative point and the second representative point. For example, as shown in FIG. 11A , system 100 can be configured to determine a vector 1105 from first representative point 1101 to second representative point 1103 . Likewise, as shown in FIG. 11B , system 100 can be configured to determine a vector 1119 from first representative point 1113 to second representative point 1117 .
  • Process 1000 can then proceed to step 1007 .
  • step 1007 system 100 can generate extreme codes.
  • system 100 can be configured to generate extreme codes by sampling the code space (e.g., code space 1200 ) along an extension (e.g., extension 1201 ) of the vector connecting the first representative point and the second representative point (e.g., vector 1105 ). In this manner, system 100 can generate a code extreme with respect to the first representative point and the second representative point (e.g. extreme point 1203 ).
  • Process 1000 can then proceed to step 1009 .
  • step 1009 system 100 can generate extreme samples. Consistent with disclosed embodiments, system 100 can be configured to generate extreme samples by converting the extreme code into the sample space using the decoder trained in step 1001 . For example, system 100 can be configured to convert extreme point 1203 into a corresponding datapoint in the sample space.
  • Process 1000 can then proceed to step 1011 .
  • system 100 can translate a dataset using the difference vector determined in step 1005 (e.g., difference vector 1105 ).
  • system 100 can be configured to convert the dataset from sample space to code space using the encoder trained in step 1001 .
  • System 100 can be configured to then translate the elements of the dataset in code space using the difference vector.
  • system 100 can be configured to translate the elements of the dataset using the vector and a scaling factor.
  • the scaling factor can be less than one.
  • the scaling factor can be greater than or equal to one.
  • the elements of the dataset can be translated in code space 1210 by the product of the difference vector and the scaling factor (e.g., original point 1211 can be translated by translation 1212 to translated point 1213 ).
  • Process 1000 can then proceed to step 1013 .
  • step 1013 system 100 can generate a translated dataset.
  • system 100 can be configured to generate the translated dataset by converting the translated points into the sample space using the decoder trained in step 1001 .
  • system 100 can be configured to convert extreme point translated point 1213 into a corresponding datapoint in the sample space.
  • FIG. 13 depicts an exemplary cloud computing system 1300 for generating a synthetic data stream that tracks a reference data stream.
  • the flow rate of the synthetic data can resemble the flow rate of the reference data stream, as system 1300 can generate synthetic data in response to receiving reference data stream data.
  • System 1300 can include a streaming data source 1301 , model optimizer 1303 , computing resource 1304 , model storage 1305 , dataset generator 1307 , and synthetic data source 1309 .
  • System 1300 can be configured to generate a new synthetic data model using actual data received from streaming data source 1301 .
  • Streaming data source 1301 , model optimizer 1303 , computing resources 1304 , and model storage 1305 can interact to generate the new synthetic data model, consistent with disclosed embodiments.
  • system 1300 can be configured to generate the new synthetic data model while also generating synthetic data using a current synthetic data model.
  • Streaming data source 1301 can be configured to retrieve new data elements from a database, a file, a datasource, a topic in a data streaming platform (e.g., IBM STREAMS), a topic in a distributed messaging system (e.g., APACHE KAFKA), or the like.
  • streaming data source 1301 can be configured to retrieve new elements in response to a request from model optimizer 1303 .
  • streaming data source 1301 can be configured to retrieve new data elements in real-time.
  • streaming data source 1301 can be configured to retrieve log data, as that log data is created.
  • streaming data source 1301 can be configured to retrieve batches of new data.
  • streaming data source 1301 can be configured to periodically retrieve all log data created within a certain period (e.g., a five-minute interval).
  • the data can be application logs.
  • the application logs can include event information, such as debugging information, transaction information, user information, user action information, audit information, service information, operation tracking information, process monitoring information, or the like.
  • the data can be JSON data (e.g., JSON application logs).
  • Model optimizer 1303 can be configured to provision computing resources 1304 with a data model, consistent with disclosed embodiments.
  • computing resources 1304 can resemble computing resources 101 , described above with regard to FIG. 1 .
  • computing resources 1304 can provide similar functionality and can be similarly implemented.
  • the data model can be a synthetic data model.
  • the data model can be a current data model configured to generate data similar to recently received data in the reference data stream.
  • the data model can be received from model storage 1305 .
  • model optimizer 1307 can be configured to provide instructions to computing resources 1304 to retrieve a current data model of the reference data stream from model storage 1305 .
  • the synthetic data model can include a recurrent neural network, a kernel density estimator, or a generative adversarial network.
  • Computing resources 1304 can be configured to train the new synthetic data model using reference data stream data.
  • system 1300 e.g., computing resources 1304 or model optimizer 1303
  • system 1300 can be configured to include reference data stream data into the training data as it is received from streaming data source 1301 .
  • the training data can therefore reflect the current characteristics of the reference data stream (e.g., the current values, current schema, current statistical properties, and the like).
  • system 1300 e.g., computing resources 1304 or model optimizer 1303
  • computing resources 1304 may have received the stored reference data stream data prior to beginning training of the new synthetic data model.
  • computing resources 1304 can be configured to gather data from streaming data source 1301 during a first time-interval (e.g., the prior repeat) and use this gathered data to train a new synthetic model in a subsequent time-interval (e.g., the current repeat).
  • computing resources 1304 can be configured to use the stored reference data stream data for training the new synthetic data model.
  • the training data can include both newly-received and stored data.
  • the synthetic data model is a Generative Adversarial Network
  • computing resources 1304 can be configured to train the new synthetic data model, in some embodiments, as described above with regard to FIGS. 8 and 9 .
  • computing resources 1304 can be configured to train the new synthetic data model according to know methods.
  • Model optimizer 1303 can be configured to evaluate performance criteria of a newly created synthetic data model.
  • the performance criteria can include a similarity metric (e.g., a statistical correlation score, data similarity score, or data quality score, as described herein).
  • model optimizer 1303 can be configured to compare the covariances or univariate distributions of a synthetic dataset generated by the new synthetic data model and a reference data stream dataset.
  • model optimizer 1303 can be configured to evaluate the number of matching or similar elements in the synthetic dataset and reference data stream dataset.
  • model optimizer 1303 can be configured to evaluate a number of duplicate elements in each of the synthetic dataset and reference data stream dataset, a prevalence of the most common value in synthetic dataset and reference data stream dataset, a maximum difference of rare values in each of the synthetic dataset and reference data stream dataset, differences in schema between the synthetic dataset and reference data stream dataset, and the like.
  • the performance criteria can include prediction metrics.
  • the prediction metrics can enable a user to determine whether data models perform similarly for both synthetic and actual data.
  • the prediction metrics can include a prediction accuracy check, a prediction accuracy cross check, a regression check, a regression cross check, and a principal component analysis check.
  • a prediction accuracy check can determine the accuracy of predictions made by a model (e.g., recurrent neural network, kernel density estimator, or the like) given a dataset.
  • the prediction accuracy check can receive an indication of the model, a set of data, and a set of corresponding labels.
  • the prediction accuracy check can return an accuracy of the model in predicting the labels given the data.
  • a prediction accuracy cross check can calculate the accuracy of a predictive model that is trained on synthetic data and tested on the original data used to generate the synthetic data.
  • a regression check can regress a numerical column in a dataset against other columns in the dataset, determining the predictability of the numerical column given the other columns.
  • a regression error cross check can determine a regression formula for a numerical column of the synthetic data and then evaluate the predictive ability of the regression formula for the numerical column of the actual data.
  • a principal component analysis check can determine a number of principal component analysis columns sufficient to capture a predetermined amount of the variance in the dataset. Similar numbers of principal component analysis columns can indicate that the synthetic data preserves the latent feature structure of the original data.
  • Model optimizer 1303 can be configured to store the newly created synthetic data model and metadata for the new synthetic data model in model storage 1305 based on the evaluated performance criteria, consistent with disclosed embodiments.
  • model optimizer 1303 can be configured to store the metadata and new data model in model storage when a value of a similarity metric or a prediction metric satisfies a predetermined threshold.
  • the metadata can include at least one value of a similarity metric or prediction metric.
  • the metadata can include an indication of the origin of the new synthetic data model, the data used to generate the new synthetic data model, when the new synthetic data model was generated, and the like.
  • System 1300 can be configured to generate synthetic data using a current data model. In some embodiments, this generation can occur while system 1300 is training a new synthetic data model.
  • Model optimizer 1303 , model storage 1305 , dataset generator 1307 , and synthetic data source 1309 can interact to generate the synthetic data, consistent with disclosed embodiments.
  • Model optimizer 1303 can be configured to receive a request for a synthetic data stream from an interface (e.g., interface 113 or the like).
  • model optimizer 1307 can resemble model optimizer 107 , described above with regard to FIG. 1 .
  • model optimizer 1307 can provide similar functionality and can be similarly implemented.
  • requests received from the interface can indicate a reference data stream.
  • such a request can identify streaming data source 1301 and/or specify a topic or subject (e.g., a Kafka topic or the like).
  • model optimizer 1307 (or another component of system 1300 ) can be configured to direct generation of a synthetic data stream that tracks the reference data stream, consistent with disclosed embodiments.
  • Dataset generator 1307 can be configured to retrieve a current data model of the reference data stream from model storage 1305 .
  • dataset generator 1307 can resemble dataset generator 103 , described above with regard to FIG. 1 .
  • dataset generator 1307 can provide similar functionality and can be similarly implemented.
  • model storage 1305 can resemble model storage 105 , described above with regard to FIG. 1 .
  • model storage 1305 can provide similar functionality and can be similarly implemented.
  • the current data model can resemble data received from streaming data source 1301 according to a similarity metric (e.g., a statistical correlation score, data similarity score, or data quality score, as described herein).
  • the current data model can resemble data received during a time interval extending to the present (e.g. the present hour, the present day, the present week, or the like). In various embodiments, the current data model can resemble data received during a prior time interval (e.g. the previous hour, yesterday, last week, or the like). In some embodiments, the current data model can be the most recently trained data model of the reference data stream.
  • Dataset generator 1307 can be configured to generate a synthetic data stream using the current data model of the reference data steam.
  • dataset generator 1307 can be configured to generate the synthetic data stream by replacing sensitive portions of the reference data steam with synthetic data, as described in FIGS. 5A and 5B .
  • dataset generator 1307 can be configured to generate the synthetic data stream without reference to the reference data steam data.
  • dataset generator 1307 can be configured to initialize the recurrent neural network with a value string (e.g., a random sequence of characters), predict a new value based on the value string, and then add the new value to the end of the value string.
  • a value string e.g., a random sequence of characters
  • Dataset generator 1307 can then predict the next value using the updated value string that includes the new value.
  • dataset generator 1307 can be configured to probabilistically choose a new value.
  • the existing value string is “examin”
  • the dataset generator 1307 can be configured to select the next value as “e” with a first probability and select the next value as “a” with a second probability.
  • dataset generator 1307 can be configured to generate the synthetic data by selecting samples from a code space, as described herein.
  • dataset generator 1307 can be configured to generate an amount of synthetic data equal to the amount of actual data retrieved from synthetic data stream 1309 .
  • the rate of synthetic data generation can match the rate of actual data generation.
  • dataset generator 1307 when streamlining data source 1301 retrieves a batch of 10 samples of actual data, dataset generator 1307 can be configured to generate a batch of 10 samples of synthetic data.
  • dataset generator 1307 when streamlining data source 1301 retrieves a batch of actual data every 10 minutes, dataset generator 1307 can be configured to generate a batch of actual data every 10 minutes. In this manner, system 1300 can be configured to generate synthetic data similar in both content and temporal characteristics to the reference data stream data.
  • dataset generator 1307 can be configured to provide synthetic data generated using the current data model to synthetic data source 1309 .
  • synthetic data source 1309 can be configured to provide the synthetic data received from dataset generator 1307 to a database, a file, a datasource, a topic in a data streaming platform (e.g., IBM STREAMS), a topic in a distributed messaging system (e.g., APACHE KAFKA), or the like.
  • a data streaming platform e.g., IBM STREAMS
  • a topic in a distributed messaging system e.g., APACHE KAFKA
  • system 1300 can be configured to track the reference data stream by repeatedly switching data models of the reference data stream.
  • dataset generator 1307 can be configured to switch between synthetic data models at a predetermined time, or upon expiration of a time interval.
  • model optimizer 1307 can be configured to switch from an old model to a current model every hour, day, week, or the like.
  • system 1300 can detect when a data schema of the reference data stream changes and switch to a current data model configured to provide synthetic data with the current schema.
  • switching between synthetic data models can include dataset generator 1307 retrieving a current model from model storage 1305 and computing resources 1304 providing a new synthetic data model for storage in model storage 1305 .
  • computing resources 1304 can update the current synthetic data model with the new synthetic data model and then dataset generator 1307 can retrieve the updated current synthetic data model.
  • dataset generator 1307 can retrieve the current synthetic data model and then computing resources 1304 can update the current synthetic data model with the new synthetic data model.
  • model optimizer 1303 can provision computing resources 1304 with a synthetic data model for training using a new set of training data.
  • computing resources 1304 can be configured to continue updating the new synthetic data model. In this manner, a repeat of the switching process can include generation of a new synthetic data model and the replacement of a current synthetic data model by this new synthetic data model.
  • FIG. 14 depicts a process 1400 for generating synthetic JSON log data using the cloud computing system of FIG. 13 .
  • Process 1400 can include the steps of retrieving reference JSON log data, training a recurrent neural network to generate synthetic data resembling the reference JSON log data, generating the synthetic JSON log data using the recurrent neural network, and validating the synthetic JSON log data. In this manner system 1300 can use process 1400 to generate synthetic JSON log data that resembles actual JSON log data.
  • step 1401 can be configured to retrieve the JSON log data from a database, a file, a datasource, a topic in a distributed messaging system such Apache Kafka, or the like.
  • the JSON log data can be retrieved in response to a request from model optimizer 1303 .
  • the JSON log data can be retrieved in real-time, or periodically (e.g., approximately every five minutes).
  • Process 1400 can then proceed to step 1403 .
  • step 1403 substantially as described above with regard to FIG. 13 , computing resources 1304 can be configured to train a recurrent neural network using the received data.
  • the training of the recurrent neural network can proceed as described in “Training Recurrent Neural Networks,” 2013, by Ilya Sutskever, which is incorporated herein by reference in its entirety.
  • dataset generator 1307 can be configured to generate synthetic JSON log data using the trained neural network.
  • dataset generator 1307 can be configured to generate the synthetic JSON log data at the same rate as actual JSON log data is received by streaming data source 1301 .
  • dataset generator 1307 can be configured to generate batches of JSON log data at regular time intervals, the number of elements in a batch dependent on the number of elements received by streaming data source 1301 .
  • dataset generator 1307 can be configured to generate an element of synthetic JSON log data upon receipt of an element of actual JSON log data from streaming data source 1301 .
  • Process 1400 can then proceed to step 1407 .
  • dataset generator 1307 (or another component of system 1300 ) can be configured to validate the synthetic data stream.
  • dataset generator 1307 can be configured to use a JSON validator (e.g., JSON SCHEMA VALIDATOR, JSONLINT, or the like) and a schema for the reference data stream to validate the synthetic data stream.
  • the schema describes key-value pairs present in the reference data stream.
  • system 1300 can be configured to derive the schema from the reference data stream.
  • validating the synthetic data stream can include validating that keys present in the synthetic data stream are present in the schema.
  • validating the synthetic data stream can include validating that key-value formats present in the synthetic data stream match corresponding key-value formats in the reference data stream.
  • system 1300 may not validate the synthetic data stream when objects in the data stream include a numeric-valued “first_name” or “last_name”.
  • FIG. 15 depicts a system 1500 for secure generation and insecure use of models of sensitive data.
  • System 1500 can include a remote system 1501 and a local system 1503 that communicate using network 1505 .
  • Remote system 1501 can be substantially similar to system 100 and be implemented, in some embodiments, as described in FIG. 4 .
  • remote system 1501 can include an interface, model optimizer, and computing resources that resemble interface 113 , model optimizer 107 , and computing resources 101 , respectively, described above with regards to FIG. 1 .
  • the interface, model optimizer, and computing resources can provide similar functionality to interface 113 , model optimizer 107 , and computing resources 101 , respectively, and can be similarly implemented.
  • remote system 1501 can be implemented using a cloud computing infrastructure.
  • Local system 1503 can comprise a computing device, such as a smartphone, tablet, laptop, desktop, workstation, server, or the like.
  • Network 1505 can include any combination of electronics communications networks enabling communication between components of system 1500 (similar to network 115 ).
  • remote system 1501 can be more secure than local system 1503 .
  • remote system 1501 can better protected from physical theft or computer intrusion than local system 1503 .
  • remote system 1501 can be implemented using AWS or a private cloud of an institution and managed at an institutional level, while the local system can be in the possession of, and managed by, an individual user.
  • remote system 1501 can be configured to comply with policies or regulations governing the storage, transmission, and disclosure of customer financial information, patient healthcare records, or similar sensitive information.
  • local system 1503 may not be configured to comply with such regulations.
  • System 1500 can be configured to perform a process of generating synthetic data. According to this process, system 1500 can train the synthetic data model on sensitive data using remote system 1501 , in compliance with regulations governing the storage, transmission, and disclosure of sensitive information. System 1500 can then transmit the synthetic data model to local system 1503 , which can be configured to use the system to generate synthetic data locally. In this manner, local system 1503 can be configured to use synthetic data resembling the sensitive information, which comply with policies or regulations governing the storage, transmission, and disclosure of such information.
  • the model optimizer can receive a data model generation request from the interface.
  • the model optimizer can provision computing resources with a synthetic data model.
  • the computing resources can train the synthetic data model using a sensitive dataset (e.g., consumer financial information, patient healthcare information, or the like).
  • the model optimizer can be configured to evaluate performance criteria of the data model (e.g., the similarity metric and prediction metrics described herein, or the like).
  • the model optimizer can be configured to store the trained data model and metadata of the data model (e.g., values of the similarity metric and prediction metrics, of the data, the origin of the new synthetic data model, the data used to generate the new synthetic data model, when the new synthetic data model was generated, and the like). For example, the model optimizer can determine that the synthetic data model satisfied predetermined acceptability criteria based on one or more similarity and/or prediction metric value.
  • Local system 1503 can then retrieve the synthetic data model from remote system 1501 .
  • local system 1503 can be configured to retrieve the synthetic data model in response to a synthetic data generation request received by local system 1503 .
  • a user can interact with local system 1503 to request generation of synthetic data.
  • the synthetic data generation request can specify metadata criteria for selecting the synthetic data model.
  • Local system 1503 can interact with remote system 1501 to select the synthetic data model based on the metadata criteria. Local system 1503 can then generate the synthetic data using the data model in response to the data generation request.
  • FIG. 16 depicts a process 1600 for transforming any model into a neural network model, consistent with disclosed embodiments.
  • Process 1600 is performed by components of system 100 , including computing resources 101 , dataset generator 103 , database 105 , model optimizer 107 , model storage 109 , model curator 111 , and interface 113 , consistent with disclosed embodiments.
  • process 1600 is performed by components of system 400 , an exemplary implementation of system 100 .
  • steps of process 1600 may be performed by model optimizer 107 as implemented on model optimization instance 409 .
  • input data is received.
  • input is received at, for example, model optimizer 107 from one or more components of system 100 .
  • input is received from a system outside system 100 via interface 113 .
  • the input data of step 1602 includes an input command, an input model having an input model type and an input dataset.
  • the input model may be any type of data model (e.g., a random forest model, a gradient boosting machine, a regression model, a logistic regression, an object based logical model, a physical data model, a neural network model, or any other data model).
  • the input command includes a command to transform the input model into a neural network model.
  • the input command specifies one or more neural network types (e.g., neural network, recurrent neural network, generative adversarial network, or the like).
  • the input command specifies one or more model parameters for the one or more types of neural networks (e.g., the number of features and number of layers in a recurrent neural network).
  • the input command specifies model selection criteria.
  • the model selection criteria may comprise a desired performance metric of the transformed model.
  • the performance metric may be an accuracy score of the neural network model, a model run time, a root-mean-square error estimate (RMSE), a logloss, an Akaike's Information Criterion (AICC), a Bayesian Information Criterion (BIC), an area under a receiver operating characteristic (ROC) curve, Precision, Recall, an F-Score, or the like.
  • the input dataset is a dataset previously used to train the input model.
  • the input command specifies the location of a database, and receiving the input database includes retrieving an input dataset from the database.
  • the input model is applied to the input dataset to generate model output.
  • model optimizer 107 applies the input model, consistent with the embodiments. That is, computing resources are provisioned to run the model over the input dataset.
  • step 1604 includes spinning up a virtual machine or ephemeral container instance to run the input model (e.g., development instance 407 ).
  • step 1604 includes generating a map of input model features. The map may include references that relate output data to input data (e.g., a foreign key to the input data).
  • the map may identify a correspondence between input data values in rows numbering row A to row Z and a set of 26 model output data values that the model produced based on input data values in rows A to Z.
  • Model features may include input values from the input data set or transformed input data values used to run the model prior to step 1604 (e.g., via normalizing, averaging, or other transformations).
  • generating model output includes transforming the input values into transformed values, and running the input model over the transformed values.
  • model output is stored.
  • storing model output includes one of (a) storing model output along with a map of input model features or (b) storing model output along with model features.
  • Model output may include a modeling result, a log, or other model output.
  • Model output may be stored in a database or cloud computing bucket. For example, model output may be stored in database 105 .
  • one or more candidate neural network models are generated.
  • model optimizer 107 generates candidate neural network models, consistent with the embodiments.
  • Generating a candidate network model may include spinning up an ephemeral container instance (e.g., development instance 407 ).
  • the number and type of candidate neural network models are based on the command received at step 1602 .
  • the number and type of candidate neural network models are pre-determined.
  • generating candidate neural network models includes retrieving a candidate neural network model from a model storage. For example, candidate neural network models may be retrieved from model storage 109 . Retrieving candidate neural network models may be based on a model index.
  • the one or more candidate neural network models of step 1608 have model parameters which are based on the features of the input model such that the one or more candidate neural network models are overfitted to the input model. For example, if the input model is a linear regression model having a set of regression coefficients, each coefficient producing a result, then a candidate neural network may be designed to reproduce the result of each regression coefficient and the overall regression result of the input model.
  • the model parameters of the one or more candidate neural network models may include, for example, the number of hidden layers, the number of nodes, the dropout rate, or other model parameters.
  • the candidate neural network models are tuned to the input model, consistent with disclosed embodiments.
  • model optimizer 107 tunes the candidate neural network models, consistent with the embodiments.
  • Tuning includes, for example, adjusting a number of hidden layers, a number of inputs, or a type of layer for any of the one or more candidate neural network models.
  • Tuning includes training a candidate model to reproduce features of the input model, consistent with disclosed embodiments.
  • tuning a candidate model includes iteratively performing operations to: select model training parameters; receive candidate model results; update the model training parameters based on the received candidate model results; and receive updated candidate model results based on the updated model training parameters.
  • Model training parameters may include, for example, batch size, number of training batches, number of epochs, chunk size, time window, input noise dimension, or the like, consistent with disclosed embodiments. Training may terminate when a training condition is satisfied. For example, training may terminate at a pre-determined run time, after a pre-determined number of epochs, or based on an accuracy score of a candidate model. In some embodiments, different candidate models are trained using separate ephemeral container instances (e.g., development instance 407 ).
  • model output is received from the candidate neural network models.
  • the candidate model output may include an overall accuracy score of a candidate model and a plurality of accuracy scores corresponding to the features of the input model.
  • Model output at step 1612 may include a log file, a run time, a number of epochs, an error, an estimate of drift, or other model-run information.
  • Model output may specify a model parameter, consistent with disclosed embodiments.
  • model optimizer 107 receives model output from, for example, resources provisioned to run the candidate neural network models, including an ephemeral container instance, consistent with the embodiments.
  • a candidate neural network model is selected based on one or more selection criteria.
  • model optimizer 107 selects the candidate neural network model, consistent with the embodiments.
  • the one or more selection criteria are the selection criteria received in the input command (step 1602 ). In some embodiments, the one or more selection criteria is predetermined.
  • the model selection criteria may comprise a desired performance metric of the transformed model (e.g., an accuracy score of the neural network model, a model run time, a root-mean-square error estimate (RMSE), a logloss, an Akaike's Information Criterion (AICC), a Bayesian Information Criterion (BIC), an area under a receiver operating characteristic (ROC) curve, Precision, Recall, an F-Score, or the like).
  • a desired performance metric of the transformed model e.g., an accuracy score of the neural network model, a model run time, a root-mean-square error estimate (RMSE), a logloss, an Akaike's Information Criterion (AICC), a Bayesian Information Criterion (BIC), an area under a receiver operating characteristic (ROC) curve, Precision, Recall, an F-Score, or the like.
  • selecting a candidate neural network model at step 1614 includes terminating processes relating to candidate neural network models that are not selected.
  • step 1614
  • the selected neural network model is returned.
  • returning the selected neural network model includes transmitting, from model optimizer 107 , the selected neural network model to an outside system via interface 113 .
  • returning the selected neural network model includes storing the selected neural network model.
  • the step 1616 may include storing the selected neural network in a database or a bucket (e.g., database 105 ).
  • FIG. 17 depicts a process 1700 for transforming a legacy model, consistent with disclosed embodiments.
  • Process 1700 is performed by components of system 100 , including computing resources 101 , dataset generator 103 , database 105 , model optimizer 107 , model storage 109 , model curator 111 , and interface 113 , consistent with disclosed embodiments.
  • process 1600 is performed by components of system 400 , an exemplary implementation of system 100 .
  • steps of process 1700 may be performed by model optimizer 107 as implemented on model optimization instance 409 .
  • inputs are received.
  • inputs are received by optimizer 107 from interface 113 , consistent with the embodiments.
  • the input includes may include a model of any model type, e.g., the legacy model, and a command to transform a legacy model into a new model of the same model type as the legacy model, the new model having a different programming environment than the legacy model programming environment.
  • the command may be to transform a linear regression model built in SAS to a linear regression model built in PYTHON.
  • the legacy model and the new model share the same programming environment but use different code packages.
  • the legacy model may be a Gradient Boosting Machine built with SCIKIT LEARN (using PYTHON libraries) and the new model may be a Gradient Boosting Machine built with XGBOOST (also using PYTHON libraries).
  • Step 1702 may include receiving metadata of the legacy model.
  • Metadata may include arguments used by the model.
  • the legacy model is a random forest and metadata includes a number of trees, a depth of trees, a min sample split.
  • receiving input at step 1702 includes sub-steps to receive a model of any type, receive a dataset, and apply the model to the received dataset (e.g., steps 1602 - 1606 ).
  • model data is received from an API call via, for example, interface 113 .
  • step 1704 a determination is made whether model metadata is present, i.e., whether model metadata was received at step 1702 , and the system proceeds to either step 1706 or step 1708 based on the determination.
  • model optimizer 107 may perform the determination of step 1704 .
  • Step 1705 may be performed if model metadata was received at step 1702 .
  • a feature map is received.
  • the feature map may include instructions to transform features of one programming environment into another programing environment (or from one package into another package).
  • model optimizer 107 may identify features of the legacy model programming environment.
  • legacy model features may map directly onto the new model programing environment.
  • arguments using different syntax in both models may specify the number of trees, depth of trees, and/or min sample split, and the feature map may include instructions to transform these arguments from SCIKIT-LEARN to XGBOOST.
  • model optimizer 107 may generate the candidate new models.
  • model optimizer 107 may spin up one or more ephemeral container instances (e.g., development instance 407 ) to generate a candidate new model.
  • Step 1708 may be performed if metadata was not received at step 1702 .
  • a global grid search over new model parameters may be performed to generate one or more candidate new models that approximates the model output of the legacy model.
  • the global grid search of step 1708 may be a broad (unrefined) parameter search.
  • the legacy model is a random forest model
  • the global grid search may comprise obtaining a model performance metric for each of a series of candidate random forest models that vary by the number of trees ranging from 0 to 10 with a step size of 2; the depth of trees ranging from 0 to 20 with a step size of 5; and/or a min sample split ranging from 0 to 1 with a step size of 0.1.
  • model optimizer 107 performs the closed-loop grid search. In some embodiments, model optimizer 107 spins up one or more ephemeral container instances or identifies one or more running container instance to perform the global grid search (e.g., development instance 407 ).
  • a closed-loop (refined) grid search may be performed over the parameters to fit one or more candidate new models (of either step 1706 or step 1708 ) to the legacy model.
  • the closed-loop grid search may include a grid search over parameters near to the parameters of the received as input at step 1702 , i.e. the parameters of the input model serves as a seed for the closed-loop grid search.
  • the close-loop grid search may include a grid search over parameters near the parameters of the subset of candidate new models identified at step 1708 .
  • the range and step size of the closed-loop grid search may be smaller than the range and step size of parameters in used in the global grid search.
  • model optimizer 107 performs the closed-loop grid search. In some embodiments, model optimizer 107 spins up one or more ephemeral container instances or identifies one or more running container instance to perform the closed-loop grid search (e.g., development instance 407 ).
  • candidate new models may be applied to the input dataset and the results are compared to the legacy model output.
  • the comparison may include at least one of an accuracy score, a model run time, a root-mean-square error estimate (RMSE), a logloss, an AICC, a BIC, an area under an ROC curve, Precision, Recall, an F-Score, or the like.
  • model optimizer 107 may perform step 1712 .
  • a candidate new model may be selected based on the comparison at step 1712 and one or more selection criteria.
  • model optimizer 107 may select the new candidate model.
  • an updated feature map may be created, consistent with disclosed embodiments.
  • the updated feature map may be based on one or more of the legacy model, the feature map received at step 1705 , and the selected new model.
  • the updated feature map may be newly generated based on legacy model and the selected new model.
  • the feature map may include instructions to transform features of one programming environment into another programing environment (or from one package into another package).
  • the selected new model may be returned.
  • returning the selected neural network model may include transmitting, by model optimizer 107 , the selected neural network model to an outside system via interface 113 .
  • returning the selected neural network model includes storing the selected neural network model.
  • step 1718 may include storing the selected neural network in a database or a bucket (e.g., database 105 ).
  • step 1718 may include at least one of returning the updated feature map with the selected new model or storing the updated feature map in memory (e.g., model storage 109 ).
  • the disclosed systems and methods can enable generation of synthetic data similar to an actual dataset (e.g., using dataset generator 103 ).
  • the synthetic data can be generated using a data model trained on the actual dataset (e.g., as described above with regards to FIG. 9 ).
  • Such data models can include generative adversarial networks.
  • the following code depicts the creation a synthetic dataset based on sensitive patient healthcare records using a generative adversarial network.
  • the following step defines a Generative Adversarial Network data model.
  • model_options ⁇ ‘GANhDim’: 498, ‘GANZDim’: 20, ‘num_epochs’: 3 ⁇
  • the following step defines the delimiters present in the actual data
  • the dataset is the publicly available University of Wisconsin Cancer dataset, a standard dataset used to benchmark machine learning prediction tasks. Given characteristics of a tumor, the task to predict whether the tumor is malignant.
  • the GAN model is trained generate data statistically similar to the actual data.
  • the GAN model can now be used to generate synthetic data.
  • the synthetic data can be saved to a file for later use in training other machine learning models for this prediction task without relying on the original data.
  • the disclosed systems and methods can enable identification and removal of sensitive data portions in a dataset.
  • sensitive portions of a dataset are automatically detected and replaced with synthetic data.
  • the dataset includes human resources records.
  • the sensitive portions of the dataset are replaced with random values (though they could also be replaced with synthetic data that is statistically similar to the original data as described in FIGS. 5A and 5B ).
  • this example depicts tokenizing four columns of the dataset.
  • Active Status columns are tokenized such that all the characters in the values can be replaced by random chars of the same type while preserving format. For the column of Employee number, the first three characters of the values can be preserved but the remainder of each employee number can be tokenized. Finally, the values of the Last Day of Work column can be replaced with fully random values. All of these replacements can be consistent across the columns.
  • the system can use the scrub map to tokenize another file in a consistent way (e.g., replace the same values with the same replacements across both files) by passing the returned scrub_map dictionary to a new application of the scrub function.
  • the disclosed systems and methods can be used to consistently tokenize sensitive portions of a file.

Abstract

Systems and methods for transforming legacy models and transforming a model into a neural network model are disclosed. In an embodiment, a method may include receiving input data comprising an input model, an input dataset, and an input command. The method may include applying the input model to the input dataset to generate model output and storing model output and at least one of input model features or a map of the input model. The method may include generating a candidate neural network models with parameters. The method may include tuning the candidate neural network models to the input model. The method may include receiving model output from the candidate neural network models and selecting a neural network model from the candidate neural network models based on the candidate model output and the model selection criteria. In some aspects, the method may include returning the selected neural network model.

Description

    CROSS REFERENCE TO RELATED APPLICATIONS
  • This application claims the benefit of U.S. Provisional Application No. 62/694,968, filed Jul. 6, 2018, and incorporated herein by reference in its entirety.
  • This application also relates to U.S. patent application Ser. No. 16/151,385 filed on Oct. 4, 2018, and titled Data Model Generation Using Generative Adversarial Networks, the disclosure of which is also incorporated herein by reference in its entirety.
  • TECHNICAL FIELD
  • The disclosed embodiments concern a platform for management of artificial intelligence systems. In particular, the disclosed embodiments concern using the disclosed platform to create neural network models of data based on previously existing models, including legacy models. A legacy model is a model that runs in a programming environment that has been or is being replaced by a different programming environment. These data models can be used, for example, to generate synthetic data for testing or training artificial intelligence systems. The disclosed embodiments also concern improvements transforming any model, including a legacy model, into a neural network model.
  • BACKGROUND
  • Neural network models provide advantages over conventional modeling approaches. Neural network models model data efficiently, adaptably, and accurately than conventional models. Neural network models may be recurrent neural network models, deep learning models, long-short-term memory (LSTM) models, convolutional neural network (CNN) models, generative adversarial networks (GANs), and the like. Neural network models meet different modeling needs than conventional models and can analyze a wide variety of data as data analysis objectives evolve.
  • A neural network model may be designed according to an underlying physical or relational understanding of the data structure. Further, to assist user understanding of a model, it may be desirable to base the neural network model on an original model, e.g., a conventional model, that captures physical or statistical relationships between data elements. Further, a neural network model designed based on the features of the original model may behave more like the original model when encountering new datasets, and thereby yield similar insights that are easier for the user to understand. Therefore, it is often desirable to transform original models into neural network models instead of, for instance, replacing original models with completely new neural network models. For example, a neural network model may train more efficiently and more accurately predict an outcome if certain nodes are trained to replicate a linear regression model that had already identified certain statistically significant relationships between input variables. Further, a previously trained predictive neural network model may be an efficient seed model for an updated, more accurate predictive neural network model because the previously trained model may have learned to make predictions with some accuracy.
  • Conventional approaches to transform a model into a neural network model involve time consuming efforts to oversee the development of the neural network model. Many organizations lack the resources or capacity to engage in conventional approaches to transforming a model into a neural network model.
  • In addition, there is a need to transform models of a given type (e.g., a random forest model, a gradient boosting machine, a regression model, a linear regression model, a symbolic model, a neural network model, or other model), into a new model of the same type. This may happen, for instance, with a legacy model. For example, an organization may be in the process of phasing out SAS and implementing PYTHON across a platform, so the organization may need to transform legacy SAS models to PYTHON models.
  • Conventionally, to transform a legacy model into a model running in a new environment, developers manually performed many steps of the transformation, which is a time consuming, costly, and error-prone process. For example, to transform a random forest model built with SCIKIT into a gradient boosting model built for XGBOOST, the developer must configure the system, install dependencies, import libraries, modify code to accept new output types of XGBOOST, and remove outdated dependencies. Any of these steps could lead to errors and extensive time debugging the new model code . Many organizations lack the resources or capacity to engage in conventional approaches to transforming a legacy model.
  • Therefore, in light of the shortcomings and deficiencies of conventional methods of modeling and of transforming models, systems and methods that reduce development time, reduce cost, improve modeling accuracy, and increase flexibility are desirable. There is a need for systems and methods that transform a model of any type, e.g., a legacy model, into a new neural network model. There is also a need for systems and methods that transform a legacy model running in one type of environment into a model (e.g., a conventional model or a neural network model) that can run in a different environment. Further, for organizations or users that lack the resources or capacity to transform models independently, there is a need for model transformation to be provided as a service.
  • SUMMARY
  • The disclosed embodiments provide unconventional systems and methods for transforming models that are more efficient, less costly, more flexible, and more accurate than conventional approaches. The disclosed embodiments provide unconventional systems and methods to transform any input model into a neural network model. The transformed neural network model is designed based on the features of the input model and is designed to overfit the input model. By overfitting the input model, the transformed neural network model accurately reproduces the modeling results of the input model on training datasets and is likely to behave like the input model when analyzing other, non-training datasets.
  • Further, the disclosed embodiments provide systems and methods to transform legacy models into new models of the same type using machine learning. By using neural network models to create a new model that is the same type as the legacy model but will run in a different environment than the legacy model, the unconventional systems and methods of disclosed embodiments save time, reduce costs, and reduce errors.
  • The disclosed embodiments include a system for transforming a model into a neural network model. The system may include one or more memory units for storing instructions, and one or more processors configured to execute the instructions to perform operations. The operations may include receiving input data comprising an input model, an input dataset, and an input command specifying model selection criteria. The operations may include applying the input model to the input dataset to generate model output. The operations may include storing model output and at least one of input model features or a map of the input model and generating a plurality of candidate neural network models. The parameters of the candidate neural network models may be based on the input model features. The operations may include tuning the plurality of candidate neural network models to the input model. The operations may include receiving model output from the plurality of candidate neural network models and selecting a neural network model from the plurality of the candidate neural network models based on the candidate model output and the model selection criteria. In some aspects, the operations may include returning the selected neural network model.
  • Consistent with disclosed embodiments, a method for transforming a model into a neural network model is disclosed. The method may include receiving input data comprising an input model, an input dataset, and an input command specifying model selection criteria. The method may include applying the input model to the input dataset to generate model output. The method may include storing model output and at least one of input model features or a map of the input model and generating a plurality of candidate neural network models. The parameters of the candidate neural network models may be based on the input model features. The method may include tuning the plurality of candidate neural network models to the input model. The method may include receiving model output from the plurality of candidate neural network models and selecting a neural network model from the plurality of the candidate neural network models based on the candidate model output and the model selection criteria. In some aspects, the method may include returning the selected neural network model.
  • Consistent with other disclosed embodiments, non-transitory computer readable storage media may store program instructions, which are executed by at least one processor device and perform any of the methods described herein.
  • The foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the claims.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The drawings are not necessarily to scale or exhaustive. Instead, emphasis is generally placed upon illustrating the principles of the embodiments described herein. The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate several embodiments consistent with the disclosure and, together with the description, serve to explain the principles of the disclosure. In the drawings:
  • FIG. 1 depicts an exemplary cloud-computing environment for generating data models, consistent with disclosed embodiments.
  • FIG. 2 depicts an exemplary process for generating data models, consistent with disclosed embodiments.
  • FIG. 3 depicts an exemplary process for generating synthetic data using existing data models, consistent with disclosed embodiments.
  • FIG. 4 depicts an exemplary implementation of the cloud-computing environment of FIG. 1, consistent with disclosed embodiments.
  • FIG. 5A depicts an exemplary process for generating synthetic data using class-specific models, consistent with disclosed embodiments.
  • FIG. 5B depicts an exemplary process for generating synthetic data using class and subclass-specific models, consistent with disclosed embodiments.
  • FIG. 6 depicts an exemplary process for training a classifier for generation of synthetic data, consistent with disclosed embodiments.
  • FIG. 7 depicts an exemplary process for training a classifier for generation of synthetic data, consistent with disclosed embodiments.
  • FIG. 8 depicts an exemplary process for training a generative adversarial using a normalized reference dataset, consistent with disclosed embodiments.
  • FIG. 9 depicts an exemplary process for training a generative adversarial network using a loss function configured to ensure a predetermined degree of similarity, consistent with disclosed embodiments.
  • FIG. 10 depicts an exemplary process for supplementing or transform datasets using code-space operations, consistent with disclosed embodiments.
  • FIGS. 11A and 11B depict an exemplary illustration of points in code-space, consistent with disclosed embodiments.
  • FIG. 12A depicts an exemplary illustration of supplementing datasets using code-space operations, consistent with disclosed embodiments.
  • FIG. 12B depicts an exemplary illustration of transforming datasets using code-space operations, consistent with disclosed embodiments.
  • FIG. 13 depicts an exemplary cloud computing system for generating a synthetic data stream that tracks a reference data stream, consistent with disclosed embodiments.
  • FIG. 14 depicts a process for generating synthetic JSON log data using the cloud computing system of FIG. 13, consistent with disclosed embodiments.
  • FIG. 15 depicts a system for secure generation and insecure use of models of sensitive data, consistent with disclosed embodiments.
  • FIG. 16 depicts a process for transforming any model into a neural network model, consistent with disclosed embodiments.
  • FIG. 17 depicts a process for transforming a legacy model, consistent with disclosed embodiments.
  • DETAILED DESCRIPTION
  • Reference will now be made in detail to exemplary embodiments, discussed with regards to the accompanying drawings. In some instances, the same reference numbers will be used throughout the drawings and the following description to refer to the same or like parts. Unless otherwise defined, technical and/or scientific terms have the meaning commonly understood by one of ordinary skill in the art. The disclosed embodiments are described in sufficient detail to enable those skilled in the art to practice the disclosed embodiments. It is to be understood that other embodiments may be utilized and that changes may be made without departing from the scope of the disclosed embodiments. Thus, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.
  • The disclosed embodiments can be used to create models of datasets, which may include sensitive datasets (e.g., customer financial information, patient healthcare information, and the like). Using these models, the disclosed embodiments can produce fully synthetic datasets with similar structure and statistics as the original sensitive or non-sensitive datasets. The disclosed embodiments also provide tools for desensitizing datasets and tokenizing sensitive values. In some embodiments, the disclosed systems can include a secure environment for training a model of sensitive data, and a non-secure environment for generating synthetic data with similar structure and statistics as the original sensitive data. In various embodiments, the disclosed systems can be used to tokenize the sensitive portions of a dataset (e.g., mailing addresses, social security numbers, email addresses, account numbers, demographic information, and the like). In some embodiments, the disclosed systems can be used to replace parts of sensitive portions of the dataset (e.g., preserve the first or last 3 digits of an account number, social security number, or the like; change a name to a first and last initial). In some aspects, the dataset can include one or more JSON (JavaScript Object Notation) or delimited files (e.g., comma-separated value, or CSV, files). In various embodiments, the disclosed systems can automatically detect sensitive portions of structured and unstructured datasets and automatically replace them with similar but synthetic values.
  • FIG. 1 depicts a cloud-computing environment 100 for generating data models. Environment 100 can be configured to support generation and storage of synthetic data, generation and storage of data models, optimized choice of parameters for machine learning, and imposition of rules on synthetic data and data models. Environment 100 can be configured to expose an interface for communication with other systems. Environment 100 can include computing resources 101, dataset generator 103, database 105, model optimizer 107, model storage 109, model curator 111, and interface 113. These components of environment 100 can be configured to communicate with each other, or with external components of environment 100, using network 115. The particular arrangement of components depicted in FIG. 1 is not intended to be limiting. System 100 can include additional components, or fewer components. Multiple components of system 100 can be implemented using the same physical computing device or different physical computing devices.
  • Computing resources 101 can include one or more computing devices configurable to train data models. The computing devices can be special-purpose computing devices, such as graphical processing units (GPUs) or application-specific integrated circuits. The cloud computing instances can be general-purpose computing devices. The computing devices can be configured to host an environment for training data models. For example, the computing devices can host virtual machines, pods, or containers. The computing devices can be configured to run applications for generating data models. For example, the computing devices can be configured to run SAGEMAKER, or similar machine learning training applications. Computing resources 101 can be configured to receive models for training from model optimizer 107, model storage 109, or another component of system 100. Computing resources 101 can be configured provide training results, including trained models and model information, such as the type and/or purpose of the model and any measures of classification error.
  • Dataset generator 103 can include one or more computing devices configured to generate data. Dataset generator 103 can be configured to provide data to computing resources 101, database 105, to another component of system 100 (e.g., interface 113), or another system (e.g., an APACHE KAFKA cluster or other publication service). Dataset generator 103 can be configured to receive data from database 105 or another component of system 100. Dataset generator 103 can be configured to receive data models from model storage 109 or another component of system 100. Dataset generator 103 can be configured to generate synthetic data. For example, dataset generator 103 can be configured to generate synthetic data by identifying and replacing sensitive information in data received from database 103 or interface 113. As an additional example, dataset generator 103 can be configured to generate synthetic data using a data model without reliance on input data. For example, the data model can be configured to generate data matching statistical and content characteristics of a training dataset. In some aspects, the data model can be configured to map from a random or pseudorandom vector to elements in the training data space.
  • Database 105 can include one or more databases configured to store data for use by system 100. The databases can include cloud-based databases (e.g., AMAZON WEB SERVICES S3 buckets) or on-premises databases.
  • Model optimizer 107 can include one or more computing systems configured to manage training of data models for system 100. Model optimizer 107 can be configured to generate models for export to computing resources 101. Model optimizer 107 can be configured to generate models based on instructions received from a user or another system. These instructions can be received through interface 113. For example, model optimizer 107 can be configured to receive a graphical depiction of a machine learning model and parse that graphical depiction into instructions for creating and training a corresponding neural network on computing resources 101. Model optimizer 107 can be configured to select model training parameters. This selection can be based on model performance feedback received from computing resources 101. Model optimizer 107 can be configured to provide trained models and descriptive information concerning the trained models to model storage 109.
  • Model storage 109 can include one or more databases configured to store data models and descriptive information for the data models. Model storage 109 can be configured to provide information regarding available data models to a user or another system. This information can be provided using interface 113. The databases can include cloud-based databases (e.g., AMAZON WEB SERVICES S3 buckets) or on-premises databases. The information can include model information, such as the type and/or purpose of the model and any measures of classification error.
  • Model curator 111 can be configured to impose governance criteria on the use of data models. For example, model curator 111 can be configured to delete or control access to models that fail to meet accuracy criteria. As a further example, model curator 111 can be configured to limit the use of a model to a particular purpose, or by a particular entity or individual. In some aspects, model curator 11 can be configured to ensure that data model satisfies governance criteria before system 100 can process data using the data model.
  • Interface 113 can be configured to manage interactions between system 100 and other systems using network 115. In some aspects, interface 113 can be configured to publish data received from other components of system 100 (e.g., dataset generator 103, computing resources 101, database 105, or the like). This data can be published in a publication and subscription framework (e.g., using APACHE KAFKA), through a network socket, in response to queries from other systems, or using other known methods. The data can be synthetic data, as described herein. As an additional example, interface 113 can be configured to provide information received from model storage 109 regarding available datasets. In various aspects, interface 113 can be configured to provide data or instructions received from other systems to components of system 100. For example, interface 113 can be configured to receive instructions for generating data models (e.g., type of data model, data model parameters, training data indicators, training parameters, or the like) from another system and provide this information to model optimizer 107. As an additional example, interface 113 can be configured to receive data including sensitive portions from another system (e.g. in a file, a message in a publication and subscription framework, a network socket, or the like) and provide that data to dataset generator 103 or database 105.
  • Network 115 can include any combination of electronics communications networks enabling communication between components of system 100. For example, network 115 may include the Internet and/or any type of wide area network, an intranet, a metropolitan area network, a local area network (LAN), a wireless network, a cellular communications network, a Bluetooth network, a radio network, a device bus, or any other type of electronics communications network known to one of skill in the art.
  • FIG. 2 depicts a process 200 for generating data models. Process 200 can be used to generate a data model for a machine learning application, consistent with disclosed embodiments. The data model can be generated using synthetic data in some aspects. This synthetic data can be generated using a synthetic dataset model, which can in turn be generated using actual data. The synthetic data may be similar to the actual data in terms of values, value distributions (e.g., univariate and multivariate statistics of the synthetic data may be similar to that of the actual data), structure and ordering, or the like. In this manner, the data model for the machine learning application can be generated without directly using the actual data. As the actual data may include sensitive information, and generating the data model may require distribution and/or review of training data, the use of the synthetic data can protect the privacy and security of the entities and/or individuals whose activities are recorded by the actual data.
  • Process 200 can then proceed to step 201. In step 201, interface 113 can provide a data model generation request to model optimizer 107. The data model generation request can include data and/or instructions describing the type of data model to be generated. For example, the data model generation request can specify a general type of data model (e.g., neural network, recurrent neural network, generative adversarial network, kernel density estimator, random data generator, or the like) and parameters specific to the particular type of model (e.g., the number of features and number of layers in a generative adversarial network or recurrent neural network). In some embodiments, a recurrent neural network can include long short term memory modules (LSTM units), or the like.
  • Process 200 can then proceed to step 203. In step 203, one or more components of system 100 can interoperate to generate a data model. For example, as described in greater detail with regard to FIG. 3, a data model can be trained using computing resources 101 using data provided by dataset generator 103. In some aspects, this data can be generated using dataset generator 103 from data stored in database 105. In various aspects, the data used to train dataset generator 103 can be actual or synthetic data retrieved from database 105. This training can be supervised by model optimizer 107, which can be configured to select model parameters (e.g., number of layers for a neural network, kernel function for a kernel density estimator, or the like), update training parameters, and evaluate model characteristics (e.g., the similarity of the synthetic data generated by the model to the actual data). In some embodiments, model optimizer 107 can be configured to provision computing resources 101 with an initialized data model for training. The initialized data model can be, or can be based upon, a model retrieved from model storage 109.
  • Process 200 can then proceed to step 205. In step 205, model optimizer 107 can evaluate the performance of the trained synthetic data model. When the performance of the trained synthetic data model satisfies performance criteria, model optimizer 107 can be configured to store the trained synthetic data model in model storage 109. For example, model optimizer 107 can be configured to determine one or more values for similarity and/or predictive accuracy metrics, as described herein. In some embodiments, based on values for similarity metrics, model optimizer 107 can be configured to assign a category to the synthetic data model.
  • According to a first category, the synthetic data model generates data maintaining a moderate level of correlation or similarity with the original data, matches well with the original schema, and does not generate too many row or value duplicates. According to a second category, the synthetic data model may generate data maintaining a high level of correlation or similarity of the original level, and therefore could potentially cause the original data to be discernable from the original data (e.g., a data leak). A synthetic data model generating data failing to match the schema with the original data or providing many duplicated rows and values may also be placed in this category. According to a third category, the synthetic data model may likely generate data maintaining a high level of correlation or similarity with the original data, likely allowing a data leak. A synthetic data model generating data badly failing to match the schema with the original data or providing far too many duplicated rows and values may also be placed in this category.
  • In some embodiments, system 100 can be configured to provide instructions for improving the quality of the synthetic data model. If a user requires synthetic data reflecting less correlation or similarity with the original data, the use can change the models' parameters to make them perform worse (e.g., by decreasing number of layers in GAN models, or reducing the number of training iterations). If the users want the synthetic data to have better quality, they can change the models' parameters to make them perform better (e.g., by increasing number of layers in GAN models, or increasing the number of training iterations).
  • Process 200 can then proceed to step 207, in step 207, model curator 111 can evaluate the trained synthetic data model for compliance with governance criteria.
  • FIG. 3 depicts a process 300 for generating a data model using an existing synthetic data model, consistent with disclosed embodiments. Process 300 can include the steps of retrieving a synthetic dataset model from model storage 109, retrieving data from database 105, providing synthetic data to computing resources 101, providing an initialized data model to computing resources 101, and providing a trained data model to model optimizer 107. In this manner, process 300 can allow system 100 to generate a model using synthetic data.
  • Process 300 can then proceed to step 301. In step 301, dataset generator 103 can retrieve a training dataset from database 105. The training dataset can include actual training data, in some aspects. The training dataset can include synthetic training data, in some aspects. In some embodiments, dataset generator 103 can be configured to generate synthetic data from sample values. For example, dataset generator 103 can be configured to use the generative network of a generative adversarial network to generate data samples from random-valued vectors. In such embodiments, process 300 may forgo step 301.
  • Process 300 can then proceed to step 303. In step 303, dataset generator 103 can be configured to receive a synthetic data model from model storage 109. In some embodiments, model storage 109 can be configured to provide the synthetic data model to dataset generator 103 in response to a request from dataset generator 103. In various embodiments, model storage 109 can be configured to provide the synthetic data model to dataset generator 103 in response to a request from model optimizer 107, or another component of system 100. As a non-limiting example, the synthetic data model can be a neural network, recurrent neural network (which may include LSTM units), generative adversarial network, kernel density estimator, random value generator, or the like.
  • Process 300 can then proceed to step 305. In step 305, in some embodiments, dataset generator 103 can generate synthetic data. Dataset generator 103 can be configured, in some embodiments, to identify sensitive data items (e.g., account numbers, social security numbers, names, addresses, API keys, network or IP addresses, or the like) in the data received from model storage 109. In some embodiments, dataset generator 103 can be configured to identify sensitive data items using a recurrent neural network. Dataset generator 103 can be configured to use the data model retrieved from model storage 109 to generate a synthetic dataset by replacing the sensitive data items with synthetic data items.
  • Dataset generator 103 can be configured to provide the synthetic dataset to computing resources 101. In some embodiments, dataset generator 103 can be configured to provide the synthetic dataset to computing resources 101 in response to a request from computing resources 101, model optimizer 107, or another component of system 100. In various embodiments, dataset generator 103 can be configured to provide the synthetic dataset to database 105 for storage. In such embodiments, computing resources 101 can be configured to subsequently retrieve the synthetic dataset from database 105 directly, or indirectly through model optimizer 107 or dataset generator 103.
  • Process 300 can then proceed to step 307. In step 307, computing resources 101 can be configured to receive a data model from model optimizer 107, consistent with disclosed embodiments. In some embodiments, the data model can be at least partially initialized by model optimizer 107. For example, at least some of the initial weights and offsets of a neural network model received by computing resources 101 in step 307 can be set by model optimizer 107. In various embodiments, computing resources 101 can be configured to receive at least some training parameters from model optimizer 107 (e.g., batch size, number of training batches, number of epochs, chunk size, time window, input noise dimension, or the like).
  • Process 300 can then proceed to step 309. In step 309, computing resources 101 can generate a trained data model using the data model received from model optimizer 107 and the synthetic dataset received from dataset generator 103. For example, computing resources 101 can be configured to train the data model received from model optimizer 107 until some training criterion is satisfied. The training criterion can be, for example, a performance criterion (e.g., a Mean Absolute Error, Root Mean Squared Error, percent good classification, and the like), a convergence criterion (e.g., a minimum required improvement of a performance criterion over iterations or over time, a minimum required change in model parameters over iterations or over time), elapsed time or number of iterations, or the like. In some embodiments, the performance criterion can be a threshold value for a similarity metric or prediction accuracy metric as described herein. Satisfaction of the training criterion can be determined by one or more of computing resources 101 and model optimizer 107. In some embodiments, computing resources 101 can be configured to update model optimizer 107 regarding the training status of the data model. For example, computing resources 101 can be configured to provide the current parameters of the data model and/or current performance criteria of the data model. In some embodiments, model optimizer 107 can be configured to stop the training of the data model by computing resources 101. In various embodiments, model optimizer 107 can be configured to retrieve the data model from computing resources 101. In some embodiments, computing resources 101 can be configured to stop training the data model and provide the trained data model to model optimizer 107.
  • FIG. 4 depicts a specific implementation (system 400) of system 100 of FIG. 1. As shown in FIG. 4, the functionality of system 100 can be divided between a distributor 401, a dataset generation instance 403, a development environment 405, a model optimization instance 409, and a production environment 411. In this manner, system 100 can be implemented in a stable and scalable fashion using a distributed computing environment, such as a public cloud-computing environment, a private cloud computing environment, a hybrid cloud computing environment, a computing cluster or grid, or the like. As present computing requirements increase for a component of system 400 (e.g., as production environment 411 is called upon to instantiate additional production instances to address requests for additional synthetic data streams), additional physical or virtual machines can be recruited to that component. In some embodiments, dataset generator 103 and model optimizer 107 can be hosted by separate virtual computing instances of the cloud computing system.
  • Distributor 401 can be configured to provide, consistent with disclosed embodiments, an interface between the components of system 400, and between the components of system 400 and other systems. In some embodiments, distributor 401 can be configured to implement interface 113 and a load balancer. Distributor 401 can be configured to route messages between computing resources 101 (e.g., implemented on one or more of development environment 405 and production environment 411), dataset generator 103 (e.g., implemented on dataset generator instance 403), and model optimizer 107 (e.g., implemented on model optimization instance 409). The messages can include data and instructions. For example, the messages can include model generation requests and trained models provided in response to model generation requests. As an additional example, the messages can include synthetic data sets or synthetic data streams. Consistent with disclosed embodiments, distributor 401 can be implemented using one or more EC2 clusters or the like.
  • Data generation instance 403 can be configured to generate synthetic data, consistent with disclosed embodiments. In some embodiments, data generation instance 403 can be configured to receive actual or synthetic data from data source 417. In various embodiments, data generation instance 403 can be configured to receive synthetic data models for generating the synthetic data. In some aspects, the synthetic data models can be received from another component of system 400, such as data source 417.
  • Development environment 405 can be configured to implement at least a portion of the functionality of computing resources 101, consistent with disclosed embodiments. For example, development environment 405 can be configured to train data models for subsequent use by other components of system 400. In some aspects, development instances (e.g., development instance 407) hosted by development environment 405 can train one or more individual data models. In some aspects, development environment 405 be configured to spin up additional development instances to train additional data models, as needed. In some aspects, a development instance can implement an application framework such as TENSORBOARD, JUPYTER and the like; as well as machine learning applications like TENSORFLOW, CUDNN, KERAS, and the like. Consistent with disclosed embodiments, these application frameworks and applications can enable the specification and training of data models. In various aspects, development environment 405 can be implemented using one or more EC2 clusters or the like.
  • Model optimization instance 409 can be configured to manage training and provision of data models by system 400. In some aspects, model optimization instance 409 can be configured to provide the functionality of model optimizer 107. For example, model optimization instance 409 can be configured to provide training parameters and at least partially initialized data models to development environment 405. This selection can be based on model performance feedback received from development environment 405. As an additional example, model optimization instance 409 can be configured to determine whether a data model satisfies performance criteria. In some aspects, model optimization instance 409 can be configured to provide trained models and descriptive information concerning the trained models to another component of system 400. In various aspects, model optimization instance 409 can be implemented using one or more EC2 clusters or the like.
  • Production environment 405 can be configured to implement at least a portion of the functionality of computing resources 101, consistent with disclosed embodiments. For example, production environment 405 can be configured to use previously trained data models to process data received by system 400. In some aspects, a production instance (e.g., production instance 413) hosted by development environment 411 can be configured to process data using a previously trained data model. In some aspects, the production instance can implement an application framework such as TENSORBOARD, JUPYTER and the like; as well as machine learning applications like TENSORFLOW, CUDNN, KERAS, and the like. Consistent with disclosed embodiments, these application frameworks and applications can enable processing of data using data models. In various aspects, development environment 405 can be implemented using one or more EC2 clusters or the like.
  • A component of system 400 (e.g., model optimization instance 409) can determine the data model and data source for a production instance according to the purpose of the data processing. For example, system 400 can configure a production instance to produce synthetic data for consumption by other systems. In this example, the production instance can then provide synthetic data for testing another application. As a further example, system 400 can configure a production instance to generate outputs using actual data. For example, system 400 can configure a production instance with a data model for detecting fraudulent transactions. The production instance can then receive a stream of financial transaction data and identify potentially fraudulent transactions. In some aspects, this data model may have been trained by system 400 using synthetic data created to resemble the stream of financial transaction data. System 400 can be configured to provide an indication of the potentially fraudulent transactions to another system configured to take appropriate action (e.g., reversing the transaction, contacting one or more of the parties to the transaction, or the like).
  • Production environment 411 can be configured to host a file system 415 for interfacing between one or more production instances and data source 417. For example, data source 417 can be configured to store data in file system 415, while the one or more production instances can be configured to retrieve the stored data from file system 415 for processing. In some embodiments, file system 415 can be configured to scale as needed. In various embodiments, file system 415 can be configured to support parallel access by data source 417 and the one or more production instances. For example, file system 415 can be an instance of AMAZON ELASTIC FILE SYSTEM (EFS) or the like.
  • Data source 417 can be configured to provide data to other components of system 400. In some embodiments, data source 417 can include sources of actual data, such as streams of transaction data, human resources data, web log data, web security data, web protocols data, or system logs data. System 400 can also be configured to implement model storage 109 using a database (not shown) accessible to at least one other component of system 400 (e.g., distributor 401, dataset generation instance 403, development environment 405, model optimization instance 409, or production environment 411). In some aspects, the database can be an s3 bucket, relational database, or the like.
  • FIG. 5A depicts process 500 for generating synthetic data using class-specific models, consistent with disclosed embodiments. System 100, or a similar system, may be configured to use such synthetic data in training a data model for use in another application (e.g., a fraud detection application). Process 500 can include the steps of retrieving actual data, determining classes of sensitive portions of the data, generating synthetic data using a data model for the appropriate class, and replacing the sensitive data portions with the synthetic data portions. In some embodiments, the data model can be a generative adversarial network trained to generate synthetic data satisfying a similarity criterion, as described herein. By using class-specific models, process 500 can generate better synthetic data that more accurately models the underlying actual data than randomly generated training data that lacks the latent structures present in the actual data. Because the synthetic data more accurately models the underlying actual data, a data model trained using this improved synthetic data may perform better processing the actual data.
  • Process 500 can then proceed to step 501. In step 501, dataset generator 103 can be configured to retrieve actual data. As a non-limiting example, the actual data may have been gathered during the course of ordinary business operations, marketing operations, research operations, or the like. Dataset generator 103 can be configured to retrieve the actual data from database 105 or from another system. The actual data may have been purchased in whole or in part by an entity associated with system 100. As would be understood from this description, the source and composition of the actual data is not intended to be limiting.
  • Process 500 can then proceed to step 503. In step 503, dataset generator 103 can be configured to determine classes of the sensitive portions of the actual data. As a non-limiting example, when the actual data is account transaction data, classes could include account numbers and merchant names. As an additional non-limiting example, when the actual data is personnel records, classes could include employee identification numbers, employee names, employee addresses, contact information, marital or beneficiary information, title and salary information, and employment actions. Consistent with disclosed embodiments, dataset generator 103 can be configured with a classifier for distinguishing different classes of sensitive information. In some embodiments, dataset generator 103 can be configured with a recurrent neural network for distinguishing different classes of sensitive information. Dataset generator 103 can be configured to apply the classifier to the actual data to determine that a sensitive portion of the training dataset belongs to the data class. For example, when the data stream includes the text string “Lorem ipsum 012-34-5678 dolor sit amet,” the classifier may be configured to indicate that positions 13-23 of the text string include a potential social security number. Though described with reference to character string substitutions, the disclosed systems and methods are not so limited. As a non-limiting example, the actual data can include unstructured data (e.g., character strings, tokens, and the like) and structured data (e.g., key-value pairs, relational database files, spreadsheets, and the like).
  • Process 500 can then proceed to step 505. In step 505, dataset generator 103 can be configured to generate a synthetic portion using a class-specific model. To continue the previous example, dataset generator 103 can generate a synthetic social security number using a synthetic data model trained to generate social security numbers. In some embodiments, this class-specific synthetic data model can be trained to generate synthetic portions similar to those appearing in the actual data. For example, as social security numbers include an area number indicating geographic information and a group number indicating date-dependent information, the range of social security numbers present in an actual dataset can depend on the geographic origin and purpose of that dataset. A dataset of social security numbers for elementary school children in a particular school district may exhibit different characteristics than a dataset of social security numbers for employees of a national corporation. To continue the previous example, the social security-specific synthetic data model could generate the synthetic portion “03-74-3285.”
  • Process 500 can then proceed to step 507. In step 507, dataset generator 103 can be configured to replace the sensitive portion of the actual data with the synthetic portion. To continue the previous example, dataset generator 103 could be configured to replace the characters at positions 13-23 of the text string with the values “013-74-3285,” creating the synthetic text string “Lorem ipsum 013-74-3285 dolor sit amet.” This text string can now be distributed without disclosing the sensitive information originally present. But this text string can still be used to train models that make valid inferences regarding the actual data, because synthetic social security numbers generated by the synthetic data model share the statistical characteristic of the actual data.
  • FIG. 5B depicts a process 510 for generating synthetic data using class and subclass-specific models, consistent with disclosed embodiments. Process 510 can include the steps of retrieving actual data, determining classes of sensitive portions of the data, selecting types for synthetic data used to replace the sensitive portions of the actual data, generating synthetic data using a data model for the appropriate type and class, and replacing the sensitive data portions with the synthetic data portions. In some embodiments, the data model can be a generative adversarial network trained to generate synthetic data satisfying a similarity criterion, as described herein. This improvement addresses a problem with synthetic data generation, that a synthetic data model may fail to generate examples of proportionately rare data subclasses. For example, when data can be classified into two distinct subclasses, with a second subclass far less prevalent in the data than a first subclass, a model of the synthetic data may generate only examples of the most common first data subclasses. The synthetic data model effectively focuses on generating the best examples of the most common data subclasses, rather than acceptable examples of all the data subclasses. Process 510 addresses this problem by expressly selecting subclasses of the synthetic data class according to a distribution model based on the actual data.
  • Process 510 can then proceed through step 511 and step 513, which resemble step 501 and step 503 in process 500. In step 511, dataset generator 103 can be configured to receive actual data. In step 513, dataset generator can be configured to determine classes of sensitive portions of the actual data. In a non-limiting example, dataset generator 103 can be configured to determine that a sensitive portion of the data may contain a financial service account number. Dataset generator 103 can be configured to identify this sensitive portion of the data as a financial service account number using a classifier, which may in some embodiments be a recurrent neural network (which may include LSTM units).
  • Process 510 can then proceed to step 515. In step 515, dataset generator 103 can be configured to select a subclass for generating the synthetic data. In some aspects, this selection is not governed by the subclass of the identified sensitive portion. For example, in some embodiments the classifier that identifies the class need not be sufficiently discerning to identify the subclass, relaxing the requirements on the classifier. Instead, this selection is based on a distribution model. For example, dataset generator 103 can be configured with a statistical distribution of subclasses (e.g., a univariate distribution of subclasses) for that class and can select one of the subclasses for generating the synthetic data according to the statistical distribution. To continue the previous example, individual accounts and trust accounts may both be financial service account numbers, but the values of these accounts numbers may differ between individual accounts and trust accounts. Furthermore, there may be 19 individual accounts for every 1 trust account. In this example, dataset generator 103 can be configured to select the trust account subclass 1 time in 20, and use a synthetic data model for financial service account numbers for trust accounts to generate the synthetic data. As a further example, dataset generator 103 can be configured with a recurrent neural network that estimates the next subclass based on the current and previous subclasses. For example, healthcare records can include cancer diagnosis stage as sensitive data. Most cancer diagnosis stage values may be “no cancer” and the value of “stage 1” may be rare, but when present in a patient record this value may be followed by “stage 2,” etc. The recurrent neural network can be trained on the actual healthcare records to use prior and cancer diagnosis stage values when selecting the subclass. For example, when generating a synthetic healthcare record, the recurrent neural network can be configured to use the previously selected cancer diagnosis stage subclass in selecting the present cancer diagnosis stage subclass. In this manner, the synthetic healthcare record can exhibit an appropriate progression of patient health that matches the progression in the actual data.
  • Process 510 can then proceed to step 517. In step 517, which resembles step 505, dataset generator 103 can be configured to generate synthetic data using a class and subclass specific model. To continue the previous financial service account number example, dataset generator 103 can be configured to use a synthetic data for trust account financial service account numbers to generate the synthetic financial server account number.
  • Process 510 can then proceed to step 519. In step 519, which resembles step 507, dataset generator 103 can be configured to replace the sensitive portion of the actual data with the generated synthetic data. For example, dataset generator 103 can be configured to replace the financial service account number in the actual data with the synthetic trust account financial service account number.
  • FIG. 6 depicts a process 600 for training a classifier for generation of synthetic data. In some embodiments, such a classifier could be used by dataset generator 103 to classify sensitive data portions of actual data, as described above with regards to FIGS. 5A and 5B. Process 600 can include the steps of receiving data sequences, receiving content sequences, generating training sequences, generating label sequences, and training a classifier using the training sequences and the label sequences. By using known data sequences and content sequences unlikely to contain sensitive data, process 600 can be used to automatically generate a corpus of labeled training data. Process 600 can be performed by a component of system 100, such as dataset generator 103 or model optimizer 107.
  • Process 600 can then proceed to step 601. In step 601, system 100 can receive training data sequences. The training data sequences can be received from a dataset. The dataset providing the training data sequences can be a component of system 100 (e.g., database 105) or a component of another system. The data sequences can include multiple classes of sensitive data. As a non-limiting example, the data sequences can include account numbers, social security numbers, and full names.
  • Process 600 can then proceed to step 603. In step 603, system 100 can receive context sequences. The context sequences can be received from a dataset. The dataset providing the context sequences can be a component of system 100 (e.g., database 105) or a component of another system. In various embodiments, the context sequences can be drawn from a corpus of pre-existing data, such as an open-source text dataset (e.g., Yelp Open Dataset or the like). In some aspects, the context sequences can be snippets of this pre-existing data, such as a sentence or paragraph of the pre-existing data.
  • Process 600 can then proceed to step 605. In step 605, system 100 can generate training sequences. In some embodiments, system 100 can be configured to generate a training sequence by inserting a data sequence into a context sequence. The data sequence can be inserted into the context sequence without replacement of elements of the context sequence or with replacement of elements of the context sequence. The data sequence can be inserted into the context sequence between elements (e.g., at a whitespace character, tab, semicolon, html closing tag, or other semantic breakpoint) or without regard to the semantics of the context sequence. For example, when the context sequence is “Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod” and the data sequence is “013-74-3285,” the training sequence can be “Lorem ipsum dolor sit amet, 013-74-3285 consectetur adipiscing elit, sed do eiusmod,” “Lorem ipsum dolor sit amet, 013-74-3285 adipiscing elit, sed do eiusmod,” or “Lorem ipsum dolor sit amet, conse013-74-3285ctetur adipiscing elit, sed do eiusmod.” In some embodiments, a training sequence can include multiple data sequences.
  • After step 601 and step 603, process 600 can proceed to step 607. In step 607, system 100 can generate a label sequence. In some aspects, the label sequence can indicate a position of the inserted data sequence in the training sequence. In various aspects, the label sequence can indicate the class of the data sequence. As a non-limiting example, when the training sequence is “dolor sit amet, 013-74-3285 consectetur adipiscing,” the label sequence can be “00000000000000001111111111100000000000000000000000,” where the value “0” indicates that a character is not part of a sensitive data portion and the value “1” indicates that a character is part of the social security number. A different class or subclass of data sequence could include a different value specific to that class or subclass. Because system 100 creates the training sequences, system 100 can automatically create accurate labels for the training sequences.
  • Process 600 can then proceed to step 609. In step 609, system 100 can be configured to use the training sequences and the label sequences to train a classifier. In some aspects, the label sequences can provide a “ground truth” for training a classifier using supervised learning. In some embodiments, the classifier can be a recurrent neural network (which may include LSTM units). The recurrent neural network can be configured to predict whether a character of a training sequence is part of a sensitive data portion. This prediction can be checked against the label sequence to generate an update to the weights and offsets of the recurrent neural network. This update can then be propagated through the recurrent neural network, according to methods described in “Training Recurrent Neural Networks,” 2013, by Ilya Sutskever, which is incorporated herein by reference in its entirety.
  • FIG. 7 depicts a process 700 for training a classifier for generation of synthetic data, consistent with disclosed embodiments. According to process 700, a data sequence 701 can include preceding samples 703, current sample 705, and subsequent samples 707. In some embodiments, data sequence 701 can be a subset of a training sequence, as described above with regard to FIG. 6. Data sequence 701 may be applied to recurrent neural network 709. In some embodiments, neural network 709 can be configured to estimate whether current sample 705 is part of a sensitive data portion of data sequence 701 based on the values of preceding samples 703, current sample 705, and subsequent samples 707. In some embodiments, preceding samples 703 can include between 1 and 100 samples, for example between 25 and 75 samples. In various embodiments, subsequent samples 707 can include between 1 and 100 samples, for example between 25 and 75 samples. In some embodiments, the preceding samples 703 and the subsequent samples 707 can be paired and provided to recurrent neural network 709 together. For example, in a first iteration, the first sample of preceding samples 703 and the last sample of subsequent samples 707 can be provided to recurrent neural network 709. In the next iteration, the second sample of preceding samples 703 and the second-to-last sample of subsequent samples 707 can be provided to recurrent neural network 709. System 100 can continue to provide samples to recurrent neural network 709 until all of preceding samples 703 and subsequent samples 707 have been input to recurrent neural network 709. System 100 can then provide current sample 705 to recurrent neural network 709. The output of recurrent neural network 709 after the input of current sample 705 can be estimated label 711. Estimated label 711 can be the inferred class or subclass of current sample 705, given data sequence 701 as input. In some embodiments, estimated label 711 can be compared to actual label 713 to calculate a loss function. Actual label 713 can correspond to data sequence 701. For example, when data sequence 701 is a subset of a training sequence, actual label 713 can be an element of the label sequence corresponding to the training sequence. In some embodiments, actual label 713 can occupy the same position in the label sequence as occupied by current sample 705 in the training sequence. Consistent with disclosed embodiments, system 100 can be configured to update recurrent neural network 709 using loss function 715 based on a result of the comparison.
  • FIG. 8 depicts a process 800 for training a generative adversarial network using a normalized reference dataset. In some embodiments, the generative adversarial network can be used by system 100 (e.g., by dataset generator 103) to generate synthetic data (e.g., as described above with regards to FIGS. 2, 3, 5A and 5B). The generative adversarial network can include a generator network and a discriminator network. The generator network can be configured to learn a mapping from a sample space (e.g., a random number or vector) to a data space (e.g. the values of the sensitive data). The discriminator can be configured to determine, when presented with either an actual data sample or a sample of synthetic data generated by the generator network, whether the sample was generated by the generator network or was a sample of actual data. As training progresses, the generator can improve at generating the synthetic data and the discriminator can improve at determining whether a sample is actual or synthetic data. In this manner, a generator can be automatically trained to generate synthetic data similar to the actual data. However, a generative adversarial network can be limited by the actual data. For example, an unmodified generative adversarial network may be unsuitable for use with categorical data or data including missing values, not-a-numbers, or the like. For example, the generative adversarial network may not know how to interpret such data. Disclosed embodiments address this technical problem by at least one of normalizing categorical data or replacing missing values with supra-normal values.
  • Process 800 can then proceed to step 801. In step 801, system 100 (e.g., dataset generator 103) can retrieve a reference dataset from a database (e.g., database 105). The reference dataset can include categorical data. For example, the reference dataset can include spreadsheets or relational databases with categorical-valued data columns. As a further example, the reference dataset can include missing values, not-a-number values, or the like.
  • Process 800 can then proceed to step 803. In step 803, system 100 (e.g., dataset generator 103) can generate a normalized training dataset by normalizing the reference dataset. For example, system 100 can be configured to normalize categorical data contained in the reference dataset. In some embodiments, system 100 can be configured to normalize the categorical data by converting this data to numerical values. The numerical values can lie within a predetermined range. In some embodiments, the predetermined range can be zero to one. For example, given a column of categorical data including the days of the week, system 100 can be configured to map these days to values between zero and one. In some embodiments, system 100 can be configured to normalize numerical data in the reference dataset as well, mapping the values of the numerical data to a predetermined range.
  • Process 800 can then proceed to step 805. In step 805, system 100 (e.g., dataset generator 103) can generate the normalized training dataset by converting special values to values outside the predetermined range. For example, system 100 can be configured to assign missing values a first numerical value outside the predetermined range. As an additional example, system 100 can be configured to assign not-a-number values to a second numerical value outside the predetermined range. In some embodiments, the first value and the second value can differ. For example, system 100 can be configured to map the categorical values and the numerical values to the range of zero to one. In some embodiments, system 100 can then map missing values to the numerical value 1.5. In various embodiments, system 100 can then map not-a-number values to the numerical value of −0.5. In this manner system 100 can preserve information about the actual data while enabling training of the generative adversarial network.
  • Process 800 can then proceed to step 807. In step 807, system 100 (e.g., dataset generator 103) can train the generative network using the normalized dataset, consistent with disclosed embodiments.
  • FIG. 9 depicts a process 900 for training a generative adversarial network using a loss function configured to ensure a predetermined degree of similarity, consistent with disclosed embodiments. System 100 can be configured to use process 900 to generate synthetic data that is similar, but not too similar to the actual data, as the actual data can include sensitive personal information. For example, when the actual data includes social security numbers or account numbers, the synthetic data would preferably not simply recreate these numbers. Instead, system 100 would preferably create synthetic data that resembles the actual data, as described below, while reducing the likelihood of overlapping values. To address this technical problem, system 100 can be configured to determine a similarity metric value between the synthetic dataset and the normalized reference dataset, consistent with disclosed embodiments. System 100 can be configured to use the similarity metric value to update a loss function for training the generative adversarial network. In this manner, system 100 can be configured to determine a synthetic dataset differing in value from the normalized reference dataset at least a predetermined amount according to the similarity metric.
  • While described below with regard to training a synthetic data model, dataset generator 103 can be configured to use such trained synthetic data models to generate synthetic data (e.g., as described above with regards to FIGS. 2 and 3). For example, development instances (e.g., development instance 407) and production instances (e.g., production instance 413) can be configured to generate data similar to a reference dataset according to the disclosed systems and methods.
  • Process 900 can then proceed to step 901, which can resemble step 801. In step 901, system 100 (e.g., model optimizer 107, computational resources 101, or the like) can receive a reference dataset. In some embodiments, system 100 can be configured to receive the reference dataset from a database (e.g., database 105). The reference dataset can include categorical and/or numerical data. For example, the reference dataset can include spreadsheet or relational database data. In some embodiments, the reference dataset can include special values, such as missing values, not-a-number values, or the like.
  • Process 900 can then proceed to step 903. In step 903, system 100 (e.g., dataset generator 103, model optimizer 107, computational resources 101, or the like) can be configured to normalize the reference dataset. In some instances, system 100 can be configured to normalize the reference dataset as described above with regard to steps 803 and 805 of process 800. For example, system 100 can be configured to normalize the categorical data and/or the numerical data in the reference dataset to a predetermined range. In some embodiments, system 100 can be configured to replace special values with numerical values outside the predetermined range.
  • Process 900 can then proceed to step 905. In step 905, system 100 (e.g., model optimizer 107, computational resources 101, or the like) can generate a synthetic training dataset using the generative network. For example, system 100 can apply one or more random samples to the generative network to generate one or more synthetic data items. In some instances, system 100 can be configured to generate between 200 and 400,000 data items, or preferably between 20,000 and 40,000 data items.
  • Process 900 can then proceed to step 907. In step 907, system 100 (e.g., model optimizer 107, computational resources 101, or the like) can determine a similarity metric value using the normalized reference dataset and the synthetic training dataset. System 100 can be configured to generate the similarity metric value according to a similarity metric. In some aspects, the similarity metric value can include at least one of a statistical correlation score (e.g., a score dependent on the covariances or univariate distributions of the synthetic data and the normalized reference dataset), a data similarity score (e.g., a score dependent on a number of matching or similar elements in the synthetic dataset and normalized reference dataset), or data quality score (e.g., a score dependent on at least one of a number of duplicate elements in each of the synthetic dataset and normalized reference dataset, a prevalence of the most common value in each of the synthetic dataset and normalized reference dataset, a maximum difference of rare values in each of the synthetic dataset and normalized reference dataset, the differences in schema between the synthetic dataset and normalized reference dataset, or the like). System 100 can be configured to calculate these scores using the synthetic dataset and a reference dataset.
  • In some aspects, the similarity metric can depend on a covariance of the synthetic dataset and a covariance of the normalized reference dataset. For example, in some embodiments, system 100 can be configured to generate a difference matrix using a covariance matrix of the normalized reference dataset and a covariance matrix of the synthetic dataset. As a further example, the difference matrix can be the difference between the covariance matrix of the normalized reference dataset and the covariance matrix of the synthetic dataset. The similarity metric can depend on the difference matrix. In some aspects, the similarity metric can depend on the summation of the squared values of the difference matrix. This summation can be normalized, for example by the square root of the product of the number of rows and number of columns of the covariance matrix for the normalized reference dataset.
  • In some embodiments, the similarity metric can depend on a univariate value distribution of an element of the synthetic dataset and a univariate value distribution of an element of the normalized reference dataset. For example, for corresponding elements of the synthetic dataset and the normalized reference dataset, system 100 can be configured to generate histograms having the same bins. For each bin, system 100 can be configured to determine a difference between the value of the bin for the synthetic data histogram and the value of the bin for the normalized reference dataset histogram. In some embodiments, the values of the bins can be normalized by the total number of datapoints in the histograms. For each of the corresponding elements, system 100 can be configured to determine a value (e.g., a maximum difference, an average difference, a Euclidean distance, or the like) of these differences. In some embodiments, the similarity metric can depend on a function of this value (e.g., a maximum, average, or the like) across the common elements. For example, the normalized reference dataset can include multiple columns of data. The synthetic dataset can include corresponding columns of data. The normalized reference dataset and the synthetic dataset can include the same number of rows. System 100 can be configured to generate histograms for each column of data for each of the normalized reference dataset and the synthetic dataset. For each bin, system 100 can determine the difference between the count of datapoints in the normalized reference dataset histogram and the synthetic dataset histogram. System 100 can determine the value for this column to be the maximum of the differences for each bin. System 100 can determine the value for the similarity metric to be the average of the values for the columns. As would be appreciated by one of skill in the art, this example is not intended to be limiting.
  • In various embodiments, the similarity metric can depend on a number of elements of the synthetic dataset that match elements of the reference dataset. In some embodiments, the matching can be an exact match, with the value of an element in the synthetic dataset matching the value of an element in the normalized reference dataset. As a nonlimiting example, when the normalized reference dataset includes a spreadsheet having rows and columns, and the synthetic dataset includes a spreadsheet having rows and corresponding columns, the similarity metric can depend on the number of rows of the synthetic dataset that have the same values as rows of the normalized reference dataset. In some embodiments, the normalized reference dataset and synthetic dataset can have duplicate rows removed prior to performing this comparison. System 100 can be configured to merge the non-duplicate normalized reference dataset and non-duplicate synthetic dataset by all columns. In this non-limiting example, the size of the resulting dataset will be the number of exactly matching rows. In some embodiments, system 100 can be configured to disregard columns that appear in one dataset but not the other when performing this comparison.
  • In various embodiments, the similarity metric can depend on a number of elements of the synthetic dataset that are similar to elements of the normalized reference dataset. System 100 can be configured to calculate similarity between an element of the synthetic dataset and an element of the normalized reference dataset according to distance measure. In some embodiments, the distance measure can depend on a Euclidean distance between the elements. For example, when the synthetic dataset and the normalized reference dataset include rows and columns, the distance measure can depend on a Euclidean distance between a row of the synthetic dataset and a row of the normalized reference dataset. In various embodiments, when comparing a synthetic dataset to an actual dataset including categorical data (e.g., a reference dataset that has not been normalized), the distance measure can depend on a Euclidean distance between numerical row elements and a Hamming distance between non-numerical row elements. The Hamming distance can depend on a count of non-numerical elements differing between the row of the synthetic dataset and the row of the actual dataset. In some embodiments, the distance measure can be a weighted average of the Euclidean distance and the Hamming distance. In some embodiments, system 100 can be configured to disregard columns that appear in one dataset but not the other when performing this comparison. In various embodiments, system 100 can be configured to remove duplicate entries from the synthetic dataset and the normalized reference dataset before performing the comparison.
  • In some embodiments, system 100 can be configured to calculate a distance measure between each row of the synthetic dataset (or a subset of the rows of the synthetic dataset) and each row of the normalized reference dataset (or a subset of the rows of the normalized reference dataset). System 100 can then determine the minimum distance value for each row of the synthetic dataset across all rows of the normalized reference dataset. In some embodiments, the similarity metric can depend on a function of the minimum distance values for all rows of the synthetic dataset (e.g., a maximum value, an average value, or the like).
  • In some embodiments, the similarity metric can depend on a frequency of duplicate elements in the synthetic dataset and the normalized reference dataset. In some aspects, system 100 can be configured to determine the number of duplicate elements in each of the synthetic dataset and the normalized reference dataset. In various aspects, system 100 can be configured to determine the proportion of each dataset represented by at least some of the elements in each dataset. For example, system 100 can be configured to determine the proportion of the synthetic dataset having a particular value. In some aspects, this value may be the most frequent value in the synthetic dataset. System 100 can be configured to similarly determine the proportion of the normalized reference dataset having a particular value (e.g., the most frequent value in the normalized reference dataset).
  • In some embodiments, the similarity metric can depend on a relative prevalence of rare values in the synthetic and normalized reference dataset. In some aspects, such rare values can be those present in a dataset with frequencies less than a predetermined threshold. In some embodiments, the predetermined threshold can be a value less than 20%, for example 10%. System 100 can be configured to determine a prevalence of rare values in the synthetic and normalized reference dataset. For example, system 100 can be configured to determine counts of the rare values in a dataset and the total number of elements in the dataset. System 100 can then determine ratios of the counts of the rare values to the total number of elements in the datasets.
  • In some embodiments, the similarity metric can depend on differences in the ratios between the synthetic dataset and the normalized reference dataset. As a non-limiting example, an exemplary dataset can be an access log for patient medical records that tracks the job title of the employee accessing a patient medical record. The job title “Administrator” may be a rare value of job title and appear in 3% of the log entries. System 100 can be configured to generate synthetic log data based on the actual dataset, but the job title “Administrator” may not appear in the synthetic log data. The similarity metric can depend on difference between the actual dataset prevalence (3%) and the synthetic log data prevalence (0%). As an alternative example, the job title “Administrator” may be overrepresented in the synthetic log data, appearing in 15% of the of the log entries (and therefore not a rare value in the synthetic log data when the predetermined threshold is 10%). In this example, the similarity metric can depend on difference between the actual dataset prevalence (3%) and the synthetic log data prevalence (15%).
  • In various embodiments, the similarity metric can depend on a function of the differences in the ratios between the synthetic dataset and the normalized reference dataset. For example, the actual dataset may include 10 rare values with a prevalence under 10% of the dataset. The difference between the prevalence of these 10 rare values in the actual dataset and the normalized reference dataset can range from −5% to 4%. In some embodiments, the similarity metric can depend on the greatest magnitude difference (e.g., the similarity metric could depend on the value −5% as the greatest magnitude difference). In various embodiments, the similarity metric can depend on the average of the magnitude differences, the Euclidean norm of the ratio differences, or the like.
  • In various embodiments, the similarity metric can depend on a difference in schemas between the synthetic dataset and the normalized reference dataset. For example, when the synthetic dataset includes spreadsheet data, system 100 can be configured to determine a number of mismatched columns between the synthetic and normalized reference datasets, a number of mismatched column types between the synthetic and normalized reference datasets, a number of mismatched column categories between the synthetic and normalized reference datasets, and number of mismatched numeric ranges between the synthetic and normalized reference datasets. The value of the similarity metric can depend on the number of at least one of the mismatched columns, mismatched column types, mismatched column categories, or mismatched numeric ranges.
  • In some embodiments, the similarity metric can depend on one or more of the above criteria. For example, the similarity metric can depend on one or more of (1) a covariance of the output data and a covariance of the normalized reference dataset, (2) a univariate value distribution of an element of the synthetic dataset, (3) a univariate value distribution of an element of the normalized reference dataset, (4) a number of elements of the synthetic dataset that match elements of the reference dataset, (5) a number of elements of the synthetic dataset that are similar to elements of the normalized reference dataset, (6) a distance measure between each row of the synthetic dataset (or a subset of the rows of the synthetic dataset) and each row of the normalized reference dataset (or a subset of the rows of the normalized reference dataset), (7) a frequency of duplicate elements in the synthetic dataset and the normalized reference dataset, (8) a relative prevalence of rare values in the synthetic and normalized reference dataset, and (9) differences in the ratios between the synthetic dataset and the normalized reference dataset.
  • System 100 can compare a synthetic dataset to a normalized reference dataset, a synthetic dataset to an actual (unnormalized) dataset, or to compare two datasets according to a similarity metric consistent with disclosed embodiments. For example, in some embodiments, model optimizer 107 can be configured to perform such comparisons. In various embodiments, model storage 105 can be configured to store similarity metric information (e.g., similarity values, indications of comparison datasets, and the like) together with a synthetic dataset.
  • Process 900 can then proceed to step 909. In step 909, system 100 (e.g., model optimizer 107, computational resources 101, or the like) can train the generative adversarial network using the similarity metric value. In some embodiments, system 100 can be configured to determine that the synthetic dataset satisfies a similarity criterion. The similarity criterion can concern at least one of the similarity metrics described above. For example, the similarity criterion can concern at least one of a statistical correlation score between the synthetic dataset and the normalized reference dataset, a data similarity score between the synthetic dataset and the reference dataset, or a data quality score for the synthetic dataset.
  • In some embodiments, synthetic data satisfying the similarity criterion can be too similar to the reference dataset. System 100 can be configured to update a loss function for training the generative adversarial network to decrease the similarity between the reference dataset and synthetic datasets generated by the generative adversarial network when the similarity criterion is satisfied. In particular, the loss function of the generative adversarial network can be configured to penalize generation of synthetic data that is too similar to the normalized reference dataset, up to a certain threshold. To that end, a penalty term can be added to the loss function of the generative adversarial network. This term can penalize the calculated loss if the dissimilarity between the synthetic data and the actual data goes below a certain threshold. In some aspects, this penalty term can thereby ensure that the value of the similarity metric exceeds some similarity threshold, or remains near the similarity threshold (e.g., the value of the similarity metric may exceed 90% of the value of the similarity threshold). In this non-limiting example, decreasing values of the similarity metric can indicate increasing similarity. System 100 can then update the loss function such that the likelihood of generating synthetic data like the current synthetic data is reduced. In this manner, system 100 can train the generative adversarial network using a loss function that penalizes generation of data differing from the reference dataset by less than the predetermined amount.
  • FIG. 10 depicts a process 1000 for supplementing or transforming datasets using code-space operations, consistent with disclosed embodiments. Process 1000 can include the steps of generating encoder and decoder models that map between a code space and a sample space, identifying representative points in code space, generating a difference vector in code space, and generating extreme points or transforming a dataset using the difference vector. In this manner, process 1000 can support model validation and simulation of conditions differing from those present during generation of a training dataset. For example, while existing systems and methods may train models using datasets representative of typical operating conditions, process 1000 can support model validation by inferring datapoints that occur infrequently or outside typical operating conditions. As an additional example, a training data include operations and interactions typical of a first user population. Process 1000 can support simulation of operations and interactions typical of a second user population that differs from the first user population. To continue this example, a young user population may interact with a system. Process 1000 can support generation of a synthetic training dataset representative of an older user population interacting with the system. This synthetic training dataset can be used to simulate performance of the system with an older user population, before developing that userbase.
  • After starting, process 1000 can proceed to step 1001. In step 1001, system 1001 can generate an encoder model and a decoder model. Consistent with disclosed embodiments, system 100 can be configured to generate an encoder model and decoder model using an adversarially learned inference model, as disclosed in “Adversarially Learned Inference” by Vincent Dumoulin, et al. According to the adversarially learned inference model, an encoder maps from a sample space to a code space and a decoder maps from a code space to a sample space. The encoder and decoder are trained by selecting either a code and generating a sample using the decoder or by selecting a sample and generating a code using the encoder. The resulting pairs of code and sample are provided to a discriminator model, which is trained to determine whether the pairs of code and sample came from the encoder or decoder. The encoder and decoder can be updated based on whether the discriminator correctly determined the origin of the samples. Thus, the encoder and decoder can be trained to fool the discriminator. When appropriately trained, the joint distribution of code and sample for the encoder and decoder match. As would be appreciated by one of skill in the art, other techniques of generating a mapping from a code space to a sample space may also be used. For example, a generative adversarial network can be used to learn a mapping from the code space to the sample space.
  • Process 1000 can then proceed to step 1003. In step 1003, system 100 can identify representative points in the code space. System 100 can identify representative points in the code space by identifying points in the sample space, mapping the identified points into code space, and determining the representative points based on the mapped points, consistent with disclosed embodiments. In some embodiments, the identified points in the sample space can be elements of a dataset (e.g., an actual dataset or a synthetic dataset generated using an actual dataset).
  • System 100 can identify points in the sample space based on sample space characteristics. For example, when the sample space includes financial account information, system 100 can be configured to identify one or more first accounts belonging to users in their 20s and one or more second accounts belonging to users in their 40s.
  • Consistent with disclosed embodiments, identifying representative points in the code space can include a step of mapping the one or more first points in the sample space and the one or more second points in the sample space to corresponding points in the code space. In some embodiments, the one or more first points and one or more second points can be part of a dataset. For example, the one or more first points and one or more second points can be part of an actual dataset or a synthetic dataset generated using an actual dataset.
  • System 100 can be configured to select first and second representative points in the code space based on the mapped one or more first points and the mapped one or more second points. As shown in FIG. 11A, when the one or more first points include a single point, the mapping of this single point to the code space (e.g., point 1101) can be a first representative point in code space 1100. Likewise, when the one or more second points include a single point, the mapping of this single point to the code space (e.g., point 1103) can be a second representative point in code space 1100.
  • As shown in FIG. 11B, when the one or more first points include multiple points, system 100 can be configured to determine a first representative point in code space 1110. In some embodiments, system 100 can be configured to determine the first representative point based on the locations of the mapped one or more first points in the code space. In some embodiments, the first representative point can be a centroid or a medoid of the mapped one or more first points. Likewise, system 100 can be configured to determine the second representative point based on the locations of the mapped one or more second points in the code space. In some embodiments, the second representative point can be a centroid or a medoid of the mapped one or more second points. For example, system 100 can be configured to identify point 1113 as the first representative point based on the locations of mapped points 1111 a and 1111 b. Likewise, system 100 can be configured to identify point 1117 as the second representative point based on the locations of mapped points 1115 a and 1115 b.
  • In some embodiments, the code space can include a subset of Rn. System 100 can be configured to map a dataset to the code space using the encoder. System 100 can then identify the coordinates of the points with respect to a basis vector in Rn (e.g., one of the vectors of the identity matrix). System 100 can be configured to identify a first point with a minimum coordinate value with respect to the basis vector and a second point with a maximum coordinate value with respect to the basis vector. System 100 can be configured to identify these points as the first and second representative points. For example, taking the identity matrix as the basis, system 100 can be configured to select as the first point the point with the lowest value of the first element of the vector. To continue this example, system 100 can be configured to select as the second point the point with the highest value of the first element of the vector. In some embodiments, system 100 can be configured to repeat process 1000 for each vector in the basis.
  • Process 1000 can then proceed to step 1005. In step 1005, system 100 can determine a difference vector connecting the first representative point and the second representative point. For example, as shown in FIG. 11A, system 100 can be configured to determine a vector 1105 from first representative point 1101 to second representative point 1103. Likewise, as shown in FIG. 11B, system 100 can be configured to determine a vector 1119 from first representative point 1113 to second representative point 1117.
  • Process 1000 can then proceed to step 1007. In step 1007, as depicted in FIG. 12A, system 100 can generate extreme codes. Consistent with disclosed embodiments, system 100 can be configured to generate extreme codes by sampling the code space (e.g., code space 1200) along an extension (e.g., extension 1201) of the vector connecting the first representative point and the second representative point (e.g., vector 1105). In this manner, system 100 can generate a code extreme with respect to the first representative point and the second representative point (e.g. extreme point 1203).
  • Process 1000 can then proceed to step 1009. In step 1009, as depicted in FIG. 12A, system 100 can generate extreme samples. Consistent with disclosed embodiments, system 100 can be configured to generate extreme samples by converting the extreme code into the sample space using the decoder trained in step 1001. For example, system 100 can be configured to convert extreme point 1203 into a corresponding datapoint in the sample space.
  • Process 1000 can then proceed to step 1011. In step 1011, as depicted in FIG. 12B, system 100 can translate a dataset using the difference vector determined in step 1005 (e.g., difference vector 1105). In some aspects, system 100 can be configured to convert the dataset from sample space to code space using the encoder trained in step 1001. System 100 can be configured to then translate the elements of the dataset in code space using the difference vector. In some aspects, system 100 can be configured to translate the elements of the dataset using the vector and a scaling factor. In some aspects, the scaling factor can be less than one. In various aspects, the scaling factor can be greater than or equal to one. For example, as shown in FIG. 12B, the elements of the dataset can be translated in code space 1210 by the product of the difference vector and the scaling factor (e.g., original point 1211 can be translated by translation 1212 to translated point 1213).
  • Process 1000 can then proceed to step 1013. In step 1013, as depicted in FIG. 12B, system 100 can generate a translated dataset. Consistent with disclosed embodiments, system 100 can be configured to generate the translated dataset by converting the translated points into the sample space using the decoder trained in step 1001. For example, system 100 can be configured to convert extreme point translated point 1213 into a corresponding datapoint in the sample space.
  • FIG. 13 depicts an exemplary cloud computing system 1300 for generating a synthetic data stream that tracks a reference data stream. The flow rate of the synthetic data can resemble the flow rate of the reference data stream, as system 1300 can generate synthetic data in response to receiving reference data stream data. System 1300 can include a streaming data source 1301, model optimizer 1303, computing resource 1304, model storage 1305, dataset generator 1307, and synthetic data source 1309. System 1300 can be configured to generate a new synthetic data model using actual data received from streaming data source 1301. Streaming data source 1301, model optimizer 1303, computing resources 1304, and model storage 1305 can interact to generate the new synthetic data model, consistent with disclosed embodiments. In some embodiments, system 1300 can be configured to generate the new synthetic data model while also generating synthetic data using a current synthetic data model.
  • Streaming data source 1301 can be configured to retrieve new data elements from a database, a file, a datasource, a topic in a data streaming platform (e.g., IBM STREAMS), a topic in a distributed messaging system (e.g., APACHE KAFKA), or the like. In some aspects, streaming data source 1301 can be configured to retrieve new elements in response to a request from model optimizer 1303. In some aspects, streaming data source 1301 can be configured to retrieve new data elements in real-time. For example, streaming data source 1301 can be configured to retrieve log data, as that log data is created. In various aspects, streaming data source 1301 can be configured to retrieve batches of new data. For example, streaming data source 1301 can be configured to periodically retrieve all log data created within a certain period (e.g., a five-minute interval). In some embodiments, the data can be application logs. The application logs can include event information, such as debugging information, transaction information, user information, user action information, audit information, service information, operation tracking information, process monitoring information, or the like. In some embodiments, the data can be JSON data (e.g., JSON application logs).
  • System 1300 can be configured to generate a new synthetic data model, consistent with disclosed embodiments. Model optimizer 1303 can be configured to provision computing resources 1304 with a data model, consistent with disclosed embodiments. In some aspects, computing resources 1304 can resemble computing resources 101, described above with regard to FIG. 1. For example, computing resources 1304 can provide similar functionality and can be similarly implemented. The data model can be a synthetic data model. The data model can be a current data model configured to generate data similar to recently received data in the reference data stream. The data model can be received from model storage 1305. For example, model optimizer 1307 can be configured to provide instructions to computing resources 1304 to retrieve a current data model of the reference data stream from model storage 1305. In some embodiments, the synthetic data model can include a recurrent neural network, a kernel density estimator, or a generative adversarial network.
  • Computing resources 1304 can be configured to train the new synthetic data model using reference data stream data. In some embodiments, system 1300 (e.g., computing resources 1304 or model optimizer 1303) can be configured to include reference data stream data into the training data as it is received from streaming data source 1301. The training data can therefore reflect the current characteristics of the reference data stream (e.g., the current values, current schema, current statistical properties, and the like). In some aspects, system 1300 (e.g., computing resources 1304 or model optimizer 1303) can be configured to store reference data stream data received from streaming data source 1301 for subsequent use as training data. In some embodiments, computing resources 1304 may have received the stored reference data stream data prior to beginning training of the new synthetic data model. As an additional example, computing resources 1304 (or another component of system 1300) can be configured to gather data from streaming data source 1301 during a first time-interval (e.g., the prior repeat) and use this gathered data to train a new synthetic model in a subsequent time-interval (e.g., the current repeat). In various embodiments, computing resources 1304 can be configured to use the stored reference data stream data for training the new synthetic data model. In various embodiments, the training data can include both newly-received and stored data. When the synthetic data model is a Generative Adversarial Network, computing resources 1304 can be configured to train the new synthetic data model, in some embodiments, as described above with regard to FIGS. 8 and 9. Alternatively, computing resources 1304 can be configured to train the new synthetic data model according to know methods.
  • Model optimizer 1303 can be configured to evaluate performance criteria of a newly created synthetic data model. In some embodiments, the performance criteria can include a similarity metric (e.g., a statistical correlation score, data similarity score, or data quality score, as described herein). For example, model optimizer 1303 can be configured to compare the covariances or univariate distributions of a synthetic dataset generated by the new synthetic data model and a reference data stream dataset. Likewise, model optimizer 1303 can be configured to evaluate the number of matching or similar elements in the synthetic dataset and reference data stream dataset. Furthermore, model optimizer 1303 can be configured to evaluate a number of duplicate elements in each of the synthetic dataset and reference data stream dataset, a prevalence of the most common value in synthetic dataset and reference data stream dataset, a maximum difference of rare values in each of the synthetic dataset and reference data stream dataset, differences in schema between the synthetic dataset and reference data stream dataset, and the like.
  • In various embodiments, the performance criteria can include prediction metrics. The prediction metrics can enable a user to determine whether data models perform similarly for both synthetic and actual data. The prediction metrics can include a prediction accuracy check, a prediction accuracy cross check, a regression check, a regression cross check, and a principal component analysis check. In some aspects, a prediction accuracy check can determine the accuracy of predictions made by a model (e.g., recurrent neural network, kernel density estimator, or the like) given a dataset. For example, the prediction accuracy check can receive an indication of the model, a set of data, and a set of corresponding labels. The prediction accuracy check can return an accuracy of the model in predicting the labels given the data. Similar model performance for the synthetic and original data can indicate that the synthetic data preserves the latent feature structure of the original data. In various aspects, a prediction accuracy cross check can calculate the accuracy of a predictive model that is trained on synthetic data and tested on the original data used to generate the synthetic data. In some aspects, a regression check can regress a numerical column in a dataset against other columns in the dataset, determining the predictability of the numerical column given the other columns. In some aspects, a regression error cross check can determine a regression formula for a numerical column of the synthetic data and then evaluate the predictive ability of the regression formula for the numerical column of the actual data. In various aspects, a principal component analysis check can determine a number of principal component analysis columns sufficient to capture a predetermined amount of the variance in the dataset. Similar numbers of principal component analysis columns can indicate that the synthetic data preserves the latent feature structure of the original data.
  • Model optimizer 1303 can be configured to store the newly created synthetic data model and metadata for the new synthetic data model in model storage 1305 based on the evaluated performance criteria, consistent with disclosed embodiments. For example, model optimizer 1303 can be configured to store the metadata and new data model in model storage when a value of a similarity metric or a prediction metric satisfies a predetermined threshold. In some embodiments, the metadata can include at least one value of a similarity metric or prediction metric. In various embodiments, the metadata can include an indication of the origin of the new synthetic data model, the data used to generate the new synthetic data model, when the new synthetic data model was generated, and the like.
  • System 1300 can be configured to generate synthetic data using a current data model. In some embodiments, this generation can occur while system 1300 is training a new synthetic data model. Model optimizer 1303, model storage 1305, dataset generator 1307, and synthetic data source 1309 can interact to generate the synthetic data, consistent with disclosed embodiments.
  • Model optimizer 1303 can be configured to receive a request for a synthetic data stream from an interface (e.g., interface 113 or the like). In some aspects, model optimizer 1307 can resemble model optimizer 107, described above with regard to FIG. 1. For example, model optimizer 1307 can provide similar functionality and can be similarly implemented. In some aspects, requests received from the interface can indicate a reference data stream. For example, such a request can identify streaming data source 1301 and/or specify a topic or subject (e.g., a Kafka topic or the like). In response to the request, model optimizer 1307 (or another component of system 1300) can be configured to direct generation of a synthetic data stream that tracks the reference data stream, consistent with disclosed embodiments.
  • Dataset generator 1307 can be configured to retrieve a current data model of the reference data stream from model storage 1305. In some embodiments, dataset generator 1307 can resemble dataset generator 103, described above with regard to FIG. 1. For example, dataset generator 1307 can provide similar functionality and can be similarly implemented. Likewise, in some embodiments, model storage 1305 can resemble model storage 105, described above with regard to FIG. 1. For example, model storage 1305 can provide similar functionality and can be similarly implemented. In some embodiments, the current data model can resemble data received from streaming data source 1301 according to a similarity metric (e.g., a statistical correlation score, data similarity score, or data quality score, as described herein). In various embodiments, the current data model can resemble data received during a time interval extending to the present (e.g. the present hour, the present day, the present week, or the like). In various embodiments, the current data model can resemble data received during a prior time interval (e.g. the previous hour, yesterday, last week, or the like). In some embodiments, the current data model can be the most recently trained data model of the reference data stream.
  • Dataset generator 1307 can be configured to generate a synthetic data stream using the current data model of the reference data steam. In some embodiments, dataset generator 1307 can be configured to generate the synthetic data stream by replacing sensitive portions of the reference data steam with synthetic data, as described in FIGS. 5A and 5B. In various embodiments, dataset generator 1307 can be configured to generate the synthetic data stream without reference to the reference data steam data. For example, when the current data model is a recurrent neural network, dataset generator 1307 can be configured to initialize the recurrent neural network with a value string (e.g., a random sequence of characters), predict a new value based on the value string, and then add the new value to the end of the value string. Dataset generator 1307 can then predict the next value using the updated value string that includes the new value. In some embodiments, rather than selecting the most likely new value, dataset generator 1307 can be configured to probabilistically choose a new value. As a nonlimiting example, when the existing value string is “examin” the dataset generator 1307 can be configured to select the next value as “e” with a first probability and select the next value as “a” with a second probability. As an additional example, when the current data model is a generative adversarial network or an adversarially learned inference network, dataset generator 1307 can be configured to generate the synthetic data by selecting samples from a code space, as described herein.
  • In some embodiments, dataset generator 1307 can be configured to generate an amount of synthetic data equal to the amount of actual data retrieved from synthetic data stream 1309. In some aspects, the rate of synthetic data generation can match the rate of actual data generation. As a nonlimiting example, when streamlining data source 1301 retrieves a batch of 10 samples of actual data, dataset generator 1307 can be configured to generate a batch of 10 samples of synthetic data. As a further nonlimiting example, when streamlining data source 1301 retrieves a batch of actual data every 10 minutes, dataset generator 1307 can be configured to generate a batch of actual data every 10 minutes. In this manner, system 1300 can be configured to generate synthetic data similar in both content and temporal characteristics to the reference data stream data.
  • In various embodiments, dataset generator 1307 can be configured to provide synthetic data generated using the current data model to synthetic data source 1309. In some embodiments, synthetic data source 1309 can be configured to provide the synthetic data received from dataset generator 1307 to a database, a file, a datasource, a topic in a data streaming platform (e.g., IBM STREAMS), a topic in a distributed messaging system (e.g., APACHE KAFKA), or the like.
  • As discussed above, system 1300 can be configured to track the reference data stream by repeatedly switching data models of the reference data stream. In some embodiments, dataset generator 1307 can be configured to switch between synthetic data models at a predetermined time, or upon expiration of a time interval. For example, model optimizer 1307 can be configured to switch from an old model to a current model every hour, day, week, or the like. In various embodiments, system 1300 can detect when a data schema of the reference data stream changes and switch to a current data model configured to provide synthetic data with the current schema. Consistent with disclosed embodiments, switching between synthetic data models can include dataset generator 1307 retrieving a current model from model storage 1305 and computing resources 1304 providing a new synthetic data model for storage in model storage 1305. In some aspects, computing resources 1304 can update the current synthetic data model with the new synthetic data model and then dataset generator 1307 can retrieve the updated current synthetic data model. In various aspects, dataset generator 1307 can retrieve the current synthetic data model and then computing resources 1304 can update the current synthetic data model with the new synthetic data model. In some embodiments, model optimizer 1303 can provision computing resources 1304 with a synthetic data model for training using a new set of training data. In various embodiments, computing resources 1304 can be configured to continue updating the new synthetic data model. In this manner, a repeat of the switching process can include generation of a new synthetic data model and the replacement of a current synthetic data model by this new synthetic data model.
  • FIG. 14 depicts a process 1400 for generating synthetic JSON log data using the cloud computing system of FIG. 13. Process 1400 can include the steps of retrieving reference JSON log data, training a recurrent neural network to generate synthetic data resembling the reference JSON log data, generating the synthetic JSON log data using the recurrent neural network, and validating the synthetic JSON log data. In this manner system 1300 can use process 1400 to generate synthetic JSON log data that resembles actual JSON log data.
  • After starting, process 1400 can proceed to step 1401. In step 1401, substantially as described above with regard to FIG. 13, streaming data source 1301 can be configured to retrieve the JSON log data from a database, a file, a datasource, a topic in a distributed messaging system such Apache Kafka, or the like. The JSON log data can be retrieved in response to a request from model optimizer 1303. The JSON log data can be retrieved in real-time, or periodically (e.g., approximately every five minutes).
  • Process 1400 can then proceed to step 1403. In step 1403, substantially as described above with regard to FIG. 13, computing resources 1304 can be configured to train a recurrent neural network using the received data. The training of the recurrent neural network can proceed as described in “Training Recurrent Neural Networks,” 2013, by Ilya Sutskever, which is incorporated herein by reference in its entirety.
  • Process 1400 can then proceed to step 1405. In step 1405, substantially as described above with regards to FIG. 13, dataset generator 1307 can be configured to generate synthetic JSON log data using the trained neural network. In some embodiments, dataset generator 1307 can be configured to generate the synthetic JSON log data at the same rate as actual JSON log data is received by streaming data source 1301. For example, dataset generator 1307 can be configured to generate batches of JSON log data at regular time intervals, the number of elements in a batch dependent on the number of elements received by streaming data source 1301. As an additional example, dataset generator 1307 can be configured to generate an element of synthetic JSON log data upon receipt of an element of actual JSON log data from streaming data source 1301.
  • Process 1400 can then proceed to step 1407. In step 1407, dataset generator 1307 (or another component of system 1300) can be configured to validate the synthetic data stream. For example, dataset generator 1307 can be configured to use a JSON validator (e.g., JSON SCHEMA VALIDATOR, JSONLINT, or the like) and a schema for the reference data stream to validate the synthetic data stream. In some embodiments, the schema describes key-value pairs present in the reference data stream. In some aspects, system 1300 can be configured to derive the schema from the reference data stream. In some embodiments, validating the synthetic data stream can include validating that keys present in the synthetic data stream are present in the schema. For example, when the schema includes the keys “first_name”: {“type”: “string”} and “last_name”: {“type”: “string”}, system 1300 may not validate the synthetic data stream when objects in the data stream lack the “first_name” and “last_name” keys. Furthermore, in some embodiments, validating the synthetic data stream can include validating that key-value formats present in the synthetic data stream match corresponding key-value formats in the reference data stream. For example, when the schema includes the keys “first_name”: {“type”: “string”} and “last_name”: {“type”: “string”}, system 1300 may not validate the synthetic data stream when objects in the data stream include a numeric-valued “first_name” or “last_name”.
  • FIG. 15 depicts a system 1500 for secure generation and insecure use of models of sensitive data. System 1500 can include a remote system 1501 and a local system 1503 that communicate using network 1505. Remote system 1501 can be substantially similar to system 100 and be implemented, in some embodiments, as described in FIG. 4. For example, remote system 1501 can include an interface, model optimizer, and computing resources that resemble interface 113, model optimizer 107, and computing resources 101, respectively, described above with regards to FIG. 1. For example, the interface, model optimizer, and computing resources can provide similar functionality to interface 113, model optimizer 107, and computing resources 101, respectively, and can be similarly implemented. In some embodiments, remote system 1501 can be implemented using a cloud computing infrastructure. Local system 1503 can comprise a computing device, such as a smartphone, tablet, laptop, desktop, workstation, server, or the like. Network 1505 can include any combination of electronics communications networks enabling communication between components of system 1500 (similar to network 115).
  • In various embodiments, remote system 1501 can be more secure than local system 1503. For example, remote system 1501 can better protected from physical theft or computer intrusion than local system 1503. As a non-limiting example, remote system 1501 can be implemented using AWS or a private cloud of an institution and managed at an institutional level, while the local system can be in the possession of, and managed by, an individual user. In some embodiments, remote system 1501 can be configured to comply with policies or regulations governing the storage, transmission, and disclosure of customer financial information, patient healthcare records, or similar sensitive information. In contrast, local system 1503 may not be configured to comply with such regulations.
  • System 1500 can be configured to perform a process of generating synthetic data. According to this process, system 1500 can train the synthetic data model on sensitive data using remote system 1501, in compliance with regulations governing the storage, transmission, and disclosure of sensitive information. System 1500 can then transmit the synthetic data model to local system 1503, which can be configured to use the system to generate synthetic data locally. In this manner, local system 1503 can be configured to use synthetic data resembling the sensitive information, which comply with policies or regulations governing the storage, transmission, and disclosure of such information.
  • According to this process, the model optimizer can receive a data model generation request from the interface. In response to the request, the model optimizer can provision computing resources with a synthetic data model. The computing resources can train the synthetic data model using a sensitive dataset (e.g., consumer financial information, patient healthcare information, or the like). The model optimizer can be configured to evaluate performance criteria of the data model (e.g., the similarity metric and prediction metrics described herein, or the like). Based on the evaluation of the performance criteria of the synthetic data model, the model optimizer can be configured to store the trained data model and metadata of the data model (e.g., values of the similarity metric and prediction metrics, of the data, the origin of the new synthetic data model, the data used to generate the new synthetic data model, when the new synthetic data model was generated, and the like). For example, the model optimizer can determine that the synthetic data model satisfied predetermined acceptability criteria based on one or more similarity and/or prediction metric value.
  • Local system 1503 can then retrieve the synthetic data model from remote system 1501. In some embodiments, local system 1503 can be configured to retrieve the synthetic data model in response to a synthetic data generation request received by local system 1503. For example, a user can interact with local system 1503 to request generation of synthetic data. In some embodiments, the synthetic data generation request can specify metadata criteria for selecting the synthetic data model. Local system 1503 can interact with remote system 1501 to select the synthetic data model based on the metadata criteria. Local system 1503 can then generate the synthetic data using the data model in response to the data generation request.
  • FIG. 16 depicts a process 1600 for transforming any model into a neural network model, consistent with disclosed embodiments.
  • Process 1600 is performed by components of system 100, including computing resources 101, dataset generator 103, database 105, model optimizer 107, model storage 109, model curator 111, and interface 113, consistent with disclosed embodiments. In some embodiments, process 1600 is performed by components of system 400, an exemplary implementation of system 100. For example, steps of process 1600 may be performed by model optimizer 107 as implemented on model optimization instance 409.
  • As shown in FIG. 16, at step 1602, input data is received. In some embodiments, input is received at, for example, model optimizer 107 from one or more components of system 100. In some embodiments, input is received from a system outside system 100 via interface 113.
  • The input data of step 1602 includes an input command, an input model having an input model type and an input dataset. The input model may be any type of data model (e.g., a random forest model, a gradient boosting machine, a regression model, a logistic regression, an object based logical model, a physical data model, a neural network model, or any other data model). The input command includes a command to transform the input model into a neural network model. In some embodiments, the input command specifies one or more neural network types (e.g., neural network, recurrent neural network, generative adversarial network, or the like). In some embodiments, the input command specifies one or more model parameters for the one or more types of neural networks (e.g., the number of features and number of layers in a recurrent neural network). In some embodiments, the input command specifies model selection criteria. The model selection criteria may comprise a desired performance metric of the transformed model. The performance metric may be an accuracy score of the neural network model, a model run time, a root-mean-square error estimate (RMSE), a logloss, an Akaike's Information Criterion (AICC), a Bayesian Information Criterion (BIC), an area under a receiver operating characteristic (ROC) curve, Precision, Recall, an F-Score, or the like. In some embodiments, the input dataset is a dataset previously used to train the input model. In some embodiments, the input command specifies the location of a database, and receiving the input database includes retrieving an input dataset from the database.
  • At step 1604, the input model is applied to the input dataset to generate model output. In some embodiments, model optimizer 107 applies the input model, consistent with the embodiments. That is, computing resources are provisioned to run the model over the input dataset. In some embodiments, step 1604 includes spinning up a virtual machine or ephemeral container instance to run the input model (e.g., development instance 407). In some embodiments, step 1604 includes generating a map of input model features. The map may include references that relate output data to input data (e.g., a foreign key to the input data). For example, if the input dataset comprises rows and columns of data, the map may identify a correspondence between input data values in rows numbering row A to row Z and a set of 26 model output data values that the model produced based on input data values in rows A to Z. Model features may include input values from the input data set or transformed input data values used to run the model prior to step 1604 (e.g., via normalizing, averaging, or other transformations). In some embodiments, generating model output includes transforming the input values into transformed values, and running the input model over the transformed values.
  • At step 1606, model output is stored. In some embodiments, storing model output includes one of (a) storing model output along with a map of input model features or (b) storing model output along with model features. Model output may include a modeling result, a log, or other model output. Model output may be stored in a database or cloud computing bucket. For example, model output may be stored in database 105.
  • At step 1608, one or more candidate neural network models are generated. In some embodiments, model optimizer 107 generates candidate neural network models, consistent with the embodiments. Generating a candidate network model may include spinning up an ephemeral container instance (e.g., development instance 407). In some embodiments, the number and type of candidate neural network models are based on the command received at step 1602. In some embodiments, the number and type of candidate neural network models are pre-determined. In some embodiments, generating candidate neural network models includes retrieving a candidate neural network model from a model storage. For example, candidate neural network models may be retrieved from model storage 109. Retrieving candidate neural network models may be based on a model index.
  • The one or more candidate neural network models of step 1608 have model parameters which are based on the features of the input model such that the one or more candidate neural network models are overfitted to the input model. For example, if the input model is a linear regression model having a set of regression coefficients, each coefficient producing a result, then a candidate neural network may be designed to reproduce the result of each regression coefficient and the overall regression result of the input model. The model parameters of the one or more candidate neural network models may include, for example, the number of hidden layers, the number of nodes, the dropout rate, or other model parameters.
  • At step 1610, the candidate neural network models are tuned to the input model, consistent with disclosed embodiments. In some embodiments, model optimizer 107 tunes the candidate neural network models, consistent with the embodiments. Tuning includes, for example, adjusting a number of hidden layers, a number of inputs, or a type of layer for any of the one or more candidate neural network models. Tuning includes training a candidate model to reproduce features of the input model, consistent with disclosed embodiments. In some embodiments, tuning a candidate model includes iteratively performing operations to: select model training parameters; receive candidate model results; update the model training parameters based on the received candidate model results; and receive updated candidate model results based on the updated model training parameters. Model training parameters may include, for example, batch size, number of training batches, number of epochs, chunk size, time window, input noise dimension, or the like, consistent with disclosed embodiments. Training may terminate when a training condition is satisfied. For example, training may terminate at a pre-determined run time, after a pre-determined number of epochs, or based on an accuracy score of a candidate model. In some embodiments, different candidate models are trained using separate ephemeral container instances (e.g., development instance 407).
  • At step 1612, model output is received from the candidate neural network models. The candidate model output may include an overall accuracy score of a candidate model and a plurality of accuracy scores corresponding to the features of the input model. Model output at step 1612 may include a log file, a run time, a number of epochs, an error, an estimate of drift, or other model-run information. Model output may specify a model parameter, consistent with disclosed embodiments. In some embodiments, model optimizer 107 receives model output from, for example, resources provisioned to run the candidate neural network models, including an ephemeral container instance, consistent with the embodiments.
  • At step 1614, a candidate neural network model is selected based on one or more selection criteria. In some embodiments, model optimizer 107 selects the candidate neural network model, consistent with the embodiments. In some embodiments, the one or more selection criteria are the selection criteria received in the input command (step 1602). In some embodiments, the one or more selection criteria is predetermined. The model selection criteria may comprise a desired performance metric of the transformed model (e.g., an accuracy score of the neural network model, a model run time, a root-mean-square error estimate (RMSE), a logloss, an Akaike's Information Criterion (AICC), a Bayesian Information Criterion (BIC), an area under a receiver operating characteristic (ROC) curve, Precision, Recall, an F-Score, or the like). In some embodiments, selecting a candidate neural network model at step 1614 includes terminating processes relating to candidate neural network models that are not selected. For example, step 1614 may include spinning down or terminating container instances (e.g., development instances 407) running candidate neural network models that are not selected.
  • At step 1616, the selected neural network model is returned. In some embodiments, returning the selected neural network model includes transmitting, from model optimizer 107, the selected neural network model to an outside system via interface 113. In some embodiments, returning the selected neural network model includes storing the selected neural network model. For example, the step 1616 may include storing the selected neural network in a database or a bucket (e.g., database 105).
  • FIG. 17 depicts a process 1700 for transforming a legacy model, consistent with disclosed embodiments.
  • Process 1700 is performed by components of system 100, including computing resources 101, dataset generator 103, database 105, model optimizer 107, model storage 109, model curator 111, and interface 113, consistent with disclosed embodiments. In some embodiments, process 1600 is performed by components of system 400, an exemplary implementation of system 100. For example, steps of process 1700 may be performed by model optimizer 107 as implemented on model optimization instance 409.
  • As shown in FIG. 17, at step 1702, inputs are received. In some embodiments, inputs are received by optimizer 107 from interface 113, consistent with the embodiments. The input includes may include a model of any model type, e.g., the legacy model, and a command to transform a legacy model into a new model of the same model type as the legacy model, the new model having a different programming environment than the legacy model programming environment. For example, the command may be to transform a linear regression model built in SAS to a linear regression model built in PYTHON. In some embodiments, the legacy model and the new model share the same programming environment but use different code packages. For example, the legacy model may be a Gradient Boosting Machine built with SCIKIT LEARN (using PYTHON libraries) and the new model may be a Gradient Boosting Machine built with XGBOOST (also using PYTHON libraries). Step 1702 may include receiving metadata of the legacy model. Metadata may include arguments used by the model. For example, in some embodiments, the legacy model is a random forest and metadata includes a number of trees, a depth of trees, a min sample split.
  • In some embodiments, receiving input at step 1702 includes sub-steps to receive a model of any type, receive a dataset, and apply the model to the received dataset (e.g., steps 1602-1606). In some embodiments, model data is received from an API call via, for example, interface 113.
  • At step 1704, a determination is made whether model metadata is present, i.e., whether model metadata was received at step 1702, and the system proceeds to either step 1706 or step 1708 based on the determination. In some embodiments, model optimizer 107 may perform the determination of step 1704.
  • Step 1705 may be performed if model metadata was received at step 1702. At step 1705, a feature map is received. The feature map may include instructions to transform features of one programming environment into another programing environment (or from one package into another package). For example, using metadata, model optimizer 107 may identify features of the legacy model programming environment. In some aspects, legacy model features may map directly onto the new model programing environment. As an illustrative example, if the legacy model is a random forest models in SCIKIT-LEARN and the new model is a random forest model in XGBOOST, arguments using different syntax in both models may specify the number of trees, depth of trees, and/or min sample split, and the feature map may include instructions to transform these arguments from SCIKIT-LEARN to XGBOOST.
  • At step 1706, one or more candidate new models are generated with parameters that are based on the feature map Machine learning is used to optimize parameters not specified by the metadata. In some embodiments, model optimizer 107 may generate the candidate new models. In some embodiments, model optimizer 107 may spin up one or more ephemeral container instances (e.g., development instance 407) to generate a candidate new model.
  • Step 1708 may be performed if metadata was not received at step 1702. At step 1708, a global grid search over new model parameters may be performed to generate one or more candidate new models that approximates the model output of the legacy model. The global grid search of step 1708 may be a broad (unrefined) parameter search. For example, if the legacy model is a random forest model, the global grid search may comprise obtaining a model performance metric for each of a series of candidate random forest models that vary by the number of trees ranging from 0 to 10 with a step size of 2; the depth of trees ranging from 0 to 20 with a step size of 5; and/or a min sample split ranging from 0 to 1 with a step size of 0.1. A subset of candidate new models may be identified at step 1708 based on a model performance metric, consistent with disclosed embodiments. In some embodiments, model optimizer 107 performs the closed-loop grid search. In some embodiments, model optimizer 107 spins up one or more ephemeral container instances or identifies one or more running container instance to perform the global grid search (e.g., development instance 407).
  • At step 1710, a closed-loop (refined) grid search may be performed over the parameters to fit one or more candidate new models (of either step 1706 or step 1708) to the legacy model. In some embodiments, the closed-loop grid search may include a grid search over parameters near to the parameters of the received as input at step 1702, i.e. the parameters of the input model serves as a seed for the closed-loop grid search. In some embodiments, the close-loop grid search may include a grid search over parameters near the parameters of the subset of candidate new models identified at step 1708. For example, the range and step size of the closed-loop grid search may be smaller than the range and step size of parameters in used in the global grid search. In some embodiments, model optimizer 107 performs the closed-loop grid search. In some embodiments, model optimizer 107 spins up one or more ephemeral container instances or identifies one or more running container instance to perform the closed-loop grid search (e.g., development instance 407).
  • At step 1712, candidate new models may be applied to the input dataset and the results are compared to the legacy model output. In some embodiments, the comparison may include at least one of an accuracy score, a model run time, a root-mean-square error estimate (RMSE), a logloss, an AICC, a BIC, an area under an ROC curve, Precision, Recall, an F-Score, or the like. In some embodiments, model optimizer 107 may perform step 1712.
  • At step 1714, a candidate new model may be selected based on the comparison at step 1712 and one or more selection criteria. In some embodiments, model optimizer 107 may select the new candidate model.
  • At step 1716, an updated feature map may be created, consistent with disclosed embodiments. In some embodiments, the updated feature map may be based on one or more of the legacy model, the feature map received at step 1705, and the selected new model. In some embodiments, the updated feature map may be newly generated based on legacy model and the selected new model. The feature map may include instructions to transform features of one programming environment into another programing environment (or from one package into another package).
  • At step 1718, the selected new model may be returned. In some embodiments, returning the selected neural network model may include transmitting, by model optimizer 107, the selected neural network model to an outside system via interface 113. In some embodiments, returning the selected neural network model includes storing the selected neural network model. For example, step 1718 may include storing the selected neural network in a database or a bucket (e.g., database 105). In some embodiments, step 1718 may include at least one of returning the updated feature map with the selected new model or storing the updated feature map in memory (e.g., model storage 109).
  • EXAMPLE Generating Cancer Data
  • As described above, the disclosed systems and methods can enable generation of synthetic data similar to an actual dataset (e.g., using dataset generator 103). The synthetic data can be generated using a data model trained on the actual dataset (e.g., as described above with regards to FIG. 9). Such data models can include generative adversarial networks. The following code depicts the creation a synthetic dataset based on sensitive patient healthcare records using a generative adversarial network.
  • The following step defines a Generative Adversarial Network data model.
  • model_options={‘GANhDim’: 498, ‘GANZDim’: 20, ‘num_epochs’: 3}
  • The following step defines the delimiters present in the actual data
  • data_options={‘delimiter’: ‘,’}
  • In this example, the dataset is the publicly available University of Wisconsin Cancer dataset, a standard dataset used to benchmark machine learning prediction tasks. Given characteristics of a tumor, the task to predict whether the tumor is malignant.
  • data=Data(input_file_path=‘wisconsin_cancer_train.csv’, options=data_options)
  • In these steps the GAN model is trained generate data statistically similar to the actual data.
  • ss=SimpleSilo(‘GAN’, model_options)
  • ss.train(data)
  • The GAN model can now be used to generate synthetic data.
  • generated_data=ss.generate(num_output_samples=5000)
  • The synthetic data can be saved to a file for later use in training other machine learning models for this prediction task without relying on the original data.
  • simplesilo.save_as_csv(generated_data, output_file_path=‘wisconsin_cancer_GAN.csv’)
  • ss.save_model_into_file(‘cancer_data_model’)
  • Tokenizing Sensitive Data
  • As described above with regards to at least FIGS. 5A and 5B, the disclosed systems and methods can enable identification and removal of sensitive data portions in a dataset. In this example, sensitive portions of a dataset are automatically detected and replaced with synthetic data. In this example, the dataset includes human resources records. The sensitive portions of the dataset are replaced with random values (though they could also be replaced with synthetic data that is statistically similar to the original data as described in FIGS. 5A and 5B). In particular, this example depicts tokenizing four columns of the dataset. In this example, the Business Unit and
  • Active Status columns are tokenized such that all the characters in the values can be replaced by random chars of the same type while preserving format. For the column of Employee number, the first three characters of the values can be preserved but the remainder of each employee number can be tokenized. Finally, the values of the Last Day of Work column can be replaced with fully random values. All of these replacements can be consistent across the columns.
  • input_data=Data(‘hr_data.csv’)
  • keys_for_formatted scrub={‘Business Unit’:None, ‘Active Status’: None, ‘Company’: (0,3)}
  • keys_to_randomize=[‘Last Day of Work’]
  • tokenized_data, scrub_map=input_data.tokenize(keys_for_formatted_scrub=keys_for_formatted scrub, keys_to_randomize=keys_to_randomize) tokenized_data.save_data_into_file(‘hr_data_tokenized.csv’)
  • Alternatively, the system can use the scrub map to tokenize another file in a consistent way (e.g., replace the same values with the same replacements across both files) by passing the returned scrub_map dictionary to a new application of the scrub function.
  • input_data_2=Data(‘hr_data_part2.csv’)
  • keys_for_formatted_scrub={‘Business Unit’:None, ‘Company’: (0,3)}
  • keys_to_randomize=[‘Last Day of Work’]
  • to tokenize the second file, we pass the scrub_map diction to tokenize function.
  • tokenized_data_2, scrub_map=input_data_2.tokenize(keys_for_formatted_scrub=keys_for_formatted scrub, keys_to_randomize=keys_to_randomize, scrub_map=scrub_map)
  • tokenized_data_2.save_data_into_file(‘hr_data_tokenized_2.csv’)
  • In this manner, the disclosed systems and methods can be used to consistently tokenize sensitive portions of a file.
  • Other embodiments will be apparent to those skilled in the art from consideration of the specification and practice of the disclosed embodiments disclosed herein. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosed embodiments being indicated by the following claims. Furthermore, although aspects of the disclosed embodiments are described as being associated with data stored in memory and other tangible computer-readable storage mediums, one skilled in the art will appreciate that these aspects can also be stored on and executed from many types of tangible computer-readable media, such as secondary storage devices, like hard disks, floppy disks, or CD-ROM, or other forms of RAM or ROM. Accordingly, the disclosed embodiments are not limited to the above-described examples, but instead are defined by the appended claims in light of their full scope of equivalents.
  • Moreover, while illustrative embodiments have been described herein, the scope includes any and all embodiments having equivalent elements, modifications, omissions, combinations (e.g., of aspects across various embodiments), adaptations or alterations based on the present disclosure. The elements in the claims are to be interpreted broadly based on the language employed in the claims and not limited to examples described in the present specification or during the prosecution of the application, which examples are to be construed as non-exclusive. Further, the steps of the disclosed methods can be modified in any manner, including by reordering steps or inserting or deleting steps. It is intended, therefore, that the specification and examples be considered as example only, with a true scope and spirit being indicated by the following claims and their full scope of equivalents.

Claims (21)

1-20. (canceled)
21. A system for transforming an input model into a neural network model, the system comprising:
one or more memory units for storing instructions; and
one or more processors configured to execute the instructions to perform operations comprising:
receiving input data comprising the input model;
obtaining dataset model output by applying the input model to an input dataset;
storing the generated dataset model output and at least one of:
input model features; or a map of the input model features;
generating a plurality of candidate neural network models based on the input model features;
tuning the plurality of candidate neural network models by adjusting at least one of a plurality of hidden layers, a plurality of inputs, or a type of layer during tuning such that at least one of the input model features is reproduced;
selecting a neural network model from the plurality of candidate neural network models based on one or more model selection criteria; and
returning the selected neural network model.
22. The system of claim 21, wherein the input model comprises a regression model.
23. The system of claim 21, wherein the input dataset comprises synthetic data being a representation of original data.
24. The system of claim 21, wherein obtaining the dataset model output comprises:
sending a command to a development instance to apply the input model to the input dataset; and
obtaining the dataset model output from the development instance.
25. The system of claim 24, wherein the development instance comprises a virtual machine or an ephemeral container instance.
26. The system of claim 21, wherein generating the plurality of candidate neural network models comprises sending a command to a development instance to generate the plurality of candidate neural network models based on the input model features or the map of the input model features.
27. The system of claim 26, wherein the development instance comprises a virtual machine or an ephemeral container instance.
28. The system of claim 26, wherein:
the command specifies one or more model parameters; and
generating the plurality of neural network models comprises generating at least one of the plurality of neural network models based on the one or more model parameters.
29. The system of claim 26, wherein:
the command specifies a model type; and
generating the plurality of candidate neural network models comprises generating at least one candidate neural network model of the specified model type.
30. The system of claim 29, wherein:
the command further specifies a number of candidate neural network models of the specified model type; and
generating the plurality of candidate neural network models comprises generating the specified number of candidate neural network models of the specified type.
31. The system of claim 21, wherein generating the plurality of candidate neural network models comprises retrieving at least one candidate neural network model from a model storage.
32. The system of claim 21, wherein generating the plurality of candidate neural network models comprises overfitting at least one of the plurality of candidate neural network models to the input model.
33. The system of claim 21, wherein tuning the plurality of candidate neural network models comprises:
training at least one of the plurality of candidate neural network models; and
terminating the training when one or more training conditions are satisfied.
34. The system of claim 33, wherein the one or more training conditions comprise at least one of a run time, a number of epochs, or an accuracy score.
35. The system of claim 33, wherein training the at least one of the plurality of candidate neural network models comprises:
sending a first command to a first development instance to train a first candidate neural network model; and
sending a second command to a second development instance different from the first development instance to train a second candidate neural network model.
36. The system of claim 21, wherein the one or more model selection criteria comprise a selection criterium associated with an accuracy score of a candidate neural network model with respect to the input model features.
37. The system of claim 21, wherein the one or more model selection criteria comprise a selection criterium associated with a model run time of a candidate neural network model.
38. The system of claim 21, wherein generating the plurality of candidate neural network models comprises:
sending a first command to a first development instance to generate a first candidate neural network model; and
sending a second command to a second development instance different from the first development instance to generate a second candidate neural network model.
39. A method for transforming an input model into a neural network model, the method comprising:
obtaining dataset model output by applying an input model to an input dataset, the input model having a plurality of input model features;
generating a plurality of candidate neural network models based on the input model features;
tuning the plurality of candidate neural network models by adjusting at least one of a plurality of hidden layers, a plurality of inputs, or a type of layer during tuning such that at least one of the input model features is reproduced;
selecting a neural network model from the plurality of candidate neural network models based on one or more model selection criteria; and
returning the selected neural network model.
40. A non-transitory computer-readable medium storing instructions that, when executed by at least one processor, are configured to cause the at least one processor to perform operations comprising:
obtaining dataset model output by applying an input model to an input dataset, the input model having a plurality of input model features;
generating a plurality of candidate neural network models based on the input model features;
tuning the plurality of candidate neural network models by adjusting at least one of a plurality of hidden layers, a plurality of inputs, or a type of layer during tuning such that at least one of the input model features is reproduced;
selecting a neural network model from the plurality of candidate neural network models based on one or more model selection criteria, a model type of the selected neural network model being different from a model type of the input model; and
returning the selected neural network model.
US17/464,796 2018-07-06 2021-09-02 Systems and methods to use neural networks for model transformations Pending US20220092419A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/464,796 US20220092419A1 (en) 2018-07-06 2021-09-02 Systems and methods to use neural networks for model transformations

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201862694968P 2018-07-06 2018-07-06
US16/172,480 US11126475B2 (en) 2018-07-06 2018-10-26 Systems and methods to use neural networks to transform a model into a neural network model
US17/464,796 US20220092419A1 (en) 2018-07-06 2021-09-02 Systems and methods to use neural networks for model transformations

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US16/172,480 Continuation US11126475B2 (en) 2018-07-06 2018-10-26 Systems and methods to use neural networks to transform a model into a neural network model

Publications (1)

Publication Number Publication Date
US20220092419A1 true US20220092419A1 (en) 2022-03-24

Family

ID=67543579

Family Applications (57)

Application Number Title Priority Date Filing Date
US16/151,431 Pending US20200012890A1 (en) 2018-07-06 2018-10-04 Systems and methods for data stream simulation
US16/152,072 Active US10635939B2 (en) 2018-07-06 2018-10-04 System, method, and computer-accessible medium for evaluating multi-dimensional synthetic data using integrated variants analysis
US16/151,385 Active US10460235B1 (en) 2018-07-06 2018-10-04 Data model generation using generative adversarial networks
US16/151,407 Active 2039-01-29 US11615208B2 (en) 2018-07-06 2018-10-04 Systems and methods for synthetic data generation
US16/172,223 Active 2038-12-12 US11256555B2 (en) 2018-07-06 2018-10-26 Automatically scalable system for serverless hyperparameter tuning
US16/172,480 Active US11126475B2 (en) 2018-07-06 2018-10-26 Systems and methods to use neural networks to transform a model into a neural network model
US16/172,344 Active US10599957B2 (en) 2018-07-06 2018-10-26 Systems and methods for detecting data drift for data used in machine learning models
US16/172,508 Active US10592386B2 (en) 2018-07-06 2018-10-26 Fully automated machine learning system which generates and optimizes solutions given a dataset and a desired outcome
US16/172,430 Active 2038-12-06 US11210144B2 (en) 2018-07-06 2018-10-26 Systems and methods for hyperparameter tuning
US16/173,374 Active US10382799B1 (en) 2018-07-06 2018-10-29 Real-time synthetically generated video from still frames
US16/181,568 Active 2039-01-28 US11385942B2 (en) 2018-07-06 2018-11-06 Systems and methods for censoring text inline
US16/181,673 Active 2039-01-08 US10983841B2 (en) 2018-07-06 2018-11-06 Systems and methods for removing identifiable information
US16/251,867 Active US10459954B1 (en) 2018-07-06 2019-01-18 Dataset connector and crawler to identify data lineage and segment data
US16/263,141 Active US10521719B1 (en) 2018-07-06 2019-01-31 Systems and methods to identify neural network brittleness based on sample data and seed generation
US16/263,839 Active US10482607B1 (en) 2018-07-06 2019-01-31 Systems and methods for motion correction in synthetic images
US16/298,463 Active US11513869B2 (en) 2018-07-06 2019-03-11 Systems and methods for synthetic database query generation
US16/362,568 Active US10452455B1 (en) 2018-07-06 2019-03-22 Systems and methods to manage application program interface communications
US16/362,466 Active US10379995B1 (en) 2018-07-06 2019-03-22 Systems and methods to identify breaking application program interface changes
US16/362,537 Active US10860460B2 (en) 2018-07-06 2019-03-22 Automated honeypot creation within a network
US16/405,989 Active US10884894B2 (en) 2018-07-06 2019-05-07 Systems and methods for synthetic data generation for time-series data using data segments
US16/409,745 Active US11113124B2 (en) 2018-07-06 2019-05-10 Systems and methods for quickly searching datasets by indexing synthetic data generating models
US16/454,041 Active US10664381B2 (en) 2018-07-06 2019-06-26 Method and system for synthetic generation of time series data
US16/457,548 Active US10599550B2 (en) 2018-07-06 2019-06-28 Systems and methods to identify breaking application program interface changes
US16/457,670 Active US11032585B2 (en) 2018-07-06 2019-06-28 Real-time synthetically generated video from still frames
US16/503,428 Active US10671884B2 (en) 2018-07-06 2019-07-03 Systems and methods to improve data clustering using a meta-clustering model
US16/565,565 Active US11210145B2 (en) 2018-07-06 2019-09-10 Systems and methods to manage application program interface communications
US16/577,010 Active US11182223B2 (en) 2018-07-06 2019-09-20 Dataset connector and crawler to identify data lineage and segment data
US16/658,858 Active US10896072B2 (en) 2018-07-06 2019-10-21 Systems and methods for motion correction in synthetic images
US16/666,316 Active 2040-11-16 US11704169B2 (en) 2018-07-06 2019-10-28 Data model generation using generative adversarial networks
US16/715,924 Active 2040-12-04 US11836537B2 (en) 2018-07-06 2019-12-16 Systems and methods to identify neural network brittleness based on sample data and seed generation
US16/748,917 Active US10970137B2 (en) 2018-07-06 2020-01-22 Systems and methods to identify breaking application program interface changes
US16/825,040 Active 2039-02-17 US11385943B2 (en) 2018-07-06 2020-03-20 System, method, and computer-accessible medium for evaluating multi-dimensional synthetic data using integrated variants analysis
US16/852,795 Pending US20200250071A1 (en) 2018-07-06 2020-04-20 Method and system for synthetic generation of time series data
US16/889,363 Active 2039-11-07 US11604896B2 (en) 2018-07-06 2020-06-01 Systems and methods to improve data clustering using a meta-clustering model
US17/084,203 Active US11237884B2 (en) 2018-07-06 2020-10-29 Automated honeypot creation within a network
US17/102,526 Active 2040-03-02 US11822975B2 (en) 2018-07-06 2020-11-24 Systems and methods for synthetic data generation for time-series data using data segments
US17/139,203 Active 2039-06-22 US11687382B2 (en) 2018-07-06 2020-12-31 Systems and methods for motion correction in synthetic images
US17/189,193 Active US11372694B2 (en) 2018-07-06 2021-03-01 Systems and methods to identify breaking application program interface changes
US17/220,409 Active 2039-03-11 US11574077B2 (en) 2018-07-06 2021-04-01 Systems and methods for removing identifiable information
US17/307,361 Active US11687384B2 (en) 2018-07-06 2021-05-04 Real-time synthetically generated video from still frames
US17/395,899 Pending US20210365305A1 (en) 2018-07-06 2021-08-06 Systems and methods for quickly searching datasets by indexing synthetic data generating models
US17/464,796 Pending US20220092419A1 (en) 2018-07-06 2021-09-02 Systems and methods to use neural networks for model transformations
US17/505,840 Pending US20220083402A1 (en) 2018-07-06 2021-10-20 Dataset connector and crawler to identify data lineage and segment data
US17/526,073 Pending US20220075670A1 (en) 2018-07-06 2021-11-15 Systems and methods for replacing sensitive data
US17/553,023 Active US11580261B2 (en) 2018-07-06 2021-12-16 Automated honeypot creation within a network
US17/585,698 Pending US20220147405A1 (en) 2018-07-06 2022-01-27 Automatically scalable system for serverless hyperparameter tuning
US17/836,614 Pending US20220308942A1 (en) 2018-07-06 2022-06-09 Systems and methods for censoring text inline
US17/845,786 Active US11900178B2 (en) 2018-07-06 2022-06-21 System, method, and computer-accessible medium for evaluating multi-dimensional synthetic data using integrated variants analysis
US18/050,694 Pending US20230073695A1 (en) 2018-07-06 2022-10-28 Systems and methods for synthetic database query generation
US18/091,638 Pending US20230205610A1 (en) 2018-07-06 2022-12-30 Systems and methods for removing identifiable information
US18/155,529 Active 2039-07-05 US11861418B2 (en) 2018-07-06 2023-01-17 Systems and methods to improve data clustering using a meta-clustering model
US18/165,725 Pending US20230195541A1 (en) 2018-07-06 2023-02-07 Systems and methods for synthetic data generation
US18/312,481 Pending US20230273841A1 (en) 2018-07-06 2023-05-04 Real-time synthetically generated video from still frames
US18/316,868 Pending US20230281062A1 (en) 2018-07-06 2023-05-12 Systems and methods for motion correction in synthetic images
US18/321,370 Pending US20230297446A1 (en) 2018-07-06 2023-05-22 Data model generation using generative adversarial networks
US18/360,482 Pending US20230376362A1 (en) 2018-07-06 2023-07-27 Systems and methods for synthetic data generation for time-series data using data segments
US18/383,946 Pending US20240054029A1 (en) 2018-07-06 2023-10-26 Systems and methods to identify neural network brittleness based on sample data and seed generation

Family Applications Before (41)

Application Number Title Priority Date Filing Date
US16/151,431 Pending US20200012890A1 (en) 2018-07-06 2018-10-04 Systems and methods for data stream simulation
US16/152,072 Active US10635939B2 (en) 2018-07-06 2018-10-04 System, method, and computer-accessible medium for evaluating multi-dimensional synthetic data using integrated variants analysis
US16/151,385 Active US10460235B1 (en) 2018-07-06 2018-10-04 Data model generation using generative adversarial networks
US16/151,407 Active 2039-01-29 US11615208B2 (en) 2018-07-06 2018-10-04 Systems and methods for synthetic data generation
US16/172,223 Active 2038-12-12 US11256555B2 (en) 2018-07-06 2018-10-26 Automatically scalable system for serverless hyperparameter tuning
US16/172,480 Active US11126475B2 (en) 2018-07-06 2018-10-26 Systems and methods to use neural networks to transform a model into a neural network model
US16/172,344 Active US10599957B2 (en) 2018-07-06 2018-10-26 Systems and methods for detecting data drift for data used in machine learning models
US16/172,508 Active US10592386B2 (en) 2018-07-06 2018-10-26 Fully automated machine learning system which generates and optimizes solutions given a dataset and a desired outcome
US16/172,430 Active 2038-12-06 US11210144B2 (en) 2018-07-06 2018-10-26 Systems and methods for hyperparameter tuning
US16/173,374 Active US10382799B1 (en) 2018-07-06 2018-10-29 Real-time synthetically generated video from still frames
US16/181,568 Active 2039-01-28 US11385942B2 (en) 2018-07-06 2018-11-06 Systems and methods for censoring text inline
US16/181,673 Active 2039-01-08 US10983841B2 (en) 2018-07-06 2018-11-06 Systems and methods for removing identifiable information
US16/251,867 Active US10459954B1 (en) 2018-07-06 2019-01-18 Dataset connector and crawler to identify data lineage and segment data
US16/263,141 Active US10521719B1 (en) 2018-07-06 2019-01-31 Systems and methods to identify neural network brittleness based on sample data and seed generation
US16/263,839 Active US10482607B1 (en) 2018-07-06 2019-01-31 Systems and methods for motion correction in synthetic images
US16/298,463 Active US11513869B2 (en) 2018-07-06 2019-03-11 Systems and methods for synthetic database query generation
US16/362,568 Active US10452455B1 (en) 2018-07-06 2019-03-22 Systems and methods to manage application program interface communications
US16/362,466 Active US10379995B1 (en) 2018-07-06 2019-03-22 Systems and methods to identify breaking application program interface changes
US16/362,537 Active US10860460B2 (en) 2018-07-06 2019-03-22 Automated honeypot creation within a network
US16/405,989 Active US10884894B2 (en) 2018-07-06 2019-05-07 Systems and methods for synthetic data generation for time-series data using data segments
US16/409,745 Active US11113124B2 (en) 2018-07-06 2019-05-10 Systems and methods for quickly searching datasets by indexing synthetic data generating models
US16/454,041 Active US10664381B2 (en) 2018-07-06 2019-06-26 Method and system for synthetic generation of time series data
US16/457,548 Active US10599550B2 (en) 2018-07-06 2019-06-28 Systems and methods to identify breaking application program interface changes
US16/457,670 Active US11032585B2 (en) 2018-07-06 2019-06-28 Real-time synthetically generated video from still frames
US16/503,428 Active US10671884B2 (en) 2018-07-06 2019-07-03 Systems and methods to improve data clustering using a meta-clustering model
US16/565,565 Active US11210145B2 (en) 2018-07-06 2019-09-10 Systems and methods to manage application program interface communications
US16/577,010 Active US11182223B2 (en) 2018-07-06 2019-09-20 Dataset connector and crawler to identify data lineage and segment data
US16/658,858 Active US10896072B2 (en) 2018-07-06 2019-10-21 Systems and methods for motion correction in synthetic images
US16/666,316 Active 2040-11-16 US11704169B2 (en) 2018-07-06 2019-10-28 Data model generation using generative adversarial networks
US16/715,924 Active 2040-12-04 US11836537B2 (en) 2018-07-06 2019-12-16 Systems and methods to identify neural network brittleness based on sample data and seed generation
US16/748,917 Active US10970137B2 (en) 2018-07-06 2020-01-22 Systems and methods to identify breaking application program interface changes
US16/825,040 Active 2039-02-17 US11385943B2 (en) 2018-07-06 2020-03-20 System, method, and computer-accessible medium for evaluating multi-dimensional synthetic data using integrated variants analysis
US16/852,795 Pending US20200250071A1 (en) 2018-07-06 2020-04-20 Method and system for synthetic generation of time series data
US16/889,363 Active 2039-11-07 US11604896B2 (en) 2018-07-06 2020-06-01 Systems and methods to improve data clustering using a meta-clustering model
US17/084,203 Active US11237884B2 (en) 2018-07-06 2020-10-29 Automated honeypot creation within a network
US17/102,526 Active 2040-03-02 US11822975B2 (en) 2018-07-06 2020-11-24 Systems and methods for synthetic data generation for time-series data using data segments
US17/139,203 Active 2039-06-22 US11687382B2 (en) 2018-07-06 2020-12-31 Systems and methods for motion correction in synthetic images
US17/189,193 Active US11372694B2 (en) 2018-07-06 2021-03-01 Systems and methods to identify breaking application program interface changes
US17/220,409 Active 2039-03-11 US11574077B2 (en) 2018-07-06 2021-04-01 Systems and methods for removing identifiable information
US17/307,361 Active US11687384B2 (en) 2018-07-06 2021-05-04 Real-time synthetically generated video from still frames
US17/395,899 Pending US20210365305A1 (en) 2018-07-06 2021-08-06 Systems and methods for quickly searching datasets by indexing synthetic data generating models

Family Applications After (15)

Application Number Title Priority Date Filing Date
US17/505,840 Pending US20220083402A1 (en) 2018-07-06 2021-10-20 Dataset connector and crawler to identify data lineage and segment data
US17/526,073 Pending US20220075670A1 (en) 2018-07-06 2021-11-15 Systems and methods for replacing sensitive data
US17/553,023 Active US11580261B2 (en) 2018-07-06 2021-12-16 Automated honeypot creation within a network
US17/585,698 Pending US20220147405A1 (en) 2018-07-06 2022-01-27 Automatically scalable system for serverless hyperparameter tuning
US17/836,614 Pending US20220308942A1 (en) 2018-07-06 2022-06-09 Systems and methods for censoring text inline
US17/845,786 Active US11900178B2 (en) 2018-07-06 2022-06-21 System, method, and computer-accessible medium for evaluating multi-dimensional synthetic data using integrated variants analysis
US18/050,694 Pending US20230073695A1 (en) 2018-07-06 2022-10-28 Systems and methods for synthetic database query generation
US18/091,638 Pending US20230205610A1 (en) 2018-07-06 2022-12-30 Systems and methods for removing identifiable information
US18/155,529 Active 2039-07-05 US11861418B2 (en) 2018-07-06 2023-01-17 Systems and methods to improve data clustering using a meta-clustering model
US18/165,725 Pending US20230195541A1 (en) 2018-07-06 2023-02-07 Systems and methods for synthetic data generation
US18/312,481 Pending US20230273841A1 (en) 2018-07-06 2023-05-04 Real-time synthetically generated video from still frames
US18/316,868 Pending US20230281062A1 (en) 2018-07-06 2023-05-12 Systems and methods for motion correction in synthetic images
US18/321,370 Pending US20230297446A1 (en) 2018-07-06 2023-05-22 Data model generation using generative adversarial networks
US18/360,482 Pending US20230376362A1 (en) 2018-07-06 2023-07-27 Systems and methods for synthetic data generation for time-series data using data segments
US18/383,946 Pending US20240054029A1 (en) 2018-07-06 2023-10-26 Systems and methods to identify neural network brittleness based on sample data and seed generation

Country Status (2)

Country Link
US (57) US20200012890A1 (en)
EP (1) EP3591587A1 (en)

Families Citing this family (320)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11210604B1 (en) * 2013-12-23 2021-12-28 Groupon, Inc. Processing dynamic data within an adaptive oracle-trained learning system using dynamic data set distribution optimization
WO2016061576A1 (en) 2014-10-17 2016-04-21 Zestfinance, Inc. Api for implementing scoring functions
US11443206B2 (en) 2015-03-23 2022-09-13 Tibco Software Inc. Adaptive filtering and modeling via adaptive experimental designs to identify emerging data patterns from large volume, high dimensional, high velocity streaming data
US10776380B2 (en) * 2016-10-21 2020-09-15 Microsoft Technology Licensing, Llc Efficient transformation program generation
US10678846B2 (en) * 2017-03-10 2020-06-09 Xerox Corporation Instance-level image retrieval with a region proposal network
US10699139B2 (en) * 2017-03-30 2020-06-30 Hrl Laboratories, Llc System for real-time object detection and recognition using both image and size features
US10839291B2 (en) * 2017-07-01 2020-11-17 Intel Corporation Hardened deep neural networks through training from adversarial misclassified data
JP6715420B2 (en) * 2017-07-31 2020-07-01 株式会社エイシング Data amount compression method, device, program and IC chip
WO2019028179A1 (en) 2017-08-02 2019-02-07 Zestfinance, Inc. Systems and methods for providing machine learning model disparate impact information
US10935940B2 (en) 2017-08-03 2021-03-02 Johnson Controls Technology Company Building management system with augmented deep learning using combined regression and artificial neural network modeling
GB2578258B (en) * 2017-08-18 2022-02-16 Landmark Graphics Corp Rate of penetration optimization for wellbores using machine learning
KR101977174B1 (en) 2017-09-13 2019-05-10 이재준 Apparatus, method and computer program for analyzing image
US10860618B2 (en) 2017-09-25 2020-12-08 Splunk Inc. Low-latency streaming analytics
US11120337B2 (en) * 2017-10-20 2021-09-14 Huawei Technologies Co., Ltd. Self-training method and system for semi-supervised learning with generative adversarial networks
JP2021503134A (en) * 2017-11-15 2021-02-04 グーグル エルエルシーGoogle LLC Unsupervised learning of image depth and egomotion prediction neural networks
US11727286B2 (en) * 2018-12-13 2023-08-15 Diveplane Corporation Identifier contribution allocation in synthetic data generation in computer-based reasoning systems
US11669769B2 (en) 2018-12-13 2023-06-06 Diveplane Corporation Conditioned synthetic data generation in computer-based reasoning systems
US11640561B2 (en) * 2018-12-13 2023-05-02 Diveplane Corporation Dataset quality for synthetic data generation in computer-based reasoning systems
US11676069B2 (en) 2018-12-13 2023-06-13 Diveplane Corporation Synthetic data generation using anonymity preservation in computer-based reasoning systems
US10817402B2 (en) * 2018-01-03 2020-10-27 Nec Corporation Method and system for automated building of specialized operating systems and virtual machine images based on reinforcement learning
US10997180B2 (en) 2018-01-31 2021-05-04 Splunk Inc. Dynamic query processor for streaming and batch queries
EP3528435B1 (en) * 2018-02-16 2021-03-31 Juniper Networks, Inc. Automated configuration and data collection during modeling of network devices
US10783660B2 (en) * 2018-02-21 2020-09-22 International Business Machines Corporation Detecting object pose using autoencoders
KR102532230B1 (en) * 2018-03-30 2023-05-16 삼성전자주식회사 Electronic device and control method thereof
EP4195112A1 (en) 2018-05-04 2023-06-14 Zestfinance, Inc. Systems and methods for enriching modeling tools and infrastructure with semantics
US10762669B2 (en) * 2018-05-16 2020-09-01 Adobe Inc. Colorization of vector images
US20190294982A1 (en) * 2018-06-16 2019-09-26 Moshe Guttmann Personalized selection of inference models
US11205121B2 (en) * 2018-06-20 2021-12-21 Disney Enterprises, Inc. Efficient encoding and decoding sequences using variational autoencoders
US10672174B2 (en) 2018-06-28 2020-06-02 Adobe Inc. Determining image handle locations
US10621764B2 (en) 2018-07-05 2020-04-14 Adobe Inc. Colorizing vector graphic objects
US20200012890A1 (en) * 2018-07-06 2020-01-09 Capital One Services, Llc Systems and methods for data stream simulation
US11741398B2 (en) * 2018-08-03 2023-08-29 Samsung Electronics Co., Ltd. Multi-layered machine learning system to support ensemble learning
US11315039B1 (en) 2018-08-03 2022-04-26 Domino Data Lab, Inc. Systems and methods for model management
EP3618287B1 (en) * 2018-08-29 2023-09-27 Université de Genève Signal sampling with joint training of learnable priors for sampling operator and decoder
US11580329B2 (en) 2018-09-18 2023-02-14 Microsoft Technology Licensing, Llc Machine-learning training service for synthetic data
US10978051B2 (en) * 2018-09-28 2021-04-13 Capital One Services, Llc Adversarial learning framework for persona-based dialogue modeling
US20200118691A1 (en) * 2018-10-10 2020-04-16 Lukasz R. Kiljanek Generation of Simulated Patient Data for Training Predicted Medical Outcome Analysis Engine
US10747642B2 (en) 2018-10-20 2020-08-18 Oracle International Corporation Automatic behavior detection and characterization in software systems
US11556746B1 (en) 2018-10-26 2023-01-17 Amazon Technologies, Inc. Fast annotation of samples for machine learning model development
US11537506B1 (en) * 2018-10-26 2022-12-27 Amazon Technologies, Inc. System for visually diagnosing machine learning models
US11568235B2 (en) * 2018-11-19 2023-01-31 International Business Machines Corporation Data driven mixed precision learning for neural networks
US11366704B2 (en) * 2018-11-27 2022-06-21 Sap Se Configurable analytics for microservices performance analysis
US20200175390A1 (en) * 2018-11-29 2020-06-04 International Business Machines Corporation Word embedding model parameter advisor
US20200175383A1 (en) * 2018-12-03 2020-06-04 Clover Health Statistically-Representative Sample Data Generation
US10839208B2 (en) * 2018-12-10 2020-11-17 Accenture Global Solutions Limited System and method for detecting fraudulent documents
WO2020123999A1 (en) 2018-12-13 2020-06-18 Diveplane Corporation Synthetic data generation in computer-based reasoning systems
US20200220869A1 (en) * 2019-01-08 2020-07-09 Fidelity Information Services, Llc Systems and methods for contactless authentication using voice recognition
US11514330B2 (en) * 2019-01-14 2022-11-29 Cambia Health Solutions, Inc. Systems and methods for continual updating of response generation by an artificial intelligence chatbot
US10798386B2 (en) * 2019-01-25 2020-10-06 At&T Intellectual Property I, L.P. Video compression with generative models
CN109829849B (en) * 2019-01-29 2023-01-31 达闼机器人股份有限公司 Training data generation method and device and terminal
US11663493B2 (en) * 2019-01-30 2023-05-30 Intuit Inc. Method and system of dynamic model selection for time series forecasting
US11854433B2 (en) * 2019-02-04 2023-12-26 Pearson Education, Inc. Systems and methods for item response modelling of digital assessments
US10776720B2 (en) * 2019-02-05 2020-09-15 Capital One Services, Llc Techniques for bimodal learning in a financial context
US11816541B2 (en) 2019-02-15 2023-11-14 Zestfinance, Inc. Systems and methods for decomposition of differentiable and non-differentiable models
US10832734B2 (en) * 2019-02-25 2020-11-10 International Business Machines Corporation Dynamic audiovisual segment padding for machine learning
US11710034B2 (en) * 2019-02-27 2023-07-25 Intel Corporation Misuse index for explainable artificial intelligence in computing environments
US10878298B2 (en) * 2019-03-06 2020-12-29 Adobe Inc. Tag-based font recognition by utilizing an implicit font classification attention neural network
US11809966B2 (en) * 2019-03-07 2023-11-07 International Business Machines Corporation Computer model machine learning based on correlations of training data with performance trends
WO2020191057A1 (en) * 2019-03-18 2020-09-24 Zestfinance, Inc. Systems and methods for model fairness
SG10201903611RA (en) * 2019-03-20 2020-10-29 Avanseus Holdings Pte Ltd Method and system for determining an error threshold value for machine failure prediction
US11676063B2 (en) * 2019-03-28 2023-06-13 International Business Machines Corporation Exposing payload data from non-integrated machine learning systems
CN109951743A (en) * 2019-03-29 2019-06-28 上海哔哩哔哩科技有限公司 Barrage information processing method, system and computer equipment
US11475246B2 (en) 2019-04-02 2022-10-18 Synthesis Ai, Inc. System and method for generating training data for computer vision systems based on image segmentation
US11562134B2 (en) * 2019-04-02 2023-01-24 Genpact Luxembourg S.à r.l. II Method and system for advanced document redaction
US11922140B2 (en) * 2019-04-05 2024-03-05 Oracle International Corporation Platform for integrating back-end data analysis tools using schema
WO2020209078A1 (en) * 2019-04-09 2020-10-15 ソニー株式会社 Information processing device, information processing method, and program
US11290492B2 (en) * 2019-04-26 2022-03-29 EMC IP Holding Company LLC Malicious data manipulation detection using markers and the data protection layer
GB201905966D0 (en) * 2019-04-29 2019-06-12 Palantir Technologies Inc Security system and method
US11580442B2 (en) * 2019-04-30 2023-02-14 Cylance Inc. Machine learning model score obfuscation using time-based score oscillations
US11636393B2 (en) * 2019-05-07 2023-04-25 Cerebri AI Inc. Predictive, machine-learning, time-series computer models suitable for sparse training sets
US11328313B2 (en) * 2019-05-08 2022-05-10 Retailmenot, Inc. Predictive bounding of combinatorial optimizations that are based on data sets acquired post-prediction through high-latency, heterogenous interfaces
US11531875B2 (en) * 2019-05-14 2022-12-20 Nasdaq, Inc. Systems and methods for generating datasets for model retraining
US20200371778A1 (en) * 2019-05-21 2020-11-26 X Development Llc Automated identification of code changes
US11205138B2 (en) * 2019-05-22 2021-12-21 International Business Machines Corporation Model quality and related models using provenance data
US11657118B2 (en) * 2019-05-23 2023-05-23 Google Llc Systems and methods for learning effective loss functions efficiently
US20200372402A1 (en) * 2019-05-24 2020-11-26 Bank Of America Corporation Population diversity based learning in adversarial and rapid changing environments
US20200379640A1 (en) * 2019-05-29 2020-12-03 Apple Inc. User-realistic path synthesis via multi-task generative adversarial networks for continuous path keyboard input
US11704494B2 (en) * 2019-05-31 2023-07-18 Ab Initio Technology Llc Discovering a semantic meaning of data fields from profile data of the data fields
US11321771B1 (en) * 2019-06-03 2022-05-03 Intuit Inc. System and method for detecting unseen overdraft transaction events
US11568310B2 (en) * 2019-06-04 2023-01-31 Lg Electronics Inc. Apparatus for generating temperature prediction model and method for providing simulation environment
US11379348B2 (en) * 2019-06-21 2022-07-05 ProKarma Inc. System and method for performing automated API tests
US11243746B2 (en) * 2019-07-01 2022-02-08 X Development Llc Learning and using programming styles
FR3098941B1 (en) * 2019-07-15 2022-02-04 Bull Sas Device and method for performance analysis of an n-tier application
US11238048B1 (en) 2019-07-16 2022-02-01 Splunk Inc. Guided creation interface for streaming data processing pipelines
US11321284B2 (en) * 2019-07-19 2022-05-03 Vmware, Inc. Adapting time series database schema
US11762853B2 (en) 2019-07-19 2023-09-19 Vmware, Inc. Querying a variably partitioned time series database
US11500829B2 (en) 2019-07-19 2022-11-15 Vmware, Inc. Adapting time series database schema
US11609885B2 (en) 2019-07-19 2023-03-21 Vmware, Inc. Time series database comprising a plurality of time series database schemas
US11373104B2 (en) * 2019-07-26 2022-06-28 Bae Systems Information And Electronic Systems Integration Inc. Connecting OBP objects with knowledge models through context data layer
US11537880B2 (en) 2019-08-12 2022-12-27 Bank Of America Corporation System and methods for generation of synthetic data cluster vectors and refinement of machine learning models
US11531883B2 (en) * 2019-08-12 2022-12-20 Bank Of America Corporation System and methods for iterative synthetic data generation and refinement of machine learning models
US20210049473A1 (en) * 2019-08-14 2021-02-18 The Board Of Trustees Of The Leland Stanford Junior University Systems and Methods for Robust Federated Training of Neural Networks
US10831566B1 (en) * 2019-08-16 2020-11-10 Bank Of America Corporation Electronic system for intelligent processing and routing of incoming API requests based on context matching
US11847244B1 (en) * 2019-08-20 2023-12-19 Shoreline Labs, Inc. Private information detector for data loss prevention
US11522881B2 (en) * 2019-08-28 2022-12-06 Nec Corporation Structural graph neural networks for suspicious event detection
US11128737B1 (en) * 2019-08-28 2021-09-21 Massachusetts Mutual Life Insurance Company Data model monitoring system
US10885343B1 (en) * 2019-08-30 2021-01-05 Amazon Technologies, Inc. Repairing missing frames in recorded video with machine learning
US20210073202A1 (en) * 2019-09-11 2021-03-11 Workday, Inc. Computation system with time based probabilities
US11509674B1 (en) 2019-09-18 2022-11-22 Rapid7, Inc. Generating machine learning data in salient regions of a feature space
US11853853B1 (en) 2019-09-18 2023-12-26 Rapid7, Inc. Providing human-interpretable explanation for model-detected anomalies
US11816542B2 (en) * 2019-09-18 2023-11-14 International Business Machines Corporation Finding root cause for low key performance indicators
US11907821B2 (en) * 2019-09-27 2024-02-20 Deepmind Technologies Limited Population-based training of machine learning models
US11108780B2 (en) * 2019-09-27 2021-08-31 Aktana, Inc. Systems and methods for access control
US10970136B1 (en) 2019-10-03 2021-04-06 Caret Holdings, Inc. Adaptive application version integration support
US11544604B2 (en) * 2019-10-09 2023-01-03 Adobe Inc. Adaptive model insights visualization engine for complex machine learning models
US11593511B2 (en) * 2019-10-10 2023-02-28 International Business Machines Corporation Dynamically identifying and redacting data from diagnostic operations via runtime monitoring of data sources
US11347613B2 (en) 2019-10-15 2022-05-31 UiPath, Inc. Inserting probabilistic models in deterministic workflows for robotic process automation and supervisor system
US11861674B1 (en) 2019-10-18 2024-01-02 Meta Platforms Technologies, Llc Method, one or more computer-readable non-transitory storage media, and a system for generating comprehensive information for products of interest by assistant systems
US20210142224A1 (en) * 2019-10-21 2021-05-13 SigOpt, Inc. Systems and methods for an accelerated and enhanced tuning of a model based on prior model tuning data
US11429386B2 (en) * 2019-10-30 2022-08-30 Robert Bosch Gmbh Method and apparatus for an advanced convolution on encrypted data
US20210133677A1 (en) * 2019-10-31 2021-05-06 Walmart Apollo, Llc Apparatus and methods for determining delivery routes and times based on generated machine learning models
US20210133632A1 (en) * 2019-11-04 2021-05-06 Domino Data Lab, Inc. Systems and methods for model monitoring
US11275972B2 (en) * 2019-11-08 2022-03-15 International Business Machines Corporation Image classification masking
US11132512B2 (en) * 2019-11-08 2021-09-28 International Business Machines Corporation Multi-perspective, multi-task neural network model for matching text to program code
EP3822826A1 (en) * 2019-11-15 2021-05-19 Siemens Energy Global GmbH & Co. KG Database interaction and interpretation tool
US11423250B2 (en) 2019-11-19 2022-08-23 Intuit Inc. Hierarchical deep neural network forecasting of cashflows with linear algebraic constraints
US11657302B2 (en) 2019-11-19 2023-05-23 Intuit Inc. Model selection in a forecasting pipeline to optimize tradeoff between forecast accuracy and computational cost
CN111159501B (en) * 2019-11-22 2023-09-22 杭州蛋壳商务信息技术有限公司 Method for establishing passenger judgment model based on multilayer neural network and passenger judgment method
US11158090B2 (en) * 2019-11-22 2021-10-26 Adobe Inc. Enhanced video shot matching using generative adversarial networks
US11763138B2 (en) * 2019-11-27 2023-09-19 Intuit Inc. Method and system for generating synthetic data using a regression model while preserving statistical properties of underlying data
US11551652B1 (en) * 2019-11-27 2023-01-10 Amazon Technologies, Inc. Hands-on artificial intelligence education service
US20230010686A1 (en) * 2019-12-05 2023-01-12 The Regents Of The University Of California Generating synthetic patient health data
CN111026664B (en) * 2019-12-09 2020-12-22 遵义职业技术学院 Program detection method and detection system based on ANN and application
US11216922B2 (en) * 2019-12-17 2022-01-04 Capital One Services, Llc Systems and methods for recognition of user-provided images
US11551159B2 (en) * 2019-12-23 2023-01-10 Google Llc Schema-guided response generation
CN111181671B (en) * 2019-12-27 2022-01-11 东南大学 Deep learning-based downlink channel rapid reconstruction method
CN113055017A (en) * 2019-12-28 2021-06-29 华为技术有限公司 Data compression method and computing device
US11086829B2 (en) * 2020-01-02 2021-08-10 International Business Machines Corporation Comparing schema definitions using sampling
US11687778B2 (en) 2020-01-06 2023-06-27 The Research Foundation For The State University Of New York Fakecatcher: detection of synthetic portrait videos using biological signals
WO2021142069A1 (en) * 2020-01-07 2021-07-15 Alegion, Inc. System and method for guided synthesis of training data
US11941520B2 (en) * 2020-01-09 2024-03-26 International Business Machines Corporation Hyperparameter determination for a differentially private federated learning process
GB2590967A (en) * 2020-01-10 2021-07-14 Blue Prism Ltd Method of remote access
US11657292B1 (en) * 2020-01-15 2023-05-23 Architecture Technology Corporation Systems and methods for machine learning dataset generation
JP2021114085A (en) * 2020-01-17 2021-08-05 富士通株式会社 Information processing device, information processing method, and information processing program
US11671506B2 (en) * 2020-01-27 2023-06-06 Dell Products L.P. Microservice management system for recommending modifications to optimize operation of microservice-based systems
US11645543B2 (en) * 2020-01-30 2023-05-09 Visa International Service Association System, method, and computer program product for implementing a generative adversarial network to determine activations
JP7298494B2 (en) * 2020-01-31 2023-06-27 横河電機株式会社 Learning device, learning method, learning program, determination device, determination method, and determination program
US11169786B2 (en) 2020-02-04 2021-11-09 X Development Llc Generating and using joint representations of source code
US11508253B1 (en) 2020-02-12 2022-11-22 Architecture Technology Corporation Systems and methods for networked virtual reality training
US11636202B2 (en) 2020-02-21 2023-04-25 Cylance Inc. Projected vector overflow penalty as mitigation for machine learning model string stuffing
US20210273962A1 (en) * 2020-02-28 2021-09-02 Electronic Caregiver, Inc. Intelligent platform for real-time precision care plan support during remote care management
EP3872584A1 (en) * 2020-02-28 2021-09-01 Deepc GmbH Technique for determining an indication of a medical condition
US11206316B2 (en) * 2020-03-04 2021-12-21 Hewlett Packard Enterprise Development Lp Multiple model injection for a deployment cluster
US11128724B1 (en) * 2020-03-09 2021-09-21 Adobe Inc. Real-time interactive event analytics
US11475364B2 (en) 2020-03-10 2022-10-18 Oracle International Corporation Systems and methods for analyzing a list of items using machine learning models
US11734849B2 (en) * 2020-03-10 2023-08-22 Siemens Healthcare Gmbh Estimating patient biographic data parameters
US10860466B1 (en) * 2020-03-18 2020-12-08 Capital One Services, Llc Systems, methods and media for testing computer-executable code involving sensitive-information domains
US11221846B2 (en) * 2020-03-19 2022-01-11 International Business Machines Corporation Automated transformation of applications to a target computing environment
CN111429006A (en) * 2020-03-24 2020-07-17 北京明略软件系统有限公司 Financial risk index prediction model construction method and device and risk situation prediction method and device
US11604871B2 (en) * 2020-03-27 2023-03-14 Cylance Inc. Projected vector modification as mitigation for machine learning model string stuffing
CN111489802B (en) * 2020-03-31 2023-07-25 重庆金域医学检验所有限公司 Report coding model generation method, system, equipment and storage medium
CN111459820B (en) * 2020-03-31 2021-01-05 北京九章云极科技有限公司 Model application method and device and data analysis processing system
US11675921B2 (en) * 2020-04-02 2023-06-13 Hazy Limited Device and method for secure private data aggregation
US11429582B2 (en) 2020-04-09 2022-08-30 Capital One Services, Llc Techniques for creating and utilizing multidimensional embedding spaces
DE202020102105U1 (en) * 2020-04-16 2020-04-29 Robert Bosch Gmbh Device for the automated generation of a knowledge graph
CN111553408B (en) * 2020-04-26 2020-12-25 智泉科技(广东)有限公司 Automatic test method for video recognition software
US11580456B2 (en) 2020-04-27 2023-02-14 Bank Of America Corporation System to correct model drift in machine learning application
US20210334694A1 (en) * 2020-04-27 2021-10-28 International Business Machines Corporation Perturbed records generation
US20210342339A1 (en) * 2020-04-30 2021-11-04 Forcepoint, LLC Method for Defining and Computing Analytic Features
CN113592059A (en) * 2020-04-30 2021-11-02 伊姆西Ip控股有限责任公司 Method, apparatus and computer program product for processing data
CN113642731A (en) * 2020-05-06 2021-11-12 支付宝(杭州)信息技术有限公司 Training method and device of data generation system based on differential privacy
WO2021228641A1 (en) * 2020-05-12 2021-11-18 Interdigital Ce Patent Holdings Systems and methods for training and/or deploying a deep neural network
US11349957B2 (en) 2020-05-14 2022-05-31 Bank Of America Corporation Automatic knowledge management for data lineage tracking
US11797340B2 (en) 2020-05-14 2023-10-24 Hewlett Packard Enterprise Development Lp Systems and methods of resource configuration optimization for machine learning workloads
US11954129B2 (en) * 2020-05-19 2024-04-09 Hewlett Packard Enterprise Development Lp Updating data models to manage data drift and outliers
CN111444721B (en) * 2020-05-27 2022-09-23 南京大学 Chinese text key information extraction method based on pre-training language model
EP4133681A4 (en) * 2020-05-28 2023-10-04 Sumo Logic, Inc. Clustering of structured log data by key schema
US20210374128A1 (en) * 2020-06-01 2021-12-02 Replica Analytics Optimizing generation of synthetic data
US11474596B1 (en) 2020-06-04 2022-10-18 Architecture Technology Corporation Systems and methods for multi-user virtual training
US11551429B2 (en) 2020-06-05 2023-01-10 Uatc, Llc Photorealistic image simulation with geometry-aware composition
CN111639143B (en) * 2020-06-05 2020-12-22 广州市玄武无线科技股份有限公司 Data blood relationship display method and device of data warehouse and electronic equipment
KR102226292B1 (en) * 2020-06-18 2021-03-10 주식회사 레몬헬스케어 Cloud-based API management method to simultaneously link multiple hospital servers and consortium servers
US11651378B2 (en) * 2020-06-18 2023-05-16 Fidelity Information Services, Llc Systems and methods to manage transaction disputes using predictions based on anomalous data
KR102417131B1 (en) * 2020-06-19 2022-07-05 (주)한국플랫폼서비스기술 A machine learning system based deep-learning used query
US11151480B1 (en) 2020-06-22 2021-10-19 Sas Institute Inc. Hyperparameter tuning system results viewer
US11562252B2 (en) * 2020-06-22 2023-01-24 Capital One Services, Llc Systems and methods for expanding data classification using synthetic data generation in machine learning models
US11853348B2 (en) * 2020-06-24 2023-12-26 Adobe Inc. Multidimensional digital content search
US11580425B2 (en) * 2020-06-30 2023-02-14 Microsoft Technology Licensing, Llc Managing defects in a model training pipeline using synthetic data sets associated with defect types
WO2022006344A1 (en) * 2020-06-30 2022-01-06 Samya.Ai Inc, Method for dynamically recommending forecast adjustments that collectively optimize objective factor using automated ml systems
US20220012220A1 (en) * 2020-07-07 2022-01-13 International Business Machines Corporation Data enlargement for big data analytics and system identification
US11288797B2 (en) 2020-07-08 2022-03-29 International Business Machines Corporation Similarity based per item model selection for medical imaging
US20220019933A1 (en) * 2020-07-15 2022-01-20 Landmark Graphics Corporation Estimation of global thermal conditions via cosimulation of machine learning outputs and observed data
US11443143B2 (en) * 2020-07-16 2022-09-13 International Business Machines Corporation Unattended object detection using machine learning
US20220024032A1 (en) * 2020-07-21 2022-01-27 UiPath, Inc. Artificial intelligence / machine learning model drift detection and correction for robotic process automation
CN111782550B (en) * 2020-07-31 2022-04-12 支付宝(杭州)信息技术有限公司 Method and device for training index prediction model based on user privacy protection
US11468999B2 (en) * 2020-07-31 2022-10-11 Accenture Global Solutions Limited Systems and methods for implementing density variation (DENSVAR) clustering algorithms
US11531903B2 (en) * 2020-08-02 2022-12-20 Actimize Ltd Real drift detector on partial labeled data in data streams
KR102491753B1 (en) * 2020-08-03 2023-01-26 (주)한국플랫폼서비스기술 Method and system for framework's deep learning a data using by query
KR102491755B1 (en) * 2020-08-03 2023-01-26 (주)한국플랫폼서비스기술 Deep learning inference system based on query, and method thereof
US11763084B2 (en) 2020-08-10 2023-09-19 International Business Machines Corporation Automatic formulation of data science problem statements
US11909482B2 (en) * 2020-08-18 2024-02-20 Qualcomm Incorporated Federated learning for client-specific neural network parameter generation for wireless communication
CN111813921B (en) * 2020-08-20 2020-12-22 浙江学海教育科技有限公司 Topic recommendation method, electronic device and computer-readable storage medium
CN111767326B (en) * 2020-09-03 2020-11-27 国网浙江省电力有限公司营销服务中心 Generation method and device of relational table data based on generative countermeasure network
US20220076157A1 (en) 2020-09-04 2022-03-10 Aperio Global, LLC Data analysis system using artificial intelligence
US11809577B2 (en) * 2020-09-07 2023-11-07 The Toronto-Dominion Bank Application of trained artificial intelligence processes to encrypted data within a distributed computing environment
CN112069820A (en) * 2020-09-10 2020-12-11 杭州中奥科技有限公司 Model training method, model training device and entity extraction method
US20220086175A1 (en) * 2020-09-16 2022-03-17 Ribbon Communications Operating Company, Inc. Methods, apparatus and systems for building and/or implementing detection systems using artificial intelligence
US20220092470A1 (en) * 2020-09-24 2022-03-24 Sap Se Runtime estimation for machine learning data processing pipeline
CN112347609A (en) * 2020-09-30 2021-02-09 广州明珞装备股份有限公司 Processing method and device of welding spot information based on process simulation and storage medium
CN111968744B (en) * 2020-10-22 2021-02-19 深圳大学 Bayesian optimization-based parameter optimization method for stroke and chronic disease model
US11182275B1 (en) 2020-10-23 2021-11-23 Capital One Services, Llc Systems and method for testing computing environments
CN112272177B (en) * 2020-10-23 2021-08-24 广州锦行网络科技有限公司 Method for deploying honey net trapping nodes in batches
US20220138437A1 (en) * 2020-10-30 2022-05-05 Intuit Inc. Methods and systems for integrating machine translations into software development workflows
US20220138556A1 (en) * 2020-11-04 2022-05-05 Nvidia Corporation Data log parsing system and method
WO2022097102A1 (en) * 2020-11-06 2022-05-12 Buyaladdin.com, Inc. Vertex interpolation in one-shot learning for object classification
US11520801B2 (en) 2020-11-10 2022-12-06 Bank Of America Corporation System and method for automatically obtaining data lineage in real time
US11392487B2 (en) * 2020-11-16 2022-07-19 International Business Machines Corporation Synthetic deidentified test data
US20220156634A1 (en) * 2020-11-19 2022-05-19 Paypal, Inc. Training Data Augmentation for Machine Learning
US11544281B2 (en) * 2020-11-20 2023-01-03 Adobe Inc. Query-oriented approximate query processing based on machine learning techniques
US11720962B2 (en) 2020-11-24 2023-08-08 Zestfinance, Inc. Systems and methods for generating gradient-boosted models with improved fairness
US11314584B1 (en) * 2020-11-25 2022-04-26 International Business Machines Corporation Data quality-based confidence computations for KPIs derived from time-series data
FR3116926A1 (en) * 2020-11-30 2022-06-03 Thales Method for evaluating the performance of a prediction algorithm and associated devices
CN112364828B (en) * 2020-11-30 2022-01-04 天津金城银行股份有限公司 Face recognition method and financial system
CN114584570A (en) * 2020-12-01 2022-06-03 富泰华工业(深圳)有限公司 Digital mirroring method, server and storage medium
US20220172064A1 (en) * 2020-12-02 2022-06-02 Htc Corporation Machine learning method and machine learning device for eliminating spurious correlation
US11720709B1 (en) * 2020-12-04 2023-08-08 Wells Fargo Bank, N.A. Systems and methods for ad hoc synthetic persona creation
US20220180244A1 (en) * 2020-12-08 2022-06-09 Vmware, Inc. Inter-Feature Influence in Unlabeled Datasets
TWI756974B (en) 2020-12-09 2022-03-01 財團法人工業技術研究院 Machine learning system and resource allocation method thereof
KR102484316B1 (en) * 2020-12-09 2023-01-02 청주대학교 산학협력단 Method and apparatus for configuring learning data set in object recognition
CN112486719B (en) * 2020-12-14 2023-07-04 上海万物新生环保科技集团有限公司 Method and equipment for RPC interface call failure processing
US11687620B2 (en) 2020-12-17 2023-06-27 International Business Machines Corporation Artificial intelligence generated synthetic image data for use with machine language models
GB202020155D0 (en) * 2020-12-18 2021-02-03 Palantir Technologies Inc Enforcing data security constraints in a data pipeline
US20240080337A1 (en) * 2020-12-22 2024-03-07 Telefonaktiebolager Lm Ericsson (Publ) Device, method, and system for supporting botnet traffic detection
US20220198278A1 (en) * 2020-12-23 2022-06-23 International Business Machines Corporation System for continuous update of advection-diffusion models with adversarial networks
US11516240B2 (en) * 2020-12-29 2022-11-29 Capital One Services, Llc Detection of anomalies associated with fraudulent access to a service platform
CN112674734B (en) * 2020-12-29 2021-12-07 电子科技大学 Pulse signal noise detection method based on supervision Seq2Seq model
US11354597B1 (en) * 2020-12-30 2022-06-07 Hyland Uk Operations Limited Techniques for intuitive machine learning development and optimization
KR102568011B1 (en) * 2020-12-30 2023-08-22 (주)한국플랫폼서비스기술 Face detection and privacy protection system using deep learning inference based on query and method thereof
US11847390B2 (en) 2021-01-05 2023-12-19 Capital One Services, Llc Generation of synthetic data using agent-based simulations
CN112685314A (en) * 2021-01-05 2021-04-20 广州知图科技有限公司 JavaScript engine security test method and test system
WO2022150343A1 (en) * 2021-01-05 2022-07-14 Capital One Services, Llc Generation and evaluation of secure synthetic data
US11588911B2 (en) * 2021-01-14 2023-02-21 International Business Machines Corporation Automatic context aware composing and synchronizing of video and audio transcript
CN112765319B (en) * 2021-01-20 2021-09-03 中国电子信息产业集团有限公司第六研究所 Text processing method and device, electronic equipment and storage medium
US20220229569A1 (en) * 2021-01-20 2022-07-21 Samsung Electronics Co., Ltd. Systems, methods, and apparatus for storage query planning
US11568320B2 (en) * 2021-01-21 2023-01-31 Snowflake Inc. Handling system-characteristics drift in machine learning applications
CN112463546B (en) * 2021-01-25 2021-04-27 北京天健源达科技股份有限公司 Processing method of abnormal log table
EP4033411A1 (en) 2021-01-26 2022-07-27 MOSTLY AI Solutions MP GmbH Synthesizing mobility traces
US20220253856A1 (en) * 2021-02-11 2022-08-11 The Toronto-Dominion Bank System and method for machine learning based detection of fraud
WO2022177448A1 (en) * 2021-02-18 2022-08-25 Xero Limited Systems and methods for training models
US20220272124A1 (en) * 2021-02-19 2022-08-25 Intuit Inc. Using machine learning for detecting solicitation of personally identifiable information (pii)
US20220277327A1 (en) * 2021-02-26 2022-09-01 Capital One Services, Llc Computer-based systems for data distribution allocation utilizing machine learning models and methods of use thereof
US11368521B1 (en) * 2021-03-12 2022-06-21 Red Hat, Inc. Utilizing reinforcement learning for serverless function tuning
US11843623B2 (en) * 2021-03-16 2023-12-12 Mitsubishi Electric Research Laboratories, Inc. Apparatus and method for anomaly detection
EP4060677A1 (en) 2021-03-18 2022-09-21 Craft.Ai Devices and processes for data sample selection for therapy-directed tasks
US20220300869A1 (en) * 2021-03-22 2022-09-22 Sap Se Intelligent airfare pattern prediction
US11550991B2 (en) * 2021-03-29 2023-01-10 Capital One Services, Llc Methods and systems for generating alternative content using adversarial networks implemented in an application programming interface layer
CN113109869A (en) * 2021-03-30 2021-07-13 成都理工大学 Automatic picking method for first arrival of shale ultrasonic test waveform
US11567739B2 (en) 2021-04-15 2023-01-31 Red Hat, Inc. Simplifying creation and publishing of schemas while building applications
CN112989631A (en) * 2021-04-19 2021-06-18 河南科技大学 Method and system for identifying equivalent component of finite state automaton
US11647052B2 (en) 2021-04-22 2023-05-09 Netskope, Inc. Synthetic request injection to retrieve expired metadata for cloud policy enforcement
US11178188B1 (en) * 2021-04-22 2021-11-16 Netskope, Inc. Synthetic request injection to generate metadata for cloud policy enforcement
US11303647B1 (en) 2021-04-22 2022-04-12 Netskope, Inc. Synthetic request injection to disambiguate bypassed login events for cloud policy enforcement
US11190550B1 (en) 2021-04-22 2021-11-30 Netskope, Inc. Synthetic request injection to improve object security posture for cloud security enforcement
US11336698B1 (en) 2021-04-22 2022-05-17 Netskope, Inc. Synthetic request injection for cloud policy enforcement
US11271973B1 (en) 2021-04-23 2022-03-08 Netskope, Inc. Synthetic request injection to retrieve object metadata for cloud policy enforcement
US11271972B1 (en) 2021-04-23 2022-03-08 Netskope, Inc. Data flow logic for synthetic request injection for cloud security enforcement
US11663219B1 (en) * 2021-04-23 2023-05-30 Splunk Inc. Determining a set of parameter values for a processing pipeline
CN113837215B (en) * 2021-04-27 2024-01-12 西北工业大学 Point cloud semantic and instance segmentation method based on conditional random field
CN113392412B (en) * 2021-05-11 2022-05-24 杭州趣链科技有限公司 Data receiving method, data sending method and electronic equipment
US11915066B2 (en) * 2021-05-12 2024-02-27 Sap Se System to facilitate transition to microservices
US11470490B1 (en) 2021-05-17 2022-10-11 T-Mobile Usa, Inc. Determining performance of a wireless telecommunication network
US11620162B2 (en) * 2021-05-24 2023-04-04 Capital One Services, Llc Resource allocation optimization for multi-dimensional machine learning environments
US11500844B1 (en) * 2021-05-28 2022-11-15 International Business Machines Corporation Synthetic data creation for dynamic program analysis
US20220405386A1 (en) * 2021-06-18 2022-12-22 EMC IP Holding Company LLC Privacy preserving ensemble learning as a service
US11720400B2 (en) 2021-06-22 2023-08-08 Accenture Global Solutions Limited Prescriptive analytics-based performance-centric dynamic serverless sizing
US11675817B1 (en) 2021-06-22 2023-06-13 Wells Fargo Bank, N.A. Synthetic data generation
US20230004991A1 (en) * 2021-06-30 2023-01-05 EMC IP Holding Company LLC Methods and systems for identifying breakpoints in variable impact on model results
US20230008868A1 (en) * 2021-07-08 2023-01-12 Nippon Telegraph And Telephone Corporation User authentication device, user authentication method, and user authentication computer program
CN113407458B (en) * 2021-07-09 2023-07-14 广州博冠信息科技有限公司 Interface testing method and device, electronic equipment and computer readable medium
CN113282961A (en) * 2021-07-22 2021-08-20 武汉中原电子信息有限公司 Data desensitization method and system based on power grid data acquisition
US11816186B2 (en) 2021-07-26 2023-11-14 Raytheon Company Architecture for dynamic ML model drift evaluation and visualization on a GUI
US11797574B2 (en) 2021-07-30 2023-10-24 Bank Of America Corporation Hierarchic distributed ledger for data lineage
US20230047057A1 (en) * 2021-08-02 2023-02-16 Samsung Electronics Co., Ltd. Automatically using configuration management analytics in cellular networks
US20230060957A1 (en) * 2021-08-25 2023-03-02 Red Hat, Inc. Creation of Message Serializer for Event Streaming Platform
US11886590B2 (en) * 2021-09-13 2024-01-30 Paypal, Inc. Emulator detection using user agent and device model learning
CN113515464B (en) * 2021-09-14 2021-11-19 广州锦行网络科技有限公司 Honeypot testing method and device based on linux system
US11734156B2 (en) * 2021-09-23 2023-08-22 Microsoft Technology Licensing, Llc Crash localization using crash frame sequence labelling
US11681445B2 (en) 2021-09-30 2023-06-20 Pure Storage, Inc. Storage-aware optimization for serverless functions
US11294937B1 (en) * 2021-10-04 2022-04-05 Tamr, Inc. Method and computer program product for producing a record clustering with estimated clustering accuracy metrics with confidence intervals
US20230107337A1 (en) * 2021-10-04 2023-04-06 Falkonry Inc. Managing machine operations using encoded multi-scale time series data
US11889019B2 (en) 2021-10-12 2024-01-30 T-Mobile Usa, Inc. Categorizing calls using early call information systems and methods
US20230137718A1 (en) * 2021-10-29 2023-05-04 Microsoft Technology Licensing, Llc Representation learning with side information
US20230137378A1 (en) * 2021-11-02 2023-05-04 Microsoft Technology Licensing, Llc Generating private synthetic training data for training machine-learning models
TWI797808B (en) * 2021-11-02 2023-04-01 財團法人資訊工業策進會 Machine learning system and method
KR20230065037A (en) * 2021-11-04 2023-05-11 (주)한국플랫폼서비스기술 Database server applicated deep learning framework for classifying gender and age, and method thereof
WO2023080276A1 (en) * 2021-11-04 2023-05-11 (주)한국플랫폼서비스기술 Query-based database linkage distributed deep learning system, and method therefor
CN113806338B (en) * 2021-11-18 2022-02-18 深圳索信达数据技术有限公司 Data discrimination method and system based on data sample imaging
CN114116870B (en) * 2021-11-25 2023-05-30 江苏商贸职业学院 Cross-service theme data exchange method and system
US20230186307A1 (en) * 2021-12-14 2023-06-15 International Business Machines Corporation Method for enhancing transaction security
CN113934453B (en) * 2021-12-15 2022-03-22 深圳竹云科技有限公司 Risk detection method, risk detection device and storage medium
US20230195734A1 (en) * 2021-12-21 2023-06-22 The Toronto-Dominion Bank Machine learning enabled real time query handling system and method
US11768753B2 (en) * 2021-12-29 2023-09-26 Cerner Innovation, Inc. System and method for evaluating and deploying data models having improved performance measures
US11797408B2 (en) * 2021-12-30 2023-10-24 Juniper Networks, Inc. Dynamic prediction of system resource requirement of network software in a live network using data driven models
US20230214368A1 (en) * 2022-01-03 2023-07-06 Capital One Services, Llc Systems and methods for using machine learning to manage data
CN114048856B (en) * 2022-01-11 2022-05-03 中孚信息股份有限公司 Knowledge reasoning-based automatic safety event handling method and system
US11468369B1 (en) 2022-01-28 2022-10-11 Databricks Inc. Automated processing of multiple prediction generation including model tuning
WO2023146549A1 (en) * 2022-01-28 2023-08-03 Databricks Inc. Automated processing of multiple prediction generation including model tuning
US11943260B2 (en) 2022-02-02 2024-03-26 Netskope, Inc. Synthetic request injection to retrieve metadata for cloud policy enforcement
US11954012B2 (en) * 2022-02-04 2024-04-09 Microsoft Technology Licensing, Llc Client-side telemetry data filter model
US20230252040A1 (en) * 2022-02-04 2023-08-10 Bank Of America Corporation Rule-Based Data Transformation Using Edge Computing Architecture
US20230306281A1 (en) * 2022-02-09 2023-09-28 Applied Materials, Inc. Machine learning model generation and updating for manufacturing equipment
CN114612967B (en) * 2022-03-03 2023-06-20 北京百度网讯科技有限公司 Face clustering method, device, equipment and storage medium
US11847431B2 (en) * 2022-03-03 2023-12-19 International Business Machines Corporation Automatic container specification file generation for a codebase
EP4243341A1 (en) * 2022-03-10 2023-09-13 Vocalink Limited Method and device for monitoring of network events
US11892989B2 (en) 2022-03-28 2024-02-06 Bank Of America Corporation System and method for predictive structuring of electronic data
US20230334169A1 (en) * 2022-04-15 2023-10-19 Collibra Belgium Bv Systems and methods for generating synthetic data
US11928128B2 (en) 2022-05-12 2024-03-12 Truist Bank Construction of a meta-database from autonomously scanned disparate and heterogeneous sources
US11822564B1 (en) * 2022-05-12 2023-11-21 Truist Bank Graphical user interface enabling interactive visualizations using a meta-database constructed from autonomously scanned disparate and heterogeneous sources
US11743552B1 (en) * 2022-06-03 2023-08-29 International Business Machines Corporation Computer technology for enhancing images with a generative adversarial network
US20240013223A1 (en) * 2022-07-10 2024-01-11 Actimize Ltd. Computerized-method for synthetic fraud generation based on tabular data of financial transactions
US11822438B1 (en) 2022-07-11 2023-11-21 Bank Of America Corporation Multi-computer system for application recovery following application programming interface failure
CN115080450B (en) * 2022-08-22 2022-11-11 深圳慧拓无限科技有限公司 Automatic driving test data generation method and system, electronic device and storage medium
US11922289B1 (en) * 2022-08-29 2024-03-05 Subsalt Inc. Machine learning-based systems and methods for on-demand generation of anonymized and privacy-enabled synthetic datasets
WO2024054576A1 (en) 2022-09-08 2024-03-14 Booz Allen Hamilton Inc. System and method synthetic data generation
CN115546652B (en) * 2022-11-29 2023-04-07 城云科技(中国)有限公司 Multi-temporal target detection model, and construction method, device and application thereof
US11704540B1 (en) * 2022-12-13 2023-07-18 Citigroup Technology, Inc. Systems and methods for responding to predicted events in time-series data using synthetic profiles created by artificial intelligence models trained on non-homogenous time series-data
US11868860B1 (en) * 2022-12-13 2024-01-09 Citibank, N.A. Systems and methods for cohort-based predictions in clustered time-series data in order to detect significant rate-of-change events
US11736580B1 (en) * 2023-01-31 2023-08-22 Intuit, Inc. Fixing microservices in distributed transactions
US11893220B1 (en) * 2023-06-14 2024-02-06 International Business Machines Corporation Generating and modifying graphical user interface elements
CN117150830B (en) * 2023-10-31 2024-01-02 格陆博科技有限公司 Oval correction method for SinCos position encoder

Family Cites Families (356)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US1028290A (en) 1909-06-16 1912-06-04 Albert L Sessions Washer.
US1021242A (en) 1911-05-06 1912-03-26 Richmond Stay Bolt Drilling Machine Mfg Company Inc Machine for drilling bolts and other work.
US1012296A (en) 1911-05-25 1911-12-19 Bryant Electric Co Shade-holder.
US6594688B2 (en) * 1993-10-01 2003-07-15 Collaboration Properties, Inc. Dedicated echo canceler for a workstation
JP3420621B2 (en) * 1993-11-04 2003-06-30 富士通株式会社 Distributed route selection controller for communication networks
US6332147B1 (en) * 1995-11-03 2001-12-18 Xerox Corporation Computer controlled display system using a graphical replay device to control playback of temporal data representing collaborative activities
US5911139A (en) 1996-03-29 1999-06-08 Virage, Inc. Visual image database search engine which allows for different schema
US5832212A (en) * 1996-04-19 1998-11-03 International Business Machines Corporation Censoring browser method and apparatus for internet viewing
US5867160A (en) * 1996-10-31 1999-02-02 International Business Machines Corporation System and method for task prioritization in computerized graphic interface environments
DE19703965C1 (en) 1997-02-03 1999-05-12 Siemens Ag Process for transforming a fuzzy logic used to simulate a technical process into a neural network
US5974549A (en) 1997-03-27 1999-10-26 Soliton Ltd. Security monitor
US7117188B2 (en) * 1998-05-01 2006-10-03 Health Discovery Corporation Methods of identifying patterns in biological systems and uses thereof
US6269351B1 (en) 1999-03-31 2001-07-31 Dryken Technologies, Inc. Method and system for training an artificial neural network
US6137912A (en) 1998-08-19 2000-10-24 Physical Optics Corporation Method of multichannel data compression
US6922699B2 (en) 1999-01-26 2005-07-26 Xerox Corporation System and method for quantitatively representing data objects in vector space
US6452615B1 (en) * 1999-03-24 2002-09-17 Fuji Xerox Co., Ltd. System and apparatus for notetaking with digital video and ink
EP1185964A1 (en) * 1999-05-05 2002-03-13 Accenture Properties (2) B.V. System, method and article of manufacture for creating collaborative simulations with multiple roles for a single student
US20030023686A1 (en) * 1999-05-05 2003-01-30 Beams Brian R. Virtual consultant
US7630986B1 (en) 1999-10-27 2009-12-08 Pinpoint, Incorporated Secure data interchange
US7124164B1 (en) * 2001-04-17 2006-10-17 Chemtob Helen J Method and apparatus for providing group interaction via communications networks
US7047279B1 (en) * 2000-05-05 2006-05-16 Accenture, Llp Creating collaborative application sharing
US6986046B1 (en) * 2000-05-12 2006-01-10 Groove Networks, Incorporated Method and apparatus for managing secure collaborative transactions
US7890405B1 (en) * 2000-06-09 2011-02-15 Collaborate Solutions Inc. Method and system for enabling collaboration between advisors and clients
US9038108B2 (en) * 2000-06-28 2015-05-19 Verizon Patent And Licensing Inc. Method and system for providing end user community functionality for publication and delivery of digital media content
US20020103793A1 (en) 2000-08-02 2002-08-01 Daphne Koller Method and apparatus for learning probabilistic relational models having attribute and link uncertainty and for performing selectivity estimation using probabilistic relational models
US6920458B1 (en) * 2000-09-22 2005-07-19 Sas Institute Inc. Model repository
US20030003861A1 (en) 2000-12-07 2003-01-02 Hideki Kagemoto Data broadcast-program production system, data broadcast-program method, data broadcast- program production computer-program, and computer-readable recorded medium
US7353252B1 (en) * 2001-05-16 2008-04-01 Sigma Design System for electronic file collaboration among multiple users using peer-to-peer network topology
US7370269B1 (en) * 2001-08-31 2008-05-06 Oracle International Corporation System and method for real-time annotation of a co-browsed document
US20090055477A1 (en) 2001-11-13 2009-02-26 Flesher Kevin E System for enabling collaboration and protecting sensitive data
JP2003256443A (en) * 2002-03-05 2003-09-12 Fuji Xerox Co Ltd Data classification device
US7107285B2 (en) * 2002-03-16 2006-09-12 Questerra Corporation Method, system, and program for an improved enterprise spatial system
US7010752B2 (en) * 2002-05-03 2006-03-07 Enactex, Inc. Method for graphical collaboration with unstructured data
US20040059695A1 (en) * 2002-09-20 2004-03-25 Weimin Xiao Neural network and method of training
US7386535B1 (en) * 2002-10-02 2008-06-10 Q.Know Technologies, Inc. Computer assisted and/or implemented method for group collarboration on projects incorporating electronic information
US7676542B2 (en) * 2002-12-02 2010-03-09 Sap Ag Establishing a collaboration environment
AU2003303499A1 (en) 2002-12-26 2004-07-29 The Trustees Of Columbia University In The City Of New York Ordered data compression system and methods
US20040172637A1 (en) * 2003-02-28 2004-09-02 Sap Ag Code morphing manager
US7730407B2 (en) * 2003-02-28 2010-06-01 Fuji Xerox Co., Ltd. Systems and methods for bookmarking live and recorded multimedia documents
WO2004084044A2 (en) * 2003-03-18 2004-09-30 Networks Dynamics, Inc. Network operating system and method
US20040201602A1 (en) * 2003-04-14 2004-10-14 Invensys Systems, Inc. Tablet computer system for industrial process design, supervisory control, and data management
US20040236830A1 (en) * 2003-05-15 2004-11-25 Steve Nelson Annotation management system
US7734690B2 (en) * 2003-09-05 2010-06-08 Microsoft Corporation Method and apparatus for providing attributes of a collaboration system in an operating system folder-based file system
US7590941B2 (en) * 2003-10-09 2009-09-15 Hewlett-Packard Development Company, L.P. Communication and collaboration system using rich media environments
US20080288889A1 (en) 2004-02-20 2008-11-20 Herbert Dennis Hunt Data visualization application
US7818679B2 (en) * 2004-04-20 2010-10-19 Microsoft Corporation Method, system, and apparatus for enabling near real time collaboration on an electronic document through a plurality of computer systems
US20060031622A1 (en) 2004-06-07 2006-02-09 Jardine Robert L Software transparent expansion of the number of fabrics coupling multiple processsing nodes of a computer system
US7814426B2 (en) * 2004-06-30 2010-10-12 Sap Aktiengesellschaft Reusable component in a collaboration workspace
US20060026502A1 (en) * 2004-07-28 2006-02-02 Koushik Dutta Document collaboration system
US7702730B2 (en) * 2004-09-03 2010-04-20 Open Text Corporation Systems and methods for collaboration
GB0428191D0 (en) 2004-12-23 2005-01-26 Cambridge Display Tech Ltd Digital signal processing methods and apparatus
US20060206370A1 (en) * 2004-11-16 2006-09-14 Netspace Technology Llc. Smart work-force tool
US20060117247A1 (en) * 2004-11-30 2006-06-01 Fite William R Web based data collaboration tool
US20060168550A1 (en) * 2005-01-21 2006-07-27 International Business Machines Corporation System, method and apparatus for creating and managing activities in a collaborative computing environment
MX2007009333A (en) * 2005-02-11 2007-10-10 Volt Inf Sciences Inc Project work change in plan/scope administrative and business information synergy system and method.
JP4591148B2 (en) * 2005-03-25 2010-12-01 富士ゼロックス株式会社 FUNCTION CONVERSION DEVICE, FUNCTION CONVERSION METHOD, FUNCTION CONVERSION PROGRAM, DEVICE DATA GENERATION DEVICE, DEVICE DATA GENERATION METHOD, AND DEVICE DATA GENERATION PROGRAM
AU2006236283A1 (en) * 2005-04-18 2006-10-26 The Trustees Of Columbia University In The City Of New York Systems and methods for detecting and inhibiting attacks using honeypots
US8239498B2 (en) * 2005-10-28 2012-08-07 Bank Of America Corporation System and method for facilitating the implementation of changes to the configuration of resources in an enterprise
GB0523703D0 (en) * 2005-11-22 2005-12-28 Ibm Collaborative editing of a document
US20070169017A1 (en) 2006-01-13 2007-07-19 Coward Daniel R Method and apparatus for translating an application programming interface (API) call
US8464164B2 (en) * 2006-01-24 2013-06-11 Simulat, Inc. System and method to create a collaborative web-based multimedia contextual dialogue
US20070191979A1 (en) * 2006-02-10 2007-08-16 International Business Machines Corporation Method, program and apparatus for supporting inter-disciplinary workflow with dynamic artifacts
US7752233B2 (en) * 2006-03-29 2010-07-06 Massachusetts Institute Of Technology Techniques for clustering a set of objects
US7774288B2 (en) 2006-05-16 2010-08-10 Sony Corporation Clustering and classification of multimedia data
US20080005269A1 (en) * 2006-06-29 2008-01-03 Knighton Mark S Method and apparatus to share high quality images in a teleconference
US8392418B2 (en) 2009-06-25 2013-03-05 University Of Tennessee Research Foundation Method and apparatus for predicting object properties and events using similarity-based information retrieval and model
US8281378B2 (en) * 2006-10-20 2012-10-02 Citrix Systems, Inc. Methods and systems for completing, by a single-sign on component, an authentication process in a federated environment to a resource not supporting federation
US20080120126A1 (en) * 2006-11-21 2008-05-22 George Bone Intelligent parallel processing system and method
US20080168339A1 (en) 2006-12-21 2008-07-10 Aquatic Informatics (139811) System and method for automatic environmental data validation
EP2111593A2 (en) 2007-01-26 2009-10-28 Information Resources, Inc. Analytic platform
US20080270363A1 (en) 2007-01-26 2008-10-30 Herbert Dennis Hunt Cluster processing of a core information matrix
US20090305790A1 (en) * 2007-01-30 2009-12-10 Vitie Inc. Methods and Apparatuses of Game Appliance Execution and Rendering Service
US8434129B2 (en) * 2007-08-02 2013-04-30 Fugen Solutions, Inc. Method and apparatus for multi-domain identity interoperability and compliance verification
US8595559B2 (en) 2007-08-23 2013-11-26 International Business Machines Corporation Method and apparatus for model-based testing of a graphical user interface
JP2009111691A (en) 2007-10-30 2009-05-21 Hitachi Ltd Image-encoding device and encoding method, and image-decoding device and decoding method
US20090158302A1 (en) * 2007-12-13 2009-06-18 Fiberlink Communications Corporation Api translation for network access control (nac) agent
US8781989B2 (en) * 2008-01-14 2014-07-15 Aptima, Inc. Method and system to predict a data value
US20120233205A1 (en) * 2008-03-07 2012-09-13 Inware, Llc System and method for document management
WO2009139650A1 (en) * 2008-05-12 2009-11-19 Business Intelligence Solutions Safe B.V. A data obfuscation system, method, and computer implementation of data obfuscation for secret databases
WO2009140723A1 (en) * 2008-05-19 2009-11-26 Smart Internet Technology Crc Pty Ltd Systems and methods for collaborative interaction
EP2291783A4 (en) * 2008-06-17 2011-08-10 Jostens Inc System and method for yearbook creation
US8375014B1 (en) 2008-06-19 2013-02-12 BioFortis, Inc. Database query builder
US20100036884A1 (en) 2008-08-08 2010-02-11 Brown Robert G Correlation engine for generating anonymous correlations between publication-restricted data and personal attribute data
KR100915832B1 (en) 2008-08-08 2009-09-07 주식회사 하이닉스반도체 Control circuit of read operation for semiconductor memory apparatus
US8112414B2 (en) * 2008-08-28 2012-02-07 International Business Machines Corporation Apparatus and system for reducing locking in materialized query tables
US9733959B2 (en) * 2008-09-15 2017-08-15 Vmware, Inc. Policy-based hypervisor configuration management
KR101789608B1 (en) 2008-10-23 2017-10-25 아브 이니티오 테크놀로지 엘엘시 A method, and a computer-readable record medium storing a computer program for performing a data operation
US20100180213A1 (en) * 2008-11-19 2010-07-15 Scigen Technologies, S.A. Document creation system and methods
US8464167B2 (en) * 2008-12-01 2013-06-11 Palo Alto Research Center Incorporated System and method for synchronized authoring and access of chat and graphics
US20100235750A1 (en) * 2009-03-12 2010-09-16 Bryce Douglas Noland System, method and program product for a graphical interface
US8683554B2 (en) 2009-03-27 2014-03-25 Wavemarket, Inc. System and method for managing third party application program access to user information via a native application program interface (API)
JP5222205B2 (en) 2009-04-03 2013-06-26 Kddi株式会社 Image processing apparatus, method, and program
US8615713B2 (en) * 2009-06-26 2013-12-24 Xerox Corporation Managing document interactions in collaborative document environments of virtual worlds
JP5621773B2 (en) * 2009-07-06 2014-11-12 日本電気株式会社 Classification hierarchy re-creation system, classification hierarchy re-creation method, and classification hierarchy re-creation program
US8554801B2 (en) * 2009-07-10 2013-10-08 Robert Mack Method and apparatus for converting heterogeneous databases into standardized homogeneous databases
US8806331B2 (en) * 2009-07-20 2014-08-12 Interactive Memories, Inc. System and methods for creating and editing photo-based projects on a digital network
CA2684438C (en) 2009-09-22 2016-07-19 Ibm Canada Limited - Ibm Canada Limitee User customizable queries to populate model diagrams
WO2011042889A1 (en) 2009-10-09 2011-04-14 Mizrahi, Moshe A method, computer product program and system for analysis of data
US7996723B2 (en) * 2009-12-22 2011-08-09 Xerox Corporation Continuous, automated discovery of bugs in released software
US8589947B2 (en) * 2010-05-11 2013-11-19 The Trustees Of Columbia University In The City Of New York Methods, systems, and media for application fault containment
US8438122B1 (en) 2010-05-14 2013-05-07 Google Inc. Predictive analytic modeling platform
US8820088B2 (en) 2010-07-27 2014-09-02 United Technologies Corporation Variable area fan nozzle with acoustic system for a gas turbine engine
EP2434411A1 (en) * 2010-09-27 2012-03-28 Qlucore AB Computer-implemented method for analyzing multivariate data
US9031957B2 (en) * 2010-10-08 2015-05-12 Salesforce.Com, Inc. Structured data in a business networking feed
EP2628143A4 (en) * 2010-10-11 2015-04-22 Teachscape Inc Methods and systems for capturing, processing, managing and/or evaluating multimedia content of observed persons performing a task
US9436502B2 (en) * 2010-12-10 2016-09-06 Microsoft Technology Licensing, Llc Eventually consistent storage and transactions in cloud based environment
US8832836B2 (en) 2010-12-30 2014-09-09 Verisign, Inc. Systems and methods for malware detection and scanning
JP5608575B2 (en) * 2011-01-19 2014-10-15 株式会社日立ハイテクノロジーズ Image classification method and image classification apparatus
EP2684117A4 (en) 2011-03-10 2015-01-07 Textwise Llc Method and system for unified information representation and applications thereof
US20120278353A1 (en) 2011-04-28 2012-11-01 International Business Machines Searching with topic maps of a model for canonical model based integration
US20120283885A1 (en) 2011-05-04 2012-11-08 General Electric Company Automated system and method for implementing statistical comparison of power plant operations
US8533224B2 (en) 2011-05-04 2013-09-10 Google Inc. Assessing accuracy of trained predictive models
US20170109676A1 (en) 2011-05-08 2017-04-20 Panaya Ltd. Generation of Candidate Sequences Using Links Between Nonconsecutively Performed Steps of a Business Process
US9015601B2 (en) * 2011-06-21 2015-04-21 Box, Inc. Batch uploading of content to a web-based collaboration environment
WO2013009337A2 (en) * 2011-07-08 2013-01-17 Arnold Goldberg Desktop application for access and interaction with workspaces in a cloud-based content management system and synchronization mechanisms thereof
US20130015931A1 (en) 2011-07-11 2013-01-17 GM Global Technology Operations LLC Tunable stiffness actuator
US9197718B2 (en) * 2011-09-23 2015-11-24 Box, Inc. Central management and control of user-contributed content in a web-based collaboration environment and management console thereof
US11074495B2 (en) 2013-02-28 2021-07-27 Z Advanced Computing, Inc. (Zac) System and method for extremely efficient image and pattern recognition and artificial intelligence platform
US9916538B2 (en) 2012-09-15 2018-03-13 Z Advanced Computing, Inc. Method and system for feature detection
US20130097103A1 (en) * 2011-10-14 2013-04-18 International Business Machines Corporation Techniques for Generating Balanced and Class-Independent Training Data From Unlabeled Data Set
US8914859B2 (en) 2011-11-07 2014-12-16 International Business Machines Corporation Managing the progressive legible obfuscation and de-obfuscation of public and quasi-public broadcast messages
US9171146B2 (en) 2011-12-14 2015-10-27 Intel Corporation Method and system for monitoring calls to an application program interface (API) function
US8897542B2 (en) * 2011-12-15 2014-11-25 Sony Corporation Depth map generation based on soft classification
US9367814B1 (en) * 2011-12-27 2016-06-14 Google Inc. Methods and systems for classifying data using a hierarchical taxonomy
WO2013101723A1 (en) 2011-12-27 2013-07-04 Wellpoint, Inc. Method and system for data pattern matching, masking and removal of sensitive data
US9542536B2 (en) * 2012-01-13 2017-01-10 Microsoft Technology Licensing, Llc Sustained data protection
US11164394B2 (en) * 2012-02-24 2021-11-02 Matterport, Inc. Employing three-dimensional (3D) data predicted from two-dimensional (2D) images using neural networks for 3D modeling applications and other applications
WO2013142273A1 (en) * 2012-03-19 2013-09-26 Citrix Systems, Inc. Systems and methods for providing user interfaces for management applications
US8782744B1 (en) 2012-06-15 2014-07-15 Amazon Technologies, Inc. Managing API authorization
US9473532B2 (en) * 2012-07-19 2016-10-18 Box, Inc. Data loss prevention (DLP) methods by a cloud service including third party integration architectures
US9311283B2 (en) 2012-08-16 2016-04-12 Realnetworks, Inc. System for clipping webpages by traversing a dom, and highlighting a minimum number of words
US9165328B2 (en) * 2012-08-17 2015-10-20 International Business Machines Corporation System, method and computer program product for classification of social streams
EP2701020A1 (en) 2012-08-22 2014-02-26 Siemens Aktiengesellschaft Monitoring of the initial equipment of a technical system for fabricating a product
US9461876B2 (en) * 2012-08-29 2016-10-04 Loci System and method for fuzzy concept mapping, voting ontology crowd sourcing, and technology prediction
US9553758B2 (en) * 2012-09-18 2017-01-24 Box, Inc. Sandboxing individual applications to specific user folders in a cloud-based service
US20180173730A1 (en) 2012-09-28 2018-06-21 Clinigence, LLC Generating a Database with Mapped Data
CN104662547A (en) 2012-10-19 2015-05-27 迈克菲股份有限公司 Mobile application management
JP6029683B2 (en) * 2012-11-20 2016-11-24 株式会社日立製作所 Data analysis device, data analysis program
JP5971115B2 (en) * 2012-12-26 2016-08-17 富士通株式会社 Information processing program, information processing method and apparatus
WO2014110167A2 (en) 2013-01-08 2014-07-17 Purepredictive, Inc. Integrated machine learning for a data management product
US9274935B1 (en) 2013-01-15 2016-03-01 Google Inc. Application testing system with application programming interface
US9311359B2 (en) * 2013-01-30 2016-04-12 International Business Machines Corporation Join operation partitioning
KR102270699B1 (en) 2013-03-11 2021-06-28 매직 립, 인코포레이티드 System and method for augmented and virtual reality
US9104867B1 (en) 2013-03-13 2015-08-11 Fireeye, Inc. Malicious content analysis using simulated user interaction without user involvement
US10514977B2 (en) * 2013-03-15 2019-12-24 Richard B. Jones System and method for the dynamic analysis of event data
US20140278339A1 (en) 2013-03-15 2014-09-18 Konstantinos (Constantin) F. Aliferis Computer System and Method That Determines Sample Size and Power Required For Complex Predictive and Causal Data Analysis
US9191411B2 (en) 2013-03-15 2015-11-17 Zerofox, Inc. Protecting against suspect social entities
US20140310208A1 (en) * 2013-04-10 2014-10-16 Machine Perception Technologies Inc. Facilitating Operation of a Machine Learning Environment
US20140324760A1 (en) 2013-04-30 2014-10-30 Hewlett-Packard Development Company, L.P. Synthetic time series data generation
US9658899B2 (en) 2013-06-10 2017-05-23 Amazon Technologies, Inc. Distributed lock management in a cloud computing environment
US10102581B2 (en) * 2013-06-17 2018-10-16 Intercontinental Exchange Holdings, Inc. Multi-asset portfolio simulation (MAPS)
US9646262B2 (en) * 2013-06-17 2017-05-09 Purepredictive, Inc. Data intelligence using machine learning
US9716842B1 (en) 2013-06-19 2017-07-25 Amazon Technologies, Inc. Augmented reality presentation
US20150012255A1 (en) * 2013-07-03 2015-01-08 International Business Machines Corporation Clustering based continuous performance prediction and monitoring for semiconductor manufacturing processes using nonparametric bayesian models
US9600503B2 (en) * 2013-07-25 2017-03-21 Facebook, Inc. Systems and methods for pruning data by sampling
US10120838B2 (en) 2013-07-25 2018-11-06 Facebook, Inc. Systems and methods for weighted sampling
US9740662B2 (en) 2013-08-26 2017-08-22 Wright State University Fractional scaling digital filters and the generation of standardized noise and synthetic data series
CN112989840A (en) 2013-08-30 2021-06-18 英特尔公司 Extensible context-aware natural language interaction for virtual personal assistants
US9465857B1 (en) * 2013-09-26 2016-10-11 Groupon, Inc. Dynamic clustering for streaming data
US20150100537A1 (en) 2013-10-03 2015-04-09 Microsoft Corporation Emoji for Text Predictions
US20150134413A1 (en) * 2013-10-31 2015-05-14 International Business Machines Corporation Forecasting for retail customers
CN103559504B (en) * 2013-11-04 2016-08-31 北京京东尚科信息技术有限公司 Image target category identification method and device
US20150128103A1 (en) * 2013-11-07 2015-05-07 Runscope, Inc. System and method for automating application programming interface integration
US9830376B2 (en) * 2013-11-20 2017-11-28 International Business Machines Corporation Language tag management on international data storage
JP6149710B2 (en) * 2013-11-27 2017-06-21 富士ゼロックス株式会社 Image processing apparatus and program
US9531609B2 (en) * 2014-03-23 2016-12-27 Ca, Inc. Virtual service automation
US9613190B2 (en) * 2014-04-23 2017-04-04 Intralinks, Inc. Systems and methods of secure data exchange
US20150309987A1 (en) 2014-04-29 2015-10-29 Google Inc. Classification of Offensive Words
US9521122B2 (en) * 2014-05-09 2016-12-13 International Business Machines Corporation Intelligent security analysis and enforcement for data transfer
US11069009B2 (en) 2014-05-16 2021-07-20 Accenture Global Services Limited System, method and apparatuses for identifying load volatility of a power customer and a tangible computer readable medium
US10496927B2 (en) * 2014-05-23 2019-12-03 DataRobot, Inc. Systems for time-series predictive data analytics, and related methods and apparatus
US10528872B2 (en) * 2014-05-30 2020-01-07 Apple Inc. Methods and system for managing predictive models
US9661013B2 (en) * 2014-05-30 2017-05-23 Ca, Inc. Manipulating API requests to indicate source computer application trustworthiness
US10445341B2 (en) * 2014-06-09 2019-10-15 The Mathworks, Inc. Methods and systems for analyzing datasets
US10318882B2 (en) * 2014-09-11 2019-06-11 Amazon Technologies, Inc. Optimized training of linear machine learning models
CN106605220A (en) * 2014-07-02 2017-04-26 道库门特公司Ip信托单位 Method and system for selective document redaction
US9785719B2 (en) * 2014-07-15 2017-10-10 Adobe Systems Incorporated Generating synthetic data
US11122058B2 (en) * 2014-07-23 2021-09-14 Seclytics, Inc. System and method for the automated detection and prediction of online threats
US9549188B2 (en) * 2014-07-30 2017-01-17 Intel Corporation Golden frame selection in video coding
US9729506B2 (en) 2014-08-22 2017-08-08 Shape Security, Inc. Application programming interface wall
US10452793B2 (en) * 2014-08-26 2019-10-22 International Business Machines Corporation Multi-dimension variable predictive modeling for analysis acceleration
BR112017003893A8 (en) * 2014-09-12 2017-12-26 Microsoft Corp DNN STUDENT APPRENTICE NETWORK VIA OUTPUT DISTRIBUTION
US9954893B1 (en) 2014-09-23 2018-04-24 Shape Security, Inc. Techniques for combating man-in-the-browser attacks
EP3197384A4 (en) 2014-09-23 2018-05-16 Surgical Safety Technologies Inc. Operating room black-box device, system, method and computer readable medium
US10296192B2 (en) 2014-09-26 2019-05-21 Oracle International Corporation Dynamic visual profiling and visualization of high volume datasets and real-time smart sampling and statistical profiling of extremely large datasets
US10210246B2 (en) 2014-09-26 2019-02-19 Oracle International Corporation Techniques for similarity analysis and data enrichment using knowledge sources
US9672248B2 (en) * 2014-10-08 2017-06-06 International Business Machines Corporation Embracing and exploiting data skew during a join or groupby
WO2016061283A1 (en) 2014-10-14 2016-04-21 Skytree, Inc. Configurable machine learning method selection and parameter optimization system and method
US20160110810A1 (en) * 2014-10-16 2016-04-21 Fmr Llc Research Systems and Methods for Integrating Query Data and Individual User Profile
US9560075B2 (en) 2014-10-22 2017-01-31 International Business Machines Corporation Cognitive honeypot
US9886247B2 (en) 2014-10-30 2018-02-06 International Business Machines Corporation Using an application programming interface (API) data structure in recommending an API composite
US20160132787A1 (en) 2014-11-11 2016-05-12 Massachusetts Institute Of Technology Distributed, multi-model, self-learning platform for machine learning
US9460288B2 (en) 2014-12-08 2016-10-04 Shape Security, Inc. Secure app update server and secure application programming interface (“API”) server
US9607217B2 (en) * 2014-12-22 2017-03-28 Yahoo! Inc. Generating preference indices for image content
US9678999B1 (en) 2014-12-31 2017-06-13 Teradata Us, Inc. Histogram generation on multiple dimensions
US9544301B2 (en) * 2015-01-28 2017-01-10 International Business Machines Corporation Providing data security with a token device
US10643138B2 (en) 2015-01-30 2020-05-05 Micro Focus Llc Performance testing based on variable length segmentation and clustering of time series data
US9608809B1 (en) 2015-02-05 2017-03-28 Ionic Security Inc. Systems and methods for encryption and provision of information security using platform services
US20160232575A1 (en) * 2015-02-06 2016-08-11 Facebook, Inc. Determining a number of cluster groups associated with content identifying users eligible to receive the content
US20160259841A1 (en) * 2015-03-04 2016-09-08 Tegu LLC Research Analysis System
EP3268870A4 (en) * 2015-03-11 2018-12-05 Ayasdi, Inc. Systems and methods for predicting outcomes using a prediction learning model
GB201604672D0 (en) * 2016-03-18 2016-05-04 Magic Pony Technology Ltd Generative methods of super resolution
US9853996B2 (en) 2015-04-13 2017-12-26 Secful, Inc. System and method for identifying and preventing malicious API attacks
US9462013B1 (en) 2015-04-29 2016-10-04 International Business Machines Corporation Managing security breaches in a networked computing environment
JP6511951B2 (en) * 2015-05-14 2019-05-15 富士ゼロックス株式会社 Information processing apparatus and program
US10163061B2 (en) 2015-06-18 2018-12-25 International Business Machines Corporation Quality-directed adaptive analytic retraining
US10452522B1 (en) * 2015-06-19 2019-10-22 Amazon Technologies, Inc. Synthetic data generation from a service description language model
US9946705B2 (en) * 2015-06-29 2018-04-17 International Business Machines Corporation Query processing using a dimension table implemented as decompression dictionaries
US10764574B2 (en) 2015-07-01 2020-09-01 Panasonic Intellectual Property Management Co., Ltd. Encoding method, decoding method, encoding apparatus, decoding apparatus, and encoding and decoding apparatus
US11567962B2 (en) 2015-07-11 2023-01-31 Taascom Inc. Computer network controlled data orchestration system and method for data aggregation, normalization, for presentation, analysis and action/decision making
US10750161B2 (en) 2015-07-15 2020-08-18 Fyusion, Inc. Multi-view interactive digital media representation lock screen
US10628521B2 (en) * 2015-08-03 2020-04-21 International Business Machines Corporation Scoring automatically generated language patterns for questions using synthetic events
JP2017041022A (en) * 2015-08-18 2017-02-23 キヤノン株式会社 Information processor, information processing method and program
US10536449B2 (en) * 2015-09-15 2020-01-14 Mimecast Services Ltd. User login credential warning system
US9953176B2 (en) * 2015-10-02 2018-04-24 Dtex Systems Inc. Method and system for anonymizing activity records
US20180253894A1 (en) 2015-11-04 2018-09-06 Intel Corporation Hybrid foreground-background technique for 3d model reconstruction of dynamic scenes
US20170134405A1 (en) 2015-11-09 2017-05-11 Qualcomm Incorporated Dynamic Honeypot System
WO2017100356A1 (en) * 2015-12-07 2017-06-15 Data4Cure, Inc. A method and system for ontology-based dynamic learning and knowledge integration from measurement data and text
US9934397B2 (en) 2015-12-15 2018-04-03 International Business Machines Corporation Controlling privacy in a face recognition application
US10200401B1 (en) * 2015-12-17 2019-02-05 Architecture Technology Corporation Evaluating results of multiple virtual machines that use application randomization mechanism
US10097581B1 (en) * 2015-12-28 2018-10-09 Amazon Technologies, Inc. Honeypot computing services that include simulated computing resources
US10740335B1 (en) * 2016-01-15 2020-08-11 Accenture Global Solutions Limited Biometric data combination engine
US20170214701A1 (en) * 2016-01-24 2017-07-27 Syed Kamran Hasan Computer security based on artificial intelligence
US20180357543A1 (en) 2016-01-27 2018-12-13 Bonsai AI, Inc. Artificial intelligence system configured to measure performance of artificial intelligence over time
US10031745B2 (en) 2016-02-02 2018-07-24 International Business Machines Corporation System and method for automatic API candidate generation
US11769193B2 (en) 2016-02-11 2023-09-26 Ebay Inc. System and method for detecting visually similar items
US10515424B2 (en) 2016-02-12 2019-12-24 Microsoft Technology Licensing, Llc Machine learned query generation on inverted indices
WO2017145960A1 (en) * 2016-02-24 2017-08-31 日本電気株式会社 Learning device, learning method, and recording medium
US11113852B2 (en) * 2016-02-29 2021-09-07 Oracle International Corporation Systems and methods for trending patterns within time-series data
CN107292326A (en) 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 The training method and device of a kind of model
US10824629B2 (en) * 2016-04-01 2020-11-03 Wavefront, Inc. Query implementation using synthetic time series
CA3023488C (en) 2016-04-14 2022-06-28 The Research Foundation For The State University Of New York System and method for generating a progressive representation associated with surjectively mapped virtual and physical reality image data
US10515101B2 (en) * 2016-04-19 2019-12-24 Strava, Inc. Determining clusters of similar activities
CN109716345B (en) * 2016-04-29 2023-09-15 普威达有限公司 Computer-implemented privacy engineering system and method
US10532268B2 (en) * 2016-05-02 2020-01-14 Bao Tran Smart device
US11295336B2 (en) * 2016-05-04 2022-04-05 Quantifind, Inc. Synthetic control generation and campaign impact assessment apparatuses, methods and systems
US10462181B2 (en) 2016-05-10 2019-10-29 Quadrant Information Security Method, system, and apparatus to identify and study advanced threat tactics, techniques and procedures
US10778647B2 (en) * 2016-06-17 2020-09-15 Cisco Technology, Inc. Data anonymization for distributed hierarchical networks
US10572641B1 (en) * 2016-06-21 2020-02-25 Wells Fargo Bank, N.A. Dynamic enrollment using biometric tokenization
US10733534B2 (en) * 2016-07-15 2020-08-04 Microsoft Technology Licensing, Llc Data evaluation as a service
TW201812646A (en) * 2016-07-18 2018-04-01 美商南坦奧美克公司 Distributed machine learning system, method of distributed machine learning, and method of generating proxy data
US20190228268A1 (en) * 2016-09-14 2019-07-25 Konica Minolta Laboratory U.S.A., Inc. Method and system for cell image segmentation using multi-stage convolutional neural networks
US11080616B2 (en) * 2016-09-27 2021-08-03 Clarifai, Inc. Artificial intelligence model and data collection/development platform
CA2981555C (en) * 2016-10-07 2021-11-16 1Qb Information Technologies Inc. System and method for displaying data representative of a large dataset
US20180100894A1 (en) * 2016-10-07 2018-04-12 United States Of America As Represented By The Secretary Of The Navy Automatic Generation of Test Sequences
US10043095B2 (en) * 2016-10-10 2018-08-07 Gyrfalcon Technology, Inc. Data structure for CNN based digital integrated circuit for extracting features out of an input image
JP2018067115A (en) 2016-10-19 2018-04-26 セイコーエプソン株式会社 Program, tracking method and tracking device
US10609284B2 (en) 2016-10-22 2020-03-31 Microsoft Technology Licensing, Llc Controlling generation of hyperlapse from wide-angled, panoramic videos
US10681012B2 (en) * 2016-10-26 2020-06-09 Ping Identity Corporation Methods and systems for deep learning based API traffic security
JP6911866B2 (en) * 2016-10-26 2021-07-28 ソニーグループ株式会社 Information processing device and information processing method
US10769721B2 (en) * 2016-10-31 2020-09-08 Accenture Global Solutions Limited Intelligent product requirement configurator
US10346223B1 (en) * 2016-11-23 2019-07-09 Google Llc Selective obfuscation of notifications
US10621210B2 (en) 2016-11-27 2020-04-14 Amazon Technologies, Inc. Recognizing unknown data objects
US20180150609A1 (en) * 2016-11-29 2018-05-31 Electronics And Telecommunications Research Institute Server and method for predicting future health trends through similar case cluster based prediction models
US9754190B1 (en) 2016-11-29 2017-09-05 Seematics Systems Ltd System and method for image classification based on Tsallis entropy
US20180158052A1 (en) * 2016-12-01 2018-06-07 The Toronto-Dominion Bank Asynchronous cryptogram-based authentication processes
US11068949B2 (en) 2016-12-09 2021-07-20 365 Retail Markets, Llc Distributed and automated transaction systems
US10713384B2 (en) 2016-12-09 2020-07-14 Massachusetts Institute Of Technology Methods and apparatus for transforming and statistically modeling relational databases to synthesize privacy-protected anonymized data
CN106650806B (en) * 2016-12-16 2019-07-26 北京大学深圳研究生院 A kind of cooperating type depth net model methodology for pedestrian detection
US10430661B2 (en) 2016-12-20 2019-10-01 Adobe Inc. Generating a compact video feature representation in a digital medium environment
US10163003B2 (en) * 2016-12-28 2018-12-25 Adobe Systems Incorporated Recognizing combinations of body shape, pose, and clothing in three-dimensional input images
US10970605B2 (en) * 2017-01-03 2021-04-06 Samsung Electronics Co., Ltd. Electronic apparatus and method of operating the same
US11496286B2 (en) 2017-01-08 2022-11-08 Apple Inc. Differential privacy with cloud data
US10212428B2 (en) 2017-01-11 2019-02-19 Microsoft Technology Licensing, Llc Reprojecting holographic video to enhance streaming bandwidth/quality
US10448054B2 (en) 2017-01-11 2019-10-15 Groq, Inc. Multi-pass compression of uncompressed data
US10192016B2 (en) * 2017-01-17 2019-01-29 Xilinx, Inc. Neural network based physical synthesis for circuit designs
US10235622B2 (en) * 2017-01-24 2019-03-19 Sas Institute Inc. Pattern identifier system
US10360517B2 (en) 2017-02-22 2019-07-23 Sas Institute Inc. Distributed hyperparameter tuning system for machine learning
US10554607B2 (en) 2017-02-24 2020-02-04 Telefonaktiebolaget Lm Ericsson (Publ) Heterogeneous cloud controller
US20180247078A1 (en) * 2017-02-28 2018-08-30 Gould & Ratner LLP System for anonymization and filtering of data
WO2018158710A1 (en) * 2017-02-28 2018-09-07 Telefonaktiebolaget Lm Ericsson (Publ) Partition-based prefix preserving anonymization approach for network traces containing ip addresses
US10382466B2 (en) 2017-03-03 2019-08-13 Hitachi, Ltd. Cooperative cloud-edge vehicle anomaly detection
JP6931290B2 (en) * 2017-03-03 2021-09-01 キヤノン株式会社 Image generator and its control method
US10733482B1 (en) * 2017-03-08 2020-08-04 Zoox, Inc. Object height estimation from monocular images
US10891545B2 (en) 2017-03-10 2021-01-12 International Business Machines Corporation Multi-dimensional time series event prediction via convolutional neural network(s)
US20180260474A1 (en) 2017-03-13 2018-09-13 Arizona Board Of Regents On Behalf Of The University Of Arizona Methods for extracting and assessing information from literature documents
US10692000B2 (en) * 2017-03-20 2020-06-23 Sap Se Training machine learning models
US11574164B2 (en) 2017-03-20 2023-02-07 International Business Machines Corporation Neural network cooperation
EP4159871A1 (en) * 2017-03-24 2023-04-05 Becton, Dickinson and Company Synthetic multiplets for multiplets determination
WO2018175098A1 (en) * 2017-03-24 2018-09-27 D5Ai Llc Learning coach for machine learning system
US11734584B2 (en) 2017-04-19 2023-08-22 International Business Machines Corporation Multi-modal construction of deep learning networks
KR102309031B1 (en) * 2017-04-27 2021-10-06 삼성전자 주식회사 Apparatus and Method for managing Intelligence Agent Service
US10541981B1 (en) * 2017-05-17 2020-01-21 Amazon Technologies, Inc. Secure message service for managing third-party dissemination of sensitive information
US20180336463A1 (en) * 2017-05-18 2018-11-22 General Electric Company Systems and methods for domain-specific obscured data transport
RU2652461C1 (en) * 2017-05-30 2018-04-26 Общество с ограниченной ответственностью "Аби Девелопмент" Differential classification with multiple neural networks
IL252657A0 (en) * 2017-06-04 2017-08-31 De Identification Ltd System and method for image de-identification
CN107633317B (en) * 2017-06-15 2021-09-21 北京百度网讯科技有限公司 Method and device for establishing journey planning model and planning journey
US10348658B2 (en) 2017-06-15 2019-07-09 Google Llc Suggested items for use with embedded applications in chat conversations
KR101970008B1 (en) * 2017-06-23 2019-04-18 (주)디노비즈 Computer program stored in computer-readable medium and user device having translation algorithm using by deep learning neural network circuit
WO2019005555A1 (en) * 2017-06-30 2019-01-03 Diluvian LLC Methods and systems for protecting user-generated data in computer network traffic
US10565434B2 (en) * 2017-06-30 2020-02-18 Google Llc Compact language-free facial expression embedding and novel triplet training scheme
CN109214238B (en) 2017-06-30 2022-06-28 阿波罗智能技术(北京)有限公司 Multi-target tracking method, device, equipment and storage medium
US10182303B1 (en) * 2017-07-12 2019-01-15 Google Llc Ambisonics sound field navigation using directional decomposition and path distance estimation
US10929945B2 (en) * 2017-07-28 2021-02-23 Google Llc Image capture devices featuring intelligent use of lightweight hardware-generated statistics
US10599460B2 (en) 2017-08-07 2020-03-24 Modelop, Inc. Analytic model execution engine with instrumentation for granular performance analysis for metrics and diagnostics for troubleshooting
US20190050600A1 (en) * 2017-08-11 2019-02-14 Ca, Inc. Masking display of sensitive information
US10929987B2 (en) 2017-08-16 2021-02-23 Nvidia Corporation Learning rigidity of dynamic scenes for three-dimensional scene flow estimation
US20190228037A1 (en) * 2017-08-19 2019-07-25 Wave Computing, Inc. Checkpointing data flow graph computation for machine learning
CA3069299C (en) * 2017-08-21 2023-03-14 Landmark Graphics Corporation Neural network models for real-time optimization of drilling parameters during drilling operations
GB2566257A (en) * 2017-08-29 2019-03-13 Sky Cp Ltd System and method for content discovery
US10225289B1 (en) * 2017-09-01 2019-03-05 Arthur Oliver Tucker, IV Anonymization overlay network for de-identification of event proximity data
EP3451190B1 (en) * 2017-09-04 2020-02-26 Sap Se Model-based analysis in a relational database
US10949614B2 (en) * 2017-09-13 2021-03-16 International Business Machines Corporation Dynamically changing words based on a distance between a first area and a second area
US20190080063A1 (en) * 2017-09-13 2019-03-14 Facebook, Inc. De-identification architecture
US10380236B1 (en) * 2017-09-22 2019-08-13 Amazon Technologies, Inc. Machine learning system for annotating unstructured text
EP3462412B1 (en) * 2017-09-28 2019-10-30 Siemens Healthcare GmbH Determining a two-dimensional mammography data set
US10510358B1 (en) * 2017-09-29 2019-12-17 Amazon Technologies, Inc. Resolution enhancement of speech signals for speech synthesis
US11720813B2 (en) * 2017-09-29 2023-08-08 Oracle International Corporation Machine learning platform for dynamic model selection
TWI662511B (en) * 2017-10-03 2019-06-11 財團法人資訊工業策進會 Hierarchical image classification method and system
CA3115898C (en) * 2017-10-11 2023-09-26 Aquifi, Inc. Systems and methods for object identification
US11120337B2 (en) * 2017-10-20 2021-09-14 Huawei Technologies Co., Ltd. Self-training method and system for semi-supervised learning with generative adversarial networks
US10489690B2 (en) * 2017-10-24 2019-11-26 International Business Machines Corporation Emotion classification based on expression variations associated with same or similar emotions
US10984905B2 (en) * 2017-11-03 2021-04-20 Siemens Healthcare Gmbh Artificial intelligence for physiological quantification in medical imaging
CN109784325A (en) * 2017-11-10 2019-05-21 富士通株式会社 Opener recognition methods and equipment and computer readable storage medium
US10990901B2 (en) 2017-11-13 2021-04-27 Accenture Global Solutions Limited Training, validating, and monitoring artificial intelligence and machine learning models
US11257002B2 (en) * 2017-11-22 2022-02-22 Amazon Technologies, Inc. Dynamic accuracy-based deployment and monitoring of machine learning models in provider networks
KR102486395B1 (en) * 2017-11-23 2023-01-10 삼성전자주식회사 Neural network device for speaker recognition, and operation method of the same
US10235533B1 (en) * 2017-12-01 2019-03-19 Palantir Technologies Inc. Multi-user access controls in electronic simultaneously editable document editor
US11393587B2 (en) * 2017-12-04 2022-07-19 International Business Machines Corporation Systems and user interfaces for enhancement of data utilized in machine-learning based medical image review
US10122969B1 (en) 2017-12-07 2018-11-06 Microsoft Technology Licensing, Llc Video capture systems and methods
US10380260B2 (en) * 2017-12-14 2019-08-13 Qualtrics, Llc Capturing rich response relationships with small-data neural networks
WO2019114982A1 (en) * 2017-12-15 2019-06-20 Nokia Technologies Oy Methods and apparatuses for inferencing using a neural network
US20190188605A1 (en) * 2017-12-20 2019-06-20 At&T Intellectual Property I, L.P. Machine Learning Model Understanding As-A-Service
WO2019126625A1 (en) 2017-12-22 2019-06-27 Butterfly Network, Inc. Methods and apparatuses for identifying gestures based on ultrasound data
US10742959B1 (en) 2017-12-29 2020-08-11 Perceive Corporation Use of machine-trained network for misalignment-insensitive depth perception
US20190212977A1 (en) * 2018-01-08 2019-07-11 Facebook, Inc. Candidate geographic coordinate ranking
US10706267B2 (en) 2018-01-12 2020-07-07 Qualcomm Incorporated Compact models for object recognition
US10679330B2 (en) * 2018-01-15 2020-06-09 Tata Consultancy Services Limited Systems and methods for automated inferencing of changes in spatio-temporal images
US10817749B2 (en) * 2018-01-18 2020-10-27 Accenture Global Solutions Limited Dynamically identifying object attributes via image analysis
US20190228310A1 (en) * 2018-01-19 2019-07-25 International Business Machines Corporation Generation of neural network containing middle layer background
US20190244138A1 (en) * 2018-02-08 2019-08-08 Apple Inc. Privatized machine learning using generative adversarial networks
US10867214B2 (en) * 2018-02-14 2020-12-15 Nvidia Corporation Generation of synthetic images for training a neural network model
US10303771B1 (en) * 2018-02-14 2019-05-28 Capital One Services, Llc Utilizing machine learning models to identify insights in a document
US10043255B1 (en) * 2018-02-20 2018-08-07 Capital One Services, Llc Utilizing a machine learning model to automatically visually validate a user interface for multiple platforms
US11347969B2 (en) * 2018-03-21 2022-05-31 Bank Of America Corporation Computer architecture for training a node in a correlithm object processing system
WO2019183153A1 (en) * 2018-03-21 2019-09-26 Kla-Tencor Corporation Training a machine learning model with synthetic images
CN110298219A (en) * 2018-03-23 2019-10-01 广州汽车集团股份有限公司 Unmanned lane keeping method, device, computer equipment and storage medium
US10860629B1 (en) * 2018-04-02 2020-12-08 Amazon Technologies, Inc. Task-oriented dialog systems utilizing combined supervised and reinforcement learning
US10803197B1 (en) * 2018-04-13 2020-10-13 Amazon Technologies, Inc. Masking sensitive information in records of filtered accesses to unstructured data
CN108596895B (en) * 2018-04-26 2020-07-28 上海鹰瞳医疗科技有限公司 Fundus image detection method, device and system based on machine learning
US10169315B1 (en) * 2018-04-27 2019-01-01 Asapp, Inc. Removing personal information from text using a neural network
US10915798B1 (en) * 2018-05-15 2021-02-09 Adobe Inc. Systems and methods for hierarchical webly supervised training for recognizing emotions in images
CN108665457B (en) * 2018-05-16 2023-12-19 腾讯医疗健康(深圳)有限公司 Image recognition method, device, storage medium and computer equipment
US20190354836A1 (en) * 2018-05-17 2019-11-21 International Business Machines Corporation Dynamic discovery of dependencies among time series data using neural networks
US10628708B2 (en) * 2018-05-18 2020-04-21 Adobe Inc. Utilizing a deep neural network-based model to identify visually similar digital images based on user-selected visual attributes
US20190354810A1 (en) * 2018-05-21 2019-11-21 Astound Ai, Inc. Active learning to reduce noise in labels
US20190362222A1 (en) * 2018-05-22 2019-11-28 Adobe Inc. Generating new machine learning models based on combinations of historical feature-extraction rules and historical machine-learning models
US10713569B2 (en) * 2018-05-31 2020-07-14 Toyota Research Institute, Inc. System and method for generating improved synthetic images
JP7035827B2 (en) * 2018-06-08 2022-03-15 株式会社リコー Learning identification device and learning identification method
US10699055B2 (en) * 2018-06-12 2020-06-30 International Business Machines Corporation Generative adversarial networks for generating physical design layout patterns
US20200012890A1 (en) * 2018-07-06 2020-01-09 Capital One Services, Llc Systems and methods for data stream simulation
US11880770B2 (en) * 2018-08-31 2024-01-23 Intel Corporation 3D object recognition using 3D convolutional neural network with depth based multi-scale filters
US11836577B2 (en) * 2018-11-27 2023-12-05 Amazon Technologies, Inc. Reinforcement learning model training through simulation
CN109784186B (en) * 2018-12-18 2020-12-15 深圳云天励飞技术有限公司 Pedestrian re-identification method and device, electronic equipment and computer-readable storage medium
US20210049499A1 (en) * 2019-08-14 2021-02-18 Capital One Services, Llc Systems and methods for diagnosing computer vision model performance issues
US11736764B2 (en) * 2019-09-13 2023-08-22 Intel Corporation Artificial intelligence inference on protected media content in a vision processing unit

Also Published As

Publication number Publication date
EP3591587A1 (en) 2020-01-08
US11182223B2 (en) 2021-11-23
US11113124B2 (en) 2021-09-07
US10521719B1 (en) 2019-12-31
US10860460B2 (en) 2020-12-08
US20230073695A1 (en) 2023-03-09
US10671884B2 (en) 2020-06-02
US20220107851A1 (en) 2022-04-07
US11822975B2 (en) 2023-11-21
US20230153177A1 (en) 2023-05-18
US11704169B2 (en) 2023-07-18
US11900178B2 (en) 2024-02-13
US20200012657A1 (en) 2020-01-09
US20210081261A1 (en) 2021-03-18
US20230195541A1 (en) 2023-06-22
US20200012937A1 (en) 2020-01-09
US20240054029A1 (en) 2024-02-15
US20200218638A1 (en) 2020-07-09
US20200012902A1 (en) 2020-01-09
US20220075670A1 (en) 2022-03-10
US20190327501A1 (en) 2019-10-24
US10635939B2 (en) 2020-04-28
US10884894B2 (en) 2021-01-05
US10970137B2 (en) 2021-04-06
US20200012934A1 (en) 2020-01-09
US20200012662A1 (en) 2020-01-09
US20200065221A1 (en) 2020-02-27
US20210255907A1 (en) 2021-08-19
US10382799B1 (en) 2019-08-13
US20210049054A1 (en) 2021-02-18
US20220318078A1 (en) 2022-10-06
US20200117998A1 (en) 2020-04-16
US20200012811A1 (en) 2020-01-09
US20230273841A1 (en) 2023-08-31
US20230376362A1 (en) 2023-11-23
US20200012933A1 (en) 2020-01-09
US20200012671A1 (en) 2020-01-09
US11836537B2 (en) 2023-12-05
US11385942B2 (en) 2022-07-12
US20200250071A1 (en) 2020-08-06
US10896072B2 (en) 2021-01-19
US10599957B2 (en) 2020-03-24
US11256555B2 (en) 2022-02-22
US20200012891A1 (en) 2020-01-09
US20200012584A1 (en) 2020-01-09
US11126475B2 (en) 2021-09-21
US11574077B2 (en) 2023-02-07
US20200012890A1 (en) 2020-01-09
US20210200604A1 (en) 2021-07-01
US20230205610A1 (en) 2023-06-29
US11687384B2 (en) 2023-06-27
US10592386B2 (en) 2020-03-17
US11604896B2 (en) 2023-03-14
US11237884B2 (en) 2022-02-01
US11513869B2 (en) 2022-11-29
US11210145B2 (en) 2021-12-28
US11687382B2 (en) 2023-06-27
US20210224142A1 (en) 2021-07-22
US20210182126A1 (en) 2021-06-17
US10379995B1 (en) 2019-08-13
US20200012583A1 (en) 2020-01-09
US20220308942A1 (en) 2022-09-29
US20220083402A1 (en) 2022-03-17
US11385943B2 (en) 2022-07-12
US20210120285A9 (en) 2021-04-22
US10459954B1 (en) 2019-10-29
US10983841B2 (en) 2021-04-20
US11372694B2 (en) 2022-06-28
US20230281062A1 (en) 2023-09-07
US11861418B2 (en) 2024-01-02
US10482607B1 (en) 2019-11-19
US20200012540A1 (en) 2020-01-09
US20200012917A1 (en) 2020-01-09
US20200014722A1 (en) 2020-01-09
US20230297446A1 (en) 2023-09-21
US20200012892A1 (en) 2020-01-09
US20200012886A1 (en) 2020-01-09
US20200012900A1 (en) 2020-01-09
US11615208B2 (en) 2023-03-28
US20200051249A1 (en) 2020-02-13
US10664381B2 (en) 2020-05-26
US20220147405A1 (en) 2022-05-12
US11210144B2 (en) 2021-12-28
US20200218637A1 (en) 2020-07-09
US20210365305A1 (en) 2021-11-25
US10599550B2 (en) 2020-03-24
US10460235B1 (en) 2019-10-29
US20200012666A1 (en) 2020-01-09
US20200012935A1 (en) 2020-01-09
US11032585B2 (en) 2021-06-08
US10452455B1 (en) 2019-10-22
US11580261B2 (en) 2023-02-14
US20200293427A1 (en) 2020-09-17

Similar Documents

Publication Publication Date Title
US20220092419A1 (en) Systems and methods to use neural networks for model transformations
EP3591586A1 (en) Data model generation using generative adversarial networks and fully automated machine learning system which generates and optimizes solutions given a dataset and a desired outcome
US20210097343A1 (en) Method and apparatus for managing artificial intelligence systems
US11810000B2 (en) Systems and methods for expanding data classification using synthetic data generation in machine learning models
US20230289665A1 (en) Failure feedback system for enhancing machine learning accuracy by synthetic data generation
WO2019212857A1 (en) Systems and methods for enriching modeling tools and infrastructure with semantics
CN104679646A (en) Method and device for detecting defects of SQL (structured query language) code
Wu et al. Invalid bug reports complicate the software aging situation
Tufek et al. On the provenance extraction techniques from large scale log files
US20220374401A1 (en) Determining domain and matching algorithms for data systems
KR102110350B1 (en) Domain classifying device and method for non-standardized databases
Marques Intelligent system for associative pattern identification in data

Legal Events

Date Code Title Description
AS Assignment

Owner name: CAPITAL ONE SERVICES, LLC, VIRGINIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:PHAM, VINCENT;TRUONG, ANH;ABAD, FARDIN ABDI TAGHI;AND OTHERS;SIGNING DATES FROM 20181026 TO 20181128;REEL/FRAME:057367/0894

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION