WO2022150343A1 - Génération et évaluation de données synthétiques sécurisées - Google Patents

Génération et évaluation de données synthétiques sécurisées Download PDF

Info

Publication number
WO2022150343A1
WO2022150343A1 PCT/US2022/011253 US2022011253W WO2022150343A1 WO 2022150343 A1 WO2022150343 A1 WO 2022150343A1 US 2022011253 W US2022011253 W US 2022011253W WO 2022150343 A1 WO2022150343 A1 WO 2022150343A1
Authority
WO
WIPO (PCT)
Prior art keywords
dataset
simulation
fields
agent
data
Prior art date
Application number
PCT/US2022/011253
Other languages
English (en)
Inventor
Eiran Shalev
Sandeep Narayanaswami
Matthew Tomaszewicz
Francisco Gutierrez
Omar Sharifali
Nicholas Mccurry
Jesse Anderson
Daniel Finn
Original Assignee
Capital One Services, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/142,097 external-priority patent/US11847390B2/en
Priority claimed from US17/142,117 external-priority patent/US20220215242A1/en
Priority claimed from US17/142,024 external-priority patent/US20220215262A1/en
Priority claimed from US17/142,137 external-priority patent/US20220215243A1/en
Priority claimed from US17/240,133 external-priority patent/US20220215142A1/en
Application filed by Capital One Services, Llc filed Critical Capital One Services, Llc
Priority to EP22737018.6A priority Critical patent/EP4275343A1/fr
Publication of WO2022150343A1 publication Critical patent/WO2022150343A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2111/00Details relating to CAD techniques
    • G06F2111/08Probabilistic or stochastic CAD
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/50Testing arrangements

Definitions

  • aspects of the disclosure relate generally to databases. More specifically, aspects of the disclosure may provide for enhanced creation and maintenance of one or more data models and their related datasets.
  • training machine learning models based on solely factual data limits the models to only environments that have actually existed. Developers seeking to train machine learning models based on environments that are rare or have not existed is difficult, if not impossible, as training data does not exist.
  • developers may need data for testing applications and/or providing training data for the training of personnel to handle various scenarios with real or realistic data across various domains including, for instance, data science, recruiting, personnel training, and other domains.
  • real data may be available but may require a lengthy security verification process before the real data is released to the developers.
  • the scrubbing process for converting real data into anonymized data may be time consuming to ensure no real data is inadvertently released.
  • Generative models have been used to generate realistic synthetic data (i.e., data that is not acquired as a result of direct observation but is otherwise indistinguishable from observed behavior, either by statistical testing or human review).
  • realistic synthetic data i.e., data that is not acquired as a result of direct observation but is otherwise indistinguishable from observed behavior, either by statistical testing or human review.
  • conventional generative models are difficult to use as well as difficult and time- consuming for average developers to modify to create the desired realistic synthetic data.
  • aspects described herein may address these and other problems, and generally improve the quality and quantity of data available for improving the modeling of systems, training machine learning models, and/or other purposes by offering improved generation of synthetic data and/or validation of the models generating the synthetic data.
  • aspects described herein may allow for generation of synthetic datasets comprising factual synthetic data and/or counterf actual synthetic data. This may have the effect of improving the complexity of data available for training machine learning models. According to some aspects, these and other benefits may be achieved by using models to generate the synthetic data. In implementation, the ability to generate a greater variety of data may be effected by using one or more models to describe data, generate synthetic datasets based on those models, and selectively configure the models to improve the modeling of the data and/or generate additional datasets varying from the original dataset.
  • the additional datasets may include data (referred to herein as "factual synthetic data”) closely matching a limited amount of actual data available based on a known environment or data (referred to herein as “counterfactual synthetic data") representing data from a created environment (e.g., an environment that has not occurred).
  • the models may include, but not limited to, a probabilistic graphical model (PGM) and/or an agent-based model (ABM). Further aspects described herein may provide for scrubbing actual data to create a generative model that does not reveal the content of the underlying true- source data and may provide for validating a generative model.
  • a computer-implemented method may comprise receiving a source dataset, wherein the source dataset may comprise a plurality of records, wherein each record contains data arranged in a plurality of fields; determining one or more parameters for the plurality of fields based on the data of the records in the plurality of fields, wherein the parameters comprise one or more of statistical parameters or correlation parameters; storing the one or more parameters; generating a generative model of the source dataset, wherein the generative model may be configured to generate one or more generated datasets having the one or more parameters; generating, based on the generative model, a generated dataset comprising data arranged in the plurality of fields, wherein the generated dataset may be a synthetic dataset; and outputting the generated dataset.
  • benefits may be achieved by using a computer- implemented method that may receive a simulation specification comprising an agent having a probability distribution definition, the agent probability distribution definition comprising attribute probability distribution definitions and identifying one or more behaviors to be simulated; receive one or more instantiation parameters; generate, using the simulation specification, a simulation state of an agent-based model, the generate comprising instantiating, via sampling using a random number generator to sample probability distribution definitions of attributes of the agent probability distribution definition, an agent instance comprising first attributes; store the simulation state; simulate, based on the simulation state and the simulation specification, a simulation step comprising performing, via sampling using the random number generator to sample a probability distribution definition of the one or more behaviors associated with the agent instance, an action for the agent instance; store the simulation step; generate, based on the stored simulation step, a synthetic dataset; and output the synthetic dataset.
  • a computer-implemented method may comprise receiving a tme-source dataset comprising a source plurality of records, wherein the source plurality of records may be arranged according to a plurality of fields and each record of the source plurality of records may comprise tme- source data for at least one field; categorizing, using a previously-trained model, one or more fields of the plurality of fields; determining, based on the categorizing of the one or more fields of the plurality of fields, a method of scrubbing the source plurality of records; generating, based on the determined method for scrubbing the one or more fields of the plurality of fields of the source plurality of records of the tme-source dataset, a scmbbed dataset comprising a scmbbed plurality of records; determining, based on the data of the scmbbed plurality of records of the scmbbed dataset, one or more parameters for the plurality of fields
  • a computer-implemented method may comprise receiving a generative model, wherein the generative model may be configured to generate one or more generated datasets having records arranged in one or more fields; generating, based on the generative model, a generated test dataset; receiving one or more input parameters associated with the one or more fields; determining, based on the one or more input parameters, a hypothesis test for the one or more fields; determining, based on data in the one or more fields of the generated test dataset, a parameter, wherein the parameter may be one or more of a statistical parameter or a correlation parameter; determining, based on the parameter, whether the generated test dataset passed the hypothesis test; and outputting the determination whether the generated test dataset passed the hypothesis test.
  • a framework for agent-based modeling (ABM) simulations may include separating definitions of agents from simulation specifications. This separation permits greater control of each of the agents and the simulations using the agents.
  • the framework with the separate agent definitions permits agents to be extensible across different simulations that otherwise would not be based on common agents.
  • separating agent definitions from simulation specifications permit the creation of extensible, complex agents that include attributes and/or behaviors that may not be used for any given simulation. By permitting agents to grow in complexity independent of any given simulation, agents become more comprehensive in their attributes and behaviors.
  • the complex agents may then streamline the creation of complex simulations by allowing simulations to reuse complex agents rather than requiring agents to be newly defined for each simulation.
  • at least some environmental factors may be represented as agents.
  • a computer-implemented method may comprise storing, in a storage, one or more agent type definitions, wherein each agent type definition may comprise a plurality of attribute probability distribution definitions, receiving, for a first simulation, a first simulation specification, wherein the first simulation specification may comprise a first list of agent type definitions, attribute probability distributions associated with the first list of agent type definitions, and behavior probability distributions associated with the first list of agent type definitions; generating the first simulation via sampling, using a random number generator, the first simulation specification's probability distributions of the first list of agent type definitions; executing steps of the first simulation via sampling, using the random number generator, the first simulation specification's probability distributions of the first list of agent type definitions and the behavior probability distributions associated with the first list of agent type definitions; outputting, based on the first simulation, a first synthetic dataset; receiving, for a second simulation, a second simulation specification, wherein the second simulation specification may comprise a second list of agent type definitions, attribute probability distributions associated with the second
  • FIG. 1 depicts an example of a computing device and system architecture that may be used in implementing one or more aspects of the disclosure in accordance with one or more illustrative aspects discussed herein;
  • FIG. 2 depicts an example of a network comprising servers and databases
  • FIG. 3 depicts a flow chart for a method of generating a dataset
  • FIG. 4 depicts a flow chart for a method of generating a dataset of FIG. 3 with additional steps
  • FIG. 5 depicts a flow chart for a method of generating synthetic data based on parameters and a probabilistic graphical model
  • FIG. 6 depicts a flow chart for a method of generating a user interface for modification of parameters of a probabilistic graphical model
  • FIG. 7 depicts a user interface for selecting and/or modifying parameters of a probabilistic graphical model
  • FIGs. 8 A, 8B, and 8C depict examples of probability distribution definitions and various simulation parameters.
  • FIG. 8 A depicts an example of an agent probability distribution definition that includes both attributes and behaviors.
  • FIG. 8B depicts an example of an agent probability distribution definition and a separate behavior probability distribution definition.
  • FIG. 8C depicts an example of desired synthetic data to be produced by the agent-based model;
  • FIGs. 9 A, 9B, 9C, and 9D depict state diagrams for conducting agent-based model simulations
  • FIG. 10 depicts a flowchart of an execution of an agent-based model simulation
  • FIG. 11 depicts another example flowchart of an execution of an agent-based model simulation
  • FIG. 12A depicts a flowchart of a process of modifying an agent-based model.
  • FIG. 12B depicts a user interface for modifying an agent-based model;
  • FIG. 13 depicts a flow chart for a method of training a model based on true-source data
  • FIGs. 14-16 depict flow charts for a method of training a model based on true-source data of FIG. 13 with additional steps;
  • FIGs. 17-18 depict flow charts for a method of validating synthetic data
  • FIG. 19 depicts a flow chart for a method of generating a user interface for adding hypothesis tests to a process of validating a generative model
  • FIG. 20 depicts a user interface for modifying a data model and for specifying hypothesis tests for validating the generative model
  • FIGs. 21-22 depict sample code for defining an agent-based model using a functional programming language
  • FIG. 23 depicts an agent storage and a simulation specification storage
  • FIG. 24 depicts various simulations
  • FIG. 25 depicts a flowchart of an execution of an agent-based model simulation
  • FIG. 26 depicts another example flowchart of an execution of an agent-based model simulation.
  • FIG. 27A depicts a flowchart of a process of modifying an agent-based model.
  • FIG. 27B depicts a user interface for modifying an agent-based model.
  • aspects discussed herein may relate to methods and techniques for improving creation and/or modification of a database based on synthetic data with relevant distributions. As discussed further herein, this combination of features may allow for improved modeling of a database by basing fields and data structures on source data having relevant distributions pertinent to the modeled fields.
  • synthetic data may refer to any data that is not acquired as a result of direct observation but is otherwise indistinguishable from observed behavior, either by statistical testing or human review
  • agent may refer to a software process behaving like something that may or may not exist in the real world to be represented in a simulation (e.g., the agent having attributes and able to execute one or more behaviors.
  • a credit card user may be modeled as a set of attributes including credit score, checking account, credit limit and credit account, and a set of behaviors including pay credit card, spend money, etc.);
  • an "agent-based model” may refer to a model of something in the real world, for example an economy, implemented as multiple software agents interacting with each other;
  • a "behavior” may refer to something a software agent is allowed to do in the context of an agent-based model (e.g., an agent model of a credit card user may have a first behavior to pay a balance on a credit card, and a second behavior to purchase goods or services using the credit card);
  • a "simulation” may refer to a series of steps in an agent-based model where agents interact with each other and execute behaviors to generate synthetic data;
  • a "probability distribution” may refer to a mathematical function defining the probabilities of possible values for sampled data points, agents, or behaviors.
  • FIG. 1 illustrates one example of a computing device 101 that may be used to implement one or more illustrative aspects discussed herein.
  • the computing device 101 may, in some embodiments, implement one or more aspects of the disclosure by reading and/or executing instructions and performing one or more actions based on the instructions.
  • the computing device 101 may represent, be incorporated in, and/or include various devices such as a desktop computer, a computer server, a mobile device (e.g., a laptop computer, a tablet computer, a smart phone, any other types of mobile computing devices, and the like), and/or any other type of data processing device.
  • the computing device 101 may, in some embodiments, operate in a standalone environment. In others, the computing device 101 may operate in a networked environment. As shown in FIG. 1, various network nodes 101, 105, 107, and 109 may be interconnected via a network 103, such as the Internet. Other networks may also or alternatively be used, including private intranets, corporate networks, LANs, wireless networks, personal networks (PAN), and the like. Network 103 is for illustration purposes and may be replaced with fewer or additional computer networks.
  • a local area network (LAN) may have one or more of any known LAN topologies and may use one or more of a variety of different protocols, such as Ethernet.
  • Devices 101, 105, 107, 109, and other devices may be connected to one or more of the networks via twisted pair wires, coaxial cable, fiber optics, radio waves, or other communication media. Additionally or alternatively, the computing device 101 and/or the network nodes 105, 107, and 109 may be a server hosting one or more databases.
  • the computing device 101 may include a processor 111, RAM 113, ROM 115, network interface 117, input/output interfaces 119 (e.g., keyboard, mouse, display, printer, etc.), and memory 121.
  • Processor 111 may include one or more computer processing units (CPUs), graphical processing units (GPUs), and/or other processing units such as a processor adapted to perform computations associated with database operations.
  • I/O 119 may include a variety of interface units and drives for reading, writing, displaying, and/or printing data or files. I/O 119 may be coupled with a display such as display 120.
  • Memory 121 may store software for configuring computing device 101 into a special purpose computing device in order to perform one or more of the various functions discussed herein.
  • Memory 121 may store operating system software 123 for controlling overall operation of the computing device 101, control logic 125 for instructing the computing device 101 to perform aspects discussed herein, database creation and manipulation software 127 and other applications 129.
  • Control logic 125 may be incorporated in and may be a part of database creation and manipulation software 127.
  • the computing device 101 may include two or more of any and/or all of these components (e.g., two or more processors, two or more memories, etc.) and/or other components and/or subsystems not illustrated here.
  • Devices 105, 107, 109 may have similar or different architecture as described with respect to the computing device 101.
  • Those of skill in the art will appreciate that the functionality of the computing device 101 (or device 105, 107, 109) as described herein may be spread across multiple data processing devices, for example, to distribute processing load across multiple computers, to segregate transactions based on geographic location, user access level, quality of service (QoS), etc.
  • devices 101, 105, 107, 109, and others may operate in concert to provide parallel computing features in support of the operation of control logic 125 and/or software 127.
  • One or more aspects discussed herein may be embodied in computer-usable or readable data and/or computer-executable instructions, such as in one or more program modules, executed by one or more computers or other devices as described herein.
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types when executed by a processor in a computer or other device.
  • the modules may be written in a source code programming language that is subsequently compiled for execution, or may be written in a scripting language such as (but not limited to) Python or JavaScript.
  • the computer executable instructions may be stored on a computer readable medium such as a hard disk, optical disk, removable storage media, solid-state memory, RAM, etc.
  • the functionality of the program modules may be combined or distributed as desired in various embodiments.
  • the functionality may be embodied in whole or in part in firmware or hardware equivalents such as integrated circuits, field programmable gate arrays (FPGA), and the like.
  • Particular data structures may be used to more effectively implement one or more aspects discussed herein, and such data structures are contemplated within the scope of computer executable instructions and computer- usable data described herein.
  • Various aspects discussed herein may be embodied as a method, a computing device, a data processing system, or a computer program product.
  • FIG. 2 depicts an example of a network of two or more servers each supporting one or more databases having datasets.
  • a database storage server 201, a file system server 202, and a computing server 203 may be connected to each other via network 204.
  • Network 204 may be represented as a single network but may comprise combinations of other networks or subnetworks.
  • the database storage server 201 may include one or more processors 205, a database 206 comprising metadata 207 for one or more datasets.
  • the file system server 202 may include one or more processors 208, a memory 209 comprising one or more source/uploaded datasets 210, one or more data models 211 (also referred to as "data model objects"), one or more scrubbed datasets 220, and one or more generated datasets 212.
  • the metadata for the source datasets 210 and the synthetic datasets may be stored as metadata 207 in the database storage server 201.
  • the computing server 203 may include one or more processors 213 and a storage 214 comprising data 215.
  • Database storage server 201, file system server 202, and/or computing server 203 may offer services for computing data ingestion, generating a data model object, and generating synthetic data. Those services may include communicating with the other servers as needed to obtain or provide the source datasets, the data model objects, and/or the generated synthetic data as needed.
  • An input data source 219 may make requests of the database storage server 201, the file system server 202, and/or the computing server 203 to obtain generated data.
  • the input data source 219 may be a user and/or outside system account.
  • the new dataset may be created from a first set of rows from a first table and a second set of rows from a second table. Further, the new dataset may obtain content from other new datasets.
  • software engineers consider a number of factors that help them plan how that new model should be configured. During the designing process, a software engineer attempts to create an abstract model that organizes elements of data to be stored in a file system and standardizes how those data elements relate to each other and to the properties of entities. For example, for a data model object relating to credit card account data, the data model object may include a first data element representing an account holder and a second data element representing the billing address for that credit card account.
  • data model object is generally used in two separate senses. In a first sense, the term refers to an abstract formulation of the objects and relationships found in a particular domain. In a second sense, the term refers to a set of concepts used to define formalizations in that particular domain. As described herein, the term “data model object” may be used in both senses, as relevant to the description in context. As a variety of performance factors are tied to the data model object (including but not limited to speeds of searches, adding new data, reindexing the database, and the like), correctly modeling data often means repeatedly revising a given model prior to deployment.
  • a software engineer may use synthetic data in datasets to replace the small, sampled source datasets where the synthetic data is expected to be close to ideal for a given numerical field.
  • An issue with the use of synthetic data is the lack of reusability of any generated synthetic data or even the process to generate the synthetic data.
  • a software engineer develops a process for generating synthetic data for modeling data, that process is highly associated with that data.
  • the process for generating additional synthetic data has to be re-created for that new data model object.
  • small, sampled source datasets may be used in machine learning models to train the models to act in a desired way and/or produce predictions based on input data.
  • Machine learning is a process by which computer algorithms improve through experience. Machine learning algorithms build a mathematical model based on sample data, known as "training data”, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or infeasible to develop conventional algorithms to perform the needed tasks.
  • One input to these machine learning models may include historical datasets that capture aspects of the operations of the business. For example, high-value business decisions may be automated with machine learning models.
  • a risk in deploying machine learning models into production may include that future events do not necessarily resemble past events. As such, a machine learning model trained only on historical events may make suboptimal decisions on live events. This downside may become significant in the financial industry because of the risk involved in making decisions based on incomplete or unrepresentative data. Machine learning models may benefit from additional data where the data encompasses.
  • a generative model is used to describe models that generate instances of output variables that may be used for machine learning.
  • a generative model may generate synthetic data that may be input into various machine learning models.
  • a generative model may be referred to as a representation of a data distribution that may be used to generate data points.
  • a good generative model may be treated as a source of synthetic data - e.g., data that is realistic but not actual, real- world data.
  • Multiple approaches exist for generating synthetic data including, but not limited to, generative adversarial networks, variational auto encoders, probabilistic graphical models, and agent-based models.
  • a generative adversarial network is generally referred to as a machine learning framework in which two neural networks compete against each other (e.g., based on game theory). Based on a training set, the GAN attempts to generate new data with the same statistics as the training set.
  • a variational auto encoder VAE attempts to learn an encoding for a set of data by training the network to ignore irrelevant information, thus creating a reduced encoding of an original dataset. The auto encoder attempts to generate, from the reduced encoding, a representation as close as possible to its original dataset.
  • a probabilistic graphical model is a statistical model that represents variables and their associated probabilities as nodes and the relationships (e.g., dependencies and/or correlation) as edges.
  • An agent-based model is a statistical model that represents individual agents and their behaviors with the probability of the behaviors occurring over time.
  • Recurrent neural networks are artificial neural networks connections between nodes form a directed graph along a temporal sequence. This allows RNNs to exhibit temporal dynamic behavior.
  • RNNs Derived from feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition.
  • GANs tend to be used in modeling where more source data is present, users are experienced in artificial intelligence processes, a goal is to accurately generate target data matching source data, and a detailed explanation or specific control of how data is generated is not required.
  • ABMs tend to be used in modeling where there is less source data, developers are experienced in a given domain of data, a goal is to simulate rare events or previously unexperienced events, and a detailed explanation or specific control of how data is generated may be needed
  • VAEs and PGMs are generally represented on the spectrum of requirements/goals between GANs and ABMs with VAEs closer to GANs and PGMs closer to ABMs.
  • the source data may have various parameters (e.g., distribution, mean, mode, median, minimum, maximum, standard deviation, symmetry, skewness, kurtosis, correlation, or other parameters), with those parameters possibly being specified and/or determined.
  • the synthetic data may also have parameters, possibly being specified and/or determined.
  • correlations between fields may exist including, but not limited to, covariance, interclass correlation, intraclass correlation, or rank. independence (e.g., determined, for instance, from a chi-squared test) may also be used to describe relationships between fields of data.
  • the synthetic data may be referred to as "factual synthetic data” or grouped as “factual synthetic datasets”. Where the parameters of the synthetic data are intentionally different than those of the source data, the synthetic data may be referred to as "counterfactual synthetic data” or grouped as “counterfactual synthetic datasets”.
  • one or more of the factual datasets or counterfactual datasets may be used to augment existing historical datasets for the training of machine learning models.
  • a machine learning model trained on counterfactual datasets may be more robust to changes in the distribution of actual and real-time data, and may be expected to do a better job in a wider set of scenarios.
  • counterfactual datasets may be used to train employees in responding to various business scenarios.
  • FIGs. 3-7 describe how to use a generative model to generate synthetic data.
  • the generated data may be factual synthetic data and/or counterfactual synthetic data based on the desired type of synthetic data.
  • FIG. 3 is an example of a flow chart describing a process for creating synthetic data from tme-source data.
  • the synthetic dataset may be used to train a machine learning model or may be used to augment existing data and the combination used to train the machine learning model.
  • the method of FIG. 3 may be implemented by a suitable computing system, for instance, as described above with respect to FIGs. 1 and/or 2.
  • the method of FIG. 3 may be implemented by any suitable computing environment by a computing device and/or combination of computing devices, such as computing devices 101, 105, 107, and 109 of FIG. 1.
  • the method of FIG. 3 may be implemented in suitable program instructions, such as in database creation and manipulation software 127, and may operate on a suitable data such as data from database storage server 201 or data from file system server 202 or data from computing server 203.
  • Various generative models may encode the distribution of a dataset by capturing both the individual variations of a variable in the dataset as well as the covariances of pairs of variables.
  • Probabilistic graphical models may be a useful choice among models because of their sparseness and interpretability, thus permitting modification of the PGMs to represent parameters not found in existing datasets, thus permitting adjustments to comport with a desired counterfactual scenario.
  • users are able to modify specific nodes to adjust parameters of variables (e.g., parameters describing the content of individual cells in fields of a database) and to modify specific edges to adjust correlations between the variables (e.g., correlations describing relationships between fields of the database).
  • step 301 an initial dataset is received, in which the dataset has records in fields.
  • a processor determines one or more statistical parameters of the one or more fields.
  • step 303 the processor determines one or more correlation parameters between two or more of the fields.
  • step 304 the statistical parameters and correlation parameters are stored (for instance, in one of the memories or databases of FIGs. 1 and/or 2).
  • step 305 a generative model (e.g., a probabilistic graphical model) is trained. Using the generative model, a dataset is generated in step 306.
  • FIG. 4 describes a process similar to that of FIG. 3 and includes additional outcomes.
  • step 401 an initial dataset is received, in which the dataset has records arranged in fields.
  • a processor determines one or more statistical parameters of the one or more fields.
  • the processor determines one or more correlation parameters between two or more of the fields.
  • the statistical parameters and correlation parameters are stored, e.g., as metadata 207 in database storage server 201).
  • step 405 a generative model (a PGM) is trained on the metadata 207.
  • a synthetic dataset e.g., a probable graphical model dataset
  • the generated synthetic dataset may be sent to a user who may have originally requested the generation of the synthetic dataset.
  • a machine learning model may be trained in step 408 on the synthetic dataset generated in step 406 and, in step 409, the machine learning model of step 408 may be used to generate predictions based on, for example, another true-source dataset.
  • the user is able to modify the underlying distribution of a generated synthetic dataset.
  • a range of synthetic datasets, from factual synthetic datasets to counterfactual synthetic datasets may be generated.
  • the synthetic datasets may vary from each other based on different statistical properties of one variable or based on different statistical properties of multiple variables.
  • the user may combine the synthetic datasets with each other and/or with existing true-source datasets.
  • the system may receive modification of parameters and/or distributions, e.g., from a user, in step 411. Based on those modifications received in step 411, the generative model may be modified in step 412 and a synthetic dataset generated, in step 406, based on the modified generative model.
  • step 413 statistical parameters and/or correlation parameters may be determined from the synthetic dataset as generated in step 406 (and possibly sent to the user). Based on the determination of the parameters in step 413, the system may receive modifications of one or more parameters/distributions in step 411 and, in step 412, modify the generative model, and generate a revised synthetic dataset in step 406.
  • the parameters/distributions of the synthetic dataset may be compared, in step 414, with expected parameters/distributions of the generative model of step 405. Based on the comparison of step 414, the generative model may be modified in step 412 and a revised synthetic dataset generated in step 406.
  • FIG. 5 describes an approach to creating synthetic datasets by capturing the knowledge of a subject matter expert, i.e., permitting the subject matter expert to control the creation of the generative model using supplied parameters.
  • FIG. 5 depicts a flow chart of a process in which statistical parameters and/or correlation parameters are received and used to modify a generative model that then is used to generate a dataset.
  • step 501 statistical parameters of one or more fields of a dataset are received.
  • correlation parameters between two or more fields of the dataset are received.
  • a generative model e.g., a probabilistic graphical model
  • FIG. 6 depicts a process of generating a user interface and modifying a generative model (e.g., a probabilistic graphical model) based on a user's interaction with the user interface.
  • a system receives a labeled tme-source dataset.
  • the system e.g., processor 213 or other processors
  • the data model object may be stored as metadata 207.
  • the system generates a user interface based on the metadata of step 602.
  • the system receives a user's interactions with the user interface modifying the metadata of the data model object and adjusts the metadata in response.
  • a generative model is trained based on the metadata adjusted (also referred to as "tuned") by the user.
  • the system may receive a user's designation of a quantity of generated datasets to be generated (e.g., through further interactions with a user interface).
  • the system generates the quantity of generated datasets requested by the user in step 606 and sends, in step 608, the datasets to the user.
  • the system may receive further user interactions with a user interface and, in response, modify the metadata of the data model object and then, based on the modified metadata, train another generative model in step 605 based on the modified metadata of the data model object.
  • the user may validate the generative model (as described herein, for example) and, based on the results of that validation step 610, further modify the metadata in step 609 for training of another generative model (or retraining based on the metadata if replacing the existing generative model).
  • the tuning of a data model object may benefit a user by allowing the user to customize generated datasets that are then generated from a generative model trained on the tuned data model object.
  • the modified generative model from step 609 may be subsequently used as described in FIGs. 4 and/or 5.
  • FIG. 7 depicts a possible representation of a user interface, permitting modification of a generative model.
  • the user interface 701 may comprise one or more regions 702 permitting a user to select and/or modify statistical parameters of the generative model and one or more regions 703 permitting a user to select and/or modify correlation parameters of the generative model.
  • the one or more regions 702 permitting selection/modification of statistical parameters may comprise one or more of a node (in the case of a PGM)/field selection/deselection (represented by region 704), a distribution modification option (represented by region 705), a mean modification option (represented by region 706), a mode modification option (represented by region 707), a maximum modification option (represented by region 708), a minimum modification option (represented by region 709), a standard deviation modification option (represented by region 710), a symmetry modification option (represented by region 711), a skewness modification option (represented by region 712), and/or a kurtosis modification option (represented by region 713).
  • Other regions may be added as desired to permit modification of other statistical parameters.
  • the one or more regions 703 permitting selection/modification of correlation parameters may comprise one or more of edge selection (in the case of a PGM) (represented by region 714) and/or the ability to select fields directly, e.g., first field 715 and second field 716, a type of correlation option (represented by region 717), and a degree of correlation option (represented by region 718).
  • Another region 719 may allow a user to identify how many generated datasets are to be generated and sent to the user. For instance, the quantity of desired synthetic datasets may be specified in region 720.
  • machine learning models trained on data from those generative models may be improved.
  • machine learning models in financial or cybersecurity applications may be particularly vulnerable to changing data distributions.
  • a bank's credit risk model may have been trained on historical data, but the historical data may not capture long-term macroeconomic variations.
  • Such a model may result in incorrect lending decisions when a new macroeconomic event (e.g., an election of a political party with little track record of decisions, a global pandemic, civil unrest in various jurisdictions, and the like).
  • a cybersecurity threat detection model may be used to highlight suspicious behavior.
  • attack vectors are constantly evolving, a current method for detecting an attack vector may not have been represented in the cybersecurity threat detection model's training dataset, possibly resulting in false negatives and/or breaches of a secure environment.
  • machine learning models may benefit from varying the content of training datasets by reducing the overemphasis of a specific dataset while permitting a greater variety of scenarios to be encompassed within the training datasets.
  • Counterfactual datasets may also be used for testing use cases. In addition to being able to create machine learning models, the counterfactual datasets may be valuable for testing the performance of existing models against data that those models would not normally encounter in production. During the development of large-scale data processing systems (like databases or stream engines), these datasets may be used to simulate anticipated load patterns.
  • a system based on PGMs may be more user-friendly in terms of its input data requirements.
  • an initial PGM model may be learned from very little data, or be encoded by hand with the help of a subject matter expert in the relevant domain (e.g., a financial services domain or a cyber-security services domain).
  • a subject matter expert in the relevant domain
  • An issue with merely enlarging an existing dataset for machine learning is that the distributions do not change. Enlarging a dataset replicates the same biases in the existing dataset and does not enhance the learning of the machine learning model but only reinforces the existing biases.
  • GANs are not tunable and are not able to be interpreted to determine what should be modified.
  • Sparse models like PGMs and ABMs, are easier and more tractable to understand and manipulate, thus being more suited for generating synthetic datasets ranging from factual to counterfactual.
  • a computer-implemented method may comprise receiving a source dataset, wherein the source dataset may comprise a plurality of records, wherein each record contains data arranged in a plurality of fields; determining one or more parameters for the plurality of fields based on the data of the records in the plurality of fields, wherein the parameters comprise one or more of statistical parameters or correlation parameters; storing the one or more parameters; generating a generative model of the source dataset, wherein the generative model may be configured to generate one or more generated datasets having the one or more parameters; generating, based on the generative model, a generated dataset comprising data arranged in the plurality of fields, wherein the generated dataset may be a synthetic dataset; and outputting the generated dataset.
  • the generated dataset may further comprise data resulting from tuning of the generative model to have a determined variation from one or more of the parameters.
  • the method may further comprise receiving a request for generating a generated dataset; receiving a desired parameter; modifying, based on the desired parameter, the generative model; and generating, based on the modified generative model, a second generated dataset, wherein the second generated dataset may be a synthetic dataset.
  • the method may further comprise receiving, from a user's computing device, a selection of the source dataset, wherein the outputting may comprise sending the generated dataset to the user's computing device.
  • the outputting may further comprise training, based on the generated dataset, a predictive model; and generating one or more predictions based on a second source dataset using the trained predictive model.
  • the method may further comprise receiving user input modifying one or more of the statistical parameters; modifying, based on the modified one or more statistical parameters, the generative model; generating, based on the modified generative model, a second generated dataset; and outputting the second generated dataset.
  • the method may further comprise receiving user input modifying one or more correlation parameters; modifying, based on the modified one or more correlation parameters, the generative model; generating, based on the modified generative model, a second generated dataset; and outputting the second generated dataset.
  • the statistical parameters may be a distribution parameter of one of the plurality of fields of the true- source dataset and comprise one of a normal distribution, uniform distribution, lognormal distribution, Poisson distribution, exponential distribution, beta distribution, gamma distribution, binomial distribution, multinomial, Dirichlet distribution, Bernoulli distribution, chi-squared distribution, Student's t distribution, F distribution, Benford distribution, power distribution, or triangular distribution.
  • the statistical parameters may comprise a minimum, maximum, mean, mode, standard deviation, symmetry, skewness, or kurtosis.
  • the correlation parameters may comprise a degree of correlation between two or more fields of the source dataset.
  • the generative model may comprise a probabilistic graphical model having two or more nodes and one or more edges, wherein at least one of the two or more nodes may be based on the one or more statistical parameters, wherein the one or more edges may be based on the one or more correlation parameters, wherein one of the one or more of the statistical parameters may be a first distribution parameter of one of the plurality of fields of the source dataset.
  • the method may further comprise receiving, from a user's computing device, a second distribution parameter; modifying, based on the receiving, a node of the generative model corresponding to the first distribution parameter to include the second distribution parameter; generating, based on the modified generative model, a second generated dataset; and sending the second generated dataset to the user's computing device.
  • the generative model may comprise a probabilistic graphical model having two or more nodes and one or more edges, at least one of the two or more nodes may be based on the one or more statistical parameters, wherein the one or more edges may be based on the one or more correlation parameters, and wherein one of the one or more of the statistical parameters may be a distribution parameter of one of the plurality of fields of the source dataset.
  • the method may further comprise determining, based on one of the second plurality of fields of the generated dataset, a second distribution parameter; comparing the second distribution parameter with the distribution parameter; modifying, based on the comparing, a node of the generative model, corresponding to the first distribution parameter, to include the modified distribution parameter; and generating, based on the modified generative model, a second generated dataset.
  • the generative model may comprise a probabilistic graphical model having two or more nodes and one or more edges, at least one of the two or more nodes may be based on the one or more statistical parameters, the one or more edges may be based on the one or more correlation parameters, and wherein one of the one or more of the statistical parameters may be a first statistical parameter of one of the plurality of fields of the source dataset.
  • the method may further comprise receiving, from a user's computing device, a second statistical parameter; modifying, based on the receiving, a node of the generative model, corresponding to the first statistical parameter, to include the second statistical parameter; generating, based on the modified generative model, a second generated dataset; and sending the second generated dataset to the user's computing device.
  • the generative model may comprise a probabilistic graphical model having two or more nodes and one or more edges, wherein at least one of the two or more nodes may be based on the one or more statistical parameters, and wherein the one or more edges may be based on the one or more correlation parameters.
  • the method may further comprise determining, based on one of the second plurality of fields of the generated dataset, a second statistical parameter; comparing the second statistical parameter with one of the one or more statistical parameters; modifying, based on comparing the second statistical parameter with the statistical parameter, a node of the generative model corresponding to the first statistical parameter, to include a modified statistical parameter; and generating, based on the modified generative model, a second generated dataset.
  • the method may further comprise receiving, from a user's computing device, a second correlation parameter; modifying, based on the receiving, an edge of the generative model, corresponding to the one or more correlation parameters, to include the second correlation parameter; generating, based on the modified generative model, a second generated dataset; and sending the second generated dataset to the user's computing device.
  • an apparatus may comprise one or more processors; and memory storing instructions that, when executed by the one or more processors, cause the apparatus to receive a source dataset, wherein the source dataset may comprise a plurality of records, wherein each record contains data arranged in a plurality of fields; determine one or more statistical parameters for the plurality of fields based on the data of the records in the plurality of fields; determine one or more correlation parameters based on a correlation between data in the plurality of records in two or more fields of the plurality of fields of the source dataset; store the one or more statistical parameters and the one or more correlation parameters; generate a generative model of the source dataset, wherein the generative model may be configured to generate one or more generated datasets having the one or more statistical parameters and the one or more correlation parameters; cause display of a graphical interface of the generative model, wherein the graphical interface may be configured to display the one or more statistical parameters and the one or more correlation parameters; receive user interactions with graphical interface, wherein the user interactions may be to modify
  • the generative model may comprise a probabilistic graphical model having two or more nodes and one or more edges. At least one of the two or more nodes may be based on the one or more statistical parameters. One or more edges may be based on the one or more correlation parameters.
  • the instructions may further cause the receiving of user interactions to receive modifications of a statistical parameter node of the generative model, cause the modification of the statistical parameter node of the two or more nodes of the generative model, and cause the generation of, based on the modified statistical parameter node of the two or more nodes of the generative model, a second generated dataset.
  • one or more non-transitory media storing instructions that, when executed by one or more processors, cause the one or more processors to perform steps comprising receiving a source dataset, wherein the source dataset may comprise a plurality of records, wherein each record contains data arranged in a plurality fields; determining one or more statistical parameters for the plurality of fields based on the data of the records in the plurality of fields; determining one or more correlation parameters based on a correlation between data in the plurality of records in two or more fields of the plurality of fields of the source dataset; storing the one or more statistical parameters and the one or more correlation parameters; generating a generative model of the source dataset, wherein the generative model may be configured to generate one or more generated datasets having the one or more statistical parameters and the one or more correlation parameters; modifying, based on received inputs adjusting one or more of the statistical parameters or the correlation parameters, the generative model to include one or more of a modified statistical parameter or a modified correlation parameter; generating, based
  • agent-based models In addition to probabilistic graphical models, synthetic data, ranging from factual data to counterf actual data, may be generated through agent-based models (ABMs).
  • ABMs agent-based models
  • Conventional agent-based models define parameters of the agents and actions performed by the agents in the definition of each agent.
  • agents and behaviors are composed of probability distribution definitions and together are used form a simulation specification. The definitions of the agents and behaviors is separate from the simulation of the agents and behaviors. By splitting, improved modeling of possible events (e.g., economic events and the like) may be achieved.
  • a set of behaviors may be modified to account for possible economic events before adding in existing agents.
  • Examples of attributes for a first agent definition may include both attributes that, when sampled, generate a specific value (e.g., a specific credit limit for a first instance of the first agent definition) and/or may generate a distribution to be sampled during each step of a simulation (e.g., a propensity to pay a credit card balance (partial or full) and when (soon after receiving statement to at or after the due date) and how often (making two or more payments per month)).
  • the specific value is, unless modified during an action, generally regarded as fixed for that first instance of the first agent and the distribution is generally regarded as varying per simulation step following the distribution pattern identified for that first instance's attribute.
  • the agent probability definitions may be robustly defined independent of an intended simulation, thus making the agent probability definitions extensible beyond the originally intended simulation.
  • An agent probability definition may, for some attributes or behaviors, may define an agent independent of other agents. Other attributes or behaviors may be tied with the actions of other agents or behaviors. For example, an agent probability definition for a home buyer may be linked to an agent probability definition for a home seller, and/or to an agent probability definition for a loan agent/bank.
  • an economy may be represented as an agent probability definition and other agent probability definitions associated with it. This permits the other agent probability definitions to tie to a common economic state and generated synthetic datasets representing that common economic state.
  • fields of desired synthetic data may be specified as an input to a simulation component. Based on the desired fields, the synthetic data may be generated for those fields. The generated synthetic data may comprise some or all of the state information generated during each step of the simulation. By permitting the identification of desired fields of synthetic data, the system permits a greater degree of flexibility compared to systems that have unalterable identifications of the fields of the data to be generated.
  • Agent-based models may be useful for users who, while having a level of knowledge in a domain and seeking to use real data (tme-source data), may not have access to enough real data or any at all.
  • An agent-based model may address this lack of data by capitalizing on the users' knowledge of the domain to tune the agents and behaviors to generate the desired synthetic data. Further, an agent-based model may be helpful where existing available data does not cover all scenarios in which the users are interested. An agent-based model may address this lack of scenario-specific data by capitalizing on the users' knowledge of the domain to tune the agents and behaviors to generate the scenario-specific synthetic data. Further, as the probability distribution definitions of the agents and behaviors are distinct from the simulation, users create agents and behaviors that are extensible beyond a given simulation of a domain.
  • the ABM samples the simulation specification to generate instances of agents performing actions.
  • the simulation specification may be run as one or more simulation steps to simulate actions taken by the instances of the agents over time.
  • the time may cover a given period (e.g., less than a year, 10 years, 50+ years) or until a goal is achieved (e.g., 30% home ownership for a given age group), or any interval as desired.
  • the users may model a domain of interest as a set of agents and execute a simulation of a process of interest in order to generate synthetic data similar to what would likely have been observed if the real process had occurred.
  • An example of such a use case may include the simulation of credit card payments under different economic conditions, including hypothetical recessions never experienced before.
  • Agents may be referred to as having "composable” probability distributions, with their attributes composed as functions of simpler distributions.
  • “composable” refers to a type of object or process that may be combined with other objects or processes to make complex instances of the objects or processes.
  • a function may be composed of other functions.
  • a “composable probability distribution” may be a probability distribution that may be combined with other probability distributions to create a more complex probability distribution.
  • Simulations may also be referred to as a complex probability distribution composed of the simpler probability distributions of simulated behaviors.
  • agents and behaviors may be specified precisely as probability distributions without having to sample any data or run the simulation.
  • a simulation state of the simulation may be executed by sampling, with a random number generator, the agent probability distribution definitions and their related behavior probability distribution definitions.
  • the definition of the probability distribution definitions and the execution of the simulation e.g., the sampling operation
  • a functional language for example, a functional language such as Haskell.
  • "functional programming" may be described as a programming paradigm where programs are constructed by applying and composing functions.
  • Haskell as an example of a functional programming language, may be used to define and to execute the simulation.
  • Haskell is described as a polymorphically statically typed, lazy, purely functional language. It is appreciated that other functional programming languages may be used in place of or in addition to Haskell.
  • the functional language may use one or more monads.
  • a "monad” may be considered a design pattern that allows structuring programs generically while automating away boilerplate code needed by the program logic.
  • Monads may achieve this goal by providing their own data type (a particular type for each type of monad), which represents a specific form of computation, along with one procedure to wrap values of any basic type within the monad (yielding a monadic value) and another to compose functions that output monadic values (called monadic functions).
  • each agent may be represented by a probability monad where the agent's probability monad is composed of individual attribute probability monads that describe the probability distribution definition for each attribute.
  • the behaviors of the instances of the agents may also be represented by monads, where each behavior monad is composed of monads representing the behaviors of each instance.
  • the set of all distributions may also be a monad
  • the subset of probability distributions comprising the behaviors of the agent may also be monads
  • the elementary probability distributions used to define the behaviors may also be monads.
  • monads complex monads may be composed from simpler monads, thus allowing complex distributions to be composed of less complex distributions.
  • FIGs. 21 and 22 show an example of an agent-based model defined using Haskell to define a Bayesian network as a composition of probability distributions.
  • an example is provided relating to a probability of grass being wet based on the following statement "grass may be wet because it is raining outside or because sprinklers are on.” The probability that it is raining is independent but the probability that the sprinklers are on depends on whether it is raining. The probability that the grass is wet depends on both whether it is raining and whether the sprinklers are on.
  • the model is defined as a composition of probability distributions. That definition may be sampled independently of the definition. In FIG.
  • the probability model is defined using a Bayesian probability monad, the types of each of the nodes in the network are declared, and the type for the joint distribution is declared.
  • a distribution monad for whether it is raining is included - given that one knows whether or not it is raining.
  • a conditional distribution monad for whether the sprinkler is on is included.
  • a joint distribution monad, composed from other distribution monads, is included. At this point, the joint distribution monad is a distribution monad as no sampling has occurred.
  • the function to sample n times from any Bayesian monad is set.
  • the output is no longer Bayesian monad but a list of items sampled from the Bayesian monad using, for instance, a random number generator.
  • the sampling is used to generate sample data. For instance, the list of Rain items may be sampled multiple times.
  • the list of Joint items may be sampled multiple times. The resulting distribution may be found from combining the results from the samplings.
  • agent probability distribution definitions may be sampled to generate agent instances during an initialization phase (the simulation state), and simulation step distributions may be sampled during simulation steps.
  • synthetic data may be generated. This data may be stored for future download or streamed in real-time, depending on user needs.
  • the code to define the simulation may be an interpreted subset of the programming language or may be a simplified domain- specific language to encode the simulation specification.
  • the definitions of the agents and definitions of the behaviors may be stored in the same or different codebases.
  • An agent-based model may be deployed locally and/or across a network (e.g., in the cloud).
  • the agent-based model may simulate what would happen to credit card defaults when the economy is in recession.
  • the user may be an economist attempting to train machine learning models to predict credit card defaults but lacking enough recession data to train the models. For example, while significant data may exist for credit card defaults occurring during strong economies, there may be a lack of data for credit card defaults during economic recessions.
  • recessions may occur due to various factors, a robust machine learning model may benefit from being trained with data from multiple recessions including data from recessions that have, in fact, occurred (e.g., actual (true-source) data or factual synthetic data) and data from recessions that have not occurred (e.g., counterfactual synthetic data).
  • the economist in this example, may know how to define various types of recessions that have not yet occurred.
  • the economist may build a micro-level model to generate macro-level aggregate data (factual synthetic data) that matches existing historical data, adjusting the agents and/or behaviors as desired.
  • the economist may adjust the ABM to emulate other types of recessions that have not, in fact, occurred.
  • the economist may generate counterfactual datasets corresponding to those other recessions. Those counterfactual datasets may be combined with one or more of the actual data or the factual synthetic data. The economist may then use the combined data to train and evaluate the predictive machine learning model. The trained machine learning model may then be deployed to make predictions based on new data.
  • FIGs. 8A, 8B, and 8C depict examples of probability distribution definitions and various simulation parameters.
  • FIG. 8 A depicts an example of an agent probability distribution definition 801 that includes both attributes 802, 803, 804, 805, and 806 and behaviors 807, 808, 809, 810, and 811.
  • FIG. 8A includes definitions of the attributes and the behaviors in the agent probability distribution definition 801. Each attribute may be associated with no behaviors or one or more behaviors. Attribute 802 is not associated with any behavior. Similarly, behavior 807 is not associated with any attribute. Attribute 803 is associated with behavior 808. Behavior 809 is associated with attributes 804 and 805. Attribute 806 is associated with behaviors 810 and 811.
  • FIG. 8B depicts an example of a first agent probability distribution definition 812, a second agent probability distribution definition 813, and a separate behavior probability distribution definition 814.
  • the first agent probability distribution definition 812 comprises attributes 815, 816, and 817 and the second agent probability distribution definition 813 comprises attributes 818 and 819.
  • the behavior probability distribution definition 814 comprises behaviors 820, 821, 822, 823, and 824. Each attribute may be associated with no behaviors or one or more behaviors. Attribute 815 is not associated with any behavior. Similarly, behavior 820 is not associated with any attribute. Attribute 816 is associated with behavior 821.
  • Behavior 822 is associated with attributes 817 of the first agent probability distribution definition 812 and with attribute 818 of the second agent probability distribution definition 813.
  • Attribute 819 of the second agent probability distribution definition 813 is associated with behaviors 823 and 824. Further, the first and second agent probability distribution definition 812 and 813 may be associated with each other (e.g., one using state information from the other to perform an action associated with a behavior) as shown by the dashed line connecting the agent probability distribution definitions 812 and 813.
  • FIG. 8C depicts an example of desired synthetic data to be produced by the agent-based model.
  • the desired synthetic data 825 comprises one or more fields (represented in FIG. 8C as fields 826, 827, and 828) for which a synthetic dataset is requested to be generated by the simulation of first agent probability distribution definitions and behavior probability distribution definitions.
  • the request may be sent by a user of a cloud-based service to the system generating the synthetic datasets.
  • FIGs. 9 A, 9B, 9C, and 9D depict state diagrams for conducting agent-based model simulations.
  • the probability distribution definitions of FIGs. 8A-8C may be combined together to form a simulation specification.
  • the simulation specification may be used, with instantiation data, to instantiate instances of agents who are defined in the simulation specification by sampling the simulation specification with a random number generator, resulting in a simulation state. That simulation state may be iteratively sampled, using the random number generator, to perform actions defined in behaviors associated with the instantiated agents. Each sampling of the simulation state may a simulation step.
  • FIG. 9A includes agent probability distribution definitions 901 (for example, agent probability distribution definition A and agent probability distribution definition B) and behavior probability distribution definitions 902 (for example, behavior probability distribution definition J, behavior probability distribution definition K, and behavior probability distribution definition L).
  • Agent probability distribution definitions 901 for example, agent probability distribution definition A and agent probability distribution definition B
  • behavior probability distribution definitions 902 for example, behavior probability distribution definition J, behavior probability distribution definition K, and behavior probability distribution definition L
  • Data 903 A relating to the quantity of instances per agent probability distribution definition and desired synthetic data 903 may also be available.
  • the combination of agent probability distribution definitions 901, behavior probability distribution definitions 902 (if separate from 901), instance data 903A, and desired synthetic data 903B may collectively be the simulation specification.
  • the attributes of the agent probability distribution definitions 901 are sampled using a random number generator 905 for the quantity of instances identified in instance data 903A (e.g., two instances of agent probability distribution definition A and three instances of agent probability distribution definition B).
  • the desired synthetic data 903B may also be used to create the simulation state.
  • agent A1 907 representing a first instantiation of agent probability distribution definition A and containing values and parameters (e
  • agent A2 was removed and agent B4 was added.
  • Agent B4 918 is the fourth instance of an agent based on the agent probability distribution definition B.
  • the synthetic data 919 may be stored and sent at a later time or streamed to the entity requesting the synthetic data.
  • agent A3 was added and agents B 1 and B4 were removed.
  • Agent A3 922 is the third instance of an agent based on the agent probability distribution definition B.
  • the synthetic data 925 may be stored and sent at a later time or streamed to the entity requesting the synthetic data.
  • FIG. 9D relationships between various agent instances are shown.
  • the simulation step of agent B2" is based on the simulation step of agent instances BF and B2'.
  • the simulation step of agent instance B3' is based on the simulation step of agent B3.
  • New agent instance A3" is based on the simulation step of instance AF, B2', and B4'.
  • FIG. 10 depicts a flowchart of an execution of an agent-based model simulation.
  • simulation definition information and other information is retrieved.
  • agent probability distribution definitions are received.
  • the quantities of agents to be instantiated per agent probability distribution definition is received. If specified separately from the agent probability distribution definitions, the behavior probability distribution definitions are received in step 1003.
  • the desired fields for synthetic data are received.
  • the process may repeat for a set number of iterations, until a given result is obtained (e.g., 30% home ownership), or the simulation reaches a steady state (no significant changes from a previous state - e.g., 99% of the collected states not changing between steps).
  • FIG. 11 depicts another example flowchart of an execution of an agent-based model simulation.
  • simulation specification and other information is retrieved.
  • agent probability distribution definitions are received.
  • the quantities of agents to be instantiated per agent probability distribution definition are received. If specified separately from the agent probability distribution definitions, the behavior probability distribution definitions are received in step 1103.
  • the desired fields for synthetic data are received.
  • the simulation specification 1100 is sampled to generate the simulation state. As no previous step of the simulation exists, the simulation state is generated based on the probability distribution definitions and other data of the simulation specification 1100.
  • the process may repeat (next simulation steps) for a set number of iterations, until a given result is obtained (e.g., 30% home ownership), or the simulation reaches a steady state (no significant changes from a previous state - e.g., 99% of the collected states not changing between steps).
  • the stored synthetic dataset may be sent to a user.
  • the generated predictions may be sent (e.g., to the above user or a different user) in step 1110.
  • the synthetic dataset may be used to train a machine- learning model in step 1114 and the trained machine-learning model used to generate predictions in step 1115 based on new true- source data.
  • the system may receive instructions to add a new agent probability distribution definition and/or a new behavior probability distribution definition.
  • the new agent and/or new behavior probability distribution definition may be added to the simulation specification 1100 for the new generation of a specification state.
  • step 1113 instructions may be received to modify one or more existing agent probability distribution definitions and/or behavior probability distribution definitions and/or instantiation parameters and/or desired synthetic data fields. Based on the information received in step 1113, the corresponding agent probability distribution definitions and/or behavior probability distribution definitions and/or instantiation parameters and/or desired synthetic data fields are modified in step 1116 and the modified simulation specification 1100 used for generation of a new simulation state and subsequent simulation steps.
  • FIG. 12A depicts a flowchart of a process of modifying an agent-based model.
  • step 12010 agent/behavior probability distribution definitions and/or instantiation parameters and/or desired synthetic data fields are received.
  • a user interface is generated in step 12020.
  • step 12030 user interactions with the user interface are received.
  • step 12040 the agent/behavior probability distribution definitions and/or instantiation parameters and/or desired synthetic data fields are modified based on the user interactions of step 12030.
  • FIG. 12B depicts a user interface for modifying an agent-based model.
  • the user interface 1201 may comprise a quantity of regions including a region 1202 permitting selection of an agent probability distribution definition (e.g., agent A probability distribution definition 1204, agent B probability distribution definition 1206, and agent X probability distribution definition 1208) and the quantity of instantiations for the selected agent probability distribution definition to be set (e.g., quantity of instantiations for agent A's probability distribution definition 1205, quantity of instantiations for agent B's probability distribution definition 1207, and/or quantity of instantiations for agent X's probability distribution definition 1209).
  • agent probability distribution definition e.g., agent A probability distribution definition 1204, agent B probability distribution definition 1206, and agent X probability distribution definition 1208
  • agent X probability distribution definition 1209 e.g., agent X probability distribution definition
  • the user interface 1201 may comprise a region 1203 permitting selection of a behavior probability distribution definition and selectively enabling/disabling that behavior (e.g., region 1217 permitting selection of behavior probability distribution definition J and enable/disable region 1218, region 1219 permitting selection of behavior probability distribution definition K and enable/disable region 1220, and region 1221 permitting selection of behavior probability distribution definition Y and enable/disable region 1222).
  • region 1217 permitting selection of behavior probability distribution definition J and enable/disable region 1218
  • region 1219 permitting selection of behavior probability distribution definition K and enable/disable region 1220
  • region 1221 permitting selection of behavior probability distribution definition Y and enable/disable region 1222
  • the user interface 1201 may comprise a region 1210 permitting modification of a selected agent/behavior's probability distribution definition.
  • Region 1210 may comprise a region 1211 for receiving a user's modification of an attribute parameter of the selected agent's probability distribution definition, a region 1212 for receiving the user's modification of a behavior probability distribution definition.
  • Region 1212 may additionally or alternatively separately permit linking or breaking a link between the selected behavior probability distribution definition such that instantiated agents perform the linked behaviors during simulation.
  • a behavior probability distribution definition comprises one or more parameters that define the behavior probability distribution definition or where each behavior probability distribution definition is comprised of separate actions (that collectively make up the behavior probability distribution definition)
  • the user interface may further comprise a region 1223 that receives user input for modification of the action or the behavior parameter.
  • the user interface 1201 may further comprise a region 1213 for accepting user input for defining a new agent probability distribution definition.
  • Region 1213 may comprise a region 1214 for receiving user input for setting a new attribute probability distribution parameter and a region 1215 for receiving user input for setting a new behavior probability distribution parameter and/or linking the new behavior probability distribution definition with an agent probability distribution definition.
  • the user interface 1201 may further comprise a region 1224 for accepting user input for modifying the fields to be populated with synthetic data for a generated synthetic dataset.
  • Applications of the synthetic data generated by the ABM may include the generation of a dataset when there is no true-source data available. Some datasets of potential interest may not exist anywhere, or are not easily accessible. For example, data on customer behavior under different types of recessions does not exist for recession types that have not occurred. In those instances, to generate relevant data, the ABM may permit a user to simulate customers and simulate behaviors relevant to one or more recessions.
  • applications of the synthetic data generated by the ABM may include the simulation of rare events to augment an existing dataset.
  • applications of the synthetic data generated by the ABM may include the generation of data with a distribution that changes over time.
  • Most generative statistical and machine learning models assume that the data is identically and independently distributed. However, in reality that is rarely the case. For example, spending habits of an individual may vary seasonally, with technological innovation, with life stage, with advertising, and even with mood. Modeling each of these variations in spending habits in a mathematical model might be intractable. However, using an ABM, the variations in spending habits may be obtained by simulating probability distributions while enabling arbitrary complexity to be included in the definition of agents and/or behaviors, without having to specify how the model is executed.
  • applications of the synthetic data generated by the ABM may include the training of reinforcement learning agents in a realistic environment.
  • Reinforcement learning agents that learn from interacting with their environment are particularly suited to learn from simulations.
  • reinforcement learning agents learn from interacting with their environment increasing the size and complexity of their environment by including examples that rarely occur in the real world permits learning that would not otherwise be possible.
  • One example may include a reinforcement learning agent that learns new ways to commit fraud in a simulation environment. This may allow a company's fraud team to predict potential new fraud vectors and prepare for them before they actually occur in real life.
  • applications of the synthetic data generated by the ABM may be used to define a granular model to explain some aggregate data.
  • a dataset includes summary data, but users may need to understand from where the data originated.
  • ABM simulations may provide the ability to identify the origin of the data by permitting the user to iterate over simple models, and gradually add complexity until the aggregate data matches the distribution of the original dataset. By the step-wise addition of complexity, the user learns how the aggregate data changes based on the user's changes.
  • an ABM may define a simulation specification separately from the execution of the simulation.
  • a simulation definition language that enables the simulation of the ABM may use two monads: a simulation step sequencing monad and a probability distribution monad.
  • the probability distribution monad permits one to compose probability distributions, enabling arbitrary complexity in the definition of agents and behaviors, without having to specify details regarding the execution of the simulation.
  • the probability distribution monad may be used to compose distribution definitions
  • the simulation monad may be used to compose simulation steps. This use of two monads may provide users the flexibility of a general-purpose language, while limiting them to only define a simulation and leaving the execution to the engine behind the simulation.
  • a computer-implemented method may receive a simulation specification comprising an agent having a probability distribution definition, the agent probability distribution definition comprising attribute probability distribution definitions and identifying one or more behaviors to be simulated; receive one or more instantiation parameters; generate, using the simulation specification, a simulation state of an agent- based model, the generate comprising instantiating, via sampling using a random number generator to sample probability distribution definitions of attributes of the agent probability distribution definition, an agent instance comprising first attributes; store the simulation state; simulate, based on the simulation state and the simulation specification, a simulation step comprising performing, via sampling using the random number generator to sample a probability distribution definition of the one or more behaviors associated with the agent instance, an action for the agent instance; store the simulation step; generate, based on the stored simulation step, a synthetic dataset; and output the synthetic dataset.
  • the simulation specification may further comprise a second agent having a second agent probability definition comprising second attribute probability distribution definitions and identifying one or more second behaviors to be simulated
  • the generating the simulation state may further comprise instantiating, via sampling using the random number generator to sample the second attribute probability distribution definitions, a second agent instance comprising second attributes
  • the simulating the simulation step may further comprise performing, via sampling using the random number generator to sample a second probability distribution definition of the one or more behaviors associated with the second agent instance, a second action for the second agent instance.
  • the outputting may comprise training, based on the synthetic dataset, a predictive machine-learning model; and generating, using the trained predictive model, one or more predictions based on a true-source dataset.
  • the method may further comprise receiving, before generating the simulation state of the agent-based model, an identification of synthetic data fields, wherein the storing the synthetic data is based on the identification of the synthetic data fields.
  • the generating the synthetic dataset simulating may further comprise iteratively simulating additional simulation steps of the agent.
  • the generating the synthetic dataset may be based on the additional simulation steps.
  • the generated synthetic dataset may comprise synthetic data, of the agent instance, from two or more iterative simulation steps.
  • the outputting may comprise streaming, per simulation step, the synthetic dataset.
  • Additional instructions may be received to modify a quantity of the agent instances to be generated in the simulation state and the method may regenerate, based on the modified quantity of agent instances, the simulation state, and the regenerated simulation state may comprise a count of agent instances corresponding to the received modified quantity.
  • the performing the action for the agent instance may further comprise performing, via sampling using the random number generator to sample the probability distribution definition of the one or more behaviors associated with the agent instance and via sampling using the random number generator to sample a second probability distribution definition of a second behavior associated with a second agent instance, the action for the agent instance.
  • the method may further comprise iteratively simulating, based on simulation step and the simulation state, additional simulation steps, wherein, in the additional simulation steps, a second agent instance may be instantiated.
  • the agent probability distribution definition may comprise a probability monad, the probability monad may comprise attribute probability monads, and the probability monad may be a complex probability distribution composed of attribute probability distributions of the attribute probability monads.
  • the simulating the agent-based model may comprise a simulation monad, the simulation monad may comprise behavior probability monads, and the simulation monad may be a complex probability distribution composed of behavior probability distributions of the behavior probability monads.
  • the behavior may comprise one or more actions that may comprise action probability distributions.
  • the behavior may be a complex probability distribution composed of the action probability distributions.
  • the one or more of the agent instance's attributes may comprise an attribute value used in performing the action.
  • the agent's attributes may comprise an attribute probability distribution, and the performing the action may comprise sampling, using the random number generator, the attribute probability distribution.
  • the method may further comprise causing display of a graphical interface of the agent- based model, wherein the graphical interface is configured to display the agent's probability distribution definitions and the one or more behaviors; receiving user interactions with the graphical interface, wherein the user interactions are to modify a specific attribute of the agent or a specific behavior of the agent; and modifying, based on the received user interactions, the agent's probability distribution definition; storing, as part of the simulation specification, the modified agent's probability distribution definition, wherein generating the simulation state further comprises generating, using the simulation specification with the modified agent's probability distribution definition, the simulation state.
  • An apparatus may comprise one or more processors; and memory storing instructions that, when executed by the one or more processors, cause the apparatus to receive a simulation specification comprising an agent having a probability distribution definition, the agent probability distribution definition comprising attribute probability distribution definitions and identifying one or more behaviors to be simulated; cause display of a graphical interface of the agent-based model, wherein the graphical interface is configured to display the agent's probability distribution definitions and the one or more behaviors; receive user interactions with the graphical interface, wherein the user interactions are to modify a specific attribute of the agent or a specific behavior of the agent; modify, based on the received user interactions, the agent's probability distribution definition; store, as part of the simulation specification, the modified agent's probability distribution definition; receive one or more instantiation parameters; generate, using the simulation specification, a simulation state of an agent-based model, the generating comprising instantiate, via sampling using a random number generator to sample probability distribution definitions of attributes of the agent probability distribution definition, an agent instance comprising first attributes; store the simulation state
  • One or more non-transitory media storing instructions that, when executed by one or more processors, may cause the one or more processors to perform steps comprising receiving a simulation specification comprising an agent having a probability distribution definition, the agent probability distribution definition comprising attribute probability distribution definitions and identifying one or more behaviors to be simulated; receiving one or more instantiation parameters; generating, using the simulation specification, a simulation state of an agent-based model, the generating comprising instantiating, via sampling using a random number generator to sample probability distribution definitions of attributes of the agent probability distribution definition, an agent instance comprising first attributes; storing the simulation state; simulating, based on the simulation state and the simulation specification, a simulation step comprising performing, via sampling using the random number generator to sample a probability distribution definition of the one or more behaviors associated with the agent instance, an action for the agent instance; storing the simulation step; generating, based on the stored simulation step, a synthetic dataset; and outputting the synthetic dataset, wherein the agent probability distribution definition comprises a probability monad, where
  • a computer-implemented method may comprise receiving a simulation specification comprising an agent having a probability distribution definition, the agent probability distribution definition comprising attribute probability distribution definitions and identifying one or more behaviors to be simulated; receiving one or more instantiation parameters; generating, using the simulation specification, a simulation state of an agent-based model, the generating comprising instantiating, via sampling using a random number generator to sample probability distribution definitions of attributes of the agent probability distribution definition, an agent instance comprising first attributes; storing the simulation state; simulating, based on the simulation state and the simulation specification, a simulation step comprising performing, via sampling using the random number generator to sample a probability distribution definition of the one or more behaviors associated with the agent instance, an action for the agent instance; storing the simulation step; generating, based on the stored simulation step, a synthetic dataset; and outputting the synthetic dataset.
  • the simulation specification further may comprise a second agent having a second agent probability definition comprising second attribute probability distribution definitions and identifying one or more second behaviors to be simulated.
  • the generating the simulation state further may comprise instantiating, via sampling using the random number generator to sample the second attribute probability distribution definitions, a second agent instance comprising second attributes.
  • the simulating the simulation step further may comprise performing, via sampling using the random number generator to sample a second probability distribution definition of the one or more behaviors associated with the second agent instance, a second action for the second agent instance.
  • the outputting may comprise training, based on the synthetic dataset, a predictive machine-learning model; and generating, using the trained predictive model, one or more predictions based on a true-source dataset.
  • the method may further comprise receiving, before generating the simulation state of the agent- based model, an identification of synthetic data fields, wherein storing the synthetic data may be based on the identification of the synthetic data fields.
  • the generating the synthetic dataset simulating further may comprise iteratively simulating additional simulation steps of the agent. The generating the synthetic dataset may be based on the additional simulation steps.
  • the generated synthetic dataset may comprise synthetic data, of the agent instance, from two or more iterative simulation steps.
  • the outputting may comprise streaming, per simulation step, the synthetic dataset.
  • the method may further comprise receiving instructions to modify a quantity of the agent instances to be generated in the simulation state; and regenerating, based on the modified quantity of agent instances, the simulation state.
  • the regenerated simulation state may comprise a count of agent instances corresponding to the received modified quantity.
  • the performing the action for the agent instance further may comprise performing, via sampling using the random number generator to sample the probability distribution definition of the one or more behaviors associated with the agent instance and via sampling using the random number generator to sample a second probability distribution definition of a second behavior associated with a second agent instance, the action for the agent instance.
  • the method may further comprise iteratively simulating, based on simulation step and the simulation state, additional simulation steps.
  • a second agent instance may be instantiated.
  • the agent probability distribution definition may comprise a probability monad, the probability monad may comprise attribute probability monads, and the probability monad may be a complex probability distribution composed of attribute probability distributions of the attribute probability monads.
  • the simulation of the agent-based model may comprise a simulation monad, the simulation monad may comprise behavior probability monads, and the simulation monad may be a complex probability distribution composed of behavior probability distributions of the behavior probability monads.
  • the behavior may comprise one or more actions, the one or more actions may comprise action probability distributions, and the behavior may be a complex probability distribution composed of the action probability distributions.
  • the agent instance's attributes may comprise an attribute value used in performing the action.
  • the agent's attributes may comprise an attribute probability distribution.
  • the performing the action further may comprise sampling, using the random number generator, the attribute probability distribution.
  • the method may further comprise causing display of a graphical interface of the agent-based model, wherein the graphical interface may be configured to display the agent's probability distribution definitions and the one or more behaviors; receiving user interactions with the graphical interface, wherein the user interactions may be to modify a specific attribute of the agent or a specific behavior of the agent; and modifying, based on the received user interactions, the agent's probability distribution definition; storing, as part of the simulation specification, the modified agent's probability distribution definition, wherein generating the simulation state further may comprise generating, using the simulation specification with the modified agent's probability distribution definition, the simulation state.
  • an apparatus may comprise one or more processors; and memory storing instructions that, when executed by the one or more processors, cause the apparatus to receive a simulation specification comprising an agent having a probability distribution definition, the agent probability distribution definition comprising attribute probability distribution definitions and identifying one or more behaviors to be simulated; cause display of a graphical interface of the agent- based model, wherein the graphical interface may be configured to display the agent's probability distribution definitions and the one or more behaviors; receive user interactions with the graphical interface, wherein the user interactions may be to modify a specific attribute of the agent or a specific behavior of the agent; modify, based on the received user interactions, the agent's probability distribution definition; store, as part of the simulation specification, the modified agent's probability distribution definition; receive one or more instantiation parameters; generate, using the simulation specification, a simulation state of an agent-based model, the generating comprising instantiate, via sampling using a random number generator to sample probability distribution definitions of attributes of the agent probability distribution definition, an
  • one or more non-transitory media storing instructions that, when executed by one or more processors, may cause the one or more processors to perform steps comprising receiving a simulation specification comprising an agent having a probability distribution definition, the agent probability distribution definition comprising attribute probability distribution definitions and identifying one or more behaviors to be simulated; receiving one or more instantiation parameters; generating, using the simulation specification, a simulation state of an agent-based model, the generating comprising instantiating, via sampling using a random number generator to sample probability distribution definitions of attributes of the agent probability distribution definition, an agent instance comprising first attributes; storing the simulation state; simulating, based on the simulation state and the simulation specification, a simulation step comprising performing, via sampling using the random number generator to sample a probability distribution definition of the one or more behaviors associated with the agent instance, an action for the agent instance; storing the simulation step; generating, based on the stored simulation step, a synthetic dataset; and outputting the synthetic dataset.
  • the agent probability distribution definition may comprise a probability monad.
  • the probability monad may comprise attribute probability monads.
  • the probability monad may be a complex probability distribution composed of attribute probability distributions of the attribute probability monads.
  • the simulation of the agent-based model may comprise a simulation monad.
  • the simulation monad may comprise behavior probability monads.
  • the simulation monad may be a complex probability distribution composed of behavior probability distributions of the behavior probability monads.
  • true-source data may exist but making the true-source data available may be prohibited by law and/or by corporate policies.
  • HIPAA Health Insurance Portability and Accountability Act of 1996
  • another developer, seeking to model a database for storing financial-related data may be prevented, by existing banking regulations, from obtaining individuals' financial data.
  • One or more aspects of the disclosure relate to generating synthetic data from true- source data using two machine-learning models.
  • the first machine learning model may categorize fields of the true-source dataset by allowing the application to replace the values in identified sensitive fields with randomized data that still follows the same syntax structure as the original true-source dataset, and output a scrubbed dataset (a scrubbed dataset).
  • the second machine learning model may determine statistical parameters of the fields of the first scrubbed dataset and determine correlations between the fields of the first synthetic dataset.
  • the second machine learning model may next generate a synthetic dataset based on the learned statistical properties, probabilities, distributions, and relationships as the original, true-source dataset.
  • Benefits of using this approach may include allowing developers to interact with realistic synthetic data that does not risk exposing sensitive customer or company data, thereby protecting customers’ privacy (e.g., health-related privacy concerns and banking-related privacy concerns). Also, by using two models, a first to scrub true- source data and a second to create synthetic data from the scrubbed, true-source data, the developers may obtain synthetic data that would otherwise take weeks or months to obtain due to permission issues or the true-source data being wholly unavailable.
  • the two machine learning models may be a cloud-deployed service that generates realistic synthetic data on demand that matches one or more of statistical probabilities, distributions, or dependencies in real data but does not contain any real records or customer information, thereby protecting customer privacy and an institution's sensitive data without requiring significant manual operation.
  • This statistically relevant synthetic data is valuable by providing access to realistic synthetic data where the true-source data is unavailable or inaccessible.
  • a probabilistic graphical model may be used for one or more of the models.
  • the PGMs may be deployed as a cloud-based microservice to query a source dataset, remove sensitive customer information, automatically train a machine-learning model, and generate synthetic data while minimizing required input from users, thereby increasing the security of the service and reducing the risk of data exposure.
  • the synthetic data is de-identified and anonymized such that recreating the true-source data is effectively impossible by reviewing the synthetic data.
  • another approach may include users to view and tune parameters of the synthetic data generation model (e.g., a PGM or other model).
  • FIG. 13 depicts a flow chart for a method of training a model based on tme-source data.
  • a true-source dataset is received.
  • the starting of the process may be based on a request from a user or system desiring a synthetic dataset.
  • the tme-source dataset may be uploaded by the user or system or may be obtained from a remote user or system.
  • the process of FIG. 13 may provide a dataset cleaning service for that user or system, thereby permitting specific datasets to be filtered and/or selected after being uploaded as part of step 1300 but before the machine-learning model is trained to generate the synthetic data.
  • step 1300 provides another level of protection of disclosure of sensitive data by having step 1300 find and obtain datasets without exposing the tme-source datasets to the requesting user.
  • the tme-source dataset may comprise a plurality of records, with data of the records arranged in various fields.
  • a previously-trained machine learning model may be obtained in step 1301.
  • An example of a previous-trained machine learning model for scmbbing datasets may be found in US 16/151,385, filed October 4, 2018, now US Patent 10,460,235, to Tmong et al. entitled "Data Model Generation Using Generative Adversarial Networks", whose contents is expressly incorporated herein by reference.
  • a previously-trained machine learning model may, in step 1301, may read and suggest labels for the fields of the retrieved dataset (retrieved in step 1300).
  • the labels may identify fields as relating to various content, of which labels may include one or more of persons' names, email addresses, physical addresses, city, state, ZIP Codes, country codes, credit card numbers, Social Security numbers, drivers' license numbers, other identifying numbers, telephone numbers, internet addresses (e.g., IPV4, IPV6), uniform resource locators (URLs), dates, times, combinations of dates and time, months, integers, FICO scores (i.e., a score based on a model provided by the Fair, Isaac, and Company), random data, and noise.
  • labels may include one or more of persons' names, email addresses, physical addresses, city, state, ZIP Codes, country codes, credit card numbers, Social Security numbers, drivers' license numbers, other identifying numbers, telephone numbers, internet addresses (e.g., IPV4, IPV6), uniform resource locators (URLs), dates, times, combinations of dates and time, months, integers, FICO scores (i.e., a score based on a model provided by the Fair, Isaac, and Company), random data
  • Some fields of the retrieved dataset may already be identified as containing sensitive information (e.g., a field of data with a field header of "SSN” or "Social Security Number” or "Address”). Additionally or alternatively, fields having pre-assigned labels may nonetheless be separately scanned to determine whether any sensitive information is in the fields and then the fields may be appropriately labeled (if needing a different label).
  • sensitive information e.g., a field of data with a field header of "SSN” or "Social Security Number” or "Address”
  • fields having pre-assigned labels may nonetheless be separately scanned to determine whether any sensitive information is in the fields and then the fields may be appropriately labeled (if needing a different label).
  • the tme-source data may be scrubbed in step 1302, to selectively replace the content of fields based on the labels of the fields. For instance, fields having been labeled with labels identifying sensitive information (e.g., names, addresses, account numbers, etc.) may be replaced with a contextually similar alternative value that follows the same schema as the source field.
  • the replacement technique may be the same for all fields having been labeled with a label identifying the field as containing sensitive information. Alternatively (as described with respect to FIG. 14), the replacement may vary between semantic and syntactic approaches.
  • questionable fields may be flagged during one or more of steps 1301 or 1302 that request review of fields that are not adequately classifiable as containing sensitive information or containing no sensitive information.
  • steps 1301 or 1302 users may be permitted to manually set data types and/or scrubbing policy.
  • step 1303 a scrubbed dataset may be generated.
  • step 1304 statistical parameters and correlation parameters may be determined for the fields in the scrubbed dataset.
  • a machine learning model evaluates the scrubbed true-source data to learn its patterns and distributions both within a field and by evaluating dependencies across fields (for example, income may be influenced by age).
  • dependencies across fields for example, income may be influenced by age.
  • dependencies between various fields may be determined. Based on those dependencies, the relationships between the fields may be mapped within the dataset to improve the accuracy of the output.
  • the output of this process is a generative model (step 1305) that generates realistic generated datasets similar to the scrubbed true-source dataset and may be used to generate synthetic data that follows the distributions of that scrubbed data.
  • the generative model may comprise a probabilistic graphical model (PGM), an agent-based model (ABM), or other generative model.
  • PGM probabilistic graphical model
  • ABSM agent-based model
  • step 1306 based on the generative model created in step 1305, a synthetic dataset is generated.
  • the synthetic dataset follows the patterns of the tme-source data by calling the generative model to generate synthetic data.
  • the quantity of records generated may be arbitrarily large and is not limited on the volume of available true- source data.
  • this data was generated to match patterns rather than being based on real transactions or records, it should not contain any real customer information or sensitive business data in it, but will still match the distributions and patterns of the scrubbed, tme-source data.
  • This synthetic data may then be passed back to the user or application requesting it for display or usage as required for the given use case. Additionally or alternatively, the data may be checked by the user or another entity and flagged where, for instance, any tme-source data is found (e.g., un-tokenized credit card numbers) or any datasets whose expected columns (based on, for example, an enterprise data management tool registration) do not match actual columns observed (referred to as schema drift).
  • any tme-source data is found (e.g., un-tokenized credit card numbers) or any datasets whose expected columns (based on, for example, an enterprise data management tool registration) do not match actual columns observed (referred to as schema drift).
  • the process may be deployed as an automatic process with no user intervention.
  • the process may be deployed to include user and/or technician's interactions to review field categorizations (or other items for review) and where users are able to manually tune one or more of data categorization, scrubbing, dependencies, or distributions to obtain the desired synthetic data.
  • Further controls may be placed on the source data from step 1301 to limit the volume of source data obtained from the data source.
  • the tme-source dataset and the scmbbed dataset may be deleted after the creation of the generative model of step 1305.
  • the generative model of step 1305 and any generated synthetic dataset from step 1306 may be deleted after a short time (e.g., from one or two days to two weeks or later as desired) after creation.
  • a whitelist of fields that should not be scmbbed may also be used.
  • the use of the whitelist in step 1301 to prevent scrubbing of specific fields may permit a finer-grained recognition of which fields are sensitive and those that are not sensitive, to allow the values in those non-sensitive fields to pass through to the scmbbed version of data, increasing the realism of the scmbbed data, the generative model, and finally the generated datasets.
  • FIGs. 14-16 depict flow charts for a method of training a model based on tme-source data of FIG. 13 with additional steps.
  • a tme-source dataset is received.
  • the starting of the process may be based on a request from a user or system desiring a synthetic dataset.
  • the true-source dataset may be uploaded by the user or system or may be obtained from a remote user or system.
  • the process of FIG. 14 may provide a dataset cleaning service for that user or system, thereby permitting specific datasets to be filtered and/or selected after being uploaded as part of step 1400 but before the training of the machine-learning model.
  • step 1400 provides another level of protection of disclosure of sensitive data by having step 1400 find and obtain datasets without exposing the datasets to the requesting user.
  • the true-source dataset may comprise a plurality of records, with data of the records arranged in various fields.
  • the size of the true-source data set may be limited. This may be achieved by monitoring the size of the received true-source dataset and, upon reaching a cap, deleting data received above that cap. Additionally or alternatively, the size of received true-source dataset may be determined before being received and datasets above the cap may be refused. Additionally or alternatively, the full size true-source dataset may be sampled to comport with the size limit in step 1401.
  • step 1402 the fields of the true-source dataset may be labeled to permit scrubbing of sensitive information in step 1403.
  • step 1402 may comprise reading and labeling the fields of the true-source dataset (step 1404).
  • the labels may identify which fields contain sensitive customer information. For example, one or more of the following classifications may be available for sensitive fields: names, email addresses, physical addresses, credit card numbers, and Social Security Numbers.
  • Some fields of the retrieved dataset may already be identified as containing sensitive information (e.g., a field of data with a field header of "SSN" or "Social Security Number" or "Address”). Additionally or alternatively, the fields may be separately scanned to determine whether any sensitive information is in the fields and then the fields may be appropriately labeled.
  • the fields may be labeled even where the field headers were not previously designated as having sensitive information.
  • the fields may be scanned and labels may be applied by a previously-trained machine learning model. Additionally or alternatively, a user (with appropriate credentials) may be permitted (in step 1405) to override the labeling results (e.g., to designate a field as containing sensitive information where it was previously identified not to contain sensitive information) for finer control of the labeling process.
  • An example of a previous -trained machine learning model for labeling fields may be found in US 16/151,385, filed October 4, 2018, now US Patent 10,460,235, to Truong et al. entitled "Data Model Generation Using Generative Adversarial Networks", whose contents are expressly incorporated herein by reference.
  • Some of the labels may designate fields as having identified sensitive information (e.g., all social security numbers, all zip codes, etc.) or as having expected sensitive fields but suspected of having sensitive information (numbers, known alpha-numeric patterns). For instance, based on the labels of some fields, the data in those fields may be treated differently from data in other fields.
  • the data in those fields may be scrubbed by replacing the content with semantically similar data from one or more tables or from one or more lists of the semantically similar data (e.g., for example, replacing a first name with a random name from a list of first names, where the random name was chosen via a random number generator).
  • semantically similar data e.g., for example, replacing a first name with a random name from a list of first names, where the random name was chosen via a random number generator.
  • the scrubbing of those fields may comprise replacing each character with syntactically similar character (e.g., replacing alphabetical characters with a random alphabetical character and replacing a number character with a random number character). Symbol characters may be replaced with a random symbol character or may permitted to remain unchanged.
  • fields 1408 containing what is believed to be non-sensitive information e.g., transaction times and dates, account balances, transaction balances, etc.
  • those labels designate the content of those fields to be permitted to remain unscmbbed.
  • the known sensitive fields 1406 may be scrubbed by replacing the content with a semantic equivalent
  • the unknown sensitive fields 1407 may be replaced with a syntactic equivalent
  • the non-sensitive fields 1408 may be retained with no replacement.
  • the scrubbing may replace all sensitive values with a realistic alternative value that follows the same schema as the source data. Additionally or alternatively, questionable fields may be flagged during step 1402 that requests review of fields that are not adequately classifiable as containing sensitive information or containing no sensitive information. Additionally or alternatively, in step 1405, users may be permitted to manually set data types and/or scrubbing policy.
  • syntactic replacement replacement values are selected based on a syntax of the data to be replaced.
  • a schema of the field may be analyzed and each character replaced with another character that would fit the schema. For example, for a field with "ABC123”, the schema is three capital letters followed by three numbers. A possible syntactic replacement would be "HDL537”. Another possible replacement would be "ZQA958".
  • the schema is three capital Xs, a dash, two capital Xs, another dash, and a four-digit number. A possible syntactic replacement would be "AAAA-AA-9943".
  • the selected replacement character for a given character may be the same across a dataset (e.g., all Xs replaced with As, all 4s replaced with 8s).
  • the selected replacement character may only be consistent for the content of a row of data in a given field (e.g., in a first row, all Xs are replaced with As and, in a second row, all Xs are replaced with Qs).
  • the replacement value is arbitrarily chosen from a list of non-sensitive values.
  • the type of field refers to a name and the replacement value would be another name.
  • the first name "Adam” may be substituted with one (e.g., Mason) of a list of male names, e.g., Liam, Arthur, William, James, Logan, Benjamin, Mason, Adam, Elijah, etc.
  • the last name "Smith” may be substituted with one (e.g., Brown) of a list of last names, e.g., Jones, Smith, Garcia, Lee, Williams, Johnson, Martinez, Hernandez, Wong, Miller, Brown, etc.
  • the resulting semantic replacement would be "Mason Brown”.
  • the field may be recognized as an address.
  • the "450” may be substituted with a three-digit number (e.g., 805), the street name and type may be substituted with one (e.g., Broadway Ave) of a list of known street names and types, e.g., Saddleback Rd, Riding Ridge Place, Belleview Ct, Broadway Ave, etc.).
  • the resulting semantic substitute would be "805 Broadway Ave”.
  • the selection from each list may be based on a random number generator to help anonymize the data.
  • step 1409 statistical parameters and correlation parameters may be determined, in step 1409, for the fields in the scrubbed dataset.
  • step 1410 based on the determined statistical parameters and correlation parameters, generative machine-learning model may be trained in step 1410
  • step 1411 a synthetic dataset may be generated, where the generation is based on the generative model trained in step 1410.
  • a generative machine learning model evaluates the scrubbed true- source data to learn its patterns and distributions both within a field and by evaluating dependencies across fields (for example, income may be influenced by age).
  • dependencies between various fields may be determined. Based on those dependencies, the relationships between the fields may be mapped within the dataset to improve the accuracy of the output.
  • the output of this process is a generative machine learning model that is tied to the scrubbed tme-source dataset and may be used in the subsequent step to generate synthetic data that follows the distributions of that scrubbed tme-source data.
  • a generative model of the scrubbed dataset is trained in step 1410 and synthetic datasets generated in step 1411.
  • the quantity of records generated may be arbitrarily large and is not limited on the volume of available scrubbed, tme-source data. Because this generated data was generated to match patterns rather than being based on real transactions or records, it will not contain any real customer information or sensitive business data in it, but it will still match the distributions and patterns of the scmbbed tme-source data. This synthetic data may then be passed back to the user or application requesting it for display or usage as required for the given use case.
  • generated data may be checked by the user or another entity and flagged where, for instance, any sensitive source data is found in any field (e.g., un-tokenized credit card numbers in fields identified as non-sensitive) or any datasets whose expected columns (based on enterprise data management tool registration) do not match actual columns observed (referred to as schema drift).
  • any sensitive source data is found in any field (e.g., un-tokenized credit card numbers in fields identified as non-sensitive) or any datasets whose expected columns (based on enterprise data management tool registration) do not match actual columns observed (referred to as schema drift).
  • the process may be deployed as an automatic process with no user intervention.
  • the process may be deployed to include user and/or technician's interactions to review field categorizations (or other items for review). Where users are able to manually tune one or more of data categorization, scmbbing, dependencies, or distributions to obtain the desired synthetic data.
  • the source data may be deleted after the creation of the model of step 1410. Additionally or alternatively, the generative model and/or the synthetic data may be deleted after a period of time (e.g., 21 days). Additionally or alternatively, a whitelist of data that should not be scmbbed may also be used. The use of the whitelist in step 1402 to prevent scmbbing of specific fields may permit a finer-grained recognition of which fields are sensitive and to allow the values in those fields to pass through to the synthetic version of data, increasing the realism of the synthetic data.
  • the true-source datasets may comprise a plurality of records, with data of the records arranged in various fields.
  • the true-source dataset may be deleted.
  • the true- source dataset may be retained for future comparisons.
  • One benefit of deleting the true- source dataset as shown in step 1412 is that the deletion further protects sensitive information of the users whose information may still be contained in or derived from the scrubbed, true-source dataset.
  • the deletion step 1412 may occur after any of the determination of statistical parameters and correlation parameters of the scrubbed dataset (step 1409), after the generation of the generative model (step 1410), or after the generation of the synthetic dataset (step 1411). Further, the generative model from step 1410 and/or any generated datasets from step 1411 may also be deleted.
  • the system may perform one or more steps of FIGs. 15 and 16 as shown by references D and E.
  • the process may proceed via reference D to FIG. 15.
  • the system receives one or more modifications of statistical and/or correlation parameters and proceeds to reference F.
  • the scrubbed data model of step 1409 and/or the trained generative model of step 1410 is modified, in step 1413 based on the modifications received in step 1503, the generative data model is retrained in step 1410 and another synthetic dataset is generated in step 1411.
  • the synthetic dataset may be sent (step 15) to one or more computing systems.
  • the synthetic dataset may be used in various ways including, for instance, training another machine learning model, modeling a database, or comparing the synthetic dataset with other datasets to possibly determine whether the other datasets represent actual data or synthetic data.
  • the system may determine statistical and/or correlation parameters of the generated data in step 1502. After the determination of the statistical or correlation parameters in step 1502, the system may receive modifications (step 1503) of the statistical parameters or correlation parameters of the scrubbed data model and/or the generative model as described above. Alternatively or additionally, after the determination of the statistical and/or correlation parameters of the generated dataset, the parameters of the generated dataset may be compared, in step 1504, with the expected parameters of the scrubbed data model and/or those of the generative model. Based on the comparison of step 1504, modifications may be received in step 1503 of the statistical and/or correlation parameters, the scrubbed data model of 1409 and/or the generative model of 1410 may be modified in step 1413 of FIG. 14.
  • Another generative model may be trained based on the modified scrubbed data model 1409 and another synthetic dataset generated in step 1411 or, if modifying the generative model directly, the another synthetic dataset may be generated in step 1411 once the generative model has been modified.
  • no modifications of the parameters may be received and another synthetic dataset may be generated in step 1413 (via reference I).
  • the results of the comparison may be sent in step 1505.
  • the results of the comparison may be further processed as described respect to FIG. 16, via reference G.
  • the system may determine whether a difference between one or more parameters of the synthetic dataset and related expected parameters of the generative model are greater than a parameter threshold. If the difference or differences are not greater than a parameter threshold, the results of the comparison may be forwarded to one or more computing systems for further evaluation of the determination of step 1601 or use of the second synthetic dataset. If the difference or differences are greater than a parameter threshold, the results of the comparison, the process may, via reference H, receive modifications of success the call and/or correlation parameters in step 1513, modify the generative model in step 1413, and generate (step 1411) another synthetic dataset based on the modified generative model.
  • a score may be generated in step 1603 and the score sent (in step 1604) to one or more computing systems for further evaluation and/or use of the second synthetic dataset.
  • the score may be compared (in step 1605) against a score threshold. If the score is above the score threshold, the results of the comparison may be sent to the one or more computing devices as described above with respect to steps 1602. If the score is below the score threshold, the system may send the results of the comparison in step 1602 and/or receive modification of the statistical/correlation parameters in step 1503 to modify (step 1413) the generative model and generate (step 1411) another synthetic dataset based on the modified generative model.
  • the comparison score 1603 may be used to rank the reliability of the generative model and determine whether any human interaction to change the generative machine-learning model is necessary. If the score is reliable, then the synthetic dataset from step 1411 may be considered for consumption to other downstream systems.
  • An example use case may comprise a dataset and information about the dataset being provided with the dataset. For instance, when users are trying to find information about a real dataset, the information may be provided along with a sample of the dataset using synthetic data (e.g., the synthetic dataset).
  • the synthetic dataset may have been previously generated or may be generated in response to the user's request for the information. Because the synthetic dataset contains no real customer information in it, users may be permitted to preview the synthetic dataset with less security or privacy restrictions, allowing the users to evaluate the synthetic dataset’s utility without needing to request and wait for access to the tme-source dataset.
  • Another use case may comprise the management of test data.
  • users may be able to request realistic data to be populated into their development and quality assurance environments and applications.
  • the system may use the described process to retrieve tme-source data and create a synthetic version of that data that may safely be shared in lower permission environments with reduced risk of exposing customer information. Further, the system permits an arbitrarily large volume of test data to be available regardless of the amount of source data available, helping teams that cannot get enough test data.
  • Creating on-demand synthetic data may permit users to interact with realistic data that does not risk exposing sensitive customer or company data, thereby protecting customers’ privacy. Also, using the system to generate synthetic data based on actual data may permit users to obtain access to realistic data without the legal or corporate delays associated with private information and without violating customer privacy or data sharing policies. Further, by permitting users to use synthetic data for tasks normally requiring actual data, companies may benefit by reducing the quantity of users and/or systems that require actual data to perform tasks, thereby permitting companies to add additional protections on the users and/or systems accessing real data and having less concern on others using the synthetic data (as the synthetic data was generated two or more models).
  • a computer-implemented method may comprise receiving a tme-source dataset comprising a source plurality of records, wherein the source plurality of records may be arranged according to a plurality of fields and each record of the source plurality of records may comprise true-source data for at least one field; categorizing, using a previously-trained model, one or more fields of the plurality of fields; determining, based on the categorizing of the one or more fields of the plurality of fields, a method of scrubbing the source plurality of records; generating, based on the determined method for scrubbing the one or more fields of the plurality of fields of the source plurality of records of the true-source dataset, a scrubbed dataset comprising a scrubbed plurality of records; determining, based on the data of the scrubbed plurality of records of the scrubbed dataset, one or more parameters for the plurality of fields of the scrubbed dataset, wherein the parameters comprise one or more of statistical parameters or correlation parameters; storing the one or more parameters;
  • the categorizing may comprise predicting, using the previously-trained model, a label for one or more of the plurality of fields
  • the generating of the scrubbed dataset may comprise replacing, based on the label for one or more of the plurality of fields, data in the source plurality of records of the true-source dataset with replacement data.
  • the replacing step further may comprise substituting, based on the label for the one or more of the plurality of fields, semantically similar data for the source plurality of records in the true-source dataset.
  • the substituting may comprise selecting, based on the label, a random value from a list of values associated with the label.
  • the replacing step further may comprise substituting, based on the label for the one or more of the plurality of fields, syntactically similar data for the source plurality of records in the true-source dataset.
  • the substituting may comprise replacing, on a character-by- character basis for a first record, any alphabetical characters with random alphabetical characters; and replacing, on a character-by-character basis for the first record, any numbers characters with random number characters.
  • the categorizing further may comprise receiving user input modifying the label of one or more fields of the true- source dataset.
  • the receiving the true-source dataset may comprise limiting a volume of true-source data in the true-source dataset.
  • the method may further comprise deleting, based on the categorizing the one or more fields of the plurality of fields, the true-source dataset.
  • the method may further comprise receiving user input modifying one or more parameters; modifying, based on the modified one or more parameters, the generative model; generating, based on the modified generative model, a second generated dataset; and outputting the second generated dataset.
  • the statistical parameters may comprise a distribution parameter of one of the plurality of fields of the scrubbed dataset, and the distribution parameter may comprise one of a normal distribution, a Benford distribution, binomial distribution, power distribution, or a triangular distribution.
  • the statistical parameters may comprise a minimum, maximum, mean, mode, standard deviation, symmetry, skewness, or kurtosis.
  • the correlation parameters may comprise a degree of correlation between two or more fields of the scrubbed dataset.
  • the label may identify the field as containing data of one or more of city, a person, a credit card number, an email address, a phone number, a social security number, or an address.
  • One of the one or more of the statistical parameters may be a first distribution parameter of one of the plurality of fields of the scrubbed dataset.
  • the method may further comprise determining, based on one of the second plurality of fields of the generated dataset, a second distribution parameter; comparing the second distribution parameter with the first distribution parameter; modifying, based on comparing the second distribution parameter with the first distribution parameter, the generative model to include a modified distribution parameter; generating, based on the modified generative model, a second generated dataset; and outputting the second generated dataset.
  • the generative model may comprise a probabilistic graphical model comprising two or more nodes and one or more edges, wherein at least one of the two or more nodes may be based on the one or more statistical parameters, and wherein the one or more edges may be based on the one or more correlation parameters.
  • the method further may comprise generating a graphical user interface representing the probabilistic graphical model; receiving user interactions with the graphical user interface, the user interactions modifying a correlation edge of the one or more edges of the probabilistic graphical model; generating, based on the modified probabilistic graphical model, a second generated dataset; and outputting the second generated dataset.
  • the outputting may further comprise sending the generated dataset to a user's computing device or training a predictive model based on the generated dataset; and generating one or more predictions based on data using the trained predictive model.
  • the instructions that cause the apparatus to output the generated dataset may further cause the apparatus to send the generated dataset to a user's computing device.
  • the instructions that cause the apparatus to output the generated dataset further cause the apparatus to train a predictive model based on the generated dataset; and generate one or more predictions based on data using the trained predictive model.
  • the instructions that cause the outputting further cause the one or more processors to perform sending the generated dataset to a user's computing device.
  • the instructions that cause the outputting further cause the one or more processors to train a predictive model based on the generated dataset; and generate one or more predictions based on data using the trained predictive model.
  • the label may comprise one or more of a person's name, an address, a city, a state, a credit card number, an email address, a telephone number, or a social security number.
  • an apparatus may comprise one or more processors; and memory storing instructions that, when executed by the one or more processors, cause the apparatus to receive a tme-source dataset comprising a source plurality of records, wherein the source plurality of records may be arranged according to a plurality of fields and each record of the source plurality of records may comprise true- source data for at least one field; categorize, using a previously-trained model, one or more fields of the plurality of fields; determine, based on the categorizing of the one or more fields of the plurality of fields, a method of scrubbing the source plurality of records; generate, based on the determined method for scrubbing the one or more fields of the plurality of fields of the source plurality of records of the true-source dataset, a scrubbed dataset comprising a scrubbed plurality of records; determine, based on the data of the scrubbed plurality of records of the scrubbed dataset, one or more parameters for the plurality of fields of the scrubbed dataset, wherein
  • one or more non-transitory media storing instructions that, when executed by one or more processors, may cause the one or more processors to perform steps comprising receiving a tme-source dataset comprising a source plurality of records, wherein the source plurality of records may be arranged according to a plurality of fields and each record of the source plurality of records may comprise tme- source data for at least one field; categorizing, using a previously-trained model, one or more fields of the plurality of fields; determining, based on the categorizing of the one or more fields of the plurality of fields, a method of scrubbing the source plurality of records; generating, based on the determined method for scrubbing the one or more fields of the plurality of fields of the source plurality of records of the tme-source dataset, a scmbbed dataset comprising a scmbbed plurality of records; determining, based on the data of the scmbbed plurality of
  • Generating synthetic data may address issues where enough actual data is unavailable.
  • a synthetic model validation process may be deployed locally or as a cloud-based service.
  • Machine learning as a whole typically involves multiple steps with model training and model validation requiring extra attention.
  • Model validation typically involves using measures of predictive accuracy, precision, recall, or a variety of other metrics to justify how well the model performs/predicts.
  • Synthetic model validation is not easily determined as an underlying machine learning model is not actually making a prediction and is instead generating data. Since the model is not involved in any predictive process, it is often unclear on how to measure how well a model is performing and measures like accuracy, fl-score, precision, and recall become obsolete when working with synthetic data.
  • a concept from statistics may be applied to help evaluate generative models: hypothesis testing.
  • Hypothesis testing is a process of accepting or rejecting a hypothesis formed on a specific parameter.
  • systems and processes permit the formation of hypothesis tests and then apply those hypothesis tests to various datasets created by a given generative model. For example, for a financial dataset of actual loans provided to people, one may expect that the age of people in this dataset to be on average of 30 years old and that most people are within ⁇ 5 years of this average.
  • One may conduct a hypothesis test, specifically a normality test, which determines whether the ages in a generated test dataset are normally distributed with a mean of 30 and standard deviation of 5.
  • the use of hypothesis tests may be applied to validate synthetic data models by creating hypothesis tests to evaluate the generated synthetic datasets.
  • a financial analyst may have a real dataset that has the ages normally distributed with a mean of 30 and standard of 5. After construction of a synthetic data model, the financial analyst may run a normality test on a quantity of synthetic datasets to verify that the synthetic dataset does indeed have a normal distribution that is present in the real dataset. If the user-specified threshold for hypothesis test success (for example, 95% of normality tests pass on 100 synthetic datasets), then the synthetic data model may be considered validated directly to the user’s needs.
  • a threshold for hypothesis test success for example, 95% of normality tests pass on 100 synthetic datasets
  • This process of applying hypothesis testing to synthetic data may include a number of advantages including allowing users to validate synthetic data models to their likes, allowing users to specify how strict they want to be in their validation, and permitting the validation process to be applied to tune and retrain the underlying synthetic model to be tailored towards the user’s needs.
  • Hypothesis tests may comprise a 2-sided-T, 1- sided-T, binomial, chi- squared, and/or normality test.
  • the parameters for a hypothesis test may comprise an alpha (also referred to as "a” or the "significance level", representing the probability of rejecting a null hypothesis when true), a quantity of tests to be performed, and quantity of data points per test. Further, based on the selected hypothesis test to perform, the parameters associated with the test may be different.
  • users who consume synthetic data may be permitted to determine how reliable the generated synthetic data is in order to increase confidence in using the data.
  • FIGs. 17 and 18 depict flow charts for a method of validating synthetic data.
  • a request is received for the generation of synthetic data.
  • a synthetic data model is received.
  • the received generative model of step 1702 may be a previously trained generative data model. Additionally or alternatively, the generative model may be trained based on a data model created from parameters determined in step 1713 from the true source dataset (e.g., a true-source dataset received in step 1712 as described below).
  • a generated test dataset is generated with, for example rows of data arranged in one or more fields.
  • parameters may be determined for data in one or more of the fields of the generated test dataset.
  • steps 1705 parameters associated with one or more fields are received.
  • step 1706 hypothesis tests relating to the parameters are determined.
  • step 1707 the process determines whether parameters of the generated test dataset pass the hypothesis tests determined in step 1706.
  • step 1708 a score may be generated based on the determination of step 1707 of whether the parameters passed the hypothesis tests.
  • a generated dataset may be subsequently generated (e.g., of a larger size than the generated test dataset) and sent to one or more computing devices for subsequent use. Alternatively or additionally, the score from step 1708 may be sent (step 1710) to the one or more computing devices or to other competing devices for further evaluation.
  • a user may possess a level of sophistication to determine how to modify the data model based on the score from step 1708.
  • the system may receive instructions to modify the data model in step 1711, modify the data model in step 1712, and generate another generated test dataset in step 1713 based on the modified data model, modified in step 1712.
  • a user may desire additional aid in evaluating the score from step 1708. As shown by reference J bridging FIGs. 17 and 18, a percent of fields satisfying the hypothesis tests may be determined in step 1801.
  • step 1802 the percent is determined to be greater than a given percentage of the threshold
  • the generated test dataset or another generated dataset may be sent (step 1803) to one or more computing devices. If, in step 1802, the percent is determined to be less than the threshold, the system may send (step 1804) results of the comparison with threshold, receive instructions to modify the data model in step 1805, modify the data model in step 1712 (via reference K), and generate (step 1703) a synthetic dataset based on the modified data model of step 1712.
  • the system may determine, in step 1806, whether the score is greater than a score threshold. If the score is greater than the score threshold, then the generated test dataset or another generated dataset (based on the same generative model but, for instance, larger) may be sent, in step 1803, to the one or more computing devices. If the score is determined to be below the score threshold, the results of the comparison may be sent, in step 1804, to one or more computing devices and the steps performed as described above.
  • the system may receive, in step 1712, a true- source dataset and determine, in step 1713, parameters associated with fields of the tme-source dataset.
  • the parameters determined in step 1713 may be used as metadata for data model of the tme-source dataset 1712.
  • the metadata may be used to train a generative data model for use in step 1703 and the generative step 1703 that generates the generated test dataset.
  • the parameters determined from step 1713 may also be used in step 1706 to determine hypothesis tests relating to the parameters of the synthetic dataset compared against the hypothesis tests in step 1707 as described above. For example, one may determine a mean of a field in the tme-source dataset.
  • a hypothesis test may be created and applied to the related field of the synthetic dataset to validate the data model that created the synthetic dataset.
  • a mean may be determined for the related field of the synthetic dataset, the hypothesis test applied to the mean of the synthetic dataset's field, and the passing of the hypothesis test for that field meaning the model used to generate the synthetic data appropriately models the tme-source data for that field.
  • one or more other statistical hypothesis tests may be created for that field and the field of the generated test dataset tested using those one or more statistical hypothesis tests.
  • FIG. 19 describes the process of generating a user interface based on the data model and receiving a user's interactions with the user interface.
  • a data model is received.
  • user interface based on the data model is generated.
  • the system receives the user's interactions with the user interface and creates hypothesis tests based on those interactions.
  • the metadata model from the user interactions adjusting parameters is stored.
  • the hypothesis tests from step 1903 are added to the model validation process described in FIG.
  • the model validated against the hypothesis tests e.g., training a generative model based on the metadata model, generating generated test datasets from the generative model, determining parameters of the generated test dataset and comparing the determine parameters with expected parameters of the generated data model.
  • the results of the validation process may be sent to the user.
  • FIG. 20 depicts a user interface for specifying hypothesis tests for the process of FIGs.
  • a user interface 2001 may comprise a region 2002 through which a user may select and/or modify one or more hypothesis tests.
  • the region 2002 may comprise a region 2003 that allows selection of one or more fields, a region 2004 that allows selection of a hypothesis test to perform, a region 2005 that allows the user to input hypothesis test parameters, a region 2006 that allows the user to input a confidence interval for the hypothesis test, and a region 2007 that allows the user to specify the quantity of hypothesis tests to perform.
  • a computer-implemented method may comprise receiving a generative model, wherein the generative model may be configured to generate one or more generated datasets having records arranged in one or more fields; generating, based on the generative model, a generated test dataset; receiving one or more input parameters associated with the one or more fields; determining, based on the one or more input parameters, a hypothesis test for the one or more fields; determining, based on data in the one or more fields of the generated test dataset, a parameter, wherein the parameter may be one or more of a statistical parameter or a correlation parameter; determining, based on the parameter, whether the generated test dataset passed the hypothesis test; and outputting the determination whether the generated test dataset passed the hypothesis test.
  • the method may further comprise receiving, based on the determination whether the generated test dataset passed the hypothesis test, an instruction; modifying, based on the instruction, the generative model; generating, based on the modified generative model, a second generated test dataset; determining, based on data in the fields of the second generated test dataset, a second parameter of the one or more fields; determining, based on the second parameter, whether the second generated test dataset passed the hypothesis test; and outputting the determination whether the second generated test dataset passed the hypothesis test.
  • the outputting may comprise sending, to a requesting device, the determination that the generated test dataset passed the hypothesis test, wherein the input parameters may be received from the requesting device.
  • the parameter may be a statistical parameter, and the method further may comprise determining, based on data in two or more fields of the generated test dataset, a correlation parameter between two or more fields of the generated test dataset; and determining, based on the correlation parameter, whether the generated test dataset passed the hypothesis test, wherein the hypothesis test may comprise a statistical hypothesis test using the statistical parameter and further may comprise a correlation hypothesis test using a correlation parameter.
  • the correlation parameter may comprise one of covariance, interclass correlation, intraclass correlation, or rank.
  • the method may further comprise receiving a true-source dataset comprising records, wherein each record contains true-source data arranged in the one or more fields; and determining, based on the third data in one or more fields of the true-source dataset, one or more third statistical parameters of the one or more fields of the true-source dataset.
  • the receiving one or more input parameters associated with the one or more fields may comprise receiving the one or more third statistical parameters.
  • the statistical parameter may comprise one or more of a minimum, a maximum, a mean, a mode, a standard deviation, symmetry, skewness, kurtosis, or distribution.
  • the method may further comprise receiving a true-source dataset comprising records, wherein each record contains true-source data arranged in the one or more fields; and determining, based on the third data in two or more fields of the true-source dataset, a correlation parameter between two or more fields of the true-source dataset.
  • the receiving one or more input parameters associated with the one or more fields may comprise receiving the correlation parameter.
  • the method may further comprise generating, based on the determination that the generated test dataset passed the hypothesis test, an output dataset; and sending, to a requesting device, the generated output dataset.
  • the method may further comprise generating an additional test dataset; determining, based on data in the one or more fields of the additional generated test dataset, a second parameter, wherein the second parameter may be one or more of a statistical parameter or a correlation parameter; determining, based on the second parameter, whether the additional generated test dataset passed the hypothesis test; and sending the determination to a requesting device.
  • the request for the generated dataset may be received via an application programming interface.
  • the input parameters comprise a distribution parameter for a field, a mean parameter for the field, and a standard deviation for the field.
  • the determining whether generated test dataset passed the hypothesis test may comprise obtaining a confidence interval percent; and determining whether a percent of fields of the generated test dataset satisfying the hypothesis test may be within the confidence interval percent.
  • the method may further comprise receiving a true- source dataset comprising records, wherein each record contains true-source data; and determining, based on the true- source data, an independence parameter between two or more fields of the true-source dataset, wherein receiving one or more input parameters associated with the one or more fields may comprise receiving the independence parameter.
  • an apparatus may comprise one or more processors; and memory storing instructions that, when executed by the one or more processors, cause the apparatus to receive a true-source dataset having true-source data arranged in fields; generate, based on fields of the true-source dataset, a data model; generate, based on the data model, a user interface; receive user interactions with the user interface, the user interactions defining relationships between the fields of the data model; generate, based on the relationships, a generative model, wherein the generative model may be configured to generate generated datasets having records arranged in the fields; generate, based on the generative model, a generated test dataset; receive an identification of a selected hypothesis test of a plurality of hypothesis tests; receive one or more input parameters associated with the one or more fields; determine, based on the one or more input parameters, a hypothesis test for the one or more fields; determine, based on data in the one or more fields of the generated test dataset, a parameter, wherein the parameter may be one or more of a statistical parameter or
  • the instructions may further control the apparatus to generate, based on the determination whether the additional generated test datasets passed the hypothesis test, a first score; and send, to a user's device, the first score, wherein the user interactions may be from the user's device.
  • the parameter may be a statistical parameter and the instructions further control the apparatus to determine, based on the generated data in two or more fields of the one or more generated test datasets, a correlation parameter between two or more fields of the one or more generated test datasets; and determine, based on the correlation parameter, whether each of the one or more generated test datasets passed the hypothesis test, wherein the hypothesis test may comprise a statistical hypothesis test using the first statistical parameter and may comprise a correlation hypothesis test using the correlation parameter.
  • the instructions to determine whether each of the one or more first generated test datasets passed the hypothesis test may cause the apparatus to obtain a confidence interval percent, and determine whether a percent of fields of each of the one or more generated test datasets satisfying the hypothesis test may be within the confidence interval percent.
  • the instructions may further cause the apparatus to receive an identification of a quantity of generated datasets to be generated; generating the quantity of generated datasets; and sending the quantity of generated datasets.
  • the correlation parameter may comprise one of covariance, interclass correlation, intraclass correlation, or rank.
  • one or more non-transitory media storing instructions that, when executed by one or more processors, may cause the one or more processors to perform steps comprising receiving a data model of a true-source dataset with true- source data arranged in fields; generating, based on the data model, a user interface; receiving user interactions with the user interface, the user interactions defining relationships between the fields of the data model; generating, based on the relationships, a generative model, wherein the generative model may be configured to generate generated datasets having records arranged in the fields; generating, based on the generative model, a generated test dataset; receiving one or more input parameters associated with the one or more fields; determining, based on the one or more input parameters, a hypothesis test for the one or more fields; determining, based on data in the one or more fields of the generated test dataset, a parameter, wherein the parameter may be one or more of a statistical parameter or a correlation parameter; determining, based on the parameter, whether the generated test dataset passed the
  • agent definitions from individual simulation specifications may provide additional capabilities beyond simulations in which the agents are defined in the simulation specification. This separation may permit greater control of the agents (including their probability distribution definitions and probability of behaviors) while enabling efficient creation of complex simulations.
  • the framework with the separate agent definitions permits agent definitions to be extensible across different simulations that otherwise would not be based on common agent definitions.
  • separating agent definitions from individual simulation specifications permit the creation of extensible, complex agent definitions that include attributes probability distributions and/or behavior probability distributions that may not be used for any given simulation (at least initially). By permitting agent definitions to grow in complexity independent of any given simulation, agent definitions become more comprehensive in their attribute probability distributions and behavior probability distributions.
  • the complex agent definitions may then be used across different simulations by selecting the existing agent definitions for a given simulation specification (in lieu of creating all agents in that simulation specification). This ability to reference existing agent definitions may help streamline the creation of complex simulations by allowing simulations to reuse complex agent definitions rather than requiring agent definitions to be newly defined for each simulation.
  • At least some environmental factors may be represented as agent definitions. This may permit easier creation of varied simulations for a given scenario. For example, one may desire to model an economic environment where certain parameters for that economic environment are specified as a probability distribution (e.g., inflation rate, savings rate, unemployment rate, gross domestic product, employment, income and wealth, health care, international trade, etc.).
  • a conventional economic ABM simulation specification would have specified specific economic parameters as values as part of the simulation specification. Multiple simulations of that simulation specification (that defined the economy as part of the simulation specification) would have resulted in synthetic data based on instantiated agents in that specified economy.
  • the developer would need to modify the existing simulation specification's specific economic factors for each new set of simulations.
  • the instantiation of the economy may be based on a sampling, during instantiation of the economy agent definition, of those probability distributions, thus varying each simulation execution to have different economic values of attributes/behaviors.
  • Separating agents' probability definitions from a simulation specification permits development of complex agents. For instance, the attributes and behaviors of the agents defined separately encourage richer modeling of agents than would normally be achieved by developing agents specifically for a given simulation (or even where the definitions of the agents and the simulation are integrated into a single specification). Decoupling the agent definitions from the simulation specifications may provide the ability to separately define agents as including disparate characteristics that would normally not be modeled in agent definitions for a specific simulation.
  • agent definitions may be defined as having financial attributes and behaviors (one or more credit cards, paying each off in part or in full, the average balance carried on each, spending habits including size of purchases and how often purchase are made, etc.), mortgage and/or foreclosure-related attributes and behaviors (one or more checking/savings/credit card accounts, one or more mortgages on their home, possible second homes, employment status and income, job security, etc.), and transportation- related attributes and behaviors (how often one commutes to an office or goes shopping, miles driven, train or bus used for part of the travel, likelihood of a vehicular breakdown in their lane of travel, time leaving for travel, direction of travel, etc.).
  • financial attributes and behaviors one or more credit cards, paying each off in part or in full, the average balance carried on each, spending habits including size of purchases and how often purchase are made, etc.
  • mortgage and/or foreclosure-related attributes and behaviors one or more checking/savings/credit card accounts, one or more mortgages on their home, possible second homes, employment status and income, job security, etc.
  • the richer agent definitions may be used for individual simulations by specifying, in a simulation specification, which existing agent definitions to use and the desired attributes and/or behaviors, thus bringing new simulation specifications on-line more efficiently.
  • creating agent definitions solely in a simulation specification may require significant time spent defining each new agent definition, thus consuming significantly more time specifying the details of each agent definition and/or tuning the agent definitions' probability distributions to make the new agents internally consistent (e.g., minimizing or eliminating contradictory attributes and/or behaviors resulting in non-sensical situations - like obtaining new loans daily with no income or wealth) and/or externally consistent (e.g., being able to model existing known data).
  • simulation specifications may be enhanced by being able to better simulate events (when executed) that may be caused by transient events not normally modeled when creating a new simulation specification.
  • events when executed
  • the foreclosure simulation may be better able to identify situations where an agent's instance is in a car accident and cannot work. Because they cannot work, their credit card debt increases and their home is eventually foreclosed upon by a lending institution. Knowing the cause of foreclosures may help provide better insights to explain foreclosure data from the simulation. Without the complex agent instances' transportation-related information, it may be difficult or impossible to explain the significant and/or origins of data in generated datasets using only simulation specifications specifying agent instances with only limited attributes and behaviors.
  • FIG. 23 depicts an agent definition storage and multiple simulation specifications using one or more agent definitions.
  • An agent definition storage 2300 may comprise a memory with one or more databases storing multiple agent definitions with their associated attribute probability distributions and behavior probability distributions.
  • the agent definitions stored in agent storage 2300 are defined separately from simulation specifications 2313 in which the agent definitions may be used. In short, the agent definitions are not simulation-specific.
  • One or more aspects of these agent definitions is the ability for the agent definitions to be used in different simulation specifications and resulting simulations where some attributes and/or behaviors are used and not others.
  • One advantage of decoupling the agent definitions from any given simulation specification is that the agent definitions may be reused across simulation specifications that would normally require the creation of simulation- specific agents.
  • Agent storage 2300 may comprise a plurality of agent types 2301, 2302, 2303, 2304 2305, and 2306, with each agent type including attribute probability distributions and/or behavior probability distributions. Additionally or alternatively, the attribute probability distributions and behavior probability distributions may be specified separate from each other and referenced by a given agent type definition.
  • agent type definition 1 2301 includes six attribute probability distributions 2307 and three behavior probability distributions 2308.
  • Agent type definition 2 2302 includes eight attribute probability distributions 2309 and two behavior probability distributions 2310.
  • Agent type definition 3 2303 includes four attribute probability distributions 2311 and five behavior probability distributions 2312.
  • Agent type definition 4 2304, agent type definition 5 2305, and agent type definition 62306 are also shown in agent storage 2300 to represent other agents with attribute probability distributions and behavior probability distributions (not shown). It is appreciated that yet other agents may be included in agent storage 2300 or in stored elsewhere. Further, in some examples, agent probability distributions and/or behavior probability distributions may be defined in simulation specifications.
  • FIG. 23 also shows multiple simulation specifications 2313 using one or more agent type definitions from the agent storage 2300.
  • the simulation specifications 2313 may be stored with agent type definition storage 2300 in a storage 2314. Additionally or alternatively, the simulation specifications may be stored separately from a collective storage 2314.
  • each agent type definition may be instantiated 10, 100, 1000, or more times in a simulation based on the simulation specification. It is appreciated that the number of agent types and the number of instantiations of each agent type for a given simulation varies and may be adjusted as desired.
  • Simulation specifications may include, for instance as shown in FIG.
  • Each simulation specification may comprise information that defines the simulation specification.
  • the information may comprise, for instance, a list 2314 of agent types to be used in the simulation, an initialization state or state 02315 that identifies a number of instances of the various agent types to be initially present in an execution of the simulation, a list 2316 of actions/mles per step, a number of steps to be performed (and/or termination conditions) 2317, probabilities 2318 of creating new/killing off various agents in a next step, and storage information 2319 of a current simulation state.
  • a random number seed value 2320 may be used to provide fine tuning of a simulation specification (e.g., enabling a developer to adjust various parts of the simulation specification while using the same random number seed).
  • the other simulation specifications may include one or more of the agent type definitions and specify other probability definitions not used in the credit card simulation specification.
  • an initial infection rate R may be specified and a likelihood of taking public transportation may be specified.
  • the likelihood of taking public transportation may also be specified.
  • Various probability distributions of attributes and/or behaviors of the agent type definitions may be specified as part of the individual simulation specifications. Behavior probability distributions 2321, related to agent types, may be identified in the simulation specifications 2313. Additionally or alternatively, the definitions of one or more agent types may be obtained from another source (e.g., another agent storage) or created and/or modified in the specification of the simulation.
  • FIG. 24 depicts various simulations based on the stored simulation specifications.
  • FIG. 24 shows storage 2314 of FIG. 23.
  • the simulation specifications are sampled to execute the simulations.
  • FIG. 24 comprises a credit card transaction simulation 2401 that is based on the credit card transaction simulation specification of FIG. 23.
  • a random number generator 2410 samples the various probability distributions of the credit card transaction simulation specification to begin the credit card transaction simulation 2401.
  • the credit card transaction simulation 2401 comprises one or more agent instances that are based on specific agent definitions.
  • the credit card transaction simulation 2401 includes agent type 1 instance 1 2402 with attribute 2408 as a value (having been sampled from an attribute probability distribution definition by the random number generator 2410) and attribute PD 2409 that is sampled in each new step of execution of the credit card transaction simulation 2401 (e.g., at each new time t, the behavior probability distributions are sampled and the result used in that step of the simulation as a value A).
  • attribute probability distributions are only sampled whenever a new agent is being instantiated and, during each step in the execution of a simulation, the attributes may be changed by the behaviors (where the behaviors are sampled from behavior probability distributions).
  • the credit card transaction simulation 2401 may also comprise other instances 2 ...
  • agent type 1 definition 100 of the agent type 1 definition (as agent type 1 instance 22403 ... agent type 1 instance 1002404).
  • the credit card transaction simulation 2401 may include instances of other agents type definitions as well (e.g., agent type 2 instance 1 2405 with attributes 2410, 2411, 2412, agent type 2 instance 2 2406 with one or more attributes, and instances 1-Y 2407 of agent type 3 with attributes).
  • Other simulations may also use one or more of the agent definitions of FIG. 23 and as used by the credit card transaction simulation 2401.
  • a transportation simulation 1 2411 may use instances of agent type 1 2415, agent type 2 2416, and agent type Q 2417.
  • the sampled attributes of the instances of agent types 1, 2, and Q are not shown for simplicity but understood to be created through sampling of the probability distributions of the attribute probability definitions of each of the agent definitions.
  • a transportation simulation 2 2412 may also be executed using another sampling of the agent definitions as specified in the transportation simulation specification of the simulation specifications 2313 of FIG. 23. Further, FIG. 24 shows two additional simulations (foreclosure simulation 1 2413 and foreclosure simulation 2 2414) executed based on the foreclosure simulation specification of the simulation specifications 2313 of FIG. 23.
  • the credit card transaction simulation specification 2401 may further include behavior probability distributions 2418 relating to actions performed by the instances of the agent types 2402-2407. It is appreciated that the behavior probability distributions may be particular to specific agent types and/or apply to some agent types and not others.
  • the transportation simulation specifications 2411 and 2412 may include behavior probability distributions 2419 that may be the same across both transportation simulation specifications 2411 and 2412 or may have some behavior probability distributions 2419 in common and not others.
  • FIG. 25 depicts a flowchart of an execution of an agent-based model simulation.
  • a simulation specification is retrieved.
  • the simulation specification 2500 may comprise agent probability distribution definitions 2501, quantities of agents to be instantiated per agent type definition 2502, behavior probability distribution definitions 2503 (additionally or alternatively received separately from the agent type definitions 2502), desired synthetic data 2504, and other simulation specification items as needed.
  • Step 2505 shows the execution of the simulation specification received in step 2500.
  • a step of the simulation is performed using a most recent execution state of the simulation (if any).
  • the next simulation state is generated in the simulation 2505.
  • the simulation may continue until termination criteria or set number of steps have been simulated (e.g., specified in the simulation specification termination criteria/number of steps 2317 of FIG. 23). For instance, the process may repeat for a set number of iterations, until a given result is obtained (e.g., 30% home ownership), or the simulation reaches a steady state (no significant changes from a previous state - e.g., 99% of the collected states not changing between steps).
  • Synthetic data 2509 may be generated at the termination of simulation 2505 or generated at one or more states during the execution of the simulation 2505.
  • FIG. 26 depicts another example flowchart of an execution of an agent-based model simulation.
  • a simulation specification is received.
  • the simulation specification may comprise agent type definitions 2601, quantities of agent types to be instantiated 2602, behavior probability distribution definitions 2603, and desired synthetic data fields 2604.
  • the simulation specification 2600 is executed.
  • the probability distributions specified in the simulation specification 2600 are sampled via a random number generator. As no previous step of the simulation exists, the simulation state is generated based on the probability distribution definitions and other data of the simulation specification 2600.
  • the process may repeat (next simulation steps) for a set number of iterations, until a given result is obtained (e.g., 30% home ownership), or the simulation reaches a steady state (no significant changes from a previous state - e.g., 99% of the collected states not changing between steps).
  • the synthetic dataset 2610 may be sent to a user.
  • the synthetic dataset 2610 may be used to train a machine-learning model in step 2612 and the trained machine-learning model used to generate predictions in step 2613 based on new true- source data.
  • the system may receive instructions to add a new agent probability distribution definition and/or a new behavior probability distribution definition.
  • the new agent probability distribution and/or new behavior probability distribution definition may be added to the simulation specification 2600 for a new simulation based on the updated simulation specification 2600.
  • step 2616 instructions may be received to modify one or more existing agent probability distribution definitions and/or behavior probability distribution definitions and/or instantiation parameters and/or desired synthetic data fields. Based on the information received in step 2616, the corresponding agent probability distribution definitions and/or behavior probability distribution definitions and/or instantiation parameters and/or desired synthetic data fields are modified in step 2617 and the modified simulation specification 2600 used for execution of a new simulation.
  • FIG. 27A depicts a flowchart of a process of modifying an agent-based model.
  • an identification of a simulation to be performed is received.
  • step 27010 agent/behavior probability distribution definitions and/or instantiation parameters and/or desired synthetic data fields are received.
  • Steps 27000 and 27010 may be performed serially and/or in parallel. Based on the received information, a user interface is generated in step 27020. In step 27030, user interactions with the user interface are received. In step 27040, the agent/behavior probability distribution definitions and/or instantiation parameters and/or desired synthetic data fields are modified based on the user interactions of step 27030.
  • FIG. 27B depicts a user interface for modifying an agent-based model.
  • the user interface 2701 may comprise a quantity of regions including a region 2702 permitting selection of an agent probability distribution definition (e.g., agent A probability distribution definition 2704, agent B probability distribution definition 2706, and agent X probability distribution definition 2708) and the quantity of instantiations for the selected agent probability distribution definition to be set (e.g., quantity of instantiations for agent A's probability distribution definition 2705, quantity of instantiations for agent B's probability distribution definition 2707, and/or quantity of instantiations for agent X's probability distribution definition 2709).
  • agent probability distribution definition e.g., agent A probability distribution definition 2704, agent B probability distribution definition 2706, and agent X probability distribution definition 2708
  • the user interface 2701 may comprise a region 2703 permitting selection of a behavior probability distribution definition and selectively enabling/disabling that behavior (e.g., region 2717 permitting selection of behavior probability distribution definition J and enable/disable region 2718, region 2719 permitting selection of behavior probability distribution definition K and enable/disable region 2720, and region 2721 permitting selection of behavior probability distribution definition Y and enable/disable region 2722).
  • region 2717 permitting selection of behavior probability distribution definition J and enable/disable region 2718
  • region 2719 permitting selection of behavior probability distribution definition K and enable/disable region 2720
  • region 2721 permitting selection of behavior probability distribution definition Y and enable/disable region 2722.
  • the user interface 2701 may comprise a region 2710 permitting modification of a selected agent/behavior's probability distribution definition.
  • Region 2710 may comprise a region 2711 for receiving a user's modification of an attribute parameter of the selected agent's probability distribution definition, a region 2712 for receiving the user's modification of a behavior probability distribution definition.
  • Region 2712 may additionally or alternatively separately permit linking or breaking a link between the selected behavior probability distribution definition such that instantiated agents perform the linked behaviors during simulation.
  • a behavior probability distribution definition comprises one or more parameters that define the behavior probability distribution definition or where each behavior probability distribution definition is comprised of separate actions (that collectively make up the behavior probability distribution definition)
  • the user interface may further comprise a region 2723 that receives user input for modification of the action or the behavior parameter.
  • the user interface 2701 may further comprise a region 2713 for accepting user input for defining a new agent probability distribution definition.
  • Region 2713 may comprise a region 2714 for receiving user input for setting a new attribute probability distribution parameter and a region 2715 for receiving user input for setting a new behavior probability distribution parameter and/or linking the new behavior probability distribution definition with an agent probability distribution definition.
  • the user interface 2701 may further comprise a region 2724 for accepting user input for modifying the fields to be populated with synthetic data for a generated synthetic dataset.
  • the user interface 2701 may further comprise a region 2725 for accepting user input for selecting and/or modifying a simulation to be executed.
  • a computer- implemented method may comprise storing, in a storage, one or more agent type definitions, wherein each agent type definition may comprise a plurality of attribute probability distribution definitions, receiving, for a first simulation, a first simulation specification, wherein the first simulation specification may comprise a first list of agent type definitions, attribute probability distributions associated with the first list of agent type definitions, and behavior probability distributions associated with the first list of agent type definitions; generating the first simulation via sampling, using a random number generator, the first simulation specification's probability distributions of the first list of agent type definitions; executing steps of the first simulation via sampling, using the random number generator, the first simulation specification's probability distributions of the first list of agent type definitions and the behavior probability distributions associated with the first list of agent type definitions; outputting, based on the first simulation, a first synthetic dataset; receiving, for a second simulation, a second simulation specification, wherein the second simulation specification may comprise a second list of agent type definitions, attribute probability distributions associated with the second list of
  • a difference between the first simulation specification and the second simulation specification may comprise, for the at least one common agent type definition, a different combination of attribute probability distributions. Additionally or alternatively, a difference between the first simulation specification and the second simulation specification may comprise, for the at least one common agent type definition, a different combination of the behavior probability distributions to be simulated. Fields of the first synthetic data set may be different from fields of the second synthetic data set.
  • the plurality of attribute probability distribution definitions may comprise at least one attribute probability distribution definition that, after generation of the first simulation or generation of the second simulation, is a value. Additionally or alternatively, the plurality of attribute probability distribution definitions may comprise at least one behavior probability distribution definition that is sampled at each step to modify an attribute value.
  • the first simulation specification may comprise a first set of synthetic data fields to be output
  • the second simulation specification may comprise a second set of synthetic data fields to be output
  • the first set of synthetic data fields may be different from the second set of synthetic data fields.
  • the first simulation may further comprise iteratively sampling probability distributions, associated with the first list of agent type definitions, of the attributes and behaviors.
  • the first synthetic data set may be streamed.
  • the first synthetic dataset may comprise synthetic data based on execution of multiple steps.
  • One or more aspects may further include receiving instructions to modify, for the first simulation specification, a quantity of instances of agent type definitions to be instantiated; modifying, based on the instructions, the first simulation specification; generating a third simulation via sampling, using the random number generator, the modified first simulation specification's probability distributions of the first list of agent type definitions; executing steps of the third simulation via sampling, using the random number generator, the modified first simulation specification's probability distributions of the first list of agent type definitions and the behavior probability distributions associated with the first list of agent type definitions; and outputting, based on the modified first simulation, a third synthetic dataset.
  • the agent type definitions may comprise probability monads, the probability monads may comprise attribute probability monads, and the probability monads may be a complex probability distribution composed of attribute probability distributions of the attribute probability monads.
  • generating the first simulation may comprise generating a simulation monad, wherein the simulation monad may comprise behavior probability monads, and the simulation monad may be a complex probability distribution composed of behavior probability distributions of the behavior probability monads.
  • the behavior probability distributions associated with at least one of the first list of agent type definitions and the second list of agent type definitions, describe one or more actions, the one or more actions may comprise action probability distributions, and the action probability distributions may be a complex probability distribution composed of the behavior probability distributions.
  • An apparatus in accordance with one or more aspects may include one or more processors and memory storing instructions that, when executed by the one or more processors, cause the apparatus to receive, for a first simulation, a first simulation specification, wherein the first simulation specification may comprise a first list of agent type definitions, attribute probability distributions associated with the first list of agent type definitions, and behavior probability distributions associated with the first list of agent type definitions; generate the first simulation via sampling, using a random number generator, the first simulation specification's probability distributions of the first list of agent type definitions; execute steps of the first simulation via sampling, using the random number generator, the first simulation specification's probability distributions of the first list of agent type definitions and the behavior probability distributions associated with the first list of agent type definitions; output, based on the first simulation, a first synthetic dataset; receive, for a second simulation, a second simulation specification, wherein the second simulation specification may comprise a second list of agent type definitions, attribute probability distributions associated with the second list of agent type definitions, and behavior probability distributions
  • a difference between the first simulation specification and the second simulation specification may comprise, for the at least one common agent type definition, a different combination of attribute probability distributions.
  • a difference between the first simulation specification and the second simulation specification may comprise, for the at least one common agent type definition, a different combination of the behavior probability distributions to be simulated.
  • Fields of the first synthetic data set may be different from fields of the second synthetic data set.
  • the plurality of attribute probability distribution definitions may comprise at least one attribute probability distribution definition that, after generation of the first simulation or generation of the second simulation, may comprise a value.
  • one or more non-transitory media may store instructions that, when executed by one or more processors, cause the one or more processors to perform steps comprising storing one or more agent type definitions, wherein each agent type definition may comprise a plurality of attribute probability distribution definitions; receiving, for a first simulation, a first simulation specification, wherein the first simulation specification may comprise a first list of agent type definitions, attribute probability distributions associated with the first list of agent type definitions, and behavior probability distributions associated with the first list of agent type definitions; causing display of a graphical interface of the first simulation specification, wherein the graphical interface may be configured to display one or more agent type definitions, of the first list the agent's probability distribution definitions and the behaviors; receiving user interactions with the graphical interface, wherein the user interactions may be to modify a specific attribute probability distribution of the one or more agent type definitions; modifying, based on the received user interactions, the one or more agent type definitions' probability distribution definition in the first simulation specification; and generating the first

Abstract

L'invention concerne un système, un procédé et un support lisible par ordinateur pour générer des données factuelles et/ou contre-factuelles. Ceci peut avoir pour effet d'améliorer la complexité des données disponibles pour entraîner des modèles d'apprentissage automatique. Les modèles peuvent comprendre des modèles à base d'agents (ABM) dans lesquels les définitions d'agent sont découplées de la simulation. Selon un ou plusieurs aspects, certains agents peuvent avoir des attributs et des comportements associés qui leur permettent d'être réutilisés dans différents ABM pour simuler différents systèmes.
PCT/US2022/011253 2021-01-05 2022-01-05 Génération et évaluation de données synthétiques sécurisées WO2022150343A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP22737018.6A EP4275343A1 (fr) 2021-01-05 2022-01-05 Génération et évaluation de données synthétiques sécurisées

Applications Claiming Priority (10)

Application Number Priority Date Filing Date Title
US17/142,097 US11847390B2 (en) 2021-01-05 2021-01-05 Generation of synthetic data using agent-based simulations
US17/142,117 US20220215242A1 (en) 2021-01-05 2021-01-05 Generation of Secure Synthetic Data Based On True-Source Datasets
US17/142,117 2021-01-05
US17/142,024 US20220215262A1 (en) 2021-01-05 2021-01-05 Augmenting Datasets with Synthetic Data
US17/142,137 2021-01-05
US17/142,024 2021-01-05
US17/142,137 US20220215243A1 (en) 2021-01-05 2021-01-05 Risk-Reliability Framework for Evaluating Synthetic Data Models
US17/142,097 2021-01-05
US17/240,133 US20220215142A1 (en) 2021-01-05 2021-04-26 Extensible Agents in Agent-Based Generative Models
US17/240,133 2021-04-26

Publications (1)

Publication Number Publication Date
WO2022150343A1 true WO2022150343A1 (fr) 2022-07-14

Family

ID=82358125

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/011253 WO2022150343A1 (fr) 2021-01-05 2022-01-05 Génération et évaluation de données synthétiques sécurisées

Country Status (2)

Country Link
EP (1) EP4275343A1 (fr)
WO (1) WO2022150343A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220207536A1 (en) * 2020-12-29 2022-06-30 Visa International Service Association System, Method, and Computer Program Product for Generating Synthetic Data

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200012933A1 (en) * 2018-07-06 2020-01-09 Capital One Services, Llc Systems and methods for synthetic data generation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200012933A1 (en) * 2018-07-06 2020-01-09 Capital One Services, Llc Systems and methods for synthetic data generation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ADAM SCIBIOR ; ZOUBIN GHAHRAMANI ; ANDREW D. GORDON: "Practical probabilistic programming with monads", HASKELL, ACM, 2 PENN PLAZA, SUITE 701 NEW YORK NY 10121-0701 USA, 30 August 2015 (2015-08-30) - 4 September 2015 (2015-09-04), 2 Penn Plaza, Suite 701 New York NY 10121-0701 USA , pages 165 - 176, XP058071676, ISBN: 978-1-4503-3808-0, DOI: 10.1145/2804302.2804317 *
MANNINO MIRO MIRO.MANNINO@NYU.EDU; ABOUZIED AZZA AZZA@NYU.EDU: "Is this Real? Generating Synthetic Data that Looks Real", USER INTERFACE SOFTWARE AND TECHNOLOGY, ACM, 2 PENN PLAZA, SUITE 701NEW YORKNY10121-0701USA, 17 October 2019 (2019-10-17) - 23 October 2019 (2019-10-23), 2 Penn Plaza, Suite 701New YorkNY10121-0701USA , pages 549 - 561, XP058450656, ISBN: 978-1-4503-6816-2, DOI: 10.1145/3332165.3347866 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220207536A1 (en) * 2020-12-29 2022-06-30 Visa International Service Association System, Method, and Computer Program Product for Generating Synthetic Data
US11640610B2 (en) * 2020-12-29 2023-05-02 Visa International Service Association System, method, and computer program product for generating synthetic data

Also Published As

Publication number Publication date
EP4275343A1 (fr) 2023-11-15

Similar Documents

Publication Publication Date Title
Loi et al. Transparency as design publicity: explaining and justifying inscrutable algorithms
EP3072089A1 (fr) Procédés, systèmes et articles de manufacture pour la gestion et l'identification de connaissances causales
WO2007005975A2 (fr) Systeme de modelisation des risques
US20210303970A1 (en) Processing data using multiple neural networks
Mishra Machine learning in the AWS cloud: Add intelligence to applications with Amazon Sagemaker and Amazon Rekognition
CA3133729A1 (fr) Systeme et methode d'essai de l'equite de l'apprentissage automatique
US20210342743A1 (en) Model aggregation using model encapsulation of user-directed iterative machine learning
WO2020036590A1 (fr) Évaluation et développement de modèles de prise de décision
US20220215242A1 (en) Generation of Secure Synthetic Data Based On True-Source Datasets
US20220215243A1 (en) Risk-Reliability Framework for Evaluating Synthetic Data Models
US20230237583A1 (en) System and method for implementing a trust discretionary distribution tool
Weinzierl et al. Detecting workarounds in business processes-a deep learning method for analyzing event logs
Mehmood et al. A Novel Approach to Improve Software Defect Prediction Accuracy Using Machine Learning
US20220215142A1 (en) Extensible Agents in Agent-Based Generative Models
US11847390B2 (en) Generation of synthetic data using agent-based simulations
EP4275343A1 (fr) Génération et évaluation de données synthétiques sécurisées
US20220215262A1 (en) Augmenting Datasets with Synthetic Data
US10896034B2 (en) Methods and systems for automated screen display generation and configuration
Strickland Data analytics using open-source tools
US20230076559A1 (en) Explainable artificial intelligence based decisioning management system and method for processing financial transactions
US11314488B2 (en) Methods and systems for automated screen display generation and configuration
Savickas et al. An approach to business process simulation using mined probabilistic models
Sunkle et al. Incorporating directives into enterprise TO-BE architecture
Ruzgar et al. Rough sets and logistic regression analysis for loan payment
Ho Big data machine learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22737018

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2022737018

Country of ref document: EP

Effective date: 20230807