US12554871B2 - Systems, methods, and computer-readable media for secure and private data valuation and transfer - Google Patents
Systems, methods, and computer-readable media for secure and private data valuation and transferInfo
- Publication number
- US12554871B2 US12554871B2 US17/712,952 US202217712952A US12554871B2 US 12554871 B2 US12554871 B2 US 12554871B2 US 202217712952 A US202217712952 A US 202217712952A US 12554871 B2 US12554871 B2 US 12554871B2
- Authority
- US
- United States
- Prior art keywords
- entity
- data samples
- synthetic data
- sample
- dataset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/94—Hardware or software architectures specially adapted for image or video understanding
- G06V10/95—Hardware or software architectures specially adapted for image or video understanding structured as a network, e.g. client-server architectures
Definitions
- the present disclosure is related to systems, methods, and computer-readable media for a fair, secure and private data valuation and transfer.
- Machine learning (ML) technology has revolutionized and achieved state-of-the-art performance in many areas like computer vision, natural language processing, and automated driving vehicles.
- the creation of accurate ML models can be highly dependent on access to large quantities of high quality and diverse training data.
- data aggregators also known as data sellers or data providers.
- data processing organizations also known as data seekers or buyers
- data processing organizations need these datasets to extract valuable business insights or train machine learning models.
- data processing organizations seek datasets from data aggregators, in exchange for compensation.
- a central problem in a data marketplace is the discovery of potentially useful datasets for a given buyer.
- An easy way to find potentially useful datasets is using data attributes like size of dataset (volume), attribute names, target names, etc.
- these attributes do not depict the quality of a dataset, and often correlate badly with the buyer task performance.
- a more promising solution that has been proposed is the utility-based data marketplace, where the business value a dataset can bring to the buyer's intended task is evaluated using data valuation.
- a test dataset provided by the buyer known as the buyer task dataset, is used to measure the utility of a particular seller dataset.
- the utility of a seller's dataset D Si can be computed based on a buyer task dataset D B that is provided by the buyer.
- the utility is computed using a function U(D B , D Si ) which estimates the utility of a seller dataset D Si to help solve the machine learning task represented by the buyer task dataset D B .
- U(D B , D Si ) which estimates the utility of a seller dataset D Si to help solve the machine learning task represented by the buyer task dataset D B .
- a buyer can make an informed decision about which datasets to acquire from what sellers.
- sellers can use the utility information to price their assets allowing a transparent price discovery.
- Utility computation requires interaction between the datasets of sellers and the buyer. Specifically, it requires buyer task dataset and seller's dataset as inputs. Hence, this computation has to happen at a computer system that is controlled by one of the sellers, the buyer or a third-party facilitator. This requires one (or more) of the market participants to transfer its dataset (assets) to other participants for data valuation.
- IP violation risks can include data-item level IP risks that pertain to IP rights in individual data items of the dataset (e.g., “Data-Item IP”) and dataset-level IP risks such as statistical information that is inherent in a dataset as a whole (e.g., “Statistical IP”).
- Data-Item IP data-item level IP risks that pertain to IP rights in individual data items of the dataset
- dataset-level IP risks such as statistical information that is inherent in a dataset as a whole
- Such a dataset contains two pieces of potentially-tradeable information of high value.
- individual data items may also contain sensitive private information about the data contributors which needs to be protected. For example, a picture of a group of humans contain sensitive information like facial expressions, clothes, location, interaction, activities of the humans.
- Unauthorized use of datasets can be difficult to detect and prove after-the-fact. For instance, it can be difficult to convincingly demonstrate whether a certain dataset was used in training of a machine learning or enforce copyright violations after a high-value image has been leaked online. Similarly, once a certain sensitive information (related to individual privacy) is leaked, its hard (if not impossible) to reverse the damage it has caused. Hence, sellers and buyers are hesitant to transfer their high-value and private assets (datasets or trained models) to each other or to the facilitator. Accordingly, there is a need for systems and methods that can enable secure and private data valuation and transfer of datasets among parties who may not have established trusting relationships.
- a computer implemented method that includes generating, by a first entity, a set of synthetic data samples that represent a corresponding set of original data samples; sending, by the first entity, the set of synthetic data samples for use by a second entity to generate a set of second entity predictions for the set of synthetic data samples using a machine learning (ML) model that has been trained using a second entity dataset; sending, by the first entity, for a third entity, a set of trusted labels corresponding to the set of original data samples; and receiving, by the first entity, from the third entity, valuation information for the second entity dataset that is based on a comparison by the third entity of the set of trusted labels and the set of second entity predictions.
- ML machine learning
- the method further includes receiving, by the first entity, the second entity dataset from the second entity upon completion by the first entity of a predetermined transfer requirement.
- the first entity, second entity, and third entity each comprise a respective controlled access computer system and (i) neither the second entity or the third entity have access to the set of original data samples, (ii) the second entity does not have access to the set of trusted labels, and (iii) the first entity does not have access to the second entity dataset prior to the completion by the first entity of the predetermined transfer requirement.
- the second entity is one of a plurality of second entities
- the method includes sending, by the first entity, the set of synthetic data samples comprises sending the set of synthetic data samples for use by each of the plurality of second entities to generate a respective set of second entity predictions for the set of synthetic data samples using a respective trained machine learning (ML) model that has been trained using a respective second entity dataset that is unique to the second entity, and receiving, by the first entity, from the third entity, valuation information comprises receiving, by the first entity, valuation information from the third entity for each of the respective second entity datasets.
- ML machine learning
- generating, by the first entity, the set of synthetic data samples comprises synthesizing a respective data sample for each original data sample based on optimizing both a utility objective that enables consistent valuation information to be generated by the third entity for each of the respective second entity datasets and a security objective that differentiates the synthetic data sample from the original data sample.
- generating, by the first entity, the set of synthetic data samples includes training multiple training ML models, wherein each training ML model is trained based on a common model architecture and training algorithm as used to train the second entity machine learning (ML) model and each training ML model is trained using a respective randomized version of the set of original data samples.
- ML machine learning
- the respective synthetic data sample is synthesized by randomly initializing the synthetic data sample; (a) using a plurality of the multiple training ML models to generate respective model outputs for both the synthetic data sample and the original data sample; (b) updating the synthetic data sample based on: (i) a first gradient computed by the first entity based on a prediction difference between the respective model outputs for the synthetic data sample and the respective model outputs for the original data sample across the multiple training ML models, and (ii) a second gradient computed by the first entity based on a sample distance between the synthetic data sample and the original data sample in a sample space; and (c) repeating (a) and (b) with an objective of minimizing the prediction difference and maximizing the sample distance, until a defined completion criteria is achieved.
- the original data samples are image samples
- the sample space is a pixel space
- the respective model outputs are final layer activations.
- the second entity is one of a plurality of second entities that each generate a respective set of second entity predictions for the set of synthetic data samples using a respective trained machine learning (ML) model that has been trained using a respective second entity dataset that is unique to the second entity.
- ML machine learning
- the valuation information for each respective second entity dataset comprises: an individual utility value that is based on an individual comparison of the set of trusted labels and the set of second entity predictions generated for the second entity, and a marginal utility value that is based on a marginal increase in utility of the respective second entity predictions compared to predictions that includes a plurality of the second entity predictions.
- the method includes sending to each second entity in the plurality of second entities an indication of a common model architecture and training algorithm for application by the second entity for training its respective ML model, wherein generating, by the first entity, the set of synthetic data samples is based on the common model architecture and training algorithm.
- the method includes receiving by the second entity, an indication of a common model architecture and training algorithm for application by the second entity for training the respective ML model, and training, by the second entity, the respective ML model based on the common model architecture and training algorithm using the second entity dataset as a training dataset.
- a method includes: receiving, by a facilitator entity, a set of trusted labels from a first entity; receiving, by the facilitator entity, a plurality of sets of second entity predictions provided by a plurality of second entities, the sets of second entity predictions having been generated by respective trained machine learning (ML) models for a common set of input samples, wherein each of the respective trained ML models has a common model architecture and has been trained using a common training algorithm based on a respective unique second entity dataset; computing, by the third entity, valuation information for each of the second entity datasets based on a comparison of the set of trusted labels with the sets of second entity predictions; sending, by the third entity, for the first entity, the valuation information for each of the second entity datasets; and sending by the third entity, for each second entity in the plurality of second entities, the valuation information for at least one of the second entity datasets.
- ML machine learning
- the present disclosure provides a computer-readable medium storing instructions for execution by a processing system.
- the instructions when executed cause the system to perform any of the aspects of the method described above.
- FIG. 2 is a flow diagram illustrating an example of a process performed by a synthetic data generation operation of the buyer entity of FIG. 1 .
- FIG. 6 is flow diagram illustrating an example of a process performed among the entities of FIG. 1 .
- FIG. 7 is a block diagram, of a computer system that can be used to implement the entities of FIG. 1 .
- Methods, systems and computer-readable media for secure and private data valuation and transfer of datasets are disclosed.
- the disclosed solution enables datasets to be accurately evaluated without disclosing information that would enable unauthorised use or copying of the datasets. This is achieved by limiting data access among different entities and also by generating synthetic data for use as a proxy for proprietary datasets during an evaluation process.
- the present disclosure describes a computer-implemented solution that can, in some examples, be applied in the context of a data marketplace where a first enterprise (referred to hereinafter as a “buyer”) wants to evaluate and potentially acquire one or more datasets from one or more second enterprises (referred to hereinafter as “seller(s)”) with the assistance of a third enterprise (referred to hereinafter as a “facilitator”).
- An enterprise can for example be a company, an institution, a governmental body, non-governmental body, a charity, a firm, a group or other type of organization, or an individual.
- task dataset can refer to a dataset that includes a collection of data samples and respective trusted labels that correspond to a target ML task.
- a set of images and classification labels for those images can be a task dataset for an ML model image classification task.
- a set of images and object detection labels for those images can be a task dataset for an ML model object detection task.
- NLP natural language programming
- the set of classes C i.e., the possible labels in seller dataset labels ⁇ (y j ) ⁇ and possible labels in the buyer dataset trusted labels ⁇ y j ⁇
- each seller trains a supervised classification machine learning (ML) model g ei on their individual datasets D Si using a standard machine learning pipeline.
- ML supervised classification machine learning
- FIG. 1 illustrates a data valuation and exchange network 5 (hereafter network 5 ) that includes a collection of participating entities 10 , 20 , 30 ( 1 ) to 30 (M) to which methods, systems, and computer readable mediums that are disclosed herein can be applied.
- entity can refer to a set of resources that is associated with or under the control of an enterprise.
- the set of resources can, for example, include one or more computer systems (including computer hardware, computer software, databases and datasets) that are part of or communicate with an enterprise network, including through a virtual private network.
- the entities of network 5 can include a first entity 10 (hereinafter referred as “buyer entity” 10 ); at least one second entity 30 ( i ) (hereinafter referred to as “seller entity” 30 ( i ); and a third entity 20 (hereinafter referred to as “facilitator entity” 20 ).
- Buyer entity 10 can include one or more computer systems associated with or controlled by a buyer that controls a task dataset D B . The buyer wants to acquire additional data that can be used to train an effective ML model to perform the same ML model prediction task that is represented in the task dataset D B . Using the acquired data, buyer may use it to train a machine learning model from scratch or improve performance of an existing machine learning model using a bigger and a more diverse dataset.
- each seller entity 30 ( i ) may for example include one or more computer systems associated with or controlled by a seller i that desires to sell a respective seller dataset D Si .
- each discrete seller entity 30 ( i ) includes a respective seller dataset D Si .
- a single seller entity 30 ( i ) may include multiple respective seller datasets that can each be individually evaluated using the systems and methods described herein.
- Facilitator entity 20 may for example include one or more computer systems associated with or controlled by a facilitator that manages a service platform for intermediating between buyers and sellers.
- each of the networked entities 10 , 20 , 30 ( 1 ) to 30 (M) includes a respective controlled access computer system for storing data and performing the respective processes that are described below.
- Controlled access means that access to the enterprise resources is limited to authorized parties or devices that meet pre-defined access criteria.
- each of the networked entities 10 , 20 , 30 ( 1 ) to 30 (M) are respectively deployed in their own secure and physically separated environment and exchange information about datasets only when mandated by the evaluation process protocols disclosed herein.
- buyer task dataset D B is a stored resource of buyer entity 10 .
- each task data sample x k can be an image and the trusted label y k can identify a class label for the image from a set of possible class labels.
- buyer entity 10 and other buyer entities
- seller entities 30 ( 1 ) to 30 (M) and possibly other seller entities can register with and submit respective metadata to a coordinator module 22 of the facilitator entity 20 .
- buyer metadata can include definitions and descriptions of the task dataset D B , desired features of a seller dataset, and an intended ML task. This can include information like ordered set of class labels, class frequency distribution, loss functions, evaluation metrics, task definition, and required dataset size, among other things.
- seller metadata can include definitions and descriptions of the seller dataset D s including information defining possible tasks, class labels, dataset size, and dataset distribution information among other things.
- the facilitator entity 20 can perform searches and select a subset of seller datasets (e.g., seller datasets D S1 to D SM ) which best match the buyer entity requirements based on the buyer metadata. For example, class labels, task definition, volume and other possible measurable constraints can be used to select the seller datasets Using, for example, a k-nearest type of matching.
- the exchanged metadata does not include actual data samples from the task dataset or seller datasets, but rather only non-sensitive descriptive information about tasks and datasets as required to enable the facilitator entity 20 to match potential seller datasets D S to the buyer entity 10 .
- the coordinator module 22 of the facilitator entity 20 can initiate a data valuation process that includes a first step of sending a buyer-side protocol (BP) 40 for the evaluation process to the buyer entity 10 and a seller-side protocol (SP) 42 to the participating seller entities 30 ( i ).
- BP buyer-side protocol
- SP seller-side protocol
- the SP 42 defines a model training and evaluation process, which includes a specific ML model architecture g and a training algorithm (denoted hereafter as “learn”) that each seller entity 30 ( i ) uses to train respective machine learning models g ei on their local datasets D Si .
- the BP 40 can specify a detailed algorithm for proxy data synthesis, which also includes a model training and evaluation process, for instance, model architecture g and training algorithm learn and other parameters.
- the defined model architecture and training algorithms can be selected for an evaluation process based on the intended prediction task.
- another entity for example the buyer entity 10 , can select the defined model architecture g and training algorithm learn for an evaluation process and convey it to the facilitator in an additional step.
- the same model architecture g and training algorithm learn be specified in both SP 42 and BP 40 for use by the seller entities 30 ( i ) and the buyer entity 10 .
- the model architecture g and algorithm learn includes definitive steps to train a particular machine leaning model.
- g may be a two-layer DNN with 50 hidden units in each layer connected using relu activations, followed by a final classification layer with C units.
- the training algorithm learn contains comprehensive code for training machine learning models like randomly initializing the weights of the model instantiated as per the architecture g, code for iteratively updating the weights, using, for example, a gradient based algorithm like Stochastic Gradient Descent, stopping criteria based on validation error or fixed number of epochs, etc.
- a buyer entity 10 applies the BP 40 for proxy data synthesis to generate a proxy dataset D p in which the task data samples ⁇ (x k ) ⁇ included in the original task dataset D B are replaced with respective synthetic data samples ⁇ ( ) ⁇ .
- Each of respective seller entity 30 ( i ) applies the SP 42 to train a respective ML model g ⁇ i , using its respective seller dataset D si .
- the buyer entity 10 provides the independent variable (i.e., synthetic data samples ⁇ ( ) ⁇ ) of the proxy dataset D p to each respective seller entity 30 ( i ).
- Each respective seller entity 30 ( i ) then generates a set of respective label predictions ⁇ (y ki ) ⁇ for the proxy dataset D p using its respective ML model g ⁇ i (which is trained on its own seller dataset D si ). Each respective seller entity 30 ( i ) sends its set of respective label predictions ⁇ (y ki ) ⁇ to facilitator entity 20 . Additionally, the buyer entity 10 provides the trusted labels ⁇ (y k ) ⁇ , unchanged from the original task dataset D B , to the facilitator entity 20 . The facilitator entity 20 then computes a utility for each respective seller entity 30 ( i ) based on a comparison of the trusted labels ⁇ (y k ) ⁇ to corresponding counterparts in the respective label predictions ⁇ (y ki ) ⁇ . These computed utilities can then be used by the entities to determine what seller datasets are useful for the buyer entity machine learning task as well as discovery of a fair monetary value of such datasets for the seller entities.
- a consideration in the overall work process of network 5 is generating the proxy dataset D p in such a manner that the synthetic data samples ⁇ ( ) ⁇ optimize competing utility and data security objectives.
- the utility objective requires that the synthetic data samples ⁇ ( ) ⁇ tends to produce the same label predictions as that of the original data points ⁇ (x k ) ⁇ for all seller models g ⁇ i .
- the utility information, computed by the facilitator entity 20 with the synthetic dataset D p is approximately similar to that of the original dataset D B .
- the data security objective requires that the synthetic data samples ⁇ ( ) ⁇ of proxy dataset D p to be sufficiently different from the original task data samples to render them perceptually unintelligible (for example, in the case of data samples that can be observed by humans). Since the individual synthetic data samples are perceptually unintelligible (e.g., they look like random noise), the data-item IP and privacy of individual data items is protected. Additionally, in order to protect statistical IP, the security goal requires to prevent ML models trained with architectures other than the evaluation process architecture g to make accurate label predictions for them, and, finally to prevent machine learning models trained using the proxy dataset D P to be effective for inference on the original dataset D B . Due to being ineffective for statistical analysis like training and inference, statistical IP is also protected.
- buyer entity 10 includes a synthetic data generation module 12 that is configured to generate the proxy dataset D p with a goal of optimizing these competing utility and data security objectives.
- Synthetic data generation module 12 applies a synthetic data generation process that is based on the BP 40 sent by the facilitator entity 20 , to generate a proxy dataset D p .
- FIG. 2 shows a block diagram overview of the synthetic data generation process 200
- FIG. 3 which shows a pseudocode outline (“Algorithm 1”) of the synthetic data generation process 200 .
- the inputs to synthetic data generation process 200 include: (i) the buyer task dataset D B (which can be stored by buyer entity 10 ); and (ii) BP 40 (received from facilitator entity 20 ) that identifies the model architecture “g” and training algorithm learn for training and evaluating ML models and other parameters to be applied by the synthetic data generation process 200 .
- the task dataset D B is used to learn a training set of ML models ⁇ TR , and a validation set of ML models ⁇ V with random initialization, using the model architecture and training algorithm specified in the BP 40 .
- the network 5 does not allow either the buyer entity 10 or the facilitator entity 20 to have access to either the seller datasets D s or the trained seller ML models g ⁇ i (which can embed high-value intellectual property and sensitive private information about the sellers dataset D s ) during the synthetic data generation process.
- the synthetic data generation process 200 relies on simulated access to a distribution of ML models ⁇ ⁇ from which the seller ML models g ⁇ i are assumed to be sampled. Using this distributional access, the synthetic data generation process uses statistical optimization to synthesize data points which can satisfy the utility and security goal for all members of the distribution ⁇ ⁇ .
- the buyer entity uses a finite sample from the distribution ⁇ TR for the statistical optimisation. In order to generate the finite sample, it learns a training set of ML models ⁇ TR by using the buyer task dataset D B with random noise and random initialization to emulate seller ML models g ⁇ i of varying ground truth utility with respect to the buyer's ML task. Buyer entity 10 learns validation set of ML models ⁇ V in a similar manner to verify how well the synthesized data point generalise for models outside the training set ⁇ TR . As explained below, trained sets of ML models ⁇ Tr , ⁇ V are used during further steps in the synthetic data generation process 200 for synthetic data generation.
- blocks 208 to 218 are repeated until a respective synthetic data point ( , y k ) is added to the proxy dataset D P for each (x k , y k ) data point included in the original buyer task dataset D B .
- the synthetic data sample of the data point ( , y k ) is initialized by sampling from a random distribution N(0,1).
- a solution to an optimization problem is then computed by performing a set of iterations to synthesize a data sample until either a stop criteria is reached, which can be achieved by either meeting a loss-stopping criteria or reaching a defined number (T) of iterations.
- Each iteration includes:
- the gradients are evaluated based on optimizing the following empirical risk minimisation loss objective:
- the first term of the above equation strives to synthesize a data point , y k ⁇ D P , which produces approximately the same output class distribution g ⁇ i ( ) ⁇ g ⁇ i (x k ) for all i. Consequently, if the entropy of g ⁇ i (x k ), for all i, is sufficiently low, which is expected from a well-trained and confident model, the argmax prediction and the final verdict will be the same. Hence, intuitively, the aggregated utility (across all k) of D p and D B can expected to be approximately similar.
- the data-item level pixel-wise distance may not necessarily ensure perceptual incompressibility, however, due to the subjective nature of g ⁇ i s, which are chosen to be over-parameterized deep neural networks, the above equation has infinite solutions, majority of which are not in the manifold of real images. This is simply because of the fact that the size of real image manifold is extremely small in the space of all real images. Hence, the pixel-wise distance term is enough to guide the optimisation away from the small probability of ending up in the manifold of real image solutions. Consequently, if the images in ⁇ ⁇ are not in the real manifold, the resulting synthesized images are perceptually incomprehensible. Due to perceptually incomprehensible images, we can safely ensure protection in terms of privacy, data-item level (image) IP and dataset-level (statistical) IP.
- the proxy dataset D P is updated to include the newly learned synthetic data sample as a data point ( , y k ).
- the blocks 208 to 218 are then repeated until proxy dataset D P is fully synthesized.
- the set of synthetic data samples ⁇ ( ) ⁇ (also referred to as the independent variables) can be provided (without any corresponding labels) through a communications channel (which can be a secure channel) to each of the seller entities 30 ( 1 ) to (M).
- the trusted labels ⁇ (y k ) ⁇ can be provided (without data samples) through a communications channel (which can be a secure channel) to the facilitator entity 20 .
- seller entities 30 ( 1 ) to (M) each include a respective ML model training module 32 and a trained ML model inference module 34 .
- a process performed at an example i th seller entity 30 ( i ) will now be described with respect to FIG. 4 , according to an example of the disclosure.
- seller entity 30 ( i ) is provided with a SP 42 by facilitator entity 20 .
- the SP 42 defines an ML model architecture g and training algorithm learn (block 402 ).
- Seller entity 30 ( i ) applies the SP 42 using its ML model train module 32 and its own local seller dataset D si to train a ML model on its dataset D Si , resulting in trained ML model g ⁇ i (block 404 ).
- the seller entity 30 ( i ) receives the proxy dataset D p (only the independent variables) from buyer entity 10 (block 406 ).
- Seller entity 30 ( i ) applies the trained ML model g ⁇ i to the proxy dataset D p using trained ML model inference module 34 to output a set of seller predictions ⁇ (y k(i) ) ⁇ for the set of synthetic data samples ⁇ ( ) ⁇ (block 408 ).
- the set of seller predictions ⁇ (y k(i) ) ⁇ are then sent to the facilitator entity 30 (block 410 ).
- the generated predictions ⁇ (y k(i) ) ⁇ should be close to that of the original buyer task dataset D B .
- the points ⁇ ( ) ⁇ gives security (intellectual property protection and privacy) which the original task dataset D B cannot provide.
- facilitator entity 20 includes a utility computation module 24 .
- a process performed by the utility computation module 24 will now be described with respect to FIG. 5 , according to an example of the disclosure.
- utility computation module 24 computes a respective individual utility (U i ) as an evaluation metric for each of the respective sets of seller predictions ⁇ (y k(i) ) ⁇ (and hence their corresponding seller datasets D si ) (block 504 ).
- the individual utility U i can be computed using a utility function that is based on a comparison of the seller predictions ⁇ (y k(i) ) ⁇ to the ground truth task labels ⁇ (y k ) ⁇ . In case of classification task, this can be classification accuracy. The higher the accuracy, the higher the individual utility U i for the seller dataset D si .
- a generic representation of a possible individual utility function is:
- the utility computation module 24 can compute a marginal utility contribution of an individual seller dataset D si with respect to all of the other seller datasets (block 506 ). This can give an indication of how much utility is improved (on average) if the prediction results for an individual seller dataset D si is added to all possible subsets of the prediction results of other individual seller dataset D si .
- a higher marginal utility of a particular seller dataset means that dataset contains distinct information (with respect to other seller datasets) and adds unique independent information with respect to other seller datasets. This unique dataset can be useful for a buyer's target ML task to get relatively hard examples correctly classified.
- Such an analysis can be performed, for example, by applying Shapley value analysis.
- a generic representation of a possible marginal utility function for the i th seller is:
- U(s) is the combined best-case utility of all seller predictions in the subset s of all sellers S; U(s u ⁇ i ⁇ ) is the combined best-case utility all seller datasets in the set ⁇ s+i ⁇ ; and S is the set of all M seller datasets being considered by the utility computation module 24 .
- the best-case utility U(s) is computed such that if, for a particular data sample, the prediction is correct for at least one
- This information can be used by participating entities to objectively assess the utility and value of the seller datasets.
- coordinator module 22 of the facilitator entity 20 can be configured to intermediate the distribution of the data valuation metrics and provide an escrow service to facilitate payment for and exchange of datasets.
- FIG. 6 illustrates a price discovery and deal finalization process that can be performed among the participating entities according to an example embodiment.
- each seller entity 30 ( i ) can be provided, by the facilitator entity 20 , with the evaluation metrics for its own respective seller dataset D si . Based on this information, each seller entity 30 ( i ) can set a Willing-To-Sell (WTS) cost for its respective seller dataset D si , and provide that information to the facilitator entity 20 .
- the facilitator entity 20 can assemble the WTS data from participating seller entities 30 ( 1 ) to 30 (M) together with the evaluation metrics for the respective seller datasets D s(1) to D S(M) and provide the assembled information to buyer entity 10 (block 604 ).
- WTS Willing-To-Sell
- Buyer entity 10 can analyze the received WTS and seller data valuation metrics and then send an indication (e.g., a buy list) to the facilitator entity 20 identifying the seller datasets D s that the buyer entity 10 wants to acquire (block 606 ).
- the facilitator entity 20 can then initiate a closing protocol between the buyer entity 10 and each seller entity 30 ( i ) that is included in the buy list, which may include additional price negotiation. (block 608 ).
- the facilitator entity 20 can facilitate a transfer of assets and payments (block 610 ) such as: (1.) facilitator entity 20 receives payment in escrow for the dataset D si from the buyer entity 10 and informs the seller entity 30 ( i ) of the received payment; (2.) seller entity 30 ( i ) sends dataset D si directly to the buyer entity 10 through a secure channel, thereby completing a predetermined transfer requirement; (3.) buyer entity 10 confirms it has received dataset D si by sending acknowledgments to facilitator entity 20 and seller entity 30 ( i ); and (4) facilitator entity 20 transfers payment to seller entity 30 ( i ).
- the facilitator entity 20 may in some examples deduct a commission fee from the payment as compensation for services.
- the synthetic data generation process 200 in combination with the use of a common ML model architecture g and training algorithm learn for the randomly initialized ML model sets ⁇ TR , ⁇ V trained by the seller entity 20 and the respective seller ML models g ⁇ i can provide one or more of the following advantageous features in at least some applications:
- the utility metrics computed based on predictions made for the synthetic proxy dataset D P can be approximately same as if the original dataset D B was processed by the seller ML models. Thus, in at least some applications, accurate utility information can be obtained with the synthetic proxy dataset D P without either the seller entities or the facilitator entity ever having access to the actual data samples ⁇ (x k ) ⁇ of the buyer task dataset D B .
- proxy dataset D P (3) ML models f ⁇ trained with the proxy datasets D P will not perform well when applied to the original buyer task dataset D B .
- proxy dataset D P is not effective in training models which perform well on the real dataset, hence protecting the statistical training information.
- the data samples of the proxy dataset D P will be at a high distance (in the pixel space) from the original data samples of the buyer task dataset D B .
- the distance measured in semantic space using image quality assessment metrics (IQA) like FID, SSID, FSIM, Content Loss, etc., will be high.
- IQA image quality assessment metrics
- new dataset images in the proxy dataset will protect the data-item IP and attribute privacy (visual information) of the original images in the buyer task dataset.
- the data valuation and exchange network 5 can enable a secure, private and fair data trading network that can achieve, in at least some application scenarios, one or more of the following properties: (i) Security: the network includes protections against leakage of the proprietary rights (both data-item level and dataset level IP) of the buyer and seller's dataset assets; (ii) Privacy: the network can ensure that buyers and sellers do not lose the privacy of their data items (attribute level privacy of visual images); (iii) Computational efficiency: the overall network can be computationally efficient with regards to utility estimation of seller datasets (no need for computationally inefficient encryption or training exponentially large models for shapely value estimation); (iv) Versatility: the network can be applied in the context of high-dimensional data and work for variety of machine learning models (in contrast to existing solutions, for example differential privacy based approaches); (v) Fairness: the network is fair with respect to sellers capability to solve the buyer task, and enables an accurate estimate of the seller dataset value (performance). Also, utility information can be transparent
- the systems and methods disclosed herein assume no trust between sellers, buyers, and the platform. This is addressed in two ways.
- the network 5 provides protection due to limited data accessibility for each party.
- the buyer entity 10 only sends the sanitized independent variable of the proxy data to seller entities 30 ( 1 ) to 30 (M) for ranking computation based on utility.
- the platform entity 20 only receives seller entity predictions and ground truth task labels from the buyer entity 10 to compute rankings (individual and marginal utilities).
- Seller entities 30 ( 1 ) to 30 (M) do not share their respective seller datasets with any other parties.
- the data synthesis process 200 converts the original buyer task dataset D B into proxy dataset D P which provides protection against intellectual property theft and privacy violations by hiding information.
- the data sanitization that is effected by data synthesis process 200 causes image obfuscation allowing attribute level privacy, resulting in images that look like random noise such that an adverse party can't obtain any visually identifiable information (e.g., faces and details of data can be hidden).
- the disclosed systems and methods do not rely on inefficient encryption or exponential model training.
- the data synthesis process 200 solves an iterative optimization problem.
- the iterative optimization can be further optimized using a one-shot process with a specifically trained neural network.
- the disclosed systems and methods can be used with standard deep learning pipelines and unstructured datasets like MNIST, CIFAR-10, among other examples. This is in contrast to existing prior art like differentially private generative adversarial networks approaches which has poor performance on high dimensional datasets.
- image classification has been mentioned above, the disclosed systems and methods described herein, including data synthesis process 200 , is independent of the underlying learning problem and is easily extendible to other learning problems like object detection, natural language processing, etc.
- the disclosed systems and methods offer transparency that offers fairness for sellers.
- the optimization used for data synthesis has an explicit term which ensures data utility for seller models is accurate (close to utility with original data). Hence, an accurate estimate of the utility of each seller dataset can be provided.
- the use Shapley value analysis to compute marginal utility gain for each seller dataset can give an indication of the importance of a particular seller dataset with respect to other seller datasets, which can be very useful in price discovery for sellers.
- seller dataset utility is computed by the facilitator entity, hence removing the possibility of seller entities lying about their utility.
- both the marginal utility gain and individual utility gain is computed for each seller dataset to determine a holistic picture of each seller dataset value to the buyer task. This utility information is shared with both sellers and buyers for a fair and transparent price discovery through negotiations.
- FIG. 7 illustrates an example of a processing system 700 that may be used to implement a respective entity (for example a seller entity 10 , facilitator entity 20 and/or a seller entity 30 ( i )) in the network 50 .
- the processing system 700 includes one or more processors 710 .
- the one or more processors 710 may include a central processing unit (CPU), a graphical processing unit (GPU), a tensor processing unit (TPU), a neural processing unit (NPU), a digital signal processor, and/or another computational element.
- the processor(s) 710 are coupled to an electronic storage(s) 720 and to one or more input and output (I/O) interfaces or devices 730 such as network interfaces, user output devices such as displays, user input devices such as touchscreens, and so on.
- I/O input and output
- the electronic storage 720 may include any suitable volatile and/or non-volatile storage and retrieval device(s), including for example flash memory, random access memory (RAM), read only memory (ROM), hard disk, optical disc, subscriber identity module (SIM) card, memory stick, secure digital (SD) memory card, and other state storage devices.
- RAM random access memory
- ROM read only memory
- SIM subscriber identity module
- SD secure digital
- the electronic storage 720 of the processing system 700 stores instructions 722 (executable by the processor(s) 710 ) and supporting data 724 for implementing one or more or the various modules described above.
- the present disclosure is described, at least in part, in terms of methods, a person of ordinary skill in the art will understand that the present disclosure is also directed to the various components for performing at least some of the aspects and features of the described methods, be it by way of hardware components, software or any combination of the two. Accordingly, the technical solution of the present disclosure may be embodied in the form of a software product.
- a suitable software product may be stored in a pre-recorded storage device or other similar non-volatile or non-transitory computer readable medium, including DVDs, CD-ROMs, USB flash disk, a removable hard disk, or other storage media, for example.
- the software product includes instructions tangibly stored thereon that enable a processing device (e.g., a personal computer, a server, or a network device) to execute examples of the methods disclosed herein.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Bioethics (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- General Engineering & Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
-
- ΩTR, ΩV—training set of ML models, and evaluation set of ML models, respectively;
- DB, DP—original task dataset and synthesized proxy dataset, respectively;
- (xk, yk), (, yk)—original and synthesized kth data points, respectively; and
- fθ—trained ML model with parameters θ, selected from training set ΩTR of ML models.
-
- (i) block 210—randomly sample the training ML model set ΩTR to select a subset of training ML models {(fθ)}, and use each of training ML models in the subset {(fθ)} to generate a respective projection (e.g., a label prediction) for the synthetic data sample xk and for the real task data sample xk;
- (ii) block 212—compute a gradient that combines (a) a utility gradient based on a loss that corresponds to the aggregate differences between the final-layer activations for the synthetic data sample relative to the final-layer activations for the real task data sample xk; and (b) a security gradient based on a difference between the synthetic data sample and the real task data sample xk in the pixel space, for example, as per the following equation:
where the first term is the utility gradient and the second term is the security gradient (the optimization objective for gradient is discussed in greater detail below);
-
- (iii) block 214—update the synthetic data sample based on the computed gradient;
- (iv) block 216—determine if stopping criteria (i.e., loss-stopping criteria or maximum number of iterations) has been reached in respect of the synthetic data sample . In example embodiments, the loss-stopping criteria is computed using an unseen set of models ΩV which have not been observed during the optimisation (computation of gradients and updates). Specifically, the loss-stopping criteria calculated based on an analysis of the predicted labels generated by the set of validation models ΩV for the recently updated synthetic data sample and the task sample xk (see Algorithm 1 lines 12, 13, 14 of
FIG. 3 ). In some example embodiments, the loss-stopping criteria can be reached when further iterations do not result in a defined threshold improvement to a loss computed based on a defined loss objective.
where the first term represents utility loss that seeks to minimize differences between a sample ML model final-layer activations for the synthetic and true data samples, and the second term represents a security loss that seeks to maximize a difference between the synthetic and true data samples in the pixel space. For each xk,yk∈DB, the first term of the above equation strives to synthesize a data point , yk∈DP, which produces approximately the same output class distribution gθi()≈gθi(xk) for all i. Consequently, if the entropy of gθi(xk), for all i, is sufficiently low, which is expected from a well-trained and confident model, the argmax prediction and the final verdict will be the same. Hence, intuitively, the aggregated utility (across all k) of Dp and DB can expected to be approximately similar. On the other hand, the data-item level pixel-wise distance (in the case of image samples) may not necessarily ensure perceptual incompressibility, however, due to the subjective nature of gθis, which are chosen to be over-parameterized deep neural networks, the above equation has infinite solutions, majority of which are not in the manifold of real images. This is simply because of the fact that the size of real image manifold is extremely small in the space of all real images. Hence, the pixel-wise distance term is enough to guide the optimisation away from the small probability of ending up in the manifold of real image solutions. Consequently, if the images in {} are not in the real manifold, the resulting synthesized images are perceptually incomprehensible. Due to perceptually incomprehensible images, we can safely ensure protection in terms of privacy, data-item level (image) IP and dataset-level (statistical) IP.
Where can be any number of standard functions for assigning a comparison value to two variables like classification accuracy.
Where U(s) is the combined best-case utility of all seller predictions in the subset s of all sellers S; U(s u {i}) is the combined best-case utility all seller datasets in the set {s+i}; and S is the set of all M seller datasets being considered by the utility computation module 24. The best-case utility U(s) is computed such that if, for a particular data sample, the prediction is correct for at least one seller in s, that data sample is regarded as correctly classified.
Claims (20)
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US17/712,952 US12554871B2 (en) | 2022-04-04 | 2022-04-04 | Systems, methods, and computer-readable media for secure and private data valuation and transfer |
| CN202380014041.3A CN118176500A (en) | 2022-04-04 | 2023-04-04 | Systems, methods, and computer-readable media for secure and private data evaluation and transfer |
| PCT/CN2023/086139 WO2023193703A1 (en) | 2022-04-04 | 2023-04-04 | Systems, methods, and computer-readable media for secure and private data valuation and transfer |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US17/712,952 US12554871B2 (en) | 2022-04-04 | 2022-04-04 | Systems, methods, and computer-readable media for secure and private data valuation and transfer |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20230315885A1 US20230315885A1 (en) | 2023-10-05 |
| US12554871B2 true US12554871B2 (en) | 2026-02-17 |
Family
ID=88194350
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/712,952 Active 2044-04-30 US12554871B2 (en) | 2022-04-04 | 2022-04-04 | Systems, methods, and computer-readable media for secure and private data valuation and transfer |
Country Status (3)
| Country | Link |
|---|---|
| US (1) | US12554871B2 (en) |
| CN (1) | CN118176500A (en) |
| WO (1) | WO2023193703A1 (en) |
Families Citing this family (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12306938B2 (en) | 2023-02-16 | 2025-05-20 | Capital One Services, Llc | Spurious-data-based detection related to malicious activity |
| US12393681B2 (en) | 2023-02-16 | 2025-08-19 | Capital One Services, Llc | Generation of effective spurious data for model degradation |
| US12395529B2 (en) * | 2023-02-16 | 2025-08-19 | Capital One Services, Llc | Layered cybersecurity using spurious data samples |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2017222902A1 (en) * | 2016-06-22 | 2017-12-28 | Microsoft Technology Licensing, Llc | Privacy-preserving machine learning |
| US20200193309A1 (en) * | 2018-12-13 | 2020-06-18 | Diveplane Corporation | Synthetic Data Generation in Computer-Based Reasoning Systems |
| US20210256383A1 (en) * | 2020-02-13 | 2021-08-19 | Northeastern University | Computer-implemented methods and systems for privacy-preserving deep neural network model compression |
| US20220067450A1 (en) * | 2020-09-01 | 2022-03-03 | International Business Machines Corporation | Determining system performance without ground truth |
| US11507836B1 (en) * | 2019-12-20 | 2022-11-22 | Apple Inc. | Federated learning using local ground truth estimation |
| US11544406B2 (en) * | 2020-02-07 | 2023-01-03 | Microsoft Technology Licensing, Llc | Privacy-preserving data platform |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20180246774A1 (en) * | 2017-02-28 | 2018-08-30 | NURV Ltd. | Intelligent Networked Architecture for Real-Time Remote Events Using Machine Learning |
| CN109376549B (en) * | 2018-10-25 | 2021-09-10 | 广州电力交易中心有限责任公司 | Electric power transaction big data publishing method based on differential privacy protection |
| WO2020204690A1 (en) * | 2019-03-29 | 2020-10-08 | Trace Blue Sdn. Bhd. | Data brokerage and valuation system and method |
| CN113919886A (en) * | 2021-11-11 | 2022-01-11 | 重庆邮电大学 | Data characteristic combination pricing method and system based on summer pril value and electronic equipment |
-
2022
- 2022-04-04 US US17/712,952 patent/US12554871B2/en active Active
-
2023
- 2023-04-04 WO PCT/CN2023/086139 patent/WO2023193703A1/en not_active Ceased
- 2023-04-04 CN CN202380014041.3A patent/CN118176500A/en active Pending
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2017222902A1 (en) * | 2016-06-22 | 2017-12-28 | Microsoft Technology Licensing, Llc | Privacy-preserving machine learning |
| US20200193309A1 (en) * | 2018-12-13 | 2020-06-18 | Diveplane Corporation | Synthetic Data Generation in Computer-Based Reasoning Systems |
| US11507836B1 (en) * | 2019-12-20 | 2022-11-22 | Apple Inc. | Federated learning using local ground truth estimation |
| US11544406B2 (en) * | 2020-02-07 | 2023-01-03 | Microsoft Technology Licensing, Llc | Privacy-preserving data platform |
| US20210256383A1 (en) * | 2020-02-13 | 2021-08-19 | Northeastern University | Computer-implemented methods and systems for privacy-preserving deep neural network model compression |
| US20220067450A1 (en) * | 2020-09-01 | 2022-03-03 | International Business Machines Corporation | Determining system performance without ground truth |
Non-Patent Citations (48)
| Title |
|---|
| Abufadda (N.P.L "A Survey of Synthetic Data Generation for Machine Learning"), Jan. 17, 2022 (Year: 2022). * |
| Abufadda, A Survey of Synthetic Data Generation for Machine Learning. Dec. 23, 2021 (Year: 2021). * |
| Azcoitia, Santiago Andrés, et al. "Try Before You Buy: A practical data purchasing algorithm for real-world data marketplaces." arXiv preprint arXiv:2012.08874 2020. |
| Cao, Tianshi, et al. "Don't Generate Me: Training Differentially Private Generative Models with Sinkhorn Divergence." Advances in Neural Information Processing Systems 34 2021. |
| Carlini, Nicholas, et al. "An Attack on InstaHide: Is Private Learning Possible with Instance Encoding ?. " arXiv preprint arXiv:2011.05315 2020. |
| Chen, Dingfan et al, Gs-wgan: A gradient-sanitized approach for learning differentially private generators. arXiv preprint arXiv:2006.08265 2020. |
| Chen, Lingjiao et al., "Towards model-based pricing for machine learning in a data marketplace." Proceedings of the 2019 International Conference on Management of Data. 2019. |
| Fernandez, RC et al. "Data market platforms: Trading data assets to solve data problems." Proceedings of the VLDB Endowment 13.12 2020. |
| Ghiassi, TrustNet: Learning from Trusted Data Against (A)symmetric Label Noise, Jul. 13, 2020 (Year: 2020). * |
| Ghorbani, Amirata et al. "Data shapley: Equitable valuation of data for machine learning." arXiv preprint arXiv:1904.02868 2019. |
| Halevi, Shai et al., "Algorithms in helib." Annual Cryptology Conference. Springer, Berlin, Heidelberg 2014. |
| Harder, Frederik et al., "DP-MERF: Differentially Private Mean Embeddings with RandomFeatures for Practical Privacy-preserving Data Generation." International Conference on Artificial Intelligence and Statistics. PMLR 2021. |
| Heckman, Judd Randolph et al. "A pricing model for data markets." iConference 2015 Proceedings 2015. |
| Huang, Yangsibo, et al. "Instahide: Instance-hiding schemes for private distributed learning." International Conference on Machine Learning. PMLR 2020. |
| Jordon, James et al., "PATE-GAN: Generating synthetic data with differential privacy guarantees." International conference on learning representations. 2018. |
| Koutsos, Vlasis, et al. "Agora: A privacy-aware data marketplace." 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS). IEEE, 2020. |
| Liu, Jinfei, et al. "Dealer: an end-to-end model marketplace with differential privacy." Proceedings of the VLDB Endowment 14.6 2021. |
| Liu, Zhijian, et al. "DataMix: Efficient Privacy-Preserving Edge-Cloud Inference." Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, Aug. 23-28, 2020, Proceedings, Part XI 16. Springer International Publishing 2020. |
| Near, Differentially Private Synthetic Data, May 3, 2021 (Year: 2021). * |
| Pei, J. "A Survey on Data Pricing: from Economics to Data Science." arXiv preprint arXiv:2009.04462 2020. |
| Ruoxi Jia, et al. "Towards efficient data valuation based on the shapley value." arXiv preprint arXiv:1902.10275 Aug. 2020. |
| Yang, Jian et al. "Big data market optimization pricing model based on data quality." Complexity 2019 2019. |
| Yu, Haifei et al. "Data pricing strategy based on data quality." Computers & Industrial Engineering 112 2017. |
| Zhang, Hongyi, et al. "mixup: Beyond empirical risk minimization." arXiv preprint arXiv:1710.09412 2017. |
| Abufadda (N.P.L "A Survey of Synthetic Data Generation for Machine Learning"), Jan. 17, 2022 (Year: 2022). * |
| Abufadda, A Survey of Synthetic Data Generation for Machine Learning. Dec. 23, 2021 (Year: 2021). * |
| Azcoitia, Santiago Andrés, et al. "Try Before You Buy: A practical data purchasing algorithm for real-world data marketplaces." arXiv preprint arXiv:2012.08874 2020. |
| Cao, Tianshi, et al. "Don't Generate Me: Training Differentially Private Generative Models with Sinkhorn Divergence." Advances in Neural Information Processing Systems 34 2021. |
| Carlini, Nicholas, et al. "An Attack on InstaHide: Is Private Learning Possible with Instance Encoding ?. " arXiv preprint arXiv:2011.05315 2020. |
| Chen, Dingfan et al, Gs-wgan: A gradient-sanitized approach for learning differentially private generators. arXiv preprint arXiv:2006.08265 2020. |
| Chen, Lingjiao et al., "Towards model-based pricing for machine learning in a data marketplace." Proceedings of the 2019 International Conference on Management of Data. 2019. |
| Fernandez, RC et al. "Data market platforms: Trading data assets to solve data problems." Proceedings of the VLDB Endowment 13.12 2020. |
| Ghiassi, TrustNet: Learning from Trusted Data Against (A)symmetric Label Noise, Jul. 13, 2020 (Year: 2020). * |
| Ghorbani, Amirata et al. "Data shapley: Equitable valuation of data for machine learning." arXiv preprint arXiv:1904.02868 2019. |
| Halevi, Shai et al., "Algorithms in helib." Annual Cryptology Conference. Springer, Berlin, Heidelberg 2014. |
| Harder, Frederik et al., "DP-MERF: Differentially Private Mean Embeddings with RandomFeatures for Practical Privacy-preserving Data Generation." International Conference on Artificial Intelligence and Statistics. PMLR 2021. |
| Heckman, Judd Randolph et al. "A pricing model for data markets." iConference 2015 Proceedings 2015. |
| Huang, Yangsibo, et al. "Instahide: Instance-hiding schemes for private distributed learning." International Conference on Machine Learning. PMLR 2020. |
| Jordon, James et al., "PATE-GAN: Generating synthetic data with differential privacy guarantees." International conference on learning representations. 2018. |
| Koutsos, Vlasis, et al. "Agora: A privacy-aware data marketplace." 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS). IEEE, 2020. |
| Liu, Jinfei, et al. "Dealer: an end-to-end model marketplace with differential privacy." Proceedings of the VLDB Endowment 14.6 2021. |
| Liu, Zhijian, et al. "DataMix: Efficient Privacy-Preserving Edge-Cloud Inference." Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, Aug. 23-28, 2020, Proceedings, Part XI 16. Springer International Publishing 2020. |
| Near, Differentially Private Synthetic Data, May 3, 2021 (Year: 2021). * |
| Pei, J. "A Survey on Data Pricing: from Economics to Data Science." arXiv preprint arXiv:2009.04462 2020. |
| Ruoxi Jia, et al. "Towards efficient data valuation based on the shapley value." arXiv preprint arXiv:1902.10275 Aug. 2020. |
| Yang, Jian et al. "Big data market optimization pricing model based on data quality." Complexity 2019 2019. |
| Yu, Haifei et al. "Data pricing strategy based on data quality." Computers & Industrial Engineering 112 2017. |
| Zhang, Hongyi, et al. "mixup: Beyond empirical risk minimization." arXiv preprint arXiv:1710.09412 2017. |
Also Published As
| Publication number | Publication date |
|---|---|
| US20230315885A1 (en) | 2023-10-05 |
| CN118176500A (en) | 2024-06-11 |
| WO2023193703A1 (en) | 2023-10-12 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11886955B2 (en) | Self-supervised data obfuscation in foundation models | |
| JP6789934B2 (en) | Learning with transformed data | |
| WO2023193703A1 (en) | Systems, methods, and computer-readable media for secure and private data valuation and transfer | |
| US11580417B2 (en) | System and method for processing data and managing information | |
| CN113011646B (en) | Data processing method, device and readable storage medium | |
| Nguyen et al. | Autogan-based dimension reduction for privacy preservation | |
| Upreti et al. | Enhanced algorithmic modelling and architecture in deep reinforcement learning based on wireless communication Fintech technology | |
| Śmietanka et al. | Algorithms in future insurance markets | |
| Tian et al. | Private data valuation and fair payment in data marketplaces | |
| US20230267337A1 (en) | Conditional noise layers for generating adversarial examples | |
| CN113240505A (en) | Graph data processing method, device, equipment, storage medium and program product | |
| CN111566686A (en) | Digital asset value management and operation method, device, medium and computing equipment | |
| Johnson | Identifying and preventing future forms of crimes using situational crime prevention | |
| CN114998024B (en) | Click-through rate-based product recommendation methods, devices, equipment, and media | |
| WO2025046300A1 (en) | Asset-backed tokens for intermediary use | |
| US11924200B1 (en) | Apparatus and method for classifying a user to an electronic authentication card | |
| US20240095553A1 (en) | Systems and methods for evaluating counterfactual samples for explaining machine learning models | |
| US20240161117A1 (en) | Trigger-Based Electronic Fund Transfers | |
| Krishnavardhan et al. | RETRACTED ARTICLE: Flower pollination optimization algorithm with stacked temporal convolution network-based classification for financial anomaly fraud detection | |
| Krishnan et al. | Blockchain-based two-level trustable reputation framework for e-commerce platform using smart contracts | |
| US12406093B2 (en) | Systems and methods for preventing sensitive data leakage during label propagation | |
| US20250307450A1 (en) | Encryption for secured documentation authorization and production | |
| US12038957B1 (en) | Apparatus and method for an online service provider | |
| US12073461B1 (en) | Apparatus and method for generating a personalized management system | |
| US12333567B2 (en) | Systems and methods for artificial intelligence using data analytics of unstructured data |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
| AS | Assignment |
Owner name: HUAWEI CLOUD COMPUTING TECHNOLOGIES CO., LTD., CHINA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SINGH, GURSIMRAN;AYUB, AHNAF TAZWAR;WANG, CHENDI;AND OTHERS;SIGNING DATES FROM 20220425 TO 20220504;REEL/FRAME:060136/0711 |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: ADVISORY ACTION COUNTED, NOT YET MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: ADVISORY ACTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: ALLOWED -- NOTICE OF ALLOWANCE NOT YET MAILED Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT RECEIVED Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |