WO2023221888A1 - 用于训练模型的方法、装置和系统 - Google Patents

用于训练模型的方法、装置和系统 Download PDF

Info

Publication number
WO2023221888A1
WO2023221888A1 PCT/CN2023/093818 CN2023093818W WO2023221888A1 WO 2023221888 A1 WO2023221888 A1 WO 2023221888A1 CN 2023093818 W CN2023093818 W CN 2023093818W WO 2023221888 A1 WO2023221888 A1 WO 2023221888A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
representative
model
training
actual
Prior art date
Application number
PCT/CN2023/093818
Other languages
English (en)
French (fr)
Inventor
吕灵娟
Original Assignee
索尼集团公司
吕灵娟
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 索尼集团公司, 吕灵娟 filed Critical 索尼集团公司
Publication of WO2023221888A1 publication Critical patent/WO2023221888A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Definitions

  • the present disclosure relates generally to privacy protection, and specifically to privacy protection during model training.
  • neural network models have been widely deployed on various systems and devices, including edge computing devices, for real-time interaction with users.
  • neural network models contain many parameters, they usually require a large amount of data and computing costs to train, and most commercial edge computing devices do not support model training with high computational complexity.
  • one technical route is to collect data to a server with high computing power for training.
  • data sharing may lead to unpredictable privacy leaks.
  • existing multinational privacy laws do not allow data to leave the local area.
  • Another technical route is to use actual data for model training locally.
  • fine-tuning is performed on the model downloaded from the cloud to improve the performance of the model on actual data.
  • model training consumes too many resources and is therefore not suitable for low-power edge computing devices.
  • Privacy processing includes data encryption and privacy attribute decoupling.
  • the former uses traditional numerical encryption methods to encrypt pictures or other data samples to ensure data availability, but it also brings a high computational burden and is not suitable for low-power devices.
  • the latter requires training a feature extraction model with decoupling capabilities in advance, and supervised learning to remove the private information contained in the features during the training process. On the one hand, it relies on privacy policies defined in advance, and on the other hand, it may also remove the privacy information contained in the features. Information that is important for the learning task reduces the quality of the features.
  • the method includes: obtaining an approximate data set composed of open world data that is similar to an actual data set; and training a model using the approximate data set.
  • the data processing method includes: obtaining an actual data set; receiving a representative data sample set of the open world data set; performing feature matching on the representative data sample and the actual data; and returning the representative data sample and the actual data. Data matching results.
  • a method of generating a model includes performing steps of the method for training a model according to an embodiment of the present disclosure to generate the model.
  • the apparatus includes a training data acquisition module configured to acquire an approximate data set composed of open world data that is similar to the actual data set; and a training module configured to use the approximate data set to train the model.
  • the data processing device includes a data acquisition module configured to obtain an actual data set; an interaction module configured to receive a representative data sample set of the open world data set; and a feature matching module configured to Perform feature matching on representative data samples and actual data; wherein the interaction module is further configured to return matching results between representative data samples and actual data.
  • a system for training a model includes a training device according to an embodiment of the present disclosure; and a data processing device according to an embodiment of the present disclosure.
  • Yet another aspect of the present disclosure relates to a computer-readable storage medium storing one or more instructions.
  • the one or more instructions when executed by the processor, may cause the processor to perform steps of methods according to embodiments of the present disclosure.
  • Yet another aspect of the present disclosure relates to a computer program product including one or more instructions.
  • the one or more instructions when executed by the processor, may cause the processor to perform steps of methods according to embodiments of the present disclosure.
  • 1A is a flowchart illustrating an example of steps of a method for training a model according to an embodiment of the present disclosure.
  • FIG. 1B is a flowchart illustrating an example of sub-steps of a step of obtaining an approximate data set in accordance with an embodiment of the present disclosure.
  • 1C is a flowchart illustrating an example of sub-steps of the step of screening representative data samples in accordance with an embodiment of the present disclosure.
  • FIG. 2 is a flowchart illustrating an example of steps of a data processing method according to an embodiment of the present disclosure.
  • FIG. 3 is a schematic diagram illustrating an example of the configuration of a system for training a model and a training device and a data processing device therein according to an embodiment of the present disclosure, illustrating the main functional modules that make up the device and information interaction.
  • 4A-4C are schematic diagrams illustrating multiple learning scenarios of a scheme for training a model according to embodiments of the present disclosure.
  • FIG. 5 illustrates an example block diagram of a computer that may be implemented as a training device, a data processing device, or a system for training a model according to an embodiment of the present disclosure.
  • a model trained using local actual data can be more accurately applied to specific local application scenarios.
  • using actual data to directly perform model training locally is not suitable for low-power computing devices (local edge devices such as sensors), while collecting actual data to high-computing computing devices (such as cloud servers) for training It may lead to unpredictable privacy leaks.
  • models trained directly using actual data may leak private information in the actual data. For example, in some cases Under this method, you can determine whether the target data is used to train the model by observing the output of the target data on the model, thereby obtaining the member privacy information of the target data. For example, more specifically, if the face image data captured by the camera is directly used for model training, then the trained model may be used maliciously, such as using the model output to infer whether someone has ever entered the camera's observation range, such as no one. Supermarket etc.
  • the inventors also realized that increasing the amount of training data is conducive to completing effective learning to obtain a high-precision model.
  • existing privacy-preserving learning methods usually limit the sharing of data to a certain extent.
  • the applicant proposes in this disclosure to sample a large number of data samples that match actual data from massive and diverse open world data, and train a model on these non-privacy-sensitive data samples.
  • the actual data collected locally is only used to match data locally and is not used by other modules.
  • the model trained based on a large amount of open world data will not leak the privacy information of the actual data, and can ensure the practical application value of the model at the same time.
  • the disclosed solution is suitable for all scenarios that require privacy protection for the actual data used to train the model and require a large amount of data, such as smart cities, smart supermarkets, etc.
  • the solution of the present disclosure can introduce the timing information of multiple frames of the surveillance video so as to be suitable for privacy protection of user action trajectories, for example.
  • the solution of the present disclosure is also suitable for scenarios in which high-energy-consuming learning tasks are transferred from edge devices with low computing power to servers with high computing power.
  • Figure 1A illustrates a flowchart of an example of steps of a method for training a model according to an embodiment of the present disclosure.
  • 1B-1C illustrate an example of a flowchart of sub-steps of some steps of a method for training a model according to an embodiment of the present disclosure.
  • the method for training a model according to embodiments of the present disclosure may be executed by any device including a processing device, for example, may be executed by a high computing power server (such as a cloud server).
  • a high computing power server such as a cloud server
  • a method 100 for training a model may mainly include the following steps:
  • step 110 obtain an approximate data set composed of open world data that is similar to the actual data set.
  • the approximate data set is used to train the model.
  • open-world data can be understood as public data resources that can be legally obtained from any channel, including Internet pictures, public data sets, etc.
  • an approximate data set composed of open world data and similar to the actual data set is obtained, and the model is trained using the approximate data set.
  • the "similarity" between the approximate data set and the actual data set can be understood as the approximate data set and the actual data set have similar data distribution.
  • the criterion can be that the loss obtained by training on the approximate data set is similar to the loss obtained by training on the actual data set. Therefore, using approximate data sets to train models can improve the performance of the model on real data.
  • the method 100 for training a model may further include obtaining a pre-trained model (step 130).
  • training the model using the approximate data set (step 120) includes adjusting the pre-trained model using the approximate data set.
  • data can be collected from the open world, and the model can be pre-trained on the collected pre-training data set to obtain a pre-trained model.
  • data from the popular ImageNet can be used for pre-training.
  • a pretrained model can be downloaded from the cloud.
  • the pretrained model can be downloaded from an external storage device.
  • step 110 Examples of the sub-steps of the step of obtaining an approximate data set (step 110) will be illustrated in detail below in conjunction with the flowchart of FIG. 1B.
  • the method of obtaining an approximate data set shown in FIG. 1B is only an example, and the present disclosure is not limited thereto.
  • Those skilled in the art can also utilize various existing open world data in combination with the ideas disclosed in the present disclosure. Collect, sample, collect, select, select, screen, filter and other methods to obtain an approximate data set that is similar to the actual data set.
  • obtaining an approximate data set may include abstracting the open world data set D to obtain a representative data sample set S q (step 112).
  • the open world data set D can be collected according to the application scenario to which the actual data belongs.
  • the application scenario is a smart city or a smart unmanned supermarket
  • data samples of a class suitable for the application scenario can be selected to form the open world data set D.
  • this approach enables faster and more accurate matching of sensor data distributions and identification of similar data sets.
  • data samples can be randomly collected from the open world to form the open world data set D.
  • the open world dataset D may have the same or similar data distribution as the pre-training dataset used to obtain the pre-trained model.
  • the open world dataset D can have different numbers than the pre-training dataset. According to distribution.
  • abtracting the open world data set D can refer to various methods of refining the data in the huge open world data set to obtain representative data samples, including but not limited to unsupervised clustering methods.
  • the feature layer of the neural network especially the penultimate layer, has inherent feature extraction and data clustering functions and can be used to extract features from and cluster the data.
  • the step of abstracting 112 the open-world dataset includes utilizing a pre-trained model to perform feature extraction and clustering on the data in the open-world dataset.
  • the feature extraction and clustering in step 112 can be performed using the pretrained model obtained in step 130 .
  • the present disclosure is not limited thereto.
  • the pretrained neural network is f ⁇ .
  • the neural network ⁇ can be obtained.
  • Data samples, all representative data samples can constitute the above representative data sample set S q .
  • the present disclosure is not limited thereto, and other feature extractors may also be used to extract features of the open world data set D and perform clustering.
  • the number C of clusters obtained in step 112 is not necessarily equal to the number of classes corresponding to all open world data.
  • obtaining the approximate data set may include screening the representative data sample set for representative data samples that characteristically match the actual data (step 114 ).
  • the risk of privacy leakage caused by uploading local actual data is avoided.
  • the step 114 of filtering representative data samples that characteristically match the actual data may include interacting with a data processing device that has access to the actual data to determine representative data samples that characteristically match the actual data. sexual data sample. Among them, the information exchanged does not contain actual data.
  • step 114 An example of screening representative data samples (step 114) will be illustrated in detail below in conjunction with the flowchart of FIG. 1C. Those skilled in the art can easily understand that the screening method shown in FIG. 1C is only an example, and the present disclosure is not limited thereto. Those skilled in the field can also utilize various existing screening methods in combination with the ideas disclosed in this disclosure to obtain representative data samples that match the characteristics of the actual data.
  • the step 114 of filtering representative data samples that characteristically match the actual data may include the following sub-steps.
  • a set of representative data samples is sent to the data processing device.
  • the sent representative data samples may be represented by extracted features for the purpose of one or more of improving communication efficiency, reducing communication overhead, and reducing local storage burden.
  • the present disclosure is not limited thereto.
  • the sent representative data sample may also be a data sample in the original format.
  • matching results for each representative data sample are received from the data processing device.
  • the matching result of each representative data sample is based on the statistical similarity in characteristics between the representative data sample and each actual data.
  • features of actual data can also be extracted using pre-trained models. From this, the neural network ⁇ can also be sent to a data processing device that has access to the actual data.
  • the matching result for each representative data sample may be based on the statistical sum or average of the similarity in features between the representative data sample and all actual data.
  • the similarity between a representative data sample and any actual data can be expressed as a numerical value, where the size of the numerical value reflects the degree of feature matching with the actual data.
  • the similarity between a representative data sample and any actual data can be represented by a discrete value of 0 or 1, where for any actual data, the similarity of the representative data sample that best matches the characteristics of the actual data is 1. The similarity of other representative data samples is 0.
  • the statistical similarity of the representative data sample c can be expressed as follows:
  • S p represents the actual data set
  • V c represents the number of votes obtained by the representative data sample c that best matches the actual data, which is used to indicate its statistical similarity.
  • the inventor realizes that the statistical similarity in characteristics between each representative data sample and the actual data may also contain part of the private information of the actual data.
  • the matching results of each representative data sample may be obtained by performing differential privacy processing A ⁇ (v c ) or other privacy protection processing on the corresponding statistical similarity.
  • the present disclosure is not limited thereto.
  • the matching result can be directly represented by the corresponding statistical similarity.
  • step 1146 representative data samples with relatively poor matching results are filtered out from the set of representative data samples.
  • the matching results of each representative data sample can be sorted, and representative data samples with lower rankings of the matching results are filtered out from the set of representative data samples.
  • representative data samples whose matching result values are smaller than a preset threshold can be filtered out from the set of representative data samples. For example, in the case of obtaining statistical similarity through the above equation 1, representative data samples whose matching results are 0 or are close to 0 due to the addition of random perturbations can be filtered out.
  • the class corresponding to the filtered representative data sample is a class included in the actual data set.
  • the number of filtered classes is not necessarily equal to the number of classes in the actual data.
  • the matching results received relate only to filtered representative data samples. Therefore, step 1146 does not need to be performed again.
  • the approximate data set can also include other open world data that matches the characteristics of the filtered representative data sample to provide a large amount of data for model training, This ensures the accuracy of the trained model.
  • obtaining the approximate data set may also include supplementary collection of relevant open world data based on the filtered representative data samples to expand the approximate data set (step 116).
  • the relevant open world data is open world data that characteristically matches the filtered representative data sample.
  • Pre-trained models can also be used here for clustering and feature extraction.
  • data can be randomly sampled within the class of that representative data sample.
  • data that are closer to the filtered representative data samples can be prioritized based on the degree of feature matching.
  • the collection data may continue to be supplemented in the open world dataset D.
  • the acquisition data can be supplemented in another open world dataset.
  • supplementary collection could be conducted within a larger open-world dataset to ensure data adequacy and richness.
  • the matching results of the filtered representative data samples can basically reflect the distribution ratio of the actual data. Specifically, in the case where the value of the matching result is positively correlated with the degree of matching, the representative data sample The higher the matching result, the higher the proportion of data in the class to which the representative data sample belongs in the actual data. Therefore, supplementing the collected data in relation to the matching results is beneficial to obtain a similar data set that is more similar to the data distribution of the actual data set.
  • step 116 of augmenting the approximate data set includes collecting data that characteristically matches each filtered representative data sample in relation to the match results for that representative data sample.
  • “correlatedly” may include “proportionally” or “approximately proportionally”.
  • step 116 an approximate data set D c composed of a large amount of open world data that is close to the actual data set can be obtained.
  • step 120 the approximate data set Dc is used to train the model.
  • training the model in step 120 may include one or more of retraining the model, adjusting the model, updating the model, and the like.
  • the model can be trained in the following way:
  • B is a batch sample randomly selected from D c .
  • steps 110-120 can be performed repeatedly to adjust the trained model against updated actual data to better adapt to the actual situation and reduce the workload required to retrain the model each time .
  • the method 100 for training a model further includes performing model compression on the trained model (step 140 ).
  • model compression technology to compress the model before model deployment can not only further reduce the model size and make the model more suitable for devices such as edge computing devices, but also further reduce model privacy leakage.
  • the method 100 for training a model further includes distributing the trained model to relevant data processing devices (step 150).
  • relevant to the actual data set means that the data to be processed by the data processing device using the trained model has the same distribution as the actual data set. For example, the visit characteristics of customers in different sections of the same supermarket may be different. The same, so that there are differences in customer data collected from image sensors arranged in different cargo areas. The data to be processed by the image sensor located in a certain cargo area has the same distribution as the actual customer data set collected by the image sensor in the cargo area. As a result, the trained model can have better performance on the data to be processed.
  • the method 100 for training a model further includes obtaining a trained model obtained by performing the above steps.
  • the method of generating a model may include performing the steps of the method for training a model according to the embodiment of the present disclosure to generate the model.
  • FIG. 2 illustrates a flowchart of an example of steps of a data processing method according to an embodiment of the present disclosure.
  • the content described above in conjunction with FIGS. 1A-1C may also be applied to the corresponding features, and the description of some repeated content will be omitted.
  • the data processing method according to the embodiment of the present disclosure may be executed by any device including a processing device, for example, may be executed by a low-power computing device (a local edge device such as an image sensor).
  • a processing device for example, may be executed by a low-power computing device (a local edge device such as an image sensor).
  • the data processing method 200 may mainly include the following steps:
  • step 210 obtain the actual data set
  • step 220 receive a representative data sample set of the open world data set
  • step 230 feature matching is performed on the representative data sample and the actual data.
  • step 240 the matching result of the representative data sample and the actual data is returned.
  • the actual data set Sp can be obtained by directly collecting data.
  • real image data can be obtained through image capture.
  • the actual data set Sp may be obtained from an external device in a manner that is secure for private information.
  • the data processing method 200 may also include performing feature extraction on the acquired actual data set to facilitate feature matching.
  • the above-mentioned neural network ⁇ obtained by removing the last layer of the pre-trained neural network f ⁇ and retaining its penultimate layer can be used as a feature extractor.
  • the data processing method 200 may further include receiving the neural network ⁇ .
  • the present disclosure is not limited thereto.
  • the representative data sample set S q of the open world data set D may be received from a computing device with high computing power (such as a cloud server).
  • the representative data sample set S q can be obtained by abstracting the open world data set D.
  • the representative data sample set S q may include cluster centers of a total of C clusters obtained by performing feature extraction and clustering using a feature extractor such as a neural network ⁇ , but the present disclosure is not limited thereto.
  • the received representative data sample S q may be a data sample represented by extracted features, but the disclosure is not limited thereto.
  • performing feature matching includes calculating a statistical similarity in features between each representative data sample and each actual data.
  • the statistical sum or average of the similarity in features between each representative data sample and all actual data can be calculated as the statistical similarity.
  • a numerical value can be used to represent the similarity between a representative data sample and any actual data, where the size of the numerical value reflects the degree of feature matching with the actual data.
  • a discrete value of 0 or 1 can be used to represent the similarity between a representative data sample and any actual data, where for any actual data, the similarity of the representative data sample that best matches the characteristics of the actual data is is set to 1, and the similarity of other representative data samples is set to 0.
  • the greater the statistical sum or average value (statistical similarity) of the similarity of the representative data sample it means that the representative data sample best matches more actual data, that is, the greater the degree of matching. high.
  • the statistical similarity of representative data samples can be calculated using the number of votes in Equation 1 above. However, those skilled in the art will easily understand that the present disclosure is not limited thereto.
  • performing feature matching further includes performing differential privacy processing on statistical similarities.
  • the matching results of each representative data sample can be obtained by performing differential privacy processing A ⁇ (v c ) on the corresponding statistical similarity. For example, random Gaussian perturbations can be added to the statistical similarity of each representative data sample, and the corresponding differential privacy risk can be calculated. If the privacy loss is within an acceptable range, a matching result is obtained.
  • the present disclosure is not limited thereto.
  • other privacy protection technologies may also be used for statistical similarity or the statistical similarity may be directly determined as the matching result.
  • the data processing method 200 may further include labeling at least part of the representative data samples (step 250). For example, label representative data samples that match the characteristics of the actual data. Therefore, in some embodiments, the data processing method 200 may further include receiving a representative data sample that is characteristically matched with the actual data or related thereto after returning a matching result between the representative data sample and the actual data (step 240). information.
  • labeling the representative data sample may include separately calculating the class similarity in features between the representative data sample and the actual data of each type; and sorting based on the class similarity, determining The class to which this representative data belongs.
  • the class similarity in characteristics between the representative data sample and the actual data of each class can be calculated by using the actual data of the class. Expressed as the number of actual data in the data that match the characteristics of a representative data sample.
  • representative data samples can be labeled using nearest neighbor pseudo-labeling methods.
  • V k represents the number of actual data of the kth class (number of votes) that meets the nearest neighbor requirements of the representative data sample x, and represents the class similarity of the sample x.
  • a ⁇ (v k ) performs noise privacy processing on the voting number and calculates the privacy risk ⁇ .
  • f(x) represents the selected class that received the largest number of votes, that is, the class in which the representative data x will be labeled.
  • the above annotation method can use local actual data labels to provide nearest neighbor pseudo labels for the filtered unlabeled open world data, which advantageously reduces the workload of the annotation task and improves the efficiency of model training.
  • the data processing method 200 may further include deploying a trained model obtained by performing the steps of the method for training a model according to an embodiment of the present disclosure to process data (step 260).
  • deploying the trained model will help improve the model's performance in actual application scenarios adaptability and accuracy.
  • Figure 3 also illustrates the main functional modules of each device and their information interaction.
  • the system 300 for training a model may include a training device 310 and a data processing device 320.
  • the training device 310 is a computing device with high computing power (such as a cloud server)
  • the data processing device 320 is a computing device with low power consumption (a local edge device such as an image sensor).
  • the training device 310 can send a representative data sample set abstracted from the open world data set to the data processing device 320, generate a similar data set based on the returned matching result with the actual data, and utilize the similar data Sets to train the model, thereby improving the model's performance on real data.
  • the open world can advantageously provide a large number of data samples.
  • the data processing device 320 only needs to perform feature matching processing on the actual data locally, and does not need to share the actual data, nor does it need to perform model training with high computing power requirements.
  • training The trained model can be deployed on the data processing device 320. As a result, the process of training the model will not leak the private information of the actual data, and can ensure the accuracy and practical application value of the trained model at the same time.
  • the training device 310 may be configured to perform the steps of the method for training a model according to embodiments of the present disclosure.
  • the content described above in conjunction with FIGS. 1A-1C and FIG. 2 may also be applied to the corresponding features, and the description of some repeated content will be omitted.
  • the training device 310 may include:
  • a training data acquisition module 312 configured to acquire an approximate data set composed of open world data that is similar to the actual data set
  • Training module 314 is configured to train the model using the approximate data set.
  • training data acquisition module 312 may include abstract sub-module 3122.
  • the abstraction sub-module 3122 may be configured to abstract the open world data set to obtain a representative data sample set.
  • the abstraction sub-module 3122 may be configured to utilize a pre-trained model to perform feature extraction and clustering of data in the open world dataset.
  • the training data acquisition module 312 may include a screening sub-module 3124.
  • the screening sub-module 3124 may be configured to screen the set of representative data samples for representative data samples that characteristically match the actual data.
  • the screening sub-module 3124 may be configured to interact with a data processing device that has access to actual data, and determine representative data samples that characteristically match the actual data, where the interacted information does not include actual data. That is, the processing of actual data is only performed locally and the actual data is not uploaded.
  • the data processing device illustrated in FIG. 3 for information interaction with the filtering sub-module 3124 is the data processing device 320 according to an embodiment of the present disclosure, the present disclosure is not limited thereto. Those skilled in the art can easily understand that the data processing device may be any data processing device that collects and/or stores actual data so that the actual data can be accessed securely without leaking private information.
  • filtering sub-module 3124 may be configured to send a set of representative data samples to the data processing device and receive matching results for each representative data sample from the data processing device.
  • the matching result of each representative data sample may be based on the statistical similarity in characteristics between the representative data sample and each actual data. Then, the filtering sub-module 3124 may be configured to filter out representative data samples with relatively poor matching results from the set of representative data samples.
  • the filtering sub-module 3124 may only receive matching results for filtered representative data samples. That is, relatively poor matching results have been filtered out. Thus, the filtering sub-module 3124 may no longer perform filtering operations.
  • the training data acquisition module 312 may also include an expansion sub-module 3126.
  • the expansion sub-module 3126 can be configured to supplementally collect relevant open world data based on the filtered representative data samples, and expand the approximate data set.
  • the augmentation sub-module 3126 may be configured to collect data that characteristically matches the representative data sample in relation to the matching result of each filtered representative data sample.
  • the training device 310 may also include a pre-training module 316.
  • the pre-training module 316 may be configured to obtain a pre-trained model.
  • the training module 314 may be configured to tune the pre-trained model using the approximate data set.
  • the training device 310 may also include a model distribution module 318.
  • Model distribution module 318 is configured to distribute the trained model to devices associated with the actual data set.
  • relevant to the actual data set means that the data that the device is to process with the trained model has the same distribution as the actual data set.
  • the relevant device may be a data processing device that interacts with the training device 310 to provide matching results, but the disclosure is not limited thereto.
  • the training device 310 may also include a model compression module (not shown).
  • the model compression module is configured to compress models using model compression techniques.
  • the data processing device 320 may be configured to perform steps of the data processing method according to embodiments of the present disclosure.
  • the content described above in conjunction with FIGS. 1A-1C and FIG. 2 may also be applied to the corresponding features, and the description of some repeated content will be omitted.
  • the data processing device 320 may include:
  • the data collection module 322 is configured to obtain the actual data set
  • an interaction module 324 configured to receive a representative set of data samples of the open world dataset
  • the feature matching module 326 is configured to perform feature matching between the representative data sample and the actual data
  • the interaction module 324 is also configured to return a matching result between the representative data sample and the actual data.
  • the data acquisition module 322 may also be configured to perform feature extraction on the acquired actual data set.
  • data processing apparatus 320 may also include a deployment module 328.
  • the deployment module 328 may be configured to deploy the trained model obtained by performing the steps of the method for training a model according to the embodiment of the present disclosure, that is, the trained model obtained by the training device 310, to process data.
  • the feature matching module 326 may be configured to calculate a statistical similarity in features between each representative data sample and each actual data, and determine a matching result based on the statistical similarity. In some embodiments, feature matching module 326 may also be configured to perform differential privacy processing on statistical similarities. Relevant content is already above Discuss in detail and will not repeat the description here.
  • the data processing device 320 may return the matching results through the interaction module 324. Therefore, the data processing device 320 only needs to perform feature matching processing on the actual data locally, and does not need to share the actual data, nor does it need to perform model training with high computing power requirements.
  • the data processing device 320 may also include an annotation module (not shown).
  • the labeling module may be configured to label at least some of the representative data samples.
  • the annotation module may be configured to respectively calculate the class similarity in features between the representative data sample and the actual data of each type; and based on the size ranking of the class similarity, determine the class to which the representative data belongs. .
  • the annotation module may also be configured to perform differential privacy processing on class similarities before ranking. The relevant content has been discussed in detail above and will not be described again here.
  • training device 310 and the data processing device 320 are depicted together in FIG. 3 , those skilled in the art can easily understand that the present disclosure is not limited thereto. In fact, in many cases, the training device 310 is set in the cloud, and the data processing device 320 is set locally.
  • system 300 may include multiple data processing devices 320 .
  • the training device 310 can respectively perform feature matching to determine similar data sets by means of each of these data processing devices 320, and distribute and deploy the models trained on the similar data sets to the corresponding devices.
  • a "corresponding" device means that the device deploying the model is to process data whose distribution is similar to the distribution of similar data sets used to train the model.
  • the system 300 illustrated in Figure 3 can be applied in various scenarios, especially business scenarios of various computer vision classification tasks.
  • FIG. 4A-4C illustrate examples of learning scenarios for schemes for training models according to embodiments of the present disclosure.
  • FIG. 4A-FIG. 4C simplify some functional modules and their interactions, but those skilled in the art can easily understand that the present disclosure is not limited thereto.
  • unlabeled open world data is annotated using local actual data labels, such as providing nearest neighbor pseudo-labels.
  • the model training scheme according to the embodiment of the present disclosure can perform semi-supervised learning to train the model.
  • feature matching can be used to filter out cloud data with too different distributions, reduce the data that needs to be labeled, and reduce the noise of the nearest neighbor pseudo-label algorithm caused by out-of-distribution samples, thereby improving the efficiency of semi-supervised learning.
  • the open world data being processed is labeled.
  • the model training solution of the embodiment can first perform model pre-training on open world data (such as ImageNet) that is different from the actual data distribution, and then perform supervised adjustment training.
  • open world data such as ImageNet
  • both the open-world and real-world data processed are unlabeled.
  • the model training solution according to the embodiment of the present disclosure can utilize outsourcing services for manual annotation after screening, thereby performing semi-supervised learning of a small number of labeled samples.
  • the number of public samples that need to be annotated and labor costs can be reduced.
  • Embodiments of the present disclosure also provide a computer-readable storage medium storing one or more instructions. These instructions, when executed by a processor, can cause the processor to perform the method for training a model or the data processing method in the above embodiments. A step of.
  • Embodiments of the present disclosure also provide a computer program product including one or more instructions, which, when executed by a processor, cause the processor to perform the steps of the method for training a model or the data processing method in the above embodiments.
  • instructions in a computer-readable storage medium may be configured to perform operations corresponding to the system and method embodiments described above.
  • Embodiments of computer-readable storage media will be apparent to those skilled in the art when referring to the above system and method embodiments, and therefore will not be described again.
  • Computer-readable storage media for carrying or including the above instructions are also within the scope of the present disclosure.
  • Such computer-readable storage media may include, but are not limited to, floppy disks, optical disks, magneto-optical disks, memory cards, memory sticks, and the like.
  • Embodiments of the present disclosure also provide various devices including components or units for performing the steps of the method for training a model or the data processing method in the above embodiments.
  • each of the above components or units may be implemented as independent physical entities, or may also be implemented by a single entity (for example, a processor (CPU or DSP, etc.), an integrated circuit, etc.).
  • a plurality of functions included in one unit in the above embodiments may be implemented by separate devices.
  • multiple functions implemented by multiple units in the above embodiments may be implemented by separate devices respectively.
  • one of the above functions may be implemented by multiple units.
  • the above series of processes and devices can also be implemented through software and/or firmware.
  • the program constituting the software is installed from a storage medium or a network to a computer with a dedicated hardware structure, such as the general-purpose computer 500 shown in FIG. 5.
  • the computer is installed with various programs , able to execute Various functions and so on.
  • 5 illustrates an example block diagram of a computer that may be implemented as a training device, an application device, and a system according to embodiments of the present disclosure.
  • a central processing unit (CPU) 501 performs various processes according to a program stored in a read-only memory (ROM) 502 or a program loaded from a storage section 508 into a random access memory (RAM) 503 .
  • ROM read-only memory
  • RAM random access memory
  • data required when the CPU 501 performs various processes and the like is also stored as necessary.
  • CPU 501, ROM 502 and RAM 503 are connected to each other via bus 504.
  • Input/output interface 505 is also connected to bus 504.
  • the following components are connected to the input/output interface 505: an input part 506, including a keyboard, a mouse, etc.; an output part 507, including a display, such as a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc.; a storage part 508 , including hard disk, etc.; and communication part 509, including network interface cards such as LAN cards, modems, etc.
  • the communication section 509 performs communication processing via a network such as the Internet.
  • Driver 510 is also connected to input/output interface 505 as needed.
  • Removable media 511 such as magnetic disks, optical disks, magneto-optical disks, semiconductor memories, etc. are installed on the drive 510 as needed, so that computer programs read therefrom are installed into the storage section 508 as needed.
  • the program constituting the software is installed from a network such as the Internet or a storage medium such as the removable medium 511.
  • this storage medium is not limited to the removable medium 511 shown in FIG. 5 in which the program is stored and distributed separately from the device to provide the program to the user.
  • the removable media 511 include magnetic disks (including floppy disks (registered trademark)), optical disks (including compact disk read-only memory (CD-ROM) and digital versatile disks (DVD)), magneto-optical disks (including minidiscs (MD) (registered trademark) )) and semiconductor memory.
  • the storage medium may be a ROM 502, a hard disk contained in the storage section 508, or the like, in which the programs are stored and distributed to the user together with the device containing them.
  • a method for training a model comprising:
  • Filter representative data samples that match the characteristics of the actual data in the representative data sample set are Filter representative data samples that match the characteristics of the actual data in the representative data sample set.
  • obtaining the approximate data set further includes:
  • relevant open world data will be collected to expand the approximate data set.
  • screening representative data samples that match the characteristics of actual data includes:
  • Interacting information with a data processing device that has access to actual data, and determining representative data samples that characteristically match the actual data, wherein the information being interacted does not include actual data.
  • augmenting the approximate data set includes:
  • Data that characteristically match the representative data sample are collected in relation to the matching result of each filtered representative data sample, where the matching result of each representative data sample is based on the relationship between the representative data sample and each actual data. Statistical similarity in features.
  • using the approximate data set to train the model includes using the approximate data set to adjust the pre-trained model.
  • a data processing method including:
  • performing feature matching includes:
  • performing feature matching further includes:
  • labeling representative data further includes:
  • Differential privacy processing is performed on class similarities before ranking.
  • a training device comprising:
  • a training data acquisition module configured to acquire an approximate data set consisting of open world data that is similar to the actual data set
  • the training module is configured to train the model using the approximate data set.
  • a data processing device comprising:
  • a data collection module configured to obtain actual data sets
  • an interaction module configured to receive a representative set of data samples from the open world dataset
  • a feature matching module configured to perform feature matching between representative data samples and actual data
  • the interactive module is further configured to return a matching result between the representative data sample and the actual data.
  • a system for training a model comprising:
  • a computer-readable storage medium having one or more instructions stored thereon that, when executed by a processor, cause the processor to perform the steps of the method according to any one of items 1-7 and/ or steps according to any of the methods described in items 8-14.
  • a computer program product comprising one or more instructions which, when executed by a processor, cause the processor to perform the steps of the method according to any one of items 1-7 and/or according to item 8- The steps of the method according to any one of 14.
  • a method of generating a model comprising: performing the steps of the method according to any one of items 1-7 to generate the model.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioethics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Image Analysis (AREA)

Abstract

本公开内容涉及用于训练模型的方法、装置和系统。描述了关于模型训练的各种实施例。在一个实施例中,一种用于训练模型的方法包括:获取由开放世界数据构成的与实际数据集相似的近似数据集;以及使用该近似数据集来训练模型。

Description

用于训练模型的方法、装置和系统
相关申请的交叉引用
本申请是以申请号为202210554969.8,申请日为2022年5月19日的中国申请为基础,并主张其优先权,该中国申请的公开内容在此作为整体引入本申请中。
技术领域
本公开一般地涉及隐私保护,并且具体地涉及模型训练过程中的隐私保护。
背景技术
随着深度学习技术的不断发展,神经网络模型已被广泛地部署于包括边缘计算设备在内的各种系统和装置上,用于与用户进行实时交互。
然而,由于神经网络模型含有较多的参数,通常需要大量的数据和计算成本来训练,而大部分商业边缘计算设备并不支持高计算复杂度的模型训练。为了解决该问题,一种技术路线是将数据收集到高算力的服务器端进行训练。然而,数据共享可能会导致难以预计的隐私泄露。特别地,在现有的多国隐私法律条款中,不允许数据离开本地。另一种技术路线是在本地利用实际数据进行模型训练,特别地,在从云端下载的模型的基础上进行微调,以提升模型在实际数据上的表现。然而,模型训练会消耗过多的资源,因此不适用于低功耗的边缘计算设备。
当前针对模型训练的隐私保护技术路线包含两种。第一种是使用差分隐私技术。然而,该类方法对性能影响较大,且算力要求较高,训练过程收敛较慢,无法满足现实的应用需求。另一种是对本地数据进行隐私化处理,进而共享数据给服务器端进行模型训练。隐私化处理包括数据加密和隐私属性解耦。前者利用传统的数值加密方法对图片或其他数据样本进行加密,从而保证数据的可用性,但同时带来很高的计算负担,对于低功耗的设备来说并不适用。后者需要提前训练有解耦能力的特征提取模型,并在训练过程中有监督地学习去除特征中包含的隐私信息,这一方面依赖提前定义好的隐私策略,另一方面可能同时去除了对于学习任务来说重要的信息,降低了特征的质量。
因此,需要在保证模型训练的性能的同时,有效地保护本地实际数据的隐私。
发明内容
本公开的一个方面涉及用于训练模型的方法。根据本公开的实施例,该方法包括:获取由开放世界数据构成的与实际数据集相似的近似数据集;以及使用近似数据集来训练模型。
本公开的一个方面涉及数据处理方法。根据本公开的实施例,该数据处理方法包括:获取实际数据集;接收开放世界数据集的代表性数据样本集;对代表性数据样本与实际数据进行特征匹配;以及返回代表性数据样本与实际数据的匹配结果。
本公开的一个方面涉及产生模型的方法。根据本公开的实施例,产生模型的方法包括:执行根据本公开实施例的用于训练模型的方法的步骤来产生模型。
本公开的一个方面涉及训练装置。根据本公开的实施例,该装置包括训练数据获取模块,被配置为获取由开放世界数据构成的与实际数据集相似的近似数据集;以及训练模块,被配置为使用近似数据集来训练模型。
本公开的一个方面涉及数据处理装置。根据本公开的实施例,该数据处理装置包括数据采集模块,被配置为获取实际数据集;交互模块,被配置为接收开放世界数据集的代表性数据样本集;以及特征匹配模块,被配置为对代表性数据样本与实际数据进行特征匹配;其中,所述交互模块还被配置为返回代表性数据样本与实际数据的匹配结果。
本公开的一个方面涉及用于训练模型的系统。根据本公开的实施例,用于训练模型的系统包括根据本公开实施例的训练装置;以及根据本公开实施例的数据处理装置。
本公开的再一个方面涉及存储有一个或多个指令的计算机可读存储介质。在一些实施例中,该一个或多个指令可以在由处理器执行时,使处理器执行根据本公开实施例的各方法的步骤。
本公开的再一个方面涉及包括一个或多个指令的计算机程序产品。在一些实施例中,该一个或多个指令可以在由处理器执行时,使处理器执行根据本公开实施例的各方法的步骤。
提供上述概述是为了总结一些示例性的实施例,以提供对本文所描述的主题的各方面的基本理解。因此,上述特征仅仅是例子并且不应该被解释为以任何方式缩小本文所描述的主题的范围或精神。本文所描述的主题的其他特征、方面和优点将从以下结合附图描述的具体实施方式而变得明晰。
附图说明
当结合附图考虑实施例的以下具体描述时,可以获得对本公开内容更好的理解。 在各附图中使用了相同或相似的附图标记来表示相同或者相似的部件。各附图连同下面的具体描述一起包含在本说明书中并形成说明书的一部分,用来例示说明本公开的实施例和解释本公开的原理和优点。其中:
图1A是示出根据本公开实施例的用于训练模型的方法的步骤的示例的流程图。
图1B是示出根据本公开实施例的获取近似数据集的步骤的子步骤的示例的流程图。
图1C是示出根据本公开实施例的筛选代表性数据样本的步骤的子步骤的示例的流程图。
图2是示出根据本公开实施例的数据处理方法的步骤的示例的流程图。
图3是示出根据本公开实施例的用于训练模型的系统以及其中的训练装置和数据处理装置的配置的示例的示意图,其中例示了组成装置的主要功能模块及信息交互。
图4A-图4C是示出根据本公开实施例的用于训练模型的方案的多个学习场景的示意图。
图5示出了根据本公开实施例的可实现为训练装置、数据处理装置或用于训练模型的系统的计算机的示例框图。
虽然在本公开内容中所描述的实施例可能易于有各种修改和另选形式,但是其具体实施例在附图中作为例子示出并且在本文中被详细描述。但是,应当理解,附图以及对其的详细描述不是要将实施例限定到所公开的特定形式,而是相反,目的是要涵盖属于权利要求的精神和范围内的所有修改、等同和另选方案。
具体实施方式
以下描述根据本公开的设备和方法等各方面的代表性应用。这些例子的描述仅是为了增加上下文并帮助理解所描述的实施例。因此,对本领域技术人员而言明晰的是,以下所描述的实施例可以在没有具体细节当中的一些或全部的情况下被实施。在其他情况下,众所周知的过程步骤没有详细描述,以避免不必要地模糊所描述的实施例。其他应用也是可能的,本公开的方案并不限制于这些示例。
发明人认识到,相比于通用的模型,使用本地的实际数据训练得到的模型能够更加准确地适用于具体的本地应用场景。然而,利用实际数据直接在本地进行模型训练不适用于低功耗的计算设备(诸如传感器之类的本地边缘设备),而将实际数据收集到高算力的计算设备(诸如云端服务器)进行训练可能会导致难以预计的隐私泄露。另外,直接使用实际数据训练得到的模型可能会泄露实际数据中的隐私信息。例如,在一些情况 下,可以通过观察目标数据在模型上的输出来判断目标数据是否被用于训练该模型,从而获取目标数据的成员隐私信息。例如更具体地,如果摄像头捕捉的人脸图像数据被直接用于模型训练,那么训练得到的模型可能会被恶意利用,例如通过模型输出来推测出某人是否曾经进入摄像头观察范围,比如无人超市等。
发明人还认识到,提高训练数据的数量有利于完成有效学习以获得高精度的模型,然而,现有的隐私保护学习方法通常会在一定程度上限制数据的共享。
因此,在保护隐私信息和保证训练所需算力的同时获取大量有价值的训练数据十分有必要。
为此,申请人在本公开中提出了从海量且多样的开放世界数据中采样得到与实际数据相匹配的大量数据样本,并在这些非隐私敏感的数据样本上训练模型。一方面,本地收集得到的实际数据只用于在本地匹配数据,而不被其他模块所利用,另一方面,本地也无需进行高算力要求的模型训练。由此,基于大量开放世界数据训练的模型不会泄露实际数据的隐私信息,并能够同时保证模型的实际应用价值。
本公开的方案适用于所有需要对训练模型的实际数据进行隐私保护且需要大量数据的场景,诸如智慧城市,智慧超市等。在一些实施例中,本公开的方案可以引入监控视频多帧的时序信息从而适用于例如用户行动轨迹的隐私保护。
本公开的方案也适用于将高能耗的学习任务从低算力的边缘设备转移到高算力的服务器的场景。
图1A例示了根据本公开实施例的用于训练模型的方法的步骤的示例的流程图。图1B-图1C例示了根据本公开实施例的用于训练模型的方法的部分步骤的子步骤的示例的流程图。根据本公开实施例的用于训练模型的方法可以由包括处理设备的任何装置执行,例如,可以由高算力服务器(诸如云端服务器)来执行。
如图1A所示,根据本公开的实施例,用于训练模型的方法100可以主要包括以下步骤:
在步骤110,获取由开放世界数据构成的与实际数据集相似的近似数据集;以及
在步骤120,使用近似数据集来训练模型。
这里,开放世界数据(Open-world data)可以理解为能够从任何渠道合法获得的公开数据资源,包括网络图片,公开数据集,等等。发明人认识到,虽然海量的开放世界数据能够提供丰富多样的数据样本,但直接使用开放世界数据训练模型是不利的,一 方面,分散的数据分布不利于模型的快速收敛,另一方面,对于某个具体应用场景而言,大量与实际数据分布差异巨大的数据会影响模型的准确性。
因此,在根据本公开的实施例中,获取由开放世界数据构成的且与实际数据集相似的近似数据集,并使用该近似数据集来训练模型。
这里,近似数据集与实际数据集“相似”可以理解为近似数据集与实际数据集具有相似的数据分布。衡量的标准可以是在近似数据集上训练得到的损失与在实际数据集上训练得到的损失相近。因此,使用近似数据集来训练模型能够提升模型在实际数据上的表现。
可选地,在一些实施例中,用于训练模型的方法100还可以包括获取预训练的模型(步骤130)。由此,使用近似数据集来训练模型(步骤120)包括使用近似数据集对预训练的模型进行调整。
在一些实施例中,可以从开放世界采集数据,并在采集得到的预训练数据集上对模型进行预训练,得到预训练的模型。例如,可以使用常见的ImageNet中的数据来进行预训练。但是,本公开不限于此。例如,在一些实施例中,可以从云端下载预训练的模型。或者,在一些实施例中,可以从外部存储装置下载预训练的模型。本领域技术人员容易理解,获取预训练的模型的方式不限于此,可以根据需要进行选择。
下面将结合图1B的流程图来详细例示获取近似数据集的步骤(步骤110)的子步骤的示例。本领域技术人员容易理解,图1B示出的获取近似数据集的方法仅是一个示例,本公开不限于此,本领域技术人员还可以结合本公开揭示的思想利用现有的各种开放世界数据采集、采样、收集、选取、挑选、筛选、滤选等方法来获得与实际数据集相似的近似数据集。
如图1B所示,在一些实施例中,获取近似数据集可以包括对开放世界数据集D进行抽象处理,获得代表性数据样本集Sq(步骤112)。
在一些实施例中,开放世界数据集D可以根据实际数据所属的应用场景来采集。例如,在应用场景是智慧城市或智慧无人超市的情况下,可以选择与该应用场景适配的类的数据样本来构成开放世界数据集D。有利地,这种方式能够更快更准确地匹配传感器数据分布,确定相似数据集。但本领域技术人员容易理解,本公开不限于此,例如,可以从开放世界随机地采集数据样本来构成开放世界数据集D。
在一些示例中,开放世界数据集D可以与用于获取预训练的模型的预训练数据集具有相同或相似的数据分布。或者,开放世界数据集D可以与预训练数据集具有不同的数 据分布。
这里,对开放世界数据集D进行“抽象”处理可以指对巨量的开放世界数据集中的数据进行提炼,以获得代表性数据样本的各种方法,包括但不限于无监督聚类方法。
发明人认识到,神经网络的特征层,特别是倒数第二层,具有内在的特征提取和数据聚类的功能,可以用于从数据中提取特征和对数据进行聚类。
因此,在一些实施例中,对开放世界数据集进行抽象处理的步骤112包括利用预训练的模型,对开放世界数据集中的数据进行特征提取和聚类。
例如,可以利用在步骤130中获取的预训练的模型来执行步骤112中的特征提取和聚类。但本公开不限于此。
假设预训练的神经网络为fθ。通过去除该神经网络的最后一层,并将保留的处于倒数第二层的特征层作为输出,可以得到神经网络φ。可以利用φ作为特征提取器来提取开放世界数据集D的隐藏层特征并进行聚类,得到总共C个聚类的C个聚类中心,每个聚类中心被视为该聚类的代表性数据样本,所有代表性数据样本可以构成上述代表性数据样本集Sq。但本公开不限于此,也可以使用其它特征提取器来提取开放世界数据集D的特征并进行聚类。
本领域技术人员容易理解,上述过程仅是抽象处理的一个示例,本公开不限于此,本领域技术人员还可以结合本公开揭示的思想利用现有的各种抽象处理的方法来获得代表性数据样本集Sq
值得注意的是,由于可以认为开放世界数据是没有边界的,因此在步骤112中得到的聚类的数量C不一定等于所有开放世界数据对应的类的数量。
如图1B所示,在一些实施例中,获取近似数据集可以包括在代表性数据样本集中筛选与实际数据在特征上匹配的代表性数据样本(步骤114)。
发明人认识到,通过将代表性数据样本集下载到本地与实际数据进行比对并返回比对结果,能够在无需分享本地实际数据的情况下实现对特征匹配的代表性数据样本的筛选,从而有利地避免上传本地实际数据带来的隐私泄露风险。
因此,在一些实施例中,筛选与实际数据在特征上匹配的代表性数据样本的步骤114可以包括与可访问实际数据的数据处理装置进行信息的交互,确定与实际数据在特征上匹配的代表性数据样本。其中,所交互的信息不包含实际数据。
下面将结合图1C的流程图来详细例示筛选代表性数据样本(步骤114)的示例。本领域技术人员容易理解,图1C示出的筛选方法仅是一个示例,本公开不限于此,本领 域技术人员还可以结合本公开揭示的思想利用现有的各种筛选方法来获得与实际数据在特征上匹配的代表性数据样本。
如图1C所示,在一些实施例中,筛选与实际数据在特征上匹配的代表性数据样本的步骤114可以包括下述子步骤。
在步骤1142,向数据处理装置发送代表性数据样本集。
在一些实施例中,出于提高通信效率、减小通信开销和减小本地储存负担等中的一个或多个的目的,所发送的代表性数据样本可以用提取的特征来表示。但本公开不限于此,例如,所发送的代表性数据样本也可以是原始格式的数据样本。
在步骤1144,从数据处理装置接收各个代表性数据样本的匹配结果。其中,每个代表性数据样本的匹配结果基于该代表性数据样本与各个实际数据在特征上的统计相似度。
在一些实施例中,实际数据的特征也可以利用预训练的模型进行提取。由此,还可以将神经网络φ发送给可访问实际数据的数据处理装置。
例如,每个代表性数据样本的匹配结果可以基于该代表性数据样本与所有实际数据在特征上的相似度的统计和值或平均值。在一些示例中,代表性数据样本与任一实际数据的相似度可以用数值表示,其中数值的大小反映与该实际数据在特征上的匹配程度。例如,代表性数据样本与任一实际数据的相似度可以用离散数值0或1表示,其中对于任一实际数据而言,与该实际数据在特征上最匹配的代表性数据样本的相似度为1,其它代表性数据样本的相似度为0。在这种情况下,代表性数据样本的相似度的统计和值或平均值(统计相似度)越大,则表示该代表性数据样本与更多的实际数据是最匹配的,即匹配程度越高。在该示例下,可如下表示代表性数据样本c的统计相似度:
[式1]
其中Sp表示实际数据集,Vc表示代表性数据样本c得到的与实际数据最匹配的投票数,用于指示其统计相似度。
发明人认识到,各个代表性数据样本与实际数据在特征上的统计相似度也可能包含实际数据的部分隐私信息。
因此,在一些实施例中,各个代表性数据样本的匹配结果可以是对相应的统计相似度进行差分隐私处理A(vc)或其它隐私保护处理后得到的。但本公开不限于此,例如,匹配结果可以直接由相应的统计相似度表示。
稍后将结合图2来详细描述确定匹配结果的具体示例。
本领域技术人员容易理解,以上描述的匹配结果的定义仅是一个示例,本公开不限于此,本领域技术人员还可以结合本公开揭示的思想来以其他方式定义匹配结果。
在步骤1146,从代表性数据样本集中滤除匹配结果相对较差的代表性数据样本。
在一些实施例中,可以将各个代表性数据样本的匹配结果进行排序,并从代表性数据样本集中滤除匹配结果的排序靠后的代表性数据样本。可替代地,在一些实施例中,可以从代表性数据样本集中滤除匹配结果的数值小于预设阈值的代表性数据样本。例如,在通过上述式1获得统计相似度的情况下,可以滤除匹配结果为0或由于添加随机扰动而接近0的代表性数据样本。
由于在步骤114中筛选出的代表性数据样本与实际数据在特征上匹配,因此可以认为筛选出的代表性数据样本所对应的类是包括在实际数据集中的类。但值得注意的是,由于不知道实际数据的类的数量,因此筛选出的类的数量不一定等于实际数据的类的数量。
替代地,在一些实施例中,接收到的匹配结果仅与已筛选出的代表性数据样本有关。由此,无需再执行步骤1146。
发明人认识到,除了筛选出的代表性数据样本本身以外,近似数据集还可以包括与筛选出的代表性数据样本在特征上匹配的其它开放世界数据,以提供用于模型训练的大量数据,从而保证所训练模型的准确性。
因此,如图1B所示,在一些实施例中,获取近似数据集还可以包括基于筛选出的代表性数据样本,补充采集相关的开放世界数据,扩充近似数据集(步骤116)。
在一些实施例中,相关的开放世界数据是与筛选出的代表性数据样本在特征上匹配的开放世界数据。这里也可以使用预训练的模型来进行聚类和特征提取。
例如,针对每个筛选出的代表性数据样本,可以在该代表性数据样本的类中随机采样数据。进一步地,可以基于特征匹配的程度优先选取与筛选出的代表性数据样本更为接近的数据。
在一些实施例中,可以继续在开放世界数据集D中补充采集数据。可替代地,可以在另一个开放世界数据集中补充采集数据。例如,可以在一个更大的开放世界数据集中进行补充采集,以保证数据的充足性和丰富性。
发明人还认识到,筛选出的代表性数据样本的匹配结果可以基本反映实际数据的分布比例情况。具体地,在匹配结果的值与匹配程度正相关的情况下,代表性数据样本 的匹配结果越高,则在实际数据中该代表性数据样本所属的类的数据比例就可能越高。由此,与匹配结果相关地补充采集数据有利于得到与实际数据集的数据分布更相似的相似数据集。
因此,在一些实施例中,扩充近似数据集的步骤116包括与筛选出的每个代表性数据样本的匹配结果相关地采集与该代表性数据样本在特征上匹配的数据。在匹配结果的值与匹配程度正相关的情况下,“相关地”可以包括“成比例地”或“近似成比例地”。由此,代表性数据样本与实际数据越匹配,在该代表性数据样本的类中采集的开放世界数据就越多。
通过步骤116,可以获得由大量开放世界数据组成的与实际数据集近似的近似数据集Dc
如图1A所示,在步骤120中,使用近似数据集Dc来训练模型。
在各种实施例中,在步骤120中训练模型可以包括重新训练模型、调整模型、更新模型等中的一个或多个。
假设初始模型是预训练的模型fθ,其中θ为预训练模型参数,并且训练的学习率为r,那么通过使用近似数据集Dc,可以通过以下方式训练模型:
[式2]
其中,B为从Dc中随机抽取的批样本。
通过将上述微调过程重复若干次,就可以实现模型的训练。
另外,在一些实施例中,步骤110-120可以重复执行,以针对更新的实际数据对经训练的模型进行调整,从而更好地适应实际情况,并减少每次重新训练模型所需的工作量。
如图1A所示,可选地,在一些实施例中,用于训练模型的方法100还包括对经训练的模型进行模型压缩(步骤140)。
有利地,在模型部署之前使用模型压缩技术压缩模型,不仅可以进一步减少模型大小,使模型更适用于诸如边缘计算设备之类的装置,还可以进一步减少模型隐私泄露。
在一些实施例中,用于训练模型的方法100还包括将经训练的模型分发给相关的数据处理装置(步骤150)。
这里,与实际数据集“相关”是指数据处理装置要利用经训练的模型处理的数据与实际数据集具有相同的分布。例如,同一超市内的不同货区的顾客的访问特点可能不尽 相同,使得从布置在不同货区的图像传感器采集的顾客数据存在差异,位于某个货区的图像传感器待处理的数据与由该货区的图像传感器采集的实际顾客数据集具有相同的分布。由此,经训练的模型可以在待处理的数据上具有更好的表现。
在一些实施例中,用于训练模型的方法100还包括获得执行上述步骤得到的经训练的模型。
根据本公开的实施例,产生模型的方法可以包括执行根据本公开实施例的用于训练模型的方法的步骤来产生模型。
图2例示了根据本公开实施例的数据处理方法的步骤的示例的流程图。上面结合图1A-图1C所描述的内容也可以适用于对应的特征,将省略部分重复内容的描述。
根据本公开实施例的数据处理方法可以由包括处理设备的任何装置执行,例如,可以由低功耗的计算设备(诸如图像传感器之类的本地边缘设备)来执行。
如图2A所示,根据本公开的实施例,数据处理方法200可以主要包括以下步骤:
在步骤210,获取实际数据集;
在步骤220,接收开放世界数据集的代表性数据样本集;
在步骤230,对代表性数据样本与实际数据进行特征匹配;以及
在步骤240,返回代表性数据样本与实际数据的匹配结果。
在一些实施例中,在步骤210中,可以通过直接采集数据来获取实际数据集Sp。例如,在一些示例中,可以通过图像捕获来获取真实的图像数据。或者,在一些实施例中,可以采用对于隐私信息安全的方式从外部装置获取实际数据集Sp
在一些实施例中,数据处理方法200还可以包括对获取的实际数据集进行特征提取,以方便进行特征匹配。
在一些实施例中,可以使用上述通过去除预训练的神经网络fθ的最后一层并保留其倒数第二层而得到的神经网络φ作为特征提取器。相应地,数据处理方法200还可以包括接收该神经网络φ。但本领域技术人员容易理解,本公开不限于此。
在一些实施例中,在步骤220中,可以从高算力的计算设备(诸如云端服务器)接收开放世界数据集D的代表性数据样本集Sq
如上所述,在一些实施例中,代表性数据样本集Sq可以通过对开放世界数据集D进行抽象处理而获得。例如,代表性数据样本集Sq可以包括通过使用诸如神经网络φ之类的特征提取器进行特征提取和聚类而得到的总共C个聚类的聚类中心,但本公开不限于此。
如上所述,在一些实施例中,所接收的代表性数据样本Sq可以是以提取的特征表示的数据样本,但本公开不限于此。
在一些实施例中,进行特征匹配(步骤230)包括计算每个代表性数据样本与各个实际数据在特征上的统计相似度。
例如,可以计算每个代表性数据样本与所有实际数据在特征上的相似度的统计和值或平均值,作为统计相似度。如上所述,在一些示例中,可以用数值表示代表性数据样本与任一实际数据的相似度,其中数值的大小反映与该实际数据在特征上的匹配程度。例如,可以用离散数值0或1表示代表性数据样本与任一实际数据的相似度,其中对于任一实际数据而言,与该实际数据在特征上最匹配的代表性数据样本的相似度被设定为1,其它代表性数据样本的相似度被设定为0。在这种情况下,代表性数据样本的相似度的统计和值或平均值(统计相似度)越大,则表示该代表性数据样本与更多的实际数据是最匹配的,即匹配程度越高。在这种示例下,可以使用上述式1中的投票数来计算代表性数据样本的统计相似度。但本领域技术人员容易理解,本公开不限于此。
在一些实施例中,进行特征匹配还包括对统计相似度执行差分隐私处理。
可以通过对相应的统计相似度进行差分隐私处理A(vc)而得到各个代表性数据样本的匹配结果。例如,可以在各个代表性数据样本的统计相似度中加入随机的高斯扰动,并计算对应的差分隐私风险。如果隐私损耗在可接受范围内,则得到匹配结果。但本公开不限于此,例如,也可以对统计相似度使用其它隐私保护技术或者直接将统计相似度确定为匹配结果。
本领域技术人员容易理解,以上描述的进行特征匹配的方法仅是一个示例,本公开不限于此,本领域技术人员还可以结合本公开揭示的思想来使用各种特征匹配的方法。
在处理的开放世界数据没有标签而实际数据具有标签的情况下,在一些实施例中,数据处理方法200还可以包括对至少部分代表性数据样本进行标注(步骤250)。例如,对与实际数据在特征上匹配的代表性数据样本进行标注。由此,在一些实施例中,数据处理方法200还可以包括在返回代表性数据样本与实际数据的匹配结果(步骤240)后接收与实际数据在特征上匹配的代表性数据样本或与其相关的信息。
在一些实施例中,对代表性数据样本进行标注(步骤250)可以包括分别计算该代表性数据样本与各类的实际数据在特征上的类相似度;以及基于类相似度的大小排序,确定该代表性数据所属的类。
其中,代表性数据样本与每类的实际数据在特征上的类相似度可以用该类的实际 数据中与代表性数据样本在特征上匹配的实际数据的数量来表示。
例如,可以使用最近邻伪标签方法来对代表性数据样本进行标注。
假设有K类的实际数据,针对代表性数据样本x,用表示x的m近邻集合,其中m是一个可选的超参数,则可以根据以下式3进行标注:
[式3]
其中,Vk表示符合代表性数据样本x的最近邻要求的第k类实际数据的数量(投票数),代表样本x的类相似度。A(vk)对投票数进行噪声隐私化处理,并计算隐私风险∈。f(x)表示选出的得到最大投票数的类,即,该代表性数据x将被标注的类。
上述标注方法可以用本地的实际数据标签为筛选出的无标签开放世界数据提供最近邻伪标签,有利地减少了标注任务的工作量,提高了模型训练的效率。
可选地,在一些实施例中,数据处理方法200还可以包括部署执行根据本公开实施例的用于训练模型的方法的步骤得到的经训练的模型,以处理数据(步骤260)。鉴于用于得到该经训练的模型的相似数据集与实际数据集相似,且待处理的数据也与实际数据集具有相同的数据分布,部署该经训练的模型将有利于提高模型对于实际应用场景的适应性和准确性。
值得注意的是,在以上描述的方法中的各个步骤之间的边界仅仅是说明性的。在实际操作中,各个步骤之间可以任意组合,甚至合成单个步骤。此外,各个步骤的执行顺序不受描述顺序的限制,并且部分步骤可以省略。各个实施例的操作步骤也可以以任何适当的顺序相互组合,从而类似地实现比所描述的更多或更少的操作。
下面结合图3示例性地描述根据本公开实施例的用于训练模型的系统、训练装置和数据处理装置。为了便于理解,图3还例示了各个装置的主要功能模块及其信息交互。
根据本公开的实施例,用于训练模型的系统300可以包括训练装置310和数据处理装置320。在一些实施例中,训练装置310是高算力的计算设备(诸如云端服务器),而数据处理装置320是低功耗的计算设备(诸如图像传感器之类的本地边缘设备)。
在一些实施例中,训练装置310可以将从开放世界数据集中抽象出来的代表性数据样本集发送给数据处理装置320,根据返回的与实际数据的匹配结果生成相似数据集,并利用该相似数据集训练模型,从而能够提高模型在实际数据中的表现。其中,开放世界能够有利地提供大量的数据样本。相应地,数据处理装置320仅需在本地针对实际数据执行特征匹配处理,既不需要分享实际数据,也不需要执行高算力要求的模型训练。训 练得到的模型可以部署在数据处理装置320上。由此,训练模型的过程不会泄露实际数据的隐私信息,并能够同时保证所训练模型的准确性和实际应用价值。
特别地,在各种实施例中,训练装置310可以被配置为执行根据本公开实施例的用于训练模型的方法的步骤。上面结合图1A-图1C以及图2所描述的内容也可以适用于对应的特征,将省略部分重复内容的描述。
在本公开的实施例中,如图3所示,训练装置310可以包括:
训练数据获取模块312,被配置为获取由开放世界数据构成的与实际数据集相似的近似数据集;以及
训练模块314,被配置为使用近似数据集来训练模型。
在一些实施例中,训练数据获取模块312可以包括抽象子模块3122。抽象子模块3122可以被配置为对开放世界数据集进行抽象处理,获得代表性数据样本集。例如,抽象子模块3122可以被配置为利用预训练的模型,对开放世界数据集中的数据进行特征提取和聚类。
在一些实施例中,训练数据获取模块312可以包括筛选子模块3124。筛选子模块3124可以被配置为在代表性数据样本集中筛选与实际数据在特征上匹配的代表性数据样本。例如,筛选子模块3124可以被配置为与可访问实际数据的数据处理装置进行信息的交互,确定与实际数据在特征上匹配的代表性数据样本,其中,所交互的信息不包含实际数据。即,对于实际数据的处理仅在本地执行,不上传实际数据。值得注意的是,虽然图3中例示的与筛选子模块3124进行信息交互的数据处理装置是根据本公开实施例的数据处理装置320,但本公开不限于此。本领域技术人员容易理解,该数据处理装置可以是收集和/或存储实际数据从而能够安全访问实际数据但不泄露隐私信息的任何数据处理装置。
在一些示例中,筛选子模块3124可以被配置为向数据处理装置发送代表性数据样本集,并从数据处理装置接收各个代表性数据样本的匹配结果。其中,每个代表性数据样本的匹配结果可以基于该代表性数据样本与各个实际数据在特征上的统计相似度。然后,筛选子模块3124可以被配置为从代表性数据样本集中滤除匹配结果相对较差的代表性数据样本。
可替代地,在一些实施例中,筛选子模块3124可以仅接收到筛选出的代表性数据样本的匹配结果。即,相对较差的匹配结果已经被滤除。由此,筛选子模块3124可以不再执行滤除操作。
在一些实施例中,训练数据获取模块312还可以包括扩充子模块3126。扩充子模块3126可以被配置为基于筛选出的代表性数据样本,补充采集相关的开放世界数据,扩充近似数据集。例如,扩充子模块3126可以被配置为与筛选出的每个代表性数据样本的匹配结果相关地采集与该代表性数据样本在特征上匹配的数据。
在一些实施例中,训练装置310还可以包括预训练模块316。预训练模块316可以被配置为获取预训练的模型。由此,训练模块314可以被配置为使用近似数据集对预训练的模型进行调整。
在一些实施例中,训练装置310还可以包括模型分发模块318。模型分发模块318被配置为将经训练的模型分发给与实际数据集相关的装置。这里,与实际数据集“相关”是指该装置要利用经训练的模型处理的数据与实际数据集具有相同的分布。例如,相关的装置可以是与训练装置310进行信息的交互以提供匹配结果的数据处理装置,但本公开不限于此。
可选地,在一些实施例中,训练装置310还可以包括模型压缩模块(未示出)。模型压缩模块被配置为使用模型压缩技术压缩模型。
特别地,在各种实施例中,数据处理装置320可以被配置为执行根据本公开实施例的数据处理方法的步骤。上面结合图1A-图1C以及图2所描述的内容也可以适用于对应的特征,将省略部分重复内容的描述。
本公开的实施例中,如图3所示,数据处理装置320可以包括:
数据采集模块322,被配置为获取实际数据集;
交互模块324,被配置为接收开放世界数据集的代表性数据样本集;以及
特征匹配模块326,被配置为对代表性数据样本与实际数据进行特征匹配;
其中,所述交互模块324还被配置为返回代表性数据样本与实际数据的匹配结果。
在一些实施例中,数据采集模块322还可以被配置为对获取的实际数据集进行特征提取。
在一些实施例中,数据处理装置320还可以包括部署模块328。部署模块328可以被配置为部署执行根据本公开实施例的用于训练模型的方法的步骤得到的经训练的模型,即,由训练装置310得到的经训练的模型,以处理数据。
在一些实施例中,特征匹配模块326可以被配置为计算每个代表性数据样本与各个实际数据在特征上的统计相似度,并基于统计相似度确定匹配结果。在一些实施例中,特征匹配模块326还可以被配置为对统计相似度执行差分隐私处理。相关内容已在上文中 详细讨论,这里不再重复描述。
然后,数据处理装置320可以通过交互模块324返回匹配结果。由此,数据处理装置320仅需在本地针对实际数据执行特征匹配处理,既不需要分享实际数据,也不需要执行高算力要求的模型训练。
在一些实施例中,数据处理装置320还可以包括标注模块(未示出)。标注模块可以被配置为对至少部分代表性数据样本进行标注。
在一些实施例中,标注模块可以被配置为分别计算该代表性数据样本与各类的实际数据在特征上的类相似度;以及基于类相似度的大小排序,确定该代表性数据所属的类。在一些实施例中,标注模块还可以被配置为在进行排序之前,对类相似度执行差分隐私处理。相关内容已在上文中详细讨论,这里不再重复描述。
虽然图3中将训练装置310和数据处理装置320描绘在一起,但本领域技术人员容易理解,本公开不限于此。事实上,在很多情况下,训练装置310被设置在云端,而数据处理装置320被设置在本地。
此外,本领域的技术人员应当理解,虽然图3中仅例示了一个数据处理装置320,但是数据处理装置的数量不限于此。例如,在一些实施例中,系统300可以包括多个数据处理装置320。训练装置310可以借助于这些数据处理装置320中的每个来分别执行特征匹配以确定相似数据集,并将在相似数据集上训练得到的模型分发并部署到相应的装置。这里,“相应的”装置是指部署模型的装置要处理的数据的分布与用于训练得到该模型的相似数据集的分布相似。
图3中例示的系统300可以应用在各种场景下,特别是各类计算机视觉分类任务的业务场景。
图4A-图4C例示了根据本公开实施例的用于训练模型的方案的学习场景的示例。为了便于理解,图4A-图4C简化了部分功能模块及其交互,但本领域技术人员容易理解,本公开不限于此。
在图4A所示的示例中,利用本地的实际数据的标签为无标签的开放世界数据进行标注,例如提供最近邻伪标签。这样,根据本公开实施例的模型训练方案可以进行半监督学习以训练模型。在该示例中,可以通过特征匹配筛除分布差异太大的云端数据,减少需要标注的数据,并减少分布外样本带来的近邻伪标签算法的噪声,从而改善半监督学习的效率。
在图4B所示的示例中,所处理的开放世界数据是有标签的。这样,根据本公开实 施例的模型训练方案可以先在不同于实际数据分布的开放世界数据(如ImageNet)上进行模型的预训练,再执行有监督的调整训练。
在图4C所示的示例中,所处理的开放世界数据和实际数据都是无标签的。这样,根据本公开实施例的模型训练方案可以在筛选后利用外包服务进行人工标注,从而进行少量标签样本的半监督学习。有利地,通过筛选出需要标注的数据,可以减少需要标注的公开样本的数量和人工成本。
本领域的技术人员应当理解,用于训练模型的方案的应用不限于以上示例。
本公开实施例还提供了存储有一个或多个指令的计算机可读存储介质,这些指令可以在由处理器执行时,使处理器执行上述实施例中的用于训练模型的方法或者数据处理方法的步骤。
本公开实施例还提供了包括一个或多个指令的计算机程序产品,这些指令可以在由处理器执行时,使处理器执行上述实施例中的用于训练模型的方法或者数据处理方法的步骤。
应当理解,根据本公开实施例的计算机可读存储介质中的指令可以被配置为执行与上述系统和方法实施例相应的操作。当参考上述系统和方法实施例时,计算机可读存储介质的实施例对于本领域技术人员而言是明晰的,因此不再重复描述。用于承载或包括上述指令的计算机可读存储介质也落在本公开的范围内。这样的计算机可读存储介质可以包括但不限于软盘、光盘、磁光盘、存储卡、存储棒等等。
本公开实施例还提供了包括用于执行上述实施例中的用于训练模型的方法或者数据处理方法的步骤的部件或单元的各种装置。
应注意,上述各个部件或单元仅是根据其所实现的具体功能划分的逻辑模块,而不是用于限制具体的实现方式,例如可以以软件、硬件或者软硬件结合的方式来实现。在实际实现时,上述各个部件或单元可被实现为独立的物理实体,或者也可由单个实体(例如,处理器(CPU或DSP等)、集成电路等)来实现。例如,在以上实施例中包括在一个单元中的多个功能可以由分开的装置来实现。替选地,在以上实施例中由多个单元实现的多个功能可分别由分开的装置来实现。另外,以上功能之一可由多个单元来实现。
另外,应当理解,上述系列处理和设备也可以通过软件和/或固件实现。在通过软件和/或固件实现的情况下,从存储介质或网络向具有专用硬件结构的计算机,例如图5所示的通用计算机500安装构成该软件的程序,该计算机在安装有各种程序时,能够执行 各种功能等等。图5示出了根据本公开实施例的可实现为训练装置、应用装置和系统的计算机的示例框图。
在图5中,中央处理单元(CPU)501根据只读存储器(ROM)502中存储的程序或从存储部分508加载到随机存取存储器(RAM)503的程序执行各种处理。在RAM 503中,也根据需要存储当CPU 501执行各种处理等时所需的数据。
CPU 501、ROM 502和RAM 503经由总线504彼此连接。输入/输出接口505也连接到总线504。
下述部件连接到输入/输出接口505:输入部分506,包括键盘、鼠标等;输出部分507,包括显示器,比如阴极射线管(CRT)、液晶显示器(LCD)等,和扬声器等;存储部分508,包括硬盘等;和通信部分509,包括网络接口卡比如LAN卡、调制解调器等。通信部分509经由网络比如因特网执行通信处理。
根据需要,驱动器510也连接到输入/输出接口505。可拆卸介质511比如磁盘、光盘、磁光盘、半导体存储器等等根据需要被安装在驱动器510上,使得从中读出的计算机程序根据需要被安装到存储部分508中。
在通过软件实现上述系列处理的情况下,从网络比如因特网或存储介质比如可拆卸介质511安装构成软件的程序。
本领域技术人员应当理解,这种存储介质不局限于图5所示的其中存储有程序、与设备相分离地分发以向用户提供程序的可拆卸介质511。可拆卸介质511的例子包含磁盘(包含软盘(注册商标))、光盘(包含光盘只读存储器(CD-ROM)和数字通用盘(DVD))、磁光盘(包含迷你盘(MD)(注册商标))和半导体存储器。或者,存储介质可以是ROM 502、存储部分508中包含的硬盘等等,其中存有程序,并且与包含它们的设备一起被分发给用户。
以上参照附图描述了本公开的示例性实施例,但是本公开当然不限于以上示例。本领域技术人员可在所附权利要求的范围内得到各种变更和修改,并且应理解这些变更和修改自然将落入本公开的技术范围内。
虽然已经详细说明了本公开及其优点,但是应当理解在不脱离由所附的权利要求所限定的本公开的精神和范围的情况下可以进行各种改变、替代和变换。而且,本公开实施例的术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没 有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
本公开的实施例还包括:
1.一种用于训练模型的方法,包括:
获取由开放世界数据构成的与实际数据集相似的近似数据集;以及
使用近似数据集来训练模型。
2.根据项目1所述的模型训练方法,其中,获取近似数据集包括:
对开放世界数据集进行抽象处理,获得代表性数据样本集;以及
在代表性数据样本集中筛选与实际数据在特征上匹配的代表性数据样本。
3.根据项目2所述的方法,其中,获取近似数据集还包括:
基于筛选出的代表性数据样本,补充采集相关的开放世界数据,扩充近似数据集。
4.根据项目2所述的方法,其中,对开放世界数据集进行抽象处理包括:
利用预训练的模型,对开放世界数据集中的数据进行特征提取和聚类。
5.根据项目2所述的方法,其中,筛选与实际数据在特征上匹配的代表性数据样本包括:
与可访问实际数据的数据处理装置进行信息的交互,确定与实际数据在特征上匹配的代表性数据样本,其中,所交互的信息不包含实际数据。
6.根据项目3所述的方法,其中,扩充近似数据集包括:
与筛选出的每个代表性数据样本的匹配结果相关地采集与该代表性数据样本在特征上匹配的数据,其中每个代表性数据样本的匹配结果基于该代表性数据样本与各个实际数据在特征上的统计相似度。
7.根据项目1所述的方法,还包括获取预训练的模型,以及
其中,使用近似数据集来训练模型包括使用近似数据集对预训练的模型进行调整。
8.一种数据处理方法,包括:
获取实际数据集;
接收开放世界数据集的代表性数据样本集;
对代表性数据样本与实际数据进行特征匹配;以及
返回代表性数据样本与实际数据的匹配结果。
9.根据项目8所述的方法,还包括:部署执行根据项目1-7中任一项所述的方法得到的经训练的模型,以处理数据。
10.根据项目8所述的方法,其中,进行特征匹配包括:
计算每个代表性数据样本与各个实际数据在特征上的统计相似度,并基于统计相似度确定匹配结果。
11.根据项目10所述的方法,其中,进行特征匹配还包括:
对统计相似度执行差分隐私处理。
12.根据项目8所述的方法,还包括:
对至少部分代表性数据样本进行标注。
13.根据项目12所述的方法,其中,对代表性数据进行标注包括:
分别计算该代表性数据样本与各类的实际数据在特征上的类相似度;以及
基于类相似度的大小排序,确定该代表性数据所属的类。
14.根据项目13所述的方法,其中,对代表性数据进行标注还包括:
在进行排序之前,对类相似度执行差分隐私处理。
15.一种训练装置,包括:
训练数据获取模块,被配置为获取由开放世界数据构成的与实际数据集相似的近似数据集;以及
训练模块,被配置为使用近似数据集来训练模型。
16.一种数据处理装置,包括:
数据采集模块,被配置为获取实际数据集;
交互模块,被配置为接收开放世界数据集的代表性数据样本集;以及
特征匹配模块,被配置为对代表性数据样本与实际数据进行特征匹配;
其中,所述交互模块还被配置为返回代表性数据样本与实际数据的匹配结果。
17.一种用于训练模型的系统,包括:
根据项目15所述的训练装置;以及
根据项目16所述的数据处理装置。
18.一种计算机可读存储介质,其上存储有一个或多个指令,所述指令在由处理器执行时,使处理器执行根据项目1-7中任一项所述方法的步骤和/或根据项目8-14中任一项所述方法的步骤。
19.一种计算机程序产品,包括一个或多个指令,所述指令在由处理器执行时,使处理器执行根据项目1-7中任一项所述方法的步骤和/或根据项目8-14中任一项所述方法的步骤。
20.一种产生模型的方法,包括:执行根据项目1-7中任一项所述方法的步骤来产生模型。

Claims (20)

  1. 一种用于训练模型的方法,包括:
    获取由开放世界数据构成的与实际数据集相似的近似数据集;以及
    使用近似数据集来训练模型。
  2. 如权利要求1所述的方法,其中,获取近似数据集包括:
    对开放世界数据集进行抽象处理,获得代表性数据样本集;以及
    在代表性数据样本集中筛选与实际数据在特征上匹配的代表性数据样本。
  3. 如权利要求2所述的方法,其中,获取近似数据集还包括:
    基于筛选出的代表性数据样本,补充采集相关的开放世界数据,扩充近似数据集。
  4. 如权利要求2所述的方法,其中,对开放世界数据集进行抽象处理包括:
    利用预训练的模型,对开放世界数据集中的数据进行特征提取和聚类。
  5. 如权利要求2所述的方法,其中,筛选与实际数据在特征上匹配的代表性数据样本包括:
    与可访问实际数据的数据处理装置进行信息的交互,确定与实际数据在特征上匹配的代表性数据样本,其中,所交互的信息不包含实际数据。
  6. 如权利要求3所述的方法,其中,扩充近似数据集包括:
    与筛选出的每个代表性数据样本的匹配结果相关地采集与该代表性数据样本在特征上匹配的数据,其中每个代表性数据样本的匹配结果基于该代表性数据样本与各个实际数据在特征上的统计相似度。
  7. 如权利要求1所述的方法,还包括获取预训练的模型,以及
    其中,使用近似数据集来训练模型包括使用近似数据集对预训练的模型进行调整。
  8. 一种数据处理方法,包括:
    获取实际数据集;
    接收开放世界数据集的代表性数据样本集;
    对代表性数据样本与实际数据进行特征匹配;以及
    返回代表性数据样本与实际数据的匹配结果。
  9. 如权利要求8所述的方法,还包括:部署执行根据权利要求1-7中任一项所述的方法得到的经训练的模型,以处理数据。
  10. 如权利要求8所述的方法,其中,进行特征匹配包括:
    计算每个代表性数据样本与各个实际数据在特征上的统计相似度,并基于统计相似度确定匹配结果。
  11. 如权利要求10所述的方法,其中,进行特征匹配还包括:
    对统计相似度执行差分隐私处理。
  12. 如权利要求8所述的方法,还包括:
    对至少部分代表性数据样本进行标注。
  13. 如权利要求12所述的方法,其中,对代表性数据进行标注包括:
    分别计算该代表性数据样本与各类的实际数据在特征上的类相似度;以及
    基于类相似度的大小排序,确定该代表性数据所属的类。
  14. 如权利要求13所述的方法,其中,对代表性数据进行标注还包括:
    在进行排序之前,对类相似度执行差分隐私处理。
  15. 一种训练装置,包括:
    训练数据获取模块,被配置为获取由开放世界数据构成的与实际数据集相似的近似数据集;以及
    训练模块,被配置为使用近似数据集来训练模型。
  16. 一种数据处理装置,包括:
    数据采集模块,被配置为获取实际数据集;
    交互模块,被配置为接收开放世界数据集的代表性数据样本集;以及
    特征匹配模块,被配置为对代表性数据样本与实际数据进行特征匹配;
    其中,所述交互模块还被配置为返回代表性数据样本与实际数据的匹配结果。
  17. 一种用于训练模型的系统,包括:
    根据权利要求15所述的训练装置;以及
    根据权利要求16所述的数据处理装置。
  18. 一种计算机可读存储介质,其上存储有一个或多个指令,所述指令在由处理器执行时,使处理器执行根据权利要求1-7中任一项所述方法的步骤和/或根据权利要求8-14中任一项所述方法的步骤。
  19. 一种计算机程序产品,包括一个或多个指令,所述指令在由处理器执行时,使处理器执行根据权利要求1-7中任一项所述方法的步骤和/或根据权利要求8-14中任一项所述方法的步骤。
  20. 一种产生模型的方法,包括:执行根据权利要求1-7中任一项所述方法的步骤来产生模型。
PCT/CN2023/093818 2022-05-19 2023-05-12 用于训练模型的方法、装置和系统 WO2023221888A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210554969.8 2022-05-19
CN202210554969.8A CN115115047A (zh) 2022-05-19 2022-05-19 用于训练模型的方法、装置和系统

Publications (1)

Publication Number Publication Date
WO2023221888A1 true WO2023221888A1 (zh) 2023-11-23

Family

ID=83325647

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/093818 WO2023221888A1 (zh) 2022-05-19 2023-05-12 用于训练模型的方法、装置和系统

Country Status (2)

Country Link
CN (1) CN115115047A (zh)
WO (1) WO2023221888A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115115047A (zh) * 2022-05-19 2022-09-27 索尼集团公司 用于训练模型的方法、装置和系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108470187A (zh) * 2018-02-26 2018-08-31 华南理工大学 一种基于扩充训练数据集的类别不平衡问题分类方法
US20190012581A1 (en) * 2017-07-06 2019-01-10 Nokia Technologies Oy Method and an apparatus for evaluating generative machine learning model
CN112116022A (zh) * 2020-09-27 2020-12-22 中国空间技术研究院 基于连续混合潜在分布模型的数据生成方法及装置
US20210295205A1 (en) * 2020-03-19 2021-09-23 International Business Machines Corporation Generating quantitatively assessed synthetic training data
CN115115047A (zh) * 2022-05-19 2022-09-27 索尼集团公司 用于训练模型的方法、装置和系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190012581A1 (en) * 2017-07-06 2019-01-10 Nokia Technologies Oy Method and an apparatus for evaluating generative machine learning model
CN108470187A (zh) * 2018-02-26 2018-08-31 华南理工大学 一种基于扩充训练数据集的类别不平衡问题分类方法
US20210295205A1 (en) * 2020-03-19 2021-09-23 International Business Machines Corporation Generating quantitatively assessed synthetic training data
CN112116022A (zh) * 2020-09-27 2020-12-22 中国空间技术研究院 基于连续混合潜在分布模型的数据生成方法及装置
CN115115047A (zh) * 2022-05-19 2022-09-27 索尼集团公司 用于训练模型的方法、装置和系统

Also Published As

Publication number Publication date
CN115115047A (zh) 2022-09-27

Similar Documents

Publication Publication Date Title
CN110431560B (zh) 目标人物的搜索方法和装置、设备和介质
WO2020001083A1 (zh) 一种基于特征复用的人脸识别方法
WO2011097041A2 (en) Recommending user image to social network groups
CN107209860A (zh) 使用分块特征来优化多类图像分类
CN111401344A (zh) 人脸识别方法和装置及人脸识别系统的训练方法和装置
WO2023221888A1 (zh) 用于训练模型的方法、装置和系统
CN114565807A (zh) 训练目标图像检索模型的方法和装置
CN114821237A (zh) 一种基于多级对比学习的无监督船舶再识别方法及系统
Yang et al. A multimedia semantic retrieval mobile system based on HCFGs
CN111738341B (zh) 一种分布式大规模人脸聚类方法及装置
CN114693624A (zh) 一种图像检测方法、装置、设备及可读存储介质
CN109670423A (zh) 一种基于深度学习的图像识别系统、方法及介质
CN108959664A (zh) 基于图片处理器的分布式文件系统
JP2009157442A (ja) データ検索装置および方法
Yao Key frame extraction method of music and dance video based on multicore learning feature fusion
Kim et al. TVDP: Translational visual data platform for smart cities
CN110489613A (zh) 协同可视数据推荐方法及装置
Baraka et al. Weakly-supervised temporal action localization: a survey
WO2018120575A1 (zh) 网页主图识别方法和装置
WO2023143449A1 (zh) 用于隐私保护的方法、装置和系统
US9014420B2 (en) Adaptive action detection
JP2014078100A (ja) 配信装置及びコンピュータプログラム
CN116958724A (zh) 一种产品分类模型的训练方法和相关装置
CN113395584B (zh) 一种视频数据处理方法、装置、设备以及介质
KR102444172B1 (ko) 영상 빅 데이터의 지능적 마이닝 방법과 처리 시스템

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23806840

Country of ref document: EP

Kind code of ref document: A1