CN114692894A - Implementation method of machine learning model supporting dynamic addition and deletion of user data - Google Patents
Implementation method of machine learning model supporting dynamic addition and deletion of user data Download PDFInfo
- Publication number
- CN114692894A CN114692894A CN202210353332.2A CN202210353332A CN114692894A CN 114692894 A CN114692894 A CN 114692894A CN 202210353332 A CN202210353332 A CN 202210353332A CN 114692894 A CN114692894 A CN 114692894A
- Authority
- CN
- China
- Prior art keywords
- data
- user
- model
- server
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 70
- 238000010801 machine learning Methods 0.000 title claims abstract description 64
- 238000012217 deletion Methods 0.000 title claims abstract description 31
- 230000037430 deletion Effects 0.000 title claims abstract description 31
- 238000012549 training Methods 0.000 claims abstract description 71
- 230000008569 process Effects 0.000 claims description 43
- 230000006870 function Effects 0.000 claims description 15
- 230000010354 integration Effects 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 8
- 238000000638 solvent extraction Methods 0.000 claims description 4
- 238000013480 data collection Methods 0.000 claims description 2
- 230000002776 aggregation Effects 0.000 claims 1
- 238000004220 aggregation Methods 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 12
- 238000012545 processing Methods 0.000 description 4
- 230000003993 interaction Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/22—Arrangements for sorting or merging computer data on continuous record carriers, e.g. tape, drum, disc
- G06F7/36—Combined merging and sorting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
- G06N5/041—Abduction
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Computer Hardware Design (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a method for realizing a machine learning model supporting dynamic addition and deletion of user data, which comprises the following steps of 1, collecting data at a server side; step 2, dividing the user data set to obtain a series of mutually independent data subsets, wherein each piece of user data is only stored in one data subset; step 3, training the data subsets obtained by division through the submodels, and performing combined reasoning on the trained submodels by adopting the integrated model; step 4, responding to a data updating request of a user, and realizing incremental learning of the data of the newly added user; and 5, responding to the data deletion request of the user, and realizing the forgetting of the user data. The invention can carry out machine learning on any amount of user data and can realize accurate model reasoning result.
Description
Technical Field
The invention relates to a method for realizing model training and updating in an online machine learning service, in particular to a method for realizing a machine learning model supporting dynamic addition and deletion of user data.
Background
With the rapid development of machine learning, especially deep learning, many enterprises begin to train deep neural network models and use them as a class of user services, that is, machine learning services, such as image classification, face recognition, and speech translation, which provide many conveniences for people's lives through online machine learning. At the same time, however, in order to improve the quality of service, service providers need to use real data as a training set of models, thereby improving the usability of the models. Therefore, part of the machine learning services need the user to upload part of private data, and an accurate and efficient machine learning model is trained on the basis, so that the model inference request of the user is further improved and satisfied.
However, this approach also results in the data of the user being stored in the server, and even if the user no longer uses the service, the data information uploaded by the user before still exists in the online service model, which increases the risk of leakage of the private data of the user. In order to solve this problem, the server needs to not only respond to the data update request of the user, but also satisfy the data deletion request of the user, so as to permanently delete the relevant user data from the server. In the machine learning scenario, the data information of the user is not only in the training set of the model, but also the machine learning model can memorize the data that has been trained, so the machine learning model also contains the information of the user, and the operation forgotten by the machine needs to be executed.
However, the inventors of the present application have found that the above-mentioned techniques have at least the following technical problems: the model structure of the existing online machine learning service does not support machine forgetting of partial data, but can be realized only by completely retraining the model, which causes that a large amount of calculation overhead is needed for data deletion and a long retraining time is needed.
Therefore, for the online machine learning service scenario, a flexible machine learning model needs to be constructed to support dynamic addition and deletion of user data and maintain the usability and updating efficiency of the model.
Disclosure of Invention
The invention aims to solve the technical problem of providing a method for realizing a machine learning model supporting dynamic addition and deletion of user data aiming at the defects of the prior art.
The application provides a method for realizing a machine learning model supporting dynamic addition and deletion of user data, which is characterized by comprising the following steps:
step 4, the server responds to the data updating request of the user to realize incremental learning of the data of the newly added user; the data updating request of the user comprises the steps that the user interacts with the server, the private data of the user is uploaded, and the server is requested to learn the newly uploaded data;
step 5, the server responds to the data deletion request of the user, and machine forgetting of the existing user data is realized; the data deleting request of the user comprises the steps that the user interacts with the server, the identity of the user is uploaded, the server is requested to delete all data under the identity, and training parameters of the part of data are removed from the machine learning model.
The technical scheme of the invention is further limited as follows: in step 1, the user data comprises an original data sample of a user, a data tag of the sample, a user identity of the sample, and available time of the sample;
the original data samples of the users, each user has a plurality of data samples, and a single data sample is represented by x;
the data labels of the samples have a corresponding data label for each sample, the corresponding data label is used for indicating a specific data category, and the related category is represented by y;
the identity marks of the users, data from the same user have the same identity mark and are marked as u;
the time available for the data, designated by the user, is denoted t.
Preferably, in step 2, the user data set is composed of a plurality of pieces of user data, and for a user data set containing n pieces of data, it is represented by Du={(xi,yi,ui,ti)|1≤i≤n}。
Preferably, the data partitioning process in step 2 includes splitting the user data set according to the user identity attribute, and calculating an average available time; then sorting the user data, and dividing the sorted data into a series of mutually independent data subsets in sequence; the data subsets contain data from multiple users, with different data subsets being independent of each other.
Preferably, the step 2 comprises the following specific steps:
step 201, for the user data set DuThe data is divided into a plurality of single-user data sets according to the identity attribute of the data, wherein each single-user data set uniquely contains the data from a specific user, and the data set of the jth user can be represented as
Step 203, the average available time of the single-user data sets is sorted by a sorting algorithm to obtain a group of ordered single-user data sets, namelyAnd for any two sets, satisfy
Step 204, dividing the user data set into m subsets according to the same sequence, wherein each subset comprisesData of individual users, noted as Wherein j is more than or equal to 1 and less than or equal to m, and the division of the user data set is completed.
Preferably, the model training process of step 3 includes training of a sub-model and training of an integrated model, where the training process of the sub-model includes training a data subset and an auxiliary data set using a machine learning algorithm, and executing the training algorithm on each data subset to obtain a corresponding sub-model; the training process of the integrated model comprises the step of using a machine learning algorithm, using all sub-models and all data sets, and training to obtain the integrated model.
Preferably, the step 3 specifically comprises the following steps: step 301, initializing a list of submodels for storing parameters of each submodel;
step 302, setting a basic structure of sub-model training and corresponding hyper-parameters, including a model training turn N, a learning rate lr of a model and a loss function L of the model;
step 303, for the jth model, initialize a single model parameter, denoted as Wj;
Step 304, using the j-th data subset obtained by the divisionAnd auxiliary data set D saved by the serveroAs a training set for the current model;
305, training the model for N times by using the corresponding learning rate lr according to the hyper-parameters of the model, and updating the parameters of the model by using a loss function in each training round, namely
Step 306, using j 1 to j m, repeating steps 303 to 305 and saving the corresponding WjObtaining submodels corresponding to all subsets from a list of the submodels;
step 307, setting a basic structure of integrated model training and corresponding hyper-parameters, including model training round N, learning rate lr of the model and loss function L of the model;
step 308, initializing parameters of the integrated model, and recording the parameters as E;
step 309, use the full user data set DuWith the auxiliary data set DoAs an original training set of the integrated model;
step 310, according to the hyper-parameters of the integrated model, training the integrated model N times by using the corresponding learning rate lr, and updating the parameters of the model by using a loss function in each training cycle, namelyWherein W1(x),…,Wm(x) Representing the calculation result of the submodel on the input data x;
step 311, storing parameters of the integrated model, and ending the training process of the model;
step 312, using the submodels and the integrated model to realize reasoning on unknown data; the inference process of the model can be represented as E (W)1(x),…,Wm(x) That is, all submodels are used to calculate the input data first, then all calculation results are input into the integrated model to be aggregated, and the integrated model outputs the final inference result.
Preferably, the model updating process in step 4 includes that the server responds to a data updating request of the user and receives a data sample uploaded by the user; the server uses machine learning algorithmThe new submodel and the integrated model complete the learning of the newly added data; the data updating request of the user comprises the steps that the user interacts with the server and uploads k private data of the user, and the uploaded data set is Dupdate={(xi,yi,ui,ti) I is more than or equal to 1 and less than or equal to k; the specific treatment steps are as follows:
step 401, after receiving the data set, the server merges the data set into an existing user data set, that is, the server merges the data set into an existing user data setAnd calculating the average available time of the newly added data
Step 402, calculating the average available time T according to the aboveupdateAdding the new data to the jth data subset, i.e.And make Tj-1<Tupdate≤TjThat is, the average available time of the data of the added data subset is as close as possible to the newly added data;
step 403, repeat step 305, using the updated data subsetTraining the submodels so that the submodels can learn the newly added data, and then storing the parameters of the newly trained submodels;
step 404, repeat step 310, using the updated data setThe updated subset model trains the integrated model, so that the integrated model can learn new data, and then parameters of the newly trained integrated model are stored;
step 405, the data updating process is ended, and the inference process of the model is consistent with the aforementioned step 312.
Preferably, the model updating process in step 5 includes that the server responds to a data deletion request of the user and receives an identity uploaded by the user; then the server retrains parameters of the submodel and the integrated model by using a machine forgetting algorithm to finish forgetting the existing data; the method comprises the following specific steps:
step 501, after receiving the identity identifier, the server screens out all user data identical to the identity identifier from the user data set, and records the user data as Dremove={(xi,yi,ui,ti)|1≤i≤k,uiU }; u is the identity of the user;
step 502, the server searches for D from the whole data subsetremoveRelated data, which is used for dividing the data with the same user identity into the same data subsets in the division process of the model, and the server determines the unique data subset containing the user data and records the unique data subset as the user data
Step 503, the server is inIn which the part of the data is removed, i.e.Then abandoning the original sub-model;
step 504, repeat step 305, using the removed data subsetRetraining the submodel so that the submodel no longer contains the relevant information of the removed data, and then saving the parameters of the newly trained submodel;
step 505, the server removes D from all user dataremoveI.e. byThen abandoning the original integrated model;
step 506, repeat step 310, using the removed user dataRetraining the integration model to remove information in the integration model related to the user data, and then saving parameters of the integration model;
step 507, ending the data deleting process, and the reasoning process of the model is consistent with the step 312.
The technical scheme provided in the embodiment of the application has at least the following technical effects or advantages: 1) the invention provides an online service-oriented scene and an implementation method of a machine learning model supporting dynamic addition and deletion of user data.
2) The invention also supports the processing of the data updating request and the data deleting request of the user, and can realize high-efficiency processing efficiency and complete data forgetting by adjusting and updating partial network parameters in the model.
3) Compared with a completely retrained machine learning model scheme, the method can greatly reduce the calculation overhead required by data updating and data forgetting of the model under the condition of maintaining the accuracy of the model unchanged, thereby being capable of adapting to frequent data changes and responding to user requests more quickly.
4) The method can be applied to most online machine learning services, particularly to fields closely related to user privacy, and has wide application scenes.
Drawings
Fig. 1 is a schematic diagram of a network structure of a server and multiple users according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating a server data set and a machine learning model structure according to an embodiment of the present invention.
Fig. 3 is a schematic flow chart of the division of a user data set into data subsets according to an embodiment of the present invention.
FIG. 4 is a diagram illustrating a training process of a machine learning model according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of a data updating process of the online machine learning service according to the embodiment of the present invention.
FIG. 6 is a diagram illustrating model structure update of an online machine learning service according to an embodiment of the present invention.
Fig. 7 is a schematic diagram of a data deletion process of the online machine learning service according to an embodiment of the present invention.
FIG. 8 is a diagram illustrating model structure updating of an online machine learning service according to an embodiment of the present invention.
Detailed Description
The invention provides a machine learning model implementation method for supporting continuous data updating and deleting in an online service-oriented scene. In the method, a centralized machine learning model is used by an online machine learning service, and a server collects and collates data and trains a machine learning model for a user to use; and uploading part of data of the user side by the user side, and performing inference tasks by using the model. According to the change of the user's will, the user can dynamically upload new data to the server during the service operation period, or require the server to delete the data which is uploaded by the user.
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and specific examples.
Examples
Fig. 1 is a schematic diagram of a network structure of a user-server model according to the present invention. In the online machine learning service, a central node is used as a server, and a server end stores data uploaded by a user and trains the data to obtain a machine learning model. And the rest of the user nodes only carry out data interaction with the server. Taking an image classification task as an example, the data interaction mainly comprises three requests, the first request is a model reasoning request of a user, wherein the user uploads part of unlabelled image samples to a server, the server is requested to carry out reasoning by using a machine learning model, and then the server is takenAnd the server returns the classification inference result to the user and immediately deletes the image uploaded by the user. The second is a data updating request of a user, wherein the user respectively uploads original image data x, image category y, user identity u and available data time t as four tuples of data to a server together, and a single request can simultaneously contain a plurality of data tuples, namely Dupdate={(xi,yi,ui,ti) I is more than or equal to 1 and less than or equal to k }; after receiving the request, the server stores the request into the local image data set of the server and updates the machine learning model, thereby realizing the classification reasoning capability of the newly uploaded image. The third type is a data deletion request of a user, wherein the user uploads a personal identity u to a server and requests the server to delete all data related to the identity; after receiving the request, the server needs to delete the image data in the locally stored data set, which is the same as the identity identifier, and update the machine learning model to remove the relevant information in the model, thereby realizing complete machine forgetting.
FIG. 2 is a diagram of a server data set and a machine learning model structure according to the present invention. The data set at the server side mainly comprises an auxiliary data set and a user data set, wherein the auxiliary data set mainly comprises public data and a part of data owned by the server, so that the data only comprises data samples and data labels, and the problems related to user data updating and deleting do not need to be considered. The user data set mainly comprises data uploaded by a plurality of users, and each piece of user data comprises a data sample, a data label, a user identity and data available time. The machine learning model shown in the figure is also composed of two parts, one part is a plurality of sub models which have the same parameter structure but are trained separately by using different user data, for example, in fig. 2, there are 6 users, which are divided into 3 data subsets, and the 3 sub models are respectively used for training, in order to improve the training effect of the model, in the process, the auxiliary data set is also used for training the sub models. After the sub-model training is completed, the invention uses a single integrated model to aggregate the output of the sub-model, and further obtains the final reasoning result.
Fig. 3 is a schematic flow chart of the user data set partitioning into data subsets according to the present invention. For example, the server provides an image classification service aiming at the CIFAR series, wherein CIFAR100 is used as an auxiliary data set, CIFAR10 is used as data uploaded by 100 users, each user randomly owns a part of data of CIFAR10, and the identity and the available time of the user are increased for each piece of data. Wherein the CIFAR10 contains 50000 data samples in total. The data partitioning method used by the server is as follows:
step 201, the server divides the data set into a series of single-user data sets according to the identity of the user. In this example, each single-user data set represents data owned by one user, and thus 100 single-user data sets are obtained by dividing the data set, each user includes 500 data samples, and the j-th user data set isWherein j is more than or equal to 1 and less than or equal to 100;
step 202, for each single-user data set, calculating the average time available in the data setIn this example, by counting the available time in 100 single-user data sets, the average available time of each data set, denoted as T, can be obtainedjWherein j is more than or equal to 1 and less than or equal to 100;
step 203, the average available time of the single-user data sets is sorted by a sorting algorithm to obtain a group of ordered single-user data sets, namelyAnd for any two sets, satisfy
And step 204, sequentially dividing the single-user data set into m subsets according to the sequence. In this example, the total division is 10 subsets of data, each subset containing 10 users of data, the jth subset of data can be expressed asWherein j is more than or equal to 1 and less than or equal to 10, and k is the data number of each subset. The data set division step is completed by taking the group of data subsets as the division result;
FIG. 4 is a schematic diagram of a training process of the machine learning model according to the present invention. In this example, the server side trains a submodel for each of the 10 divided data subsets, and then aggregates the results of the submodels by integrating the models to obtain the prediction output of the overall model. The specific implementation steps are as follows:
step 301, the server initializes an empty submodel list for storing the training parameters of each submodel.
Step 302, the server selects a suitable model structure for each data subset, and sets corresponding training hyper-parameters. In this example, the server may use the ResNet network as the network structure of the submodel for training the CIFAR10 data set, the model hyper-parameter related thereto may be set to the training round N-30, the learning rate lr of the model is 0.1, the loss function of the model uses the cross-entropy function, that is, the model loss function uses the cross-entropy function Wherein y iscIs a unit vector of 1, p only in the c-th dimensioncIs the predicted outcome of the model;
step 303, initializing a separate submodel W for each subset of dataj. In this example, the server randomly initializes a new ResNet model;
step 304, the server will subset the jth dataAnd an auxiliary data set DoCombined as currentA training data set of the model. In this example, each subset of dataContaining data samples of size of the original data setI.e. containing 5000 user data samples, and the complete auxiliary data set;
305, training the model for N times by using a machine learning algorithm and the hyper-parameters of the model and the corresponding learning rate lr, and updating the parameters of the model by using a loss function in each training round, namelyIn the embodiment, the server trains the ResNet model obtained by initialization for 30 times, and the parameters of the submodels are stored in a list after the training is finished;
and step 306, repeating the steps 303 to 305 by using different data subsets and enabling j to be 1 to j to be m, and finishing the training process of the sub-model. In this example, after the training process is completed, the server obtains 10 different ResNet models, each ResNet model is independent of each other, the learned user information is different, and a data reasoning process can be performed independently;
step 307, the server selects a suitable integration model and sets the associated hyper-parameters. In this example, the server may use a fully-connected network as the structure of the integrated model, and since the fully-connected network includes a small number of parameters, the training time of the model is relatively short, and here, the training round N is 10, the learning rate lr is 0.01, and the loss function also uses a cross entropy function;
308, the server randomly initializes the parameters E of the integration model;
step 309, the server trains the integration model using the full user data set. In this example, the server trains the integrated model 10 times with a learning rate of 0.01, 50000 pieces of user data are involved in each training, and the updating process of the model isWherein W1(x),…,Wm(x) Representing the result of the calculation of the input data x by the submodel;
and step 310, storing the parameters of the integrated model after training, and ending the training process of the model.
In step 311, using the above sub-model and the integration model, the inference of unknown data can be realized. The reasoning process of the model can be represented as E (W)1(x),…,Wm(x) That is, all submodels are used to calculate the input data first, then all calculation results are input into the integrated model to be aggregated, and the integrated model outputs the final inference result.
Fig. 5 and fig. 6 are schematic diagrams of a data updating process and a model structure updating of the online machine learning service according to the present invention. The user data updating request comprises that a user interacts with the server and uploads own k pieces of private data. At this time, the uploaded data set is Dupdate={(xi,yi,ui,ti) I is more than or equal to 1 and less than or equal to k. After the server receives the request, the specific processing procedure is as follows:
the server first saves this part of the data to a local user data set, step 401, i.e. it saves this part of the data to a local user data setTaking fig. 6 as an example, on the basis of fig. 2, the server receives the update data of the user 7 at this time;
In step 403, in all the data subsets, the server searches the data subset closest to the average available time, and adds the new data to the data subset. Namely that makeAnd Tj-1<Tupdate≤Tj. As shown in fig. 6, the server finds that the subset 3 is closest to the available time of the new data, and adds the subset 3 with the available time;
in step 404, the server updates the sub-model corresponding to the subset of data. As shown in fig. 6, since the data of the subset 3 is updated, the server updates the sub-model corresponding to the subset 3, and for the update of the data, the server may use the updated data subset on the parameters of the sub-model that has been trained previously in an incremental learning mannerThe training process of machine learning continues, enabling the submodel to learn the distribution of the update data. After the training is finished, the server saves the parameters of the trained sub-model and replaces the original sub-model;
in step 405, the server updates the final integration model. Since the sub-model is updated with parameters and the corresponding output result changes, the server needs to further update the parameters of the integrated model, where the updated data set can be usedThe updated subset model trains the integrated model, so that the integrated model can learn new data, and then parameters of the newly trained integrated model are stored and replace the original sub-model;
at step 406, the data update process ends. As shown in FIG. 6, the part of the model that needs to be changed includes a sub-model and an integrated model, and the data update cost required by the machine learning scheme of the present invention is about the update cost of the whole model compared with the update of the whole modelWhere m represents the number of submodels.
Fig. 7 and 8 are schematic diagrams of a data deletion process and model structure update of the online machine learning service according to the present invention. The data deleting request mainly relates to the interaction between a user and a server, the user uploads the own identity mark u, the server is requested to delete all data under the identity mark, and training parameters of the part of data are removed in a machine learning model. After the server receives the request, the specific processing procedure is as follows:
step 501, after receiving the identity identifier, the server screens out all user data identical to the identity identifier from the user data set, and records the user data as Dremove={(xi,yi,ui,ti)|1≤i≤k,uiU }. As in fig. 8, the user 4 requests to delete the relevant data, and then the server searches all the data related to the user 4;
step 502, the server searches for the data subset where the part of data is located, and since the data with the same user identity identifier is divided into the same data subset in the division process of the model, the server can determine the unique data subset containing the user data and record the unique data subset as the unique data subsetIn the figure, subset 2 is related to the partial data;
step 503, the server is inIn which the part of the data is removed, i.e. The original submodels and integration models are then discarded. As in fig. 8, the relevant data of user 4 is removed from subset 2, and also from the overall user data set, so that the training data of the model no longer contains relevant data;
step 504, the server uses the removed data subsetAnd retraining the submodel, and then saving and updating the parameters of the submodel. For a data deletion request, the server completely retrains the relevant model so as to ensure that the retrained model does not contain any information related to the deleted data, so that complete machine forgetting is realized;
in step 505, the server uses the removed user dataRetraining the integrated model, and then storing and updating the parameters of the integrated model;
step 506, the data deletion process is ended. As shown in fig. 8, the model structure of the model to be retrained mainly includes a sub-model and an integrated model, and compared with retraining of all models, the data deletion cost required by the machine learning scheme of the present invention is about the update cost of all modelsWhere m represents the number of submodels.
Furthermore, because the average available time of the data is used for sorting when the data is divided, the user data which are more likely to be forgotten are gathered together, and the data deletion of the adjacent time can be executed together when the data is deleted, so that the required model retraining times are reduced, and the retraining expense of the model is further reduced.
The foregoing is only a preferred embodiment of this invention and it should be noted that modifications can be made by those skilled in the art without departing from the principle of the invention and these modifications should also be considered as the protection scope of the invention.
Claims (9)
1. A method for realizing a machine learning model supporting dynamic addition and deletion of user data is characterized by comprising the following steps:
step 1, collecting data of a server side; the data collection comprises an auxiliary data set consisting of related public data sets and data collected by the server in advance and user data uploaded by a user;
step 2, the server divides the user data set to obtain a series of mutually independent data subsets, and each piece of user data is only stored in one data subset;
step 3, the server trains the data subsets obtained by division through the submodels, and the trained submodels are subjected to combined reasoning through the integrated model;
step 4, the server responds to the data updating request of the user to realize incremental learning of the data of the newly added user; the data updating request of the user comprises the steps that the user interacts with the server, the private data of the user is uploaded, and the server is requested to learn the newly uploaded data;
step 5, the server responds to the data deletion request of the user, and machine forgetting of the existing user data is realized; the data deleting request of the user comprises the steps that the user interacts with the server, the identity of the user is uploaded, the server is requested to delete all data under the identity, and training parameters of the part of data are removed from the machine learning model.
2. The method of claim 1 for implementing a machine learning model that supports dynamic addition and deletion of user data, comprising: in step 1, the user data comprises an original data sample of a user, a data tag of the sample, a user identity of the sample, and available time of the sample;
the original data samples of the users, each user has a plurality of data samples, and a single data sample is represented by x;
the data labels of the samples have a corresponding data label for each sample, the corresponding data label is used for indicating a specific data category, and the related category is represented by y;
the identity marks of the users, data from the same user have the same identity mark and are marked as u;
the time available for the data, designated by the user, is denoted t.
3. The method of claim 1 for implementing a machine learning model that supports dynamic addition and deletion of user data, comprising: in step 2, the user data set is composed of a plurality of pieces of user data, and for the user data set containing n pieces of data, the user data set is represented as Du={(xi,yi,ui,ti)|1≤i≤n}。
4. The method of claim 3 for implementing a machine learning model that supports dynamic addition and deletion of user data, comprising: the data partitioning process of the step 2 comprises splitting a user data set according to the identity attribute of the user, and calculating the average available time; then sorting the user data, and dividing the sorted data into a series of mutually independent data subsets in sequence; the data subsets contain data from multiple users, with different data subsets being independent of each other.
5. The method of claim 4 for implementing a machine learning model that supports dynamic addition and deletion of user data, comprising: the step 2 specifically comprises the following steps:
step 201, for the user data set DuThe data is divided into a plurality of single-user data sets according to the identity attribute of the data, wherein each single-user data set uniquely contains the data from a specific user, and the data set of the jth user can be represented as
Step 203, the average available time of the single-user data sets is sorted by a sorting algorithm to obtain a group of ordered single-user data sets, namelyAnd for any two sets, satisfy
6. The method of claim 1 for implementing a machine learning model that supports dynamic addition and deletion of user data, comprising: the model training process of the step 3 comprises the training of the submodel and the training of the integrated model, wherein the training process of the submodel comprises the steps of training a data subset and an auxiliary data set by using a machine learning algorithm, and executing the training algorithm on each data subset to obtain a corresponding submodel; the training process of the integrated model comprises the step of using a machine learning algorithm, using all sub-models and all data sets, and training to obtain the integrated model.
7. The method of claim 6 for implementing a machine learning model that supports dynamic addition and deletion of user data, comprising: the step 3 specifically comprises the following steps: step 301, initializing a list of submodels for storing parameters of each submodel;
step 302, setting a basic structure of sub-model training and corresponding hyper-parameters, including a model training turn N, a learning rate lr of a model and a loss function L of the model;
step 303, for the jth model, initialize a single model parameter, denoted as Wj;
Step 304, using the j-th data subset obtained by the divisionAnd auxiliary data set D saved by the serveroAs a training set for the current model;
305, training the model for N times by using the corresponding learning rate lr according to the hyper-parameters of the model, and updating the parameters of the model by using a loss function in each training round, namely
Step 306, using j 1 to j m, repeating steps 303 to 305 and saving the corresponding WjObtaining submodels corresponding to all subsets from a list of the submodels;
step 307, setting a basic structure of integrated model training and corresponding hyper-parameters, including a model training turn N, a learning rate lr of a model and a loss function L of the model;
step 308, initializing parameters of the integrated model, and recording the parameters as E;
step 309, use the full user data set DuWith the auxiliary data set DoAs an original training set of the integrated model;
step 310, according to the hyper-parameters of the integrated model, training the integrated model for N times by using the corresponding learning rate lr, and updating the parameters of the model by using the loss function in each training round, namelyWherein W1(x),…,Wm(x) Representing the calculation result of the submodel on the input data x;
step 311, saving the parameters of the integrated model, and ending the training process of the model;
step 312, using the submodels and the integrated model to realize reasoning on unknown data; the reasoning process of the model can be represented as E (W)1(x),…,Wm(x) That is, all the submodels are used for calculating the input data, all the calculation results are input into the integrated model for aggregation, and the integrated model outputs the final inference result.
8. The method of claim 7 for implementing a machine learning model that supports dynamic addition and deletion of user data, comprising: the model updating process in the step 4 comprises the steps that the server responds to a data updating request of a user and receives a data sample uploaded by the user; then the server updates the parameters of the sub-model and the integrated model by using a machine learning algorithm to complete the learning of the newly added data; the data updating request of the user comprises the steps that the user interacts with the server and uploads k private data of the user, and the uploaded data set is Dupdate={(xi,yi,ui,ti) I is more than or equal to 1 and less than or equal to k; the specific treatment steps are as follows:
step 401, after receiving the data set, the server merges the data set into the existing user data set, that is, the server merges the data set into the existing user data setAnd calculating the average available time of the newly added data
Step 402, calculating the average available time T according to the aboveupdateAdding the new data to the jth data subset, i.e.And make Tj-1<Tupdate≤TjThat is, the average available time of the data of the added data subset is as close as possible to the newly added data;
step 403, repeat step 305, using the updated data subsetTraining the submodels so that the submodels can learn the newly added data, and then storing the parameters of the newly trained submodels;
step 404, repeat step 310, using the updated data setThe updated subset model trains the integrated model, so that the integrated model can learn new data, and then parameters of the newly trained integrated model are stored;
step 405, the data updating process is ended, and the inference process of the model is consistent with the aforementioned step 312.
9. The method of claim 7 for implementing a machine learning model that supports dynamic addition and deletion of user data, comprising: the model updating process of the step 5 comprises that the server responds to a data deletion request of a user and receives an identity uploaded by the user; then the server retrains parameters of the submodel and the integrated model by using a machine forgetting algorithm to finish forgetting the existing data; the method comprises the following specific steps:
step 501, after receiving the identity identifier, the server screens out all user data identical to the identity identifier from the user data set, and records the user data as Dremove={(xi,yi,ui,ti)|1≤i≤k,uiU }; u is the identity of the user;
step 502, the server searches for D from the whole data subsetremoveRelated data, which is used for dividing the data with the same user identity into the same data subsets in the division process of the model, and the server determines the unique data subset containing the user data and records the unique data subset as the user data
Step 503, the server is inIn which the part of the data is removed, i.e.Then abandoning the original sub-model;
step 504, repeat step 305, using the removed data subsetRetraining the submodel so that the submodel no longer contains the relevant information of the removed data, and then saving the parameters of the newly trained submodel;
step 505, the server removes D from all user dataremoveI.e. byThen abandoning the original integrated model;
step 506, repeat step 310, using the removed user dataRetraining the integration model to remove information in the integration model related to the user data, and then saving parameters of the integration model;
step 507, ending the data deleting process, wherein the reasoning process of the model is consistent with the step 312.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210353332.2A CN114692894A (en) | 2022-04-02 | 2022-04-02 | Implementation method of machine learning model supporting dynamic addition and deletion of user data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210353332.2A CN114692894A (en) | 2022-04-02 | 2022-04-02 | Implementation method of machine learning model supporting dynamic addition and deletion of user data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114692894A true CN114692894A (en) | 2022-07-01 |
Family
ID=82143670
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210353332.2A Pending CN114692894A (en) | 2022-04-02 | 2022-04-02 | Implementation method of machine learning model supporting dynamic addition and deletion of user data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114692894A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116522007A (en) * | 2023-07-05 | 2023-08-01 | 中国科学技术大学 | Recommendation system model-oriented data forgetting learning method, device and medium |
-
2022
- 2022-04-02 CN CN202210353332.2A patent/CN114692894A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116522007A (en) * | 2023-07-05 | 2023-08-01 | 中国科学技术大学 | Recommendation system model-oriented data forgetting learning method, device and medium |
CN116522007B (en) * | 2023-07-05 | 2023-10-20 | 中国科学技术大学 | Recommendation system model-oriented data forgetting learning method, device and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Meng et al. | Adavit: Adaptive vision transformers for efficient image recognition | |
Aszemi et al. | Hyperparameter optimization in convolutional neural network using genetic algorithms | |
CN109919685B (en) | Customer churn prediction method, apparatus, device and computer readable storage medium | |
CN109582864B (en) | Course recommendation method and system based on big data science and dynamic weight adjustment | |
CN107844784A (en) | Face identification method, device, computer equipment and readable storage medium storing program for executing | |
Ding et al. | Where to prune: Using LSTM to guide data-dependent soft pruning | |
CN112631717A (en) | Network service function chain dynamic deployment system and method based on asynchronous reinforcement learning | |
CN113807176B (en) | Small sample video behavior recognition method based on multi-knowledge fusion | |
CN115186097A (en) | Knowledge graph and reinforcement learning based interactive recommendation method | |
CN115840900A (en) | Personalized federal learning method and system based on self-adaptive clustering layering | |
CN114692894A (en) | Implementation method of machine learning model supporting dynamic addition and deletion of user data | |
CN109034953B (en) | Movie recommendation method | |
CN104008177B (en) | Rule base structure optimization and generation method and system towards linguistic indexing of pictures | |
CN112990385A (en) | Active crowdsourcing image learning method based on semi-supervised variational self-encoder | |
CN116089883B (en) | Training method for improving classification degree of new and old categories in existing category increment learning | |
Zhang et al. | Adversarial reinforcement learning for unsupervised domain adaptation | |
CN103886030A (en) | Cost-sensitive decision-making tree based physical information fusion system data classification method | |
CN115409203A (en) | Federal recommendation method and system based on model independent meta learning | |
CN115035341A (en) | Image recognition knowledge distillation method capable of automatically selecting student model structure | |
CN113590958B (en) | Continuous learning method of sequence recommendation model based on sample playback | |
CN114743133A (en) | Lightweight small sample video classification and identification method and system | |
Khan et al. | A multi-perspective revisit to the optimization methods of Neural Architecture Search and Hyper-parameter optimization for non-federated and federated learning environments | |
CN113705215A (en) | Meta-learning-based large-scale multi-label text classification method | |
CN110119268B (en) | Workflow optimization method based on artificial intelligence | |
Buijs et al. | Applying transfer learning and various ANN architectures to predict transportation mode choice in Amsterdam |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |