WO2020035075A1 - 在数据隐私保护下执行机器学习的方法和系统 - Google Patents

在数据隐私保护下执行机器学习的方法和系统 Download PDF

Info

Publication number
WO2020035075A1
WO2020035075A1 PCT/CN2019/101441 CN2019101441W WO2020035075A1 WO 2020035075 A1 WO2020035075 A1 WO 2020035075A1 CN 2019101441 W CN2019101441 W CN 2019101441W WO 2020035075 A1 WO2020035075 A1 WO 2020035075A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
machine learning
learning model
data
source
Prior art date
Application number
PCT/CN2019/101441
Other languages
English (en)
French (fr)
Inventor
郭夏玮
涂威威
姚权铭
陈雨强
戴文渊
杨强
Original Assignee
第四范式(北京)技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201811136436.8A external-priority patent/CN110990859B/zh
Priority claimed from CN201910618274.XA external-priority patent/CN110858253A/zh
Application filed by 第四范式(北京)技术有限公司 filed Critical 第四范式(北京)技术有限公司
Priority to EP19849826.3A priority Critical patent/EP3839790A4/en
Publication of WO2020035075A1 publication Critical patent/WO2020035075A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules

Definitions

  • the present invention relates generally to data security technologies in the field of artificial intelligence, and more particularly, to a method and system for performing machine learning under data privacy protection, and a method for performing prediction using a machine learning model with data privacy protection. And system.
  • a method for performing machine learning under data privacy protection may include: obtaining a target data set including a plurality of target data records; obtaining multiple migrations about a source data set Item, wherein each migration item of the plurality of migration items is used to migrate knowledge of a corresponding part of the source data set to a target data set under the protection of source data privacy; respectively, using the plurality of migration items
  • the target data records in the plurality of target data records are used in a target data privacy protection mode. All or part.
  • a method for predicting using a machine learning model with data privacy protection may include: obtaining a plurality of first target machine learning models and a second target as described above.
  • Target machine learning model obtain prediction data records; divide prediction data records into multiple sub-prediction data; for each sub-prediction data in each prediction data record, use the first target machine learning model corresponding to it to perform prediction to obtain A prediction result of each sub-prediction data; and a plurality of prediction results corresponding to each prediction data record obtained by a plurality of first target machine learning models are input to a second target machine learning model to obtain a record for each prediction data Forecast results.
  • a computer-readable storage medium storing instructions, wherein when the instructions are executed by at least one computing device, the at least one computing device may be caused to execute the above-mentioned A method for performing machine learning under data privacy protection and / or a method for performing prediction using a machine learning model with data privacy protection as described above.
  • a system including at least one computing device and at least one storage device storing instructions, wherein the instructions, when executed by the at least one computing device, can cause the At least one computing device executes the method for performing machine learning under data privacy protection as described above and / or the method for performing prediction using a machine learning model with data privacy protection as described above.
  • a system for performing machine learning under data privacy protection may include: a target data set acquisition device configured to acquire target data including a plurality of target data records. Set; a migration item acquisition device configured to obtain a plurality of migration items about a source data set, wherein each migration item among the plurality of migration items is used to copy a corresponding part of the source data under source data privacy protection The set of knowledge is transferred to the target data set; the first target machine learning model obtaining device is configured to use each migration item of the plurality of migration items to obtain a first target machine learning corresponding to each migration item Model to obtain a plurality of first target machine learning models; a second target machine learning model obtaining device configured to obtain a second target machine learning model using the plurality of first target machine learning models, wherein, at the first target, The machine learning model obtaining device obtains the plurality of first target machine learning models and / or the second target machine learning model is obtained.
  • a target data set acquisition device configured to acquire target data including a plurality of target data records. Set
  • a system for predicting using a machine learning model with data privacy protection may include: a target machine learning model acquisition device configured to acquire multiple data as described above. A first target machine learning model and a second target machine learning model; a prediction data record acquisition device configured to obtain a prediction data record; a dividing device configured to divide the prediction data record into a plurality of sub-prediction data; the prediction device being Configured to perform prediction for each sub-prediction data in each prediction data record using a first target machine learning model corresponding to it to obtain prediction results for each sub-prediction data, and will be obtained by multiple first target machine learning models A plurality of prediction results corresponding to each prediction data record are input to a second target machine learning model to obtain a prediction result for each prediction data record.
  • a method for performing machine learning under data privacy protection may include: obtaining a target data set; obtaining a migration item about a source data set, wherein the migration item is used In the source data privacy protection mode, the knowledge of the source data set is transferred to the target data set to train the target machine learning model on the target data set; and in the target data privacy protection mode, based on the target data set, combined with the migration item To train the target machine learning model.
  • a computer-readable storage medium storing instructions, wherein when the instructions are executed by at least one computing device, the at least one computing device is caused to perform the above-mentioned Method for performing machine learning under data privacy protection.
  • a system including at least one computing device and at least one storage device storing instructions, wherein the instructions, when executed by the at least one computing device, cause the at least one A computing device executes a method for performing machine learning under data privacy protection as described above.
  • a system for performing machine learning under data privacy protection may include: a target data set acquisition device configured to acquire a target data set; a migration item acquisition A device configured to obtain a migration item about a source data set, wherein the migration item is used to transfer knowledge of the source data set to a target data set in a privacy protection manner of the source data to train the target machine learning on the target data set A model; and a target machine learning model training device configured to train a target machine learning model based on a target data set in combination with the migration term in a target data privacy protection mode.
  • the method and system for performing machine learning under data privacy protection can not only achieve privacy protection of source data and target data, but also can transfer knowledge in the source data set to the target data set, thereby enabling Based on the target data set and combined with the transferred knowledge, a target machine learning model with better model performance is trained.
  • the method and system for performing machine learning in a data privacy protection mode can not only ensure that data privacy information is not leaked, but can also effectively utilize different data while ensuring the availability of privacy-protected data
  • the source data is machine-learned, which makes the machine learning model better.
  • FIG. 1 is a block diagram illustrating a system for performing machine learning under data privacy protection according to an exemplary embodiment of the present disclosure
  • FIG. 2 is a flowchart illustrating a method of performing machine learning in a data privacy protection mode according to an exemplary embodiment of the present disclosure
  • FIG. 3 is a schematic diagram illustrating a method for performing machine learning in a data privacy protection mode according to a first exemplary embodiment of the present disclosure
  • FIG. 4 is a schematic diagram illustrating a method for performing machine learning in a data privacy protection mode according to a second exemplary embodiment of the present disclosure
  • FIG. 5 is a schematic diagram illustrating a method for performing machine learning in a data privacy protection mode according to a third exemplary embodiment of the present disclosure
  • FIG. 6 is a schematic diagram illustrating a method for performing machine learning in a data privacy protection mode according to a fourth exemplary embodiment of the present disclosure
  • FIG. 7 is a schematic diagram illustrating a concept of performing machine learning in a data privacy protection manner according to an exemplary embodiment of the present disclosure
  • FIG. 8 is a block diagram illustrating a system for performing machine learning under data privacy protection according to an exemplary embodiment of the present disclosure
  • FIG. 9 is a flowchart illustrating a method for performing machine learning in a data privacy protection mode according to an exemplary embodiment of the present disclosure
  • FIG. 10 is a schematic diagram illustrating a concept of performing machine learning in a data privacy protection manner according to an exemplary embodiment of the present disclosure.
  • FIG. 1 is a block diagram illustrating a system (hereinafter, simply referred to as a “machine learning system”) for performing machine learning under data privacy protection according to an exemplary embodiment of the present disclosure.
  • the machine learning system 100 may include a target data set acquisition device 110, a migration item acquisition device 120, a first target machine learning model acquisition device 130, and a second target machine learning model acquisition device 140.
  • the target data set acquiring device 110 may acquire a target data set including a plurality of target data records.
  • the target data set may be any data set that can be used for machine learning model training, and optionally, the target data set may further include a label of the target data record with respect to the machine learning target (prediction target).
  • the target data record may include multiple data attribute fields (e.g., user ID, age, gender, historical credit record, etc.) that reflect various attributes of the object or event, and the target data record's tag for the machine learning target may be, for example, the user Whether it is able to repay the loan, whether the user accepts the recommended content, etc., but is not limited to this.
  • the mark of the target data record about the machine learning target is not limited to the mark of the target data record about one machine learning target, but may include the mark of the target data record about one or more machine learning targets, that is, one target data record does not It is limited to correspond to one mark, but may correspond to one or more marks.
  • the target data set may involve various personal privacy information (for example, the user's name, ID number, mobile phone number, total property, loan records, etc.) that the user does not expect to be known to others, and may also include information that does not involve personal privacy. Other relevant information.
  • the target data record can come from different data sources (for example, network operators, banking institutions, medical institutions, etc.), and the target data set can be used by specific institutions or organizations with user authorization, but it is often desirable to involve Personal privacy information is no longer known to other organizations or individuals.
  • “privacy” may refer generally to any attribute involving a single individual.
  • the target data set acquisition device 110 may acquire the target data set from the target data source at one time or in batches, and may acquire the target data set manually, automatically, or semi-automatically.
  • the target data set acquisition device 110 may acquire the target data records in the target data set and / or the target data records regarding machine learning targets in real time or offline, and the target data set acquisition device 110 may acquire the target data records and the target data at the same time.
  • the time to record the mark on the machine learning target or to obtain the target data record may be lagging behind the time to obtain the target data record.
  • the target data set obtaining device 110 may obtain the target data set from the target data source in an encrypted form or directly utilize the target data set that has been stored locally.
  • the machine learning system 100 may further include a device for decrypting the target data, and may further include a data processing device to process the target data as suitable for current machine learning form. It should be noted that the present disclosure has no restrictions on the types, forms, contents, and acquisition methods of target data records and their marks in the target data set. Any data that can be used for machine learning obtained by any means can be used as The target dataset mentioned above.
  • the premise of the migration is to ensure that the privacy information involved in the data sets of other data sources (in this disclosure, may be referred to as "source data sets") is not leaked, that is, the privacy protection of the source data .
  • the migration item acquisition device 120 may acquire a plurality of migration items regarding the source data set.
  • each of the plurality of migration items may be used to migrate knowledge of a corresponding part of the source data set to the target data set under the protection of source data privacy.
  • the corresponding part of the source data set may refer to a part of the data set corresponding to each migration item, that is, each migration item is only used to transfer knowledge of a part of the corresponding source data set in a source data privacy protection mode.
  • the knowledge of the entire source data set is transferred to the target data set through the plurality of migration items.
  • each migration item may be any information related to the knowledge contained in a part of the source data set corresponding to the migration item obtained when the source data is subjected to privacy protection (that is, in a source data privacy protection mode).
  • the disclosure does not limit the specific content and form of each migration item, as long as it can transfer the knowledge of a corresponding part of the source data set to the target data set under the source data privacy protection mode.
  • each migration item may Involving the samples of the corresponding part of the source data set, the characteristics of the corresponding part of the source data set, the model obtained based on the corresponding part of the source data set, the objective function for model training based on the corresponding part of the source data set, and the corresponding part Statistics for the source dataset, etc.
  • the corresponding part of the source data set may be a corresponding source data subset obtained by dividing the source data set according to a data attribute field.
  • the source data set may include multiple source data records, and optionally, each source data record may also include a tag about the machine learning target.
  • each source data record may also include multiple data attribute fields (eg, user ID, age, gender, historical credit record, historical loan record, etc.) that reflect various attributes of the object or event.
  • dividing by data attribute field may refer to grouping multiple data attribute fields included in each source data record included in the source data set, so that each data record after the division (that is, each The sub-data record) may include at least one data attribute field, and a set composed of data records having the same data attribute field is a corresponding source data subset obtained by dividing the source data set according to the data attribute field. That is, here, each data record in the corresponding source data subset may include the same data attribute field, and each data record may include one or more data attribute fields.
  • the number of data attribute fields included in data records in different source data subsets may be the same or different.
  • each source data record may include the following five data attribute fields: user ID, age, gender, historical credit record, and historical loan record
  • the five data attribute fields may be divided into three, for example Data attribute field group
  • the first data attribute field group may include two data attribute fields of user ID and age
  • the second data attribute field group may include two data attribute fields of gender and historical credit record
  • the third The data attribute field group may include a data attribute field of a historical loan record.
  • the corresponding source data subset obtained by dividing the source data set according to the data attribute field may be the first source data composed of the data record including the data attribute field in the first data attribute field group.
  • the method of dividing the source data set is explained in combination with the examples above. However, it is clear to those skilled in the art that whether the number and content of data attribute fields included in the source data record, or the specific method of dividing the source data set, etc. Not limited to the above examples.
  • the migration item acquisition device 120 may receive a plurality of migration items regarding the source data set from the outside.
  • the migration item acquisition device 120 may obtain the migration item from an entity owning the source data set, or an entity authorized to perform related processing on the source data set (for example, a service provider providing a machine learning related service).
  • each migration item may be obtained by an entity owning the source data set or an entity authorized to perform related processing on the source data set based on the corresponding source data subset described above to perform machine learning related processing.
  • the obtained migration items may be sent to the migration item acquisition device 120 by these entities.
  • the migration item acquisition device 120 may also obtain multiple migration items on the source dataset by performing machine learning related processing on the source dataset.
  • the acquisition and use of the source data set by the migration item acquisition device 120 may be authorized or protected, so that it can perform corresponding processing on the acquired source data set.
  • the migration item obtaining device 120 may first obtain a source data set including a plurality of source data records.
  • the source data set may be any data set related to the target data set.
  • the above description about the composition of the target data set, the method of obtaining the target data set, and the like are applicable to the source data set, and are not repeated here.
  • the source data record and the target data record may include the same data attribute field.
  • the source data set is described as being acquired by the migration item obtaining device 120, it should be noted that the target data set obtaining device 110 may also perform the operation of obtaining the source data set, or the above two The authors jointly obtain the source data set, and the disclosure is not limited thereto.
  • the acquired target data set, source data set, and migration items can all be stored in a storage device (not shown) of the machine learning system.
  • the target data, source data, or migration items stored above can be physically or access separated to ensure the safe use of the data.
  • the machine learning system 100 cannot directly use the obtained source data set together with the target data set to perform machine learning, but needs to ensure that the source data is executed in privacy It can only be used for machine learning under protection.
  • the migration item acquisition device 120 may obtain multiple migration items on the source data set by performing processing related to machine learning on the source data set in a source data privacy protection mode.
  • the migration item acquisition device 120 may divide the source data set into multiple source data subsets according to the data attribute field, and in the source data privacy protection mode, based on each source data subset,
  • the first prediction target trains a source machine learning model corresponding to each source data subset, and uses the parameters of each source machine learning model trained as migration terms related to each source data subset.
  • the data record in each source data subset may include at least one data attribute field. Since the method of dividing the source data set according to the data attribute field has been explained above in conjunction with the examples, it will not be repeated here.
  • the source data set may include, in addition to a plurality of source data records, a mark on the machine learning target of the source data record, and the source data set includes the source data record and the source data.
  • the above-mentioned division of the source data set according to the data attribute field is limited to the division of the source data record in the source data set according to the data attribute field, and the source data record included in the source data set is not about machine learning.
  • the target's mark is divided.
  • a mark on each machine data target (including at least one data field) obtained by dividing each source data record is still a mark on the machine learning target before the source data record is divided.
  • training the source machine learning model corresponding to each source data subset for the first prediction target may be based on each source data subset (i.e., each data record included in each source data subset and its Corresponding tags) to train a source machine learning model corresponding to each source data subset, and the tag for each data record (obtained by dividing the source data record) for the first prediction target is the source data record for the first A prediction target.
  • the first prediction target may be, but is not limited to, predicting whether the transaction is a fraudulent transaction, predicting whether a user is capable of paying off a loan, and the like.
  • the migration item related to each source data subset can be any information related to the knowledge contained in the source data subset obtained in the privacy protection mode of the source data.
  • a migration term related to each source data subset may relate to a model parameter, an objective function, and / or obtained during a process of performing machine learning-related processing based on the source data subset. Or statistical information about the data in the source data subset, but is not limited to this.
  • the operation of performing machine learning-related processing based on the source data subset may include training the source machine learning corresponding to each source data subset based on each source data subset under the source data privacy protection mode described above.
  • the model it can also include machine learning related processing such as performing feature processing or data statistical analysis on a subset of the source data.
  • machine learning related processing such as performing feature processing or data statistical analysis on a subset of the source data.
  • the above-mentioned model parameters, objective functions, and / or statistical information about the source data subset may all be the above-mentioned information obtained directly in the process of performing processing related to machine learning based on the source data subset. It can also be information obtained after further transformation or processing of this information, which is not limited in this disclosure.
  • the migration term involving model parameters may be parameters of the source machine learning model or statistical information of the parameters of the source machine learning model, but is not limited thereto.
  • the objective function involved in the migration term may refer to an objective function constructed in order to train the source machine learning model. In the case that the parameters of the source machine learning model itself are not migrated, the objective function may not be performed separately. Actual solution, but this disclosure is not limited to this.
  • the migration item related to the statistical information about the source data subset may be data distribution information about the source data subset and / or data distribution change information obtained in a source data privacy protection mode, but is not limited thereto.
  • the source data privacy protection method may be a protection method that follows the definition of differential privacy, but is not limited to this, but may be any privacy protection that can exist or may appear in the future to protect the source data. the way.
  • Equation 1 the smaller the ⁇ , the better the degree of privacy protection, and the worse the converse.
  • the specific value of ⁇ can be set accordingly according to the user's requirements for the degree of data privacy protection.
  • the source data privacy protection manner may be adding random noise in the process of training the source machine learning model as described above.
  • random noise can be added so that the above-mentioned differential privacy protection definition is followed.
  • the definition of privacy protection is not limited to the definition of differential privacy protection, but can be other definitions of privacy protection such as K-anonymization, L-diversity, T-confidence, etc. the way.
  • the source machine learning model may be, for example, a generalized linear model, for example, a logistic regression model, but is not limited thereto.
  • the migration term acquisition device 120 may construct an objective function for training the source machine learning model to include at least a loss function and a noise term.
  • the noise term can be used to add random noise in the process of training the source machine learning model, so that the privacy protection of the source data can be achieved.
  • the objective function used to train the source machine learning model can be constructed to include a loss function and a noise term, and can also be constructed to include other constraint terms used to constrain the model parameters. For example, it can also be constructed as Including regular terms to prevent over-fitting of the model or to prevent model parameters from being too complex, and compensation terms for privacy protection.
  • the source data privacy protection method is a protection method that follows the definition of differential privacy
  • the source machine learning model is a generalized linear model.
  • q k is a scaling constant (specifically, it is an upper bound for limiting the second norm of the samples in each data subset), and the set of scaling constants Need to meet c is a constant, ⁇ k is a set of constants, and ⁇ is the privacy budget in Equation 1 above;
  • Each data record formed by extracting the data attribute fields belonging to G k in the data includes a data subset of the data attribute fields in G k , that is, The data set is divided by data attribute fields The obtained k-th data subset;
  • Equation 2 Use Equation 2 to protect data privacy based on data subsets. Training and data subsets for prediction targets Corresponding machine learning model:
  • Equation 2 w is a parameter of the machine learning model, Is a loss function, g k (w) is a regularization function, Is a noise term used to add random noise in the process of training machine learning models to achieve data privacy protection, Is a compensation term used for privacy protection, ⁇ k is a constant used to control the regularization strength, Is the objective function constructed for training the k-th machine learning model.
  • the value of w when the value of the objective function is the smallest is the parameter of the k-th machine learning model finally solved
  • a 2 can be used to solve the parameters of the source machine learning model and the parameters of the target machine learning model.
  • Equation 2 To satisfy the ⁇ -difference privacy definition, the following predetermined conditions need to be met:
  • the regularization function g k (w) needs to be a 1-strongly convex function and second-order differentiable.
  • the mechanism A 2 for solving the parameters of the machine learning model described above can be used to solve the parameters of the source machine learning model.
  • the regularization function of each source machine learning model can be made equal to That is, for k ⁇ ⁇ 1, ..., K ⁇ , let the regularization function (Where g sk (w) is g k (w) in Equation 2 above), in this case, the parameters of the machine learning model described above can be used to solve Mechanism A 2 finally solves the parameters of K source machine learning models among them, Is the source data set, ⁇ s is the privacy budget of the source data privacy protection method, and S G is the set of data attribute fields included in each source data record, Is a constant ⁇ sk (ie, ⁇ k in Equation 2 above), a regularization function g sk (ie, g k (w) in Equation 2 above), and a scaling constant q sk ( That is, the set of q k ) described above.
  • the parameters of the source machine learning model corresponding to each source data subset solved according to the above mechanism A 2 not only satisfy the privacy protection of the source data, but also carry the knowledge of the corresponding source data subset. Subsequently, the parameters of each source machine learning model trained can be used as a migration term related to each source data subset to transfer the knowledge of the source data subset to the target data set.
  • the corresponding source machine learning model is trained for each source data subset to obtain the migration term, instead of training the source machine learning model to obtain the entire source data set.
  • the transfer term therefore, can effectively reduce the random noise added during the training process, so that the parameters of the source machine learning model corresponding to each source data subset ( Set-related migration items) not only protects the privacy information in the corresponding source data subset, but also ensures the availability of the migration items.
  • the first target machine learning model obtaining device 130 may use each migration item among the multiple migration items to obtain a copy of each migration item.
  • the first target machine learning model corresponding to the term to obtain a plurality of first target machine learning models.
  • the first target machine learning model obtaining device 130 may directly use each migration item as a parameter of the first target machine learning model corresponding thereto without using a target data set (for convenience of description, the following This method of obtaining the first target machine learning model is referred to simply as "the method of directly obtaining the first target machine learning model"). That is, suppose the parameters of multiple first target machine learning models are Then the first target machine learning model and the source machine learning model can be the same type of machine learning model, and can directly make Thereby, a first target machine learning model corresponding to each migration term is obtained.
  • the first target machine learning model obtaining device 130 may use the following method (for convenience of description, this method of obtaining the first target machine learning model is hereinafter simply referred to as "the method of obtaining the first target machine learning model through training" ) To obtain a first target machine learning model corresponding to each migration term. Specifically, the first target machine learning model obtaining device 130 may first divide the target data set or the first target data set into multiple first target data subsets in the same manner as the source data set according to the data attribute field, and then, In the target data privacy protection mode, based on each first target data subset, the migration items related to the source data subset corresponding to each first target data subset are combined to train the second prediction target with the migration item. Corresponding first target machine learning model.
  • the first target data set may include a part of the target data records included in the target data set, and the data records in each of the first target data subset and the source data subset corresponding thereto may include the same data attribute field.
  • the target data record and the source data record include the same data attribute fields.
  • the target data set or the first target data set can be divided into the data attribute fields in the same manner as the source data set is divided into A plurality of first target data subsets.
  • each target data record also includes the following five data attribute fields: user ID, age, gender, historical credit record, and historical loan record
  • the exemplary division manner of dividing the source data record is the same division manner to divide the target data set or the first target data set.
  • the five data attribute fields are also divided into three data attribute field groups.
  • the first data attribute field group may include two data attribute fields of user ID and age
  • the second data attribute field group may be Including two data attribute fields: gender and historical credit record
  • the third data attribute field group may include one data attribute field: historical loan record.
  • the plurality of first target data subsets obtained by dividing the target data set or the first target data set according to the data attribute field may be a group including the data attribute field in the first data attribute field group.
  • the data record constitutes a first target data subset.
  • the source data subset corresponding to the above first first target data subset is the first source data subset mentioned when describing the division of the source data set, and the first The data records in a target data subset and a first source data subset include the same data attribute fields (that is, both data ID fields including user ID and age), and so on.
  • the above-mentioned target data privacy protection method may be the same as the source data privacy protection method, for example, it may also be a protection method that follows the definition of differential privacy, but is not limited thereto.
  • the first target machine learning model may belong to the same type of machine learning model as the source machine learning model.
  • the first target machine learning model may also be a generalized linear model, such as a logistic regression model, but is not limited thereto.
  • it may be any linear model that meets a predetermined condition.
  • the target data privacy protection method here may also be a privacy protection method different from the source data privacy protection method, and the first target machine learning model may also belong to a different type of machine learning model from the source machine learning model. There are no restrictions on applications.
  • the above-mentioned target data privacy protection manner may be adding random noise in the process of obtaining the first target machine learning model.
  • the first target machine learning model obtaining device 130 may construct an objective function for training the first target machine learning model to include at least a loss function and a noise term.
  • the first target machine learning model obtaining device 130 may construct an objective function for training the first target machine learning model to include at least a loss function and a noise term and reflect The difference between the parameters of the first target machine learning model and the migration term corresponding to the first target machine learning model, and then the first target machine learning model obtaining device 130 may, based on each target data privacy protection mode, based on each The first target data subset is combined with the migration terms related to the source data subset corresponding to each first target data subset, and the first target corresponding to the migration term is trained for the second prediction target by solving the constructed objective function.
  • Target machine learning model may construct an objective function for training the first target machine learning model to include at least a loss function and a noise term and reflect The difference between the parameters of the first target machine learning model and the migration term corresponding to the first target machine learning model, and then the first target machine learning model obtaining device 130 may, based on each target data privacy protection mode, based on each The first target data subset is combined with the migration terms related to the source data subset corresponding
  • the knowledge in the source data subset is transferred to the target data set, so that the training process can jointly use the knowledge on the source data set and the target data set, so the first target machine learning model trained is more effective.
  • the second prediction target may be the same as the first prediction target targeted by the training source machine learning model described above (for example, both are predicting whether the transaction is a fraudulent transaction) or similar (for example, the first One prediction target may be to predict whether the transaction is a fraudulent transaction, and the second prediction target may be to predict whether the transaction is suspected of being illegal.)
  • the above objective function can also be constructed to include regular terms to prevent the over-fitting phenomenon of the first target machine learning model trained, or can be constructed to include other constraints according to the actual task requirements Items, for example, compensation items for privacy protection, this application is not limited in this application, as long as the constructed objective function can effectively achieve privacy protection of the target data, and at the same time, the knowledge on the corresponding source data subset can be transferred to The target dataset is sufficient.
  • the source machine learning model is a logistic regression model
  • the first target machine learning model is a generalized linear model
  • the target data privacy protection mode is a protection mode that follows the definition of differential privacy protection.
  • the target dataset Or first target dataset (among them, Is included
  • the target data set that is part of the target data record included in the All target data records in are divided into the first target data set according to a 1: 1: 1-p ratio And the second target dataset )
  • the data attribute field it is divided into a plurality of first target data subsets in the same manner as the source data set.
  • the data attribute field set S G included in the source data record is divided into non-overlapping K data field groups G 1 , G 2 , ..., G K.
  • the regularization function in the objective function used to train the k-th first target machine learning model can be made as:
  • u is a parameter of the k-th first target machine learning model
  • the parameters of the machine learning model described above can be used to solve Mechanism A 2 , by replacing w with u, Replaced with or Replace g k (w) with g tk (u) and ⁇ k with ⁇ tk (a constant used to control the strength of regularization in the objective function used to train the first target machine learning model), and q k Replace with q tk (the scaling constant used to scale the samples in the k-th first target data subset) to obtain the k-th migration term Parameters of the corresponding k-th first target machine learning model
  • the previously divided target data set is And the subsequent target dataset used to train the second target machine learning model is The parameters of the K first target machine learning models obtained in the case of complete overlap or partial overlap (Where p ⁇ t is the privacy budget corresponding to the noise term included in the objective function used to train the first target machine learning model, where p is the noise included in the objective function used to train the first target machine learning model The ratio of the privacy budget corresponding to the item to the privacy budget of the entire target data privacy protection method, and 0 ⁇ p ⁇ 1), the previously divided target data set is the first target data set The subsequent target dataset used to train the second target machine learning model is The parameters of the K first target machine learning models obtained without completely overlapping (Where ⁇ t is the privacy budget corresponding to the noise term included in the objective function used to train the first target machine learning model and the privacy corresponding to the noise term included in the objective function used to train the second target machine learning model The larger privacy budget of the two).
  • the regularization function g tk (u) contains The objective function for training the first target machine learning model is constructed to reflect the difference between the parameters of the first target machine learning model and the migration term corresponding to the first target machine learning model, thereby effectively implementing It transfers the knowledge from the corresponding source data subset to the target data set.
  • the model nor the first target machine learning model is limited to a logistic regression model, but may be, for example, any linear model that satisfies a predetermined condition as described above, or even any other suitable model.
  • the second target machine learning model obtaining device 140 may use the plurality of first target machine learning models to obtain a second target machine learning model.
  • the first target machine learning model and the second target machine learning model usually have upper and lower layers.
  • the first target machine learning model may correspond to the first layer machine learning model
  • the second target machine learning model may correspond to the first layer. Two-tier machine learning model.
  • the second target machine learning model obtaining device 140 may obtain the second target machine learning model in the following manner (hereinafter, for convenience of description, this method is simply referred to as "the second target machine learning model acquisition method through training"): First, the second target machine learning model obtaining device 140 may The target data set is divided into multiple target data subsets according to the data attribute fields in the same manner as the source data set. Here, the data records in each target data subset and the corresponding source data subset include the same data attribute fields.
  • the second target machine learning model obtaining device 140 may perform prediction on each target data subset by using the first target machine learning model corresponding to the target data subset to obtain a prediction result for each data record in each target data subset.
  • a second target machine learning model is trained for the third prediction target based on a set of training samples composed of multiple prediction results obtained corresponding to each target data record.
  • the label of the training sample is the label of the target data record for the third prediction target.
  • the K first target machine learning models obtained are all logistic regression models, and the parameters of the K first target machine learning models are respectively (K is also the number of divided multiple target data subsets), then the training sample composed of multiple prediction results obtained corresponding to each target data record in the target data set can be expressed as
  • x ki is the i-th data record in the k-th (where k ⁇ ⁇ 1, ..., K ⁇ ) target data subset
  • the above K prediction results are the K prediction results corresponding to the ith target data record in the target data set, and the K prediction results A feature portion of a training sample that can constitute a second target machine learning model.
  • the first target machine learning model and the second target machine learning model may belong to the same type of machine learning model.
  • the second target machine learning model may also be a generalized linear model (eg, a logistic regression model).
  • the target data privacy protection method here may be a protection method that follows the definition of differential privacy, but is not limited to this.
  • the target data privacy protection method may be adding random noise in the process of obtaining the second target machine learning model.
  • the second target machine learning model obtaining device 140 may construct an objective function for training the second target machine learning model to include at least a loss function and a noise term.
  • the second target machine learning model can be trained according to the mechanism A 1 for training the machine learning model described below, where A 1 is to solve the parameters of the machine learning model under the condition of satisfying the definition of differential privacy protection.
  • mechanism A 1 is to solve the parameters of the machine learning model under the condition of satisfying the definition of differential privacy protection.
  • mechanism A 1 is to solve the parameters of the machine learning model under the condition of satisfying the definition of differential privacy protection.
  • the implementation process of mechanism A 1 is as follows:
  • Equation 4 is used to train the machine learning model to obtain the parameters of the machine learning model that satisfy the definition of differential privacy protection.
  • Equation 4 can be used to protect the data privacy Train a machine learning model. Equation 4 is as follows:
  • Equation 4 w is the parameter of the machine learning model, Is the loss function, g (w) is the regularization function, Is a noise term used to add random noise in the process of training machine learning models to achieve data privacy protection, Is a compensation term used for privacy protection, ⁇ is a constant used to control the strength of regularization, It is an objective function constructed for training machine learning models. According to Equation 4 above, the value of w when the value of the objective function is the smallest is the parameter w * of the machine learning model finally solved.
  • the first target machine learning model is not limited to being a logistic regression model, and the second target machine learning model may be any machine learning model of the same or different type as the first target machine learning model.
  • the third prediction target here may be the same as or similar to the second prediction target mentioned in the training of the first target machine learning model described above.
  • each target data record in the target data set may actually correspond to two marks, which are respectively the target data records. The mark of the second prediction target and the target data record the mark about the third prediction target.
  • the first target machine learning model obtaining device 130 obtains a plurality of first targets through the above-mentioned “method of obtaining the first target machine learning model through training” described above.
  • the second target machine learning model obtaining device 140 may obtain the second target machine learning model by performing the following operations (hereinafter, for convenience of description, this method of obtaining the second target machine learning model is simply referred to as " Method for obtaining the second target machine learning model directly ”): Set the rules of the second target machine learning model to: obtain the second target machine learning model based on multiple prediction results corresponding to each prediction data record obtained in the following manner The prediction result for each prediction data record, wherein the method includes: obtaining a prediction data record, and dividing the prediction data record into a plurality of sub-prediction data according to a data attribute field in the same manner as the source data set is divided; Each sub-prediction data in each prediction data record The first goal of machine learning models predict for the implementation
  • the predicted data record may include the same data attribute fields as the previously described target data record and source data record, except that the predicted data record does not include a tag, and the source data has been divided according to the data attribute field according to the example above.
  • the method of dividing the data records in the same way has been described. Therefore, how to divide the prediction data records into multiple sub-prediction data is not described in detail here.
  • each sub-prediction data may include at least one data attribute field.
  • the process of performing prediction using each first target machine learning model corresponding to each target data subset to obtain the prediction result for each data record in each target data subset has been described above.
  • obtaining the prediction result of the second target machine learning model for each prediction data record based on the obtained multiple prediction results corresponding to each prediction data record may be averaging, taking Obtain the prediction result of the second target machine learning model for each prediction data record by using a maximum value or voting on the multiple prediction results.
  • the prediction result of the second target machine learning model for the prediction data record may be the obtained probability value after averaging 20%, 50%, 60%, 70%, and 80%.
  • the multiple prediction results are "transaction is fraud”, “transaction is non-fraud”, “transaction is fraud”, “transaction is fraud”, and “transaction is fraud”, respectively, then obtained by voting The prediction result of the second target machine learning model for the prediction data record is "transaction is fraud”.
  • the second target machine learning model of the present disclosure is not limited to the model obtained through machine learning, but can refer to any suitable mechanism for processing data (for example, integrating multiple prediction results described above). To get the prediction results for each prediction data record).
  • the first target machine learning model obtaining device 130 can use the target data set in the above “method of obtaining the first target machine learning model through training”. To obtain multiple first target machine learning models, which can also be used First target dataset To obtain multiple first target machine learning models. The first target machine learning model obtaining device 130 uses the target data set in the above-mentioned “method of obtaining the first target machine learning model through training” described above.
  • the second target machine learning model obtaining device 140 may use each of the first target data subsets with The corresponding first target machine learning model performs prediction to obtain a prediction result for each data record in each first target data subset; and in the target data privacy protection mode, based on the obtained corresponding to each target data record A set of training samples composed of multiple prediction results, and a second target machine learning model is trained for the third prediction target.
  • the above process is similar to the "method for obtaining a second target machine learning model through training" previously described, except that in the "method for obtaining a first target machine learning model through training" that obtains the first target machine learning model,
  • the target data set has been divided into multiple first target data subsets. Therefore, there is no need to divide the data set here. Instead, each target data subset can be directly executed using the corresponding first target machine learning model.
  • ⁇ t is the privacy budget of the entire target data privacy protection method
  • (1-p) ⁇ t is the privacy budget corresponding to the noise term included in the objective function used to train the second target machine learning model.
  • the first target machine learning model obtaining device 130 uses the first target data set in the above-mentioned “method of obtaining the first target machine learning model through training” described above.
  • the second target machine learning model obtaining device 140 may divide the second target data set into multiple second targets in the same manner as the source data set according to the data attribute field Data subset.
  • the second target data set may include at least the remaining target data records after excluding the first target data set in the target data set, wherein the target data records in the second target data set have the same attribute fields as the source data records.
  • the second target data set may include only the remaining target data records after excluding the first target data set in the target data set (i.e., the second target data set may be the above-mentioned ), Or the second target data set may include a part of the target data records in the first target data set in addition to the remaining target data records after excluding the first target data set in the target data set.
  • the method of dividing the source data set according to the data attribute field has been described above, and therefore, the operation of dividing the second target data set is not described herein again.
  • the second target machine learning model obtaining device 140 may perform prediction for each second target data subset using the first target machine learning model corresponding to the second target data subset to obtain each data record for each second target data subset. And in the target data privacy protection mode, based on a set of training samples composed of multiple prediction results obtained corresponding to each target data record (each target data record in the second target data set) The third prediction target trains a second target machine learning model.
  • the third prediction target may be the same as or similar to the second prediction target mentioned in the training of the first target machine learning model described above.
  • the second prediction target may be whether the transaction is suspected.
  • the third prediction target may be to predict whether the transaction is suspected of being illegal or to predict whether the transaction is fraud.
  • the second target machine learning model may be any machine learning model that is the same as or different from the first target machine learning model, and the second target machine learning model may be used to execute business decisions.
  • the business decision may involve at least one of transaction anti-fraud, account opening anti-fraud, intelligent marketing, intelligent recommendation, and loan evaluation, but is not limited thereto.
  • the trained target machine learning model can also be used to communicate with Business decisions related to physiological conditions. In fact, the disclosure does not place any restrictions on the types of specific business decisions to which the target machine learning model can be applied, as long as it is a business suitable for making decisions using the machine learning model.
  • the first target machine learning model obtaining device 130 may construct an objective function for training the first target machine learning model to include at least a loss function and a noise term
  • the second The target machine learning model obtaining device 140 may construct an objective function for training a second target machine learning model to include at least a loss function and a noise term
  • the privacy budget of the target data privacy protection method may depend on The sum of the privacy budget corresponding to the noise term included in the objective function of one target machine learning model and the privacy budget corresponding to the noise term included in the objective function used to train the second target machine learning model. Big privacy budget.
  • the target data set used in the process of training the first target machine learning model is exactly the same as or partially the same as the target data set used in the process of training the second target machine learning model (for example, training the first target machine learning model
  • the target data set used in the process is the first target data set
  • the target data set used in the process of training the second target machine learning model includes the remaining target data records after the first target data set is excluded from the target data set and the first Part of the target data set in the target data set
  • the privacy budget of the target data privacy protection method may depend on the privacy budget corresponding to the noise term included in the target function used to train the first target machine learning model And the privacy budget corresponding to the noise term included in the objective function used to train the second target machine learning model.
  • the target data set used during the training of the first target machine learning model is completely different or completely non-overlapping from the target data set used during the training of the second target machine learning model (for example, the target data set may be recorded according to the target data Is divided into a first target data set and a second target data set, the first target data set is used in the process of training the first target machine learning model, and the second target data is used in the process of training the second target machine learning model Set),
  • the privacy budget of the target data privacy protection method may depend on the privacy budget corresponding to the noise term included in the objective function used to train the first target machine learning model and the privacy budget corresponding to used to train the second target machine.
  • the noise budget included in the objective function of the learning model corresponds to the larger privacy budget.
  • the machine learning system 100 may respectively divide a corresponding part of the source data in a source data privacy protection mode.
  • the centralized knowledge is successfully transferred to the target data set, and at the same time, the availability of the transferred knowledge can be ensured, so that it can further integrate more knowledge in the target data privacy protection mode to train a second target machine learning model with better model performance. To apply to the appropriate business decisions.
  • the machine learning system is described above, it is divided into devices for respectively performing corresponding processing (for example, the target data set acquisition device 110, the migration item acquisition device 120, and the first target machine learning model acquisition device 130). And the second target machine learning model obtaining device 140), however, it is clear to those skilled in the art that the processing performed by each device described above may also be performed in the machine learning system without any specific device division or no clear delimitation between the devices Case.
  • the machine learning system 100 described above with reference to FIG. 1 is not limited to including the devices described above, but may also add some other devices (for example, prediction devices, storage devices, and / or model update devices, etc.) as needed, or The above devices can also be combined.
  • the prediction device may obtain a prediction data set including at least one prediction data record and divide the prediction data set into a plurality of data attribute fields in the same manner as the source data set is divided. A subset of the prediction data, and for each subset of the prediction data, a prediction is performed using the trained first target machine learning model corresponding thereto to obtain a prediction result for each data record in each subset of the prediction data, and based on the obtained and A plurality of prediction results corresponding to each prediction data record are used to obtain a prediction result for each prediction data record.
  • the obtained multiple prediction results corresponding to each prediction data record may be directly integrated to obtain the prediction results for each prediction data record, or, For a prediction sample composed of multiple obtained prediction results corresponding to each prediction data record, a trained second target machine learning model is used to perform prediction to obtain a prediction result for each prediction data record.
  • a system using a machine learning model with data privacy protection for prediction may include a target machine learning model acquisition device and prediction data record acquisition Device, division device, and prediction device.
  • the target machine learning model obtaining device may obtain the plurality of first target machine learning models and the second target machine learning model described above.
  • the target machine learning model acquisition device may acquire multiple first target machine learning models according to the above-mentioned "first target machine learning model direct acquisition method" or "through the trained first target machine learning model acquisition method".
  • the target machine learning model acquisition device may acquire the second target machine learning model in accordance with the "method for obtaining the second target machine learning model through training" or "the method for obtaining the second target machine learning model directly". That is, the target machine learning model acquisition device may itself perform the operations described above to obtain the first target machine learning model and the second target machine learning model to obtain multiple first target machine learning models and second target machine learning models, In this case, the target machine learning model acquisition device may correspond to the machine learning system 100 described above.
  • the target machine learning model obtaining device may also directly obtain from the machine learning system 100 if the machine learning system 100 has obtained a plurality of first target machine learning models and second target machine learning models respectively in the above manner. The plurality of first target machine learning models and the second target machine learning model for subsequent prediction.
  • the prediction data record acquisition device can acquire a prediction data record.
  • the predicted data record may include the same data attribute fields as the previously described source data record and target data record.
  • the prediction data record acquisition device may obtain the prediction data records one by one in real time, or may obtain the prediction data records in batches offline.
  • the dividing device may divide the prediction data record into a plurality of sub-prediction data.
  • the dividing device may divide the prediction data record into a plurality of sub-prediction data in a data attribute field in the same manner as the previously described division source data set, and each sub-prediction data may include at least one data attribute field.
  • the division method has been described above in conjunction with an example, so it will not be repeated here, the difference is that the object divided here is the prediction data record.
  • the prediction device may perform, for each sub-prediction data in each prediction data record, a prediction using a first target machine learning model corresponding thereto to obtain a prediction result for each sub-prediction data. For example, if the sub-prediction data includes two data attribute fields, gender and historical credit record, based on the set of data records (ie, the first target data subset mentioned above) that includes the same data attribute fields as the sub-prediction data
  • the trained first target machine learning model is the first target machine learning model corresponding to the sub-data record.
  • the prediction result herein may be, for example, a confidence value, but is not limited thereto.
  • the prediction device may input a plurality of prediction results corresponding to each prediction data record obtained by the plurality of first target machine learning models into a second target machine learning model to obtain a prediction result for each prediction data record.
  • the prediction device may obtain a prediction result of the second target machine learning model for each of the prediction data records based on the multiple prediction results according to a set rule of the second target machine learning model, for example, by using the multiple The prediction results are averaged, a maximum value is obtained, or a vote is obtained to obtain a prediction result for each prediction data record.
  • the prediction device may use a previously trained second target machine learning model (for a specific training process, refer to the description of training a second target machine learning model described previously) to perform prediction on a prediction sample composed of the multiple prediction results To obtain a prediction result for each prediction data record.
  • a previously trained second target machine learning model for a specific training process, refer to the description of training a second target machine learning model described previously
  • the prediction system may obtain prediction results corresponding to each prediction data record by performing prediction using a plurality of first target machine learning models after dividing the prediction data records, and further based on the multiple The prediction result uses the second target machine learning model to obtain the final prediction result, thereby improving the prediction effect of the model.
  • machine learning mentioned in the present disclosure can be implemented in the form of “supervised learning”, “unsupervised learning”, or “semi-supervised learning”.
  • the exemplary embodiments of the present invention are The specific form of machine learning is not specifically limited.
  • FIG. 2 is a flowchart illustrating a method of performing machine learning in a data privacy protection mode (hereinafter, it is simply referred to as a “machine learning method”) according to an exemplary embodiment of the present disclosure.
  • the machine learning method shown in FIG. 2 may be executed by the machine learning system 100 shown in FIG. 1, or may be implemented entirely in software by a computer program or instruction, or may be performed by a computing system or computing device with a specific configuration.
  • a system including at least one computing device and at least one storage device storing instructions, wherein the instructions, when executed by the at least one computing device, cause the at least one computing device to execute the above Machine learning methods.
  • the method shown in FIG. 2 is executed by the machine learning system 100 shown in FIG. 1, and it is assumed that the machine learning system 100 may have the configuration shown in FIG. 1.
  • the target data set acquisition device 110 may acquire a target data set including a plurality of target data records. Any content described above regarding the acquisition of the target data set when the target data set acquisition device 110 is described with reference to FIG. 1 is adapted to this, and therefore, it will not be described in detail here.
  • the migration item obtaining device 120 may obtain multiple migration items about the source data set.
  • each migration item among the multiple migration items may be used for privacy in the source data.
  • the corresponding part of the source data set is transferred to the target data set.
  • the corresponding part of the source data set may be a source data subset obtained by dividing the source data set according to data attribute fields. The source data set, the migration item, the corresponding source data subset, and the method of dividing the source data set have been described in the description of the migration item obtaining device 120 in FIG. 1, and are not repeated here.
  • the migration item acquisition device 120 may receive multiple migration items regarding the source data set from the outside.
  • the migration item acquisition device 120 may obtain a plurality of migration items about the source data set by itself performing a machine learning process on the source data set.
  • the migration item obtaining device 120 may first obtain a source data set including a plurality of source data records.
  • the source data record and the target data record may include the same data attribute field.
  • the migration item obtaining device 120 may divide the source data set into multiple source data subsets according to the data attribute fields, where the data records in each source data subset include at least one data attribute field.
  • the migration item acquisition device 120 may train a source machine learning model corresponding to each source data subset for the first prediction target based on each source data subset, and use the trained The parameters of each source machine learning model are used as migration terms related to each source data subset.
  • the source data privacy protection method may be a protection method following the definition of differential privacy protection, but is not limited thereto.
  • the source data privacy protection method may be to add random noise in the process of performing processing related to machine learning based on the source data set to achieve privacy protection of the source data.
  • the source data privacy protection method may be adding random noise in the process of training the source machine learning model.
  • an objective function for training a source machine learning model may be constructed in the source data privacy protection manner to include at least a loss function and a noise term.
  • the noise term is used to add random noise in the process of training the source machine learning model, thereby achieving privacy protection of the source data.
  • the objective function may be configured to include other constraint terms for constraining model parameters.
  • the source machine learning model may be a generalized linear model (for example, a logistic regression model), but is not limited thereto.
  • it may be any linear model that meets a predetermined condition, or even any suitable that meets a certain condition. model.
  • the first target machine learning model obtaining device 130 may use each migration item among the plurality of migration items to obtain each migration item. Corresponding first target machine learning models to obtain multiple first target machine learning models.
  • the second target machine learning model obtaining device 140 may use the plurality of first target machine learning models obtained in step S230 to obtain a second target machine learning model.
  • the target data privacy protection method may also be a protection method that follows the definition of differential privacy, but is not limited to this, but may be another data privacy protection method that is the same as or different from the source data privacy protection method.
  • the target data privacy protection method may be adding random noise in the process of obtaining the first target machine learning model and / or the second target machine learning model.
  • FIG. 3 is a schematic diagram illustrating a method of performing machine learning in a data privacy protection mode according to a first exemplary embodiment of the present disclosure.
  • the obtained source data set is divided into a plurality of source data sets according to a data attribute field.
  • D s is a source data set, which is Divided into four source data subsets based on data attribute fields with Then, in the source data privacy protection mode, based on each source data subset, a source machine learning model corresponding to each source data subset is trained for the first prediction target, and the trained The parameters serve as migration items related to each source data subset.
  • Figure 3 with Subsets with source data with Corresponds to the parameters of the source machine learning model and is used as a subset of the source data with Related migration items.
  • the first target machine learning model obtaining device 130 may directly use each migration item as a parameter of the first target machine learning model corresponding thereto without using the target dataset.
  • the second target machine learning model obtaining device 140 may divide the target data set into multiple target data subsets in the same manner as the source data set according to the data attribute field, where each target data subset The data record contains the same data attribute field as the data record in the corresponding source data subset.
  • the target data set D t may be divided into four target data subsets in the same manner as the source data set D s is divided.
  • the second target machine learning model obtaining device 140 may perform prediction on each target data subset using the first target machine learning model corresponding to the target data subset to obtain each piece of data for each target data subset Recorded predictions. For example, referring to FIG.
  • p 1 is a parameter using First target machine learning model for a subset of target data
  • the set of prediction results for performing the prediction including the parameters as First goal machine learning model for The prediction results for each data record in.
  • p 2 , p 3 and p 4 are respectively using the parameters as The first target machine learning model
  • parameters are The first target machine learning model and parameters are The first target machine learning model performs predictions on the prediction result set.
  • the second target machine learning model obtaining device 140 may, in the target data privacy protection mode, based on a set of training samples composed of multiple obtained prediction results corresponding to each target data record, for Three predictive targets train a second target machine learning model. For example, for each target data in the target data set D t recorded in the prediction result set p 1 , p 2 , p 3 and p 4 has a corresponding prediction result, and these four prediction results can constitute and Each target data record corresponds to a training sample, and such a set of training samples can be used to train a second target machine learning model for a third predicted target under the protection of target data privacy.
  • all data records in the multiple target data records are utilized in a target data privacy protection mode.
  • FIG. 4 illustrates a method for performing machine learning in a data privacy protection mode according to a second exemplary embodiment of the present disclosure.
  • the first target machine learning model obtaining device 130 may divide the target data set into a plurality of first target data subsets in the same manner as the source data set according to the data attribute field.
  • the data records in a target data subset and the corresponding source data subset include the same data attribute fields.
  • the target data set D t may be divided into four first target data subsets in the same manner as the source data set D s is divided. with among them, with Corresponding to the source data subset with
  • the first target machine learning model obtaining device 130 may combine the migration related to the source data subset corresponding to each first target data subset based on each first target data subset in the target data privacy protection mode.
  • Term training a first target machine learning model corresponding to the migration term for a second prediction target. For example, referring to FIG. 4, based on each first target data subset with Combine migration items separately with For the second prediction target, a first target machine learning model corresponding to each migration term is trained. As shown in FIG.
  • the second target machine learning model obtaining device 140 may set a rule of the second target machine learning model to: obtain the first target machine The prediction result of the two-target machine learning model for each prediction data record, wherein the method includes: obtaining a prediction data record, and dividing the prediction data record into multiple data attribute fields in the same manner as the source data set Sub-prediction data; for each sub-prediction data in each prediction data record, use a first target machine learning model corresponding thereto to perform prediction to obtain a prediction result for each sub-prediction data.
  • the prediction data record may be a data record that needs to be predicted during real-time prediction or batch prediction. Referring to FIG.
  • the obtained prediction data record D p is divided into four sub prediction data in the same manner as the source data set is divided.
  • the parameters of the corresponding first target machine learning model are with
  • the second target machine learning model obtaining device 140 may perform prediction for each sub-prediction data using the first target machine learning model corresponding thereto to obtain a prediction result for each sub-prediction data.
  • Available parameters are The first target machine learning model performs prediction to obtain the prediction result p 1 .
  • the second target machine learning model obtaining device 140 may set a rule of the second target machine learning model to: obtain a second target machine learning model for each of the prediction targets based on the obtained multiple prediction results corresponding to each prediction data record. Forecast results for forecast data records. For example, by averaging the above four prediction results corresponding to each prediction data record, the prediction result of the second target machine learning model for each prediction data record can be obtained, but the second target machine learning model is obtained for each prediction data.
  • the method of recording the prediction result is not limited to this.
  • a prediction result of the second target machine learning model for each piece of prediction data can also be obtained by voting.
  • all of the plurality of target data records in the target data set are utilized in a target data privacy protection mode.
  • FIG. 5 is a schematic diagram illustrating a method of performing machine learning in a data privacy protection mode according to a third exemplary embodiment of the present disclosure.
  • step S240 the second target machine learning model may directly target each first target data subset divided in step S230, using the corresponding first A target machine learning model performs prediction to obtain a prediction result for each data record in each first target data subset, and, in the target data privacy protection mode, based on the obtained data corresponding to each target data record, A set of training samples formed by each prediction result, and a second target machine learning model is trained for the third prediction target.
  • the second target machine learning model may directly target each first target data subset divided in step S230, using the corresponding first A target machine learning model performs prediction to obtain a prediction result for each data record in each first target data subset, and, in the target data privacy protection mode, based on the obtained data corresponding to each target data record, A set of training samples formed by each prediction result, and a second target machine learning model is trained for the third prediction target.
  • FIG. 5 the second target machine learning model may directly target each first target data subset divided in step S230, using the corresponding first A target machine learning model performs prediction to obtain a
  • the first target machine learning model performs prediction to obtain a prediction result set p 1 , where p 1 includes parameters as First goal machine learning model for The prediction results for each data record in.
  • the first target machine learning model performs prediction to obtain a prediction result set p 2 for the first target data subset Utilization parameters are
  • the first target machine learning model performs prediction to obtain a prediction result set p 3 for the first target data subset Utilization parameters are
  • the first target machine learning model performs prediction to obtain a prediction result set p 4 .
  • each target data in the target data set D t recorded in the prediction result set p 1 , p 2 , p 3 and p 4 has a corresponding prediction result, and these four prediction results can constitute and
  • Each target data record corresponds to a training sample, and such a set of training samples can be used to train a second target machine learning model for a third predicted target under the protection of target data privacy.
  • a plurality of targets in the target data set obtained in step S210 are utilized in the target data privacy protection mode. All in the data record.
  • FIG. 6 is a schematic diagram illustrating a method of performing machine learning in a data privacy protection mode according to a fourth exemplary embodiment of the present disclosure.
  • the target data set is not divided in the same way as the source data set according to the data attribute field. Is a plurality of first target data subsets, but the first target data set (for example, D t1 in FIG. 6) is divided into a plurality of first target data sets according to the data attribute field in the same manner as the source data set is divided.
  • a target data subset e.g., with
  • the first target data set may include a part of the target data records included in the target data set, and the data records in each of the first target data subset and the corresponding source data subset include the same data attribute field.
  • the first target machine learning model obtaining device 130 may combine the migration related to the source data subset corresponding to each first target data subset based on each first target data subset in the target data privacy protection mode. Term, training a first target machine learning model corresponding to the migration term for a second prediction target.
  • the second target machine learning model obtaining device 140 obtains a second target machine learning model by using the plurality of first target machine learning models. Instead of using a target data set that is exactly the same as the target data set used in step S230, a second target data set different from the first target data set is used.
  • the second target machine learning model obtaining device 140 may divide the second target data set (for example, D t2 in FIG. 6) into multiples according to the data attribute field in the same manner as the source data set is divided.
  • a second target data subset (e.g., with ).
  • the second target data set is different from the first target data set and includes at least the remaining target data records after excluding the first target data set in the target data set.
  • the second target machine learning model obtaining device 140 performs, for each second target data subset, a prediction using the first target machine learning model corresponding thereto to obtain each data record for each second target data subset.
  • a second target machine learning model is trained for the third prediction target based on a set of training samples composed of multiple prediction results obtained corresponding to each target data record.
  • the plurality of pieces of the target data set obtained in step S210 are utilized in a target data privacy protection mode. Part of the target data record.
  • the plurality of target data sets are used in a target data privacy protection mode. All or part of the target data record.
  • an objective function for training a first target machine learning model and / or an objective function for training a second target machine learning model may be constructed as It includes at least a loss function and a noise term, and the privacy budget of the target data privacy protection method may depend on the privacy budget corresponding to the noise term included in the objective function used to train the first target machine learning model and the privacy budget corresponding to the training term.
  • the target data privacy protection method may be determined by the privacy budget corresponding to the noise term included in the objective function used to train the first target machine learning model and the privacy budget corresponding to the noise term included in the objective function used to train the second target machine learning model. The sum of the two.
  • the privacy budget of the target data privacy protection method May depend on both the privacy budget corresponding to the noise term included in the objective function used to train the first target machine learning model and the privacy budget corresponding to the noise term included in the objective function used to train the second target machine learning model Among the larger privacy budgets.
  • the privacy budget of the target data privacy protection method in the exemplary embodiment of FIG. 5 described above depends on the sum of the above two
  • the privacy budget of the target data privacy protection method in the exemplary embodiment of FIG. 6 depends on the above.
  • the source machine learning model and the first target machine learning model may belong to the same type of machine learning model, and / or the first prediction target and the second prediction target are the same or similar.
  • the same type of machine learning model is a logistic regression model.
  • the first target machine learning model may be trained by constructing an objective function for training the first target machine learning model to include at least a loss function and a noise term and reflect the first The difference between the parameters of the target machine learning model and the migration term corresponding to the first target machine learning model; in the target data privacy protection mode, based on each first target data subset, combine and combine with each first The migration term related to the source data subset corresponding to the target data subset is used to solve the constructed objective function to train a first target machine learning model corresponding to the migration term for the second predicted target.
  • the first target machine learning model and the second target machine learning model may belong to the same type of machine learning model, and / or the second prediction target and the third prediction target may be the same or similar.
  • a second target machine learning model may be used to perform business decisions.
  • the business decision may involve at least one of transaction anti-fraud, account opening anti-fraud, intelligent marketing, intelligent recommendation, and loan evaluation, but is not limited thereto.
  • the method for performing machine learning in a data privacy protection mode can not only ensure that the privacy of the source data and the privacy of the target data are not leaked, but can also upload the source data set through multiple migration items.
  • knowledge is transferred to the target dataset, and because each migration item is only used to transfer the knowledge of the corresponding part of the source dataset to the target dataset, the process of obtaining the first target machine learning model in the privacy protection mode of the source data
  • the noise added in order to achieve the privacy of the source data is relatively small, which can not only ensure the availability of the migration items, but also effectively transfer the knowledge to the target data set.
  • the noise added to achieve the protection of target data privacy will also be relatively small, so that both the target data privacy is achieved and a better model effect is obtained.
  • Target machine learning model
  • steps S210 and S220 described above can be performed in the reverse order or in parallel, that is, multiple migration items about the source dataset can be obtained before the target dataset is obtained, Or you can get the target dataset and migration items at the same time.
  • step S230 is performed, step S210 or step S220 may also be performed, that is, in the process of obtaining the first target machine learning model, a new target data set or migration item may be obtained at the same time for use in, for example, Update operations of subsequent target machine learning models.
  • machine learning method according to the present disclosure has been described above with reference only to FIGS. 3 to 6, the machine learning method that can be according to the present disclosure is not limited to the above exemplary embodiments, but may be implemented by appropriate Variants to obtain more exemplary embodiments.
  • a method for performing prediction using a machine learning model with data privacy protection may be provided (for convenience of description, this method is detected as a "prediction method").
  • the prediction method may be executed by the “prediction system” described above, may also be implemented entirely in software by a computer program or instructions, and may also be executed by a specially configured computing system or computing device.
  • the "prediction method” is performed by the above-mentioned "prediction system”
  • the prediction system includes a target machine learning model acquisition device, a prediction data record acquisition device, a division device, and a prediction device.
  • the target machine learning model obtaining device may obtain a plurality of first target machine learning models and second target machine learning models that have been obtained through the above steps S210 to S240 after step S240 described above.
  • the target machine learning model acquisition device may also obtain multiple first target machine learning models and second target machine learning models by performing steps S210 to S240, regarding obtaining the first target machine learning model and the second target machine.
  • the specific way of learning the model has been described above with reference to FIG. 2 to FIG. 6, so it will not be repeated here. That is, the "prediction method" here can be either a continuation of the above "machine learning method” or a completely independent prediction method.
  • the prediction data record acquisition device may obtain a prediction data record.
  • the predicted data record may include the same data attribute fields as the previously described source data record and target data record.
  • the prediction data record acquisition device can obtain the prediction data records one by one in real time, and can obtain the prediction data records in batches offline.
  • the dividing device may divide the prediction data record into a plurality of sub-prediction data.
  • the dividing device may divide the prediction data record into a plurality of sub-prediction data in a data attribute field in the same manner as the previously described division source data set, and each sub-prediction data may include at least one data attribute field.
  • the prediction device may perform prediction using each first prediction machine learning model corresponding to each sub-prediction data in each prediction data record to obtain a prediction result for each sub-prediction data.
  • the prediction device may input a plurality of prediction results corresponding to each prediction data record obtained by the plurality of first target machine learning models into a second target machine learning model to obtain a prediction result for each prediction data record.
  • the prediction is performed by using multiple first target machine learning models after dividing the prediction data records to obtain multiple prediction results corresponding to each prediction data record, and further using the second based on the obtained multiple prediction results.
  • the machine learning model obtains the final prediction result, which can improve the prediction effect of the model.
  • FIG. 7 is a schematic diagram illustrating a concept of performing machine learning in a data privacy protection manner according to an exemplary embodiment of the present disclosure.
  • machine learning has played a role in many stages of the financial ecosystem. Plays an indispensable role.
  • banks can use machine learning to decide whether to approve loan applications from loan applicants.
  • a bank's own records of historical financial activities about the loan applicant may not be sufficient to fully reflect the true credit or loan repayment ability of the loan applicant, in which case the bank may expect Ability to obtain records of historical financial activities of the loan applicant at other institutions.
  • due to customer privacy concerns it is difficult for the bank to use records of historical financial activity of loan applicants owned by other institutions.
  • the data of multiple institutions can be fully utilized to help banks more accurately determine whether to approve a loan application of a loan applicant under the condition of user data protection privacy, thereby reducing financial risks.
  • a target data source 710 may send a target data set including a plurality of target data records related to a user's historical financial activities to a machine learning system 730.
  • each target data record may include a plurality of data attribute fields such as the user's name, nationality, occupation, salary, property, credit record, and historical loan amount, but is not limited thereto.
  • each target data record may include, for example, flag information about whether the user paid off the loan on time.
  • the machine learning system 730 may be the machine learning system 100 described above with reference to FIG. 1.
  • the machine learning system 730 may be provided by an entity (eg, a machine learning service provider) that specializes in providing machine learning services, or may be constructed by the target data source 710 itself.
  • the machine learning system 730 can be installed in the cloud (such as a public cloud, a private cloud, or a hybrid cloud) or a local system of a banking institution.
  • the machine learning system 730 is set in a public cloud and is constructed by a machine learning service provider.
  • the first banking institution may, for example, reach an agreement with the source data source 720 (eg, the second institution) to share data with each other while protecting user data privacy.
  • the source data source 720 may send the source data set that it owns including multiple source data records to the machine learning system 730.
  • the source data The set may be, for example, a data set related to a user's financial activity similar to the target data set described above, and the source data record and the target data record may include the same data attribute fields.
  • the source data record may also include, for example, the name of the user, Multiple data attribute fields for nationality, occupation, salary, property, credit history, historical loan amount.
  • the machine learning system 730 may divide the source data set into multiple source data subsets according to the data attribute field as described above with reference to FIGS. 1 to 6, and in the privacy protection mode of the source data, based on each The source data subset trains a corresponding source machine learning model for the first prediction target, and uses the parameters of each source machine learning model trained as a migration term related to each source data subset.
  • the source machine learning model may be, for example, a machine learning model for predicting a user's loan risk index or loan solvency, or a machine learning model similar to a prediction target, or a machine learning model for other prediction targets related to a loan estimation business. .
  • the machine learning system 730 may also obtain the migration item directly from the source data source 720.
  • the source data source 720 may advance through its own machine learning system or entrust other machine learning service providers in a privacy protection manner of the source data based on each of the data obtained by dividing the source data set according to the data attribute fields.
  • Each source data subset performs machine learning related processing to obtain migration items related to each source data subset, and sends multiple migration items to the machine learning system 730.
  • the source data source 720 may also choose to send the source data set / multiple migration items to the target data source 710, and then, the target data source 710 provides the source data set / multiple migration items together with the target data set to A machine learning system 730 for subsequent machine learning.
  • the machine learning system 730 may use each migration item of the plurality of migration items to obtain a first target machine learning model corresponding to each migration item to obtain a plurality of first target machine learning models.
  • the first target machine learning model may also be a machine learning model used to predict a user's loan risk index or loan solvency.
  • the machine learning system 730 may further use the plurality of first target machine learning models to obtain a second target machine learning model.
  • the second target machine learning model may belong to the same type of machine learning model as the first target machine learning model.
  • the second target machine learning model may be a machine learning model used to predict a user's loan risk index or loan solvency, or may be a machine learning model used to predict whether a user's loan behavior is suspected of fraud.
  • the concept of the present disclosure as described above with reference to FIG. 1 to FIG. 6, in a process of obtaining a plurality of first target machine learning models and / or a process of obtaining a second target machine learning model, in a target data privacy protection manner, The following uses all or part of multiple target data records in the target data set.
  • the target data source 710 may send a prediction data set including at least one prediction data record to at least one loan applicant to Machine learning system 730.
  • the prediction data record may include the same data attribute fields as the source data record and target data record mentioned above, for example, it may also include the user's name, nationality, occupation, salary, property, credit history, and historical loan amount. Data attribute fields.
  • the machine learning system 730 may divide the prediction data set into multiple prediction data subsets in the same manner as the source data set according to the data attribute field, and for each prediction data subset, use the first target machine learning model corresponding to it Perform predictions to get predictions for each data record in each subset of predicted data.
  • the machine learning system 730 may obtain a prediction result of the second target machine learning model for each prediction data record based on the obtained multiple prediction results corresponding to each prediction data record.
  • the machine learning system 730 may perform prediction by using the trained second target machine learning model in the target data privacy protection mode to provide a configuration consisting of multiple prediction results obtained corresponding to each prediction data record.
  • the prediction result of the prediction sample may be the loan risk index or loan solvency score of each loan applicant, or it may be whether the loan behavior of each loan applicant is suspected of fraud.
  • the machine learning system 730 may feedback the prediction result to the target data source 710.
  • the target data source 710 may determine whether to approve the loan application made by the loan applicant based on the received prediction result.
  • the concept of the present disclosure has been introduced by taking the example of the application of machine learning in the field of finance as an example, it is clear to those skilled in the art that according to the exemplary embodiment of the present disclosure, it is executed under the protection of data privacy.
  • the methods and systems of machine learning are not limited to applications in the financial sector, nor are they limited to business decisions such as performing loan estimates. Instead, it can be applied to any domain and business decision involving data security and machine learning.
  • the method and system for performing machine learning under data privacy protection according to exemplary embodiments of the present disclosure can also be applied to transaction anti-fraud, account opening anti-fraud, smart marketing, smart recommendation, and prediction of physiological data in the field of public health Wait.
  • FIGS. 1 to 7. The machine learning method and the machine learning system according to an exemplary embodiment of the present disclosure have been described above with reference to FIGS. 1 to 7. It should be understood, however, that the devices and systems shown in the figures may be individually configured as software, hardware, firmware, or any combination of the foregoing, that perform particular functions. For example, these systems and devices can correspond to dedicated integrated circuits, can also correspond to pure software codes, and can also correspond to modules combining software and hardware. In addition, one or more functions implemented by these systems or devices may be uniformly performed by components in a physical entity device (for example, a processor, a client, or a server, etc.).
  • a physical entity device for example, a processor, a client, or a server, etc.
  • a computer-readable storage medium storing instructions may be provided, wherein when the instructions are When at least one computing device is running, the at least one computing device is caused to perform the following steps: obtaining a target data set including multiple target data records; obtaining multiple migration items about a source data set, wherein one of the multiple migration items Each migration item in is used to migrate the knowledge of a corresponding part of the source data set to the target data set under the protection of the privacy of the source data; each migration item among the plurality of migration items is used to obtain the same information as each migration.
  • the first target machine learning model corresponding to the term to obtain a plurality of first target machine learning models; using the plurality of first target machine learning models to obtain a second target machine learning model, wherein, in obtaining the plurality of first target machine learning models, In the process of the target machine learning model and / or the process of obtaining the second target machine learning model, the target data privacy protection method All or part of the plurality of target data records are utilized.
  • the instructions stored in the computer-readable storage medium can be run in an environment deployed in a computer device such as a client, a host, a proxy device, or a server. It should be noted that the instructions can also be used to perform additional steps in addition to the above steps or When performing the above steps, more specific processing is performed. The content of these additional steps and further processing has been mentioned in the description of the machine learning method with reference to FIG. 2 to FIG. 6, so it will not be repeated here in order to avoid repetition.
  • a machine learning system may completely rely on the execution of a computer program or instructions to implement corresponding functions, that is, each device corresponds to each step in the functional architecture of the computer program, so that the entire system passes a special The software package (for example, the lib library) is called to implement the corresponding function.
  • the software package for example, the lib library
  • program code or code segments for performing corresponding operations may be stored in a computer-readable medium such as a storage medium, This enables at least one processor or at least one computing device to perform corresponding operations by reading and running corresponding program code or code segments.
  • a system including at least one computing device and at least one storage device storing instructions may be provided, wherein the instructions, when executed by the at least one computing device, cause the at least one A computing device performs the following steps: obtaining a target data set including a plurality of target data records; obtaining a plurality of migration items about a source data set, wherein each of the plurality of migration items is used for Transfer knowledge of a corresponding part of the source data set to the target data set under data privacy protection; and use each migration item of the plurality of migration items to obtain a first target machine learning model corresponding to each migration item, Obtaining a plurality of first target machine learning models; obtaining a second target machine learning model using the plurality of first target machine learning models, wherein, in the process of obtaining the plurality of first target machine learning models and / or In the process of obtaining the second target machine learning model, the target data records are used in the target data privacy protection mode. All or part.
  • the above system may be deployed in a server or a client, or may be deployed on a node in a distributed network environment.
  • the system may be a PC computer, a tablet device, a personal digital assistant, a smart phone, a web application, or other device capable of executing the above instruction set.
  • the system may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.).
  • all components of the system may be connected to each other via a bus and / or a network.
  • the system does not have to be a single system, but may also be an assembly of any device or circuit capable of executing the above-mentioned instructions (or instruction sets) individually or jointly.
  • the system may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that is interconnected with a local or remote (e.g., via wireless transmission) interface.
  • the at least one computing device may include a central processing unit (CPU), a graphics processor (GPU), a programmable logic device, a special-purpose processor system, a microcontroller, or a microprocessor.
  • the at least one computing device may further include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.
  • the computing device may execute instructions or code stored in one of the storage devices, wherein the storage device may also store data. Instructions and data can also be sent and received over a network via a network interface device, which can employ any known transmission protocol.
  • the storage device may be integrated with the computing device, for example, the RAM or the flash memory is arranged in an integrated circuit microprocessor or the like.
  • the storage device may include a stand-alone device, such as an external disk drive, a storage array, or other storage device usable by any database system.
  • the storage device and the computing device may be operatively coupled, or may communicate with each other, for example, through an I / O port, a network connection, or the like, so that the computing device can read instructions stored in the storage device.
  • FIG. 8 is a block diagram illustrating a system (hereinafter, simply referred to as a “machine learning system”) for performing machine learning under data privacy protection according to an exemplary embodiment of the present disclosure.
  • the machine learning system 800 may include a target data set acquisition device 810, a migration item acquisition device 820, and a target machine learning model training device 830.
  • the target data set acquisition device 810 can acquire a target data set.
  • the target data set may be any data set that can be used for training of the target machine learning model, and may include multiple target data records and / or the results of the target data records after various data processing or feature processing.
  • the target data set may further include a label of the target data record with respect to the machine learning target.
  • the target data record may include at least one attribute field (e.g., user ID, age, gender, historical credit record, etc.) that reflects various attributes of the object or event, and the target data record's flag regarding the machine learning target may be, for example, Ability to repay loans, whether users accept recommended content, etc., but not limited to this.
  • the target data set may involve various personal privacy information (for example, the user's name, ID number, mobile phone number, total property, loan records, etc.) that the user does not expect to be known to others, and may also include information that does not involve personal privacy. Group-related information.
  • the target data record can come from different data sources (for example, network operators, banking institutions, medical institutions, etc.), and the target data set can be used by specific institutions or organizations with user authorization, but users often expect Its personal privacy information is no longer known to other organizations or individuals.
  • "privacy” may refer generally to any attribute involving a single individual.
  • the target data set acquisition device 810 may acquire the target data set from the target data source at one time or in batches, and may acquire the target data set manually, automatically, or semi-automatically.
  • the target data set acquisition device 810 can acquire the target data record and / or a mark about the target data record in the target data set in real time or offline, and the target data set acquisition device 810 can simultaneously acquire the target data record and the target data record.
  • the time to mark, or obtain a mark on a target data record may lag the time to obtain the target data record.
  • the target data set obtaining device 810 may obtain the target data set from the target data source in an encrypted form or directly utilize the target data set that has been stored locally.
  • the machine learning system 800 may further include a device for decrypting the target data, and may further include a data processing device to process the target data as suitable for current machine learning form. It should be noted that the present disclosure has no restrictions on the types, forms, contents, and acquisition methods of target data records and their marks in the target data set. Any data that can be used for machine learning obtained by any means can be used as The target dataset mentioned above.
  • the premise of the migration is to ensure that the privacy information involved in the data sets of other data sources (in this disclosure, may be referred to as "source data sets") is not leaked, that is, the privacy protection of the source data .
  • the migration item obtaining device 820 may obtain a migration item regarding the source data set.
  • the migration term can be used to transfer the knowledge of the source data set to the target data set under the privacy protection mode of the source data to train the target machine learning model on the target data set.
  • the migration item may be any information related to the knowledge contained in the source data set obtained when the source data is privacy-protected (that is, in the privacy protection mode of the source data).
  • the content and form are not limited, as long as it can transfer the knowledge of the source data set to the target data set under the privacy protection method of the source data.
  • the migration item may involve a sample of the source data set, characteristics of the source data set, source-based The model obtained from the dataset, the objective function used for model training, and statistical information about the source data.
  • the migration item acquisition device 820 may receive a migration item regarding the source data set from the outside.
  • the migration item acquisition device 820 may obtain the migration item from an entity that owns the source data set, or an entity authorized to perform related processing on the source data source (for example, a service provider that provides machine learning related services).
  • the migration item may be obtained by an entity that owns the source data set or an entity that is authorized to perform related processing on the source data source based on the machine data-related processing performed by the source data set, and the migration that these entities will obtain The item is sent to the migration item acquisition device 820.
  • the prediction target targeted for performing machine learning related processing based on the source dataset and the target targeted for the target machine learning model on the target dataset may be the same target (for example, both are Predict whether a transaction is a fraudulent transaction) or a related target (for example, a classification problem with a certain degree of similarity, for example, whether a predictive transaction is a fraudulent transaction and whether it is suspected of illegality).
  • the migration item obtaining device 820 may also obtain the migration item about the source dataset by performing machine learning related processing on the source dataset.
  • the acquisition and use of the source data set by the migration item acquisition device 820 may be authorized or protected, so that it can perform corresponding processing on the acquired source data set.
  • the migration item acquisition device 820 may first acquire a source data set.
  • the source data set may be any data set related to the target data set.
  • the above description about the composition of the target data set, the method of obtaining the target data set, and the like are applicable to the source data set, and are not repeated here.
  • the source data set is described as being acquired by the migration item obtaining device 820, it should be noted that the target data set obtaining device 810 may also perform the operation of obtaining the source data set, or the above two The authors jointly obtain the source data set, and the disclosure is not limited thereto.
  • the acquired target data set, source data set, and migration items can be stored in a storage device (not shown) of the machine learning system.
  • the target data, source data, or migration items stored above can be physically or access separated to ensure the safe use of the data.
  • the machine learning system 800 cannot directly use the obtained source data set together with the target data set to perform machine learning, but needs to ensure that the source data is executed in privacy It can only be used for machine learning under protection.
  • the migration item acquisition device 820 may perform processing related to machine learning based on the source data set in the privacy protection mode of the source data, and acquire the source data set in the process of performing processing related to machine learning based on the source data set.
  • the source data privacy protection method may be a protection method that follows the definition of differential privacy, but is not limited to this, but may be any privacy protection that can exist or may appear in the future to protect the source data. the way.
  • M can be a machine learning model.
  • M any two input data sets differ by only one sample.
  • the probability that the output is equal to t is with And satisfying the following Equation 1 (where ⁇ is the privacy protection degree constant or privacy budget), then M can be considered to satisfy ⁇ differential privacy protection for any input.
  • Equation 1 the smaller the ⁇ , the better the degree of privacy protection, and the worse the converse.
  • the specific value of ⁇ can be set accordingly according to the user's requirements for the degree of data privacy protection.
  • the source data protection manner may be adding random noise in a process of performing processing related to machine learning based on the source data set.
  • random noise can be added so that the above-mentioned differential privacy protection definition is followed.
  • the definition of privacy protection is not limited to the definition of differential privacy protection, but can be other definitions of privacy protection such as k-anonymization, I diversification, t-closeness, etc. .
  • the migration item can be any information related to the knowledge contained in the source data set obtained in the privacy protection mode of the source data.
  • the migration term may involve a model parameter, an objective function, and / or statistical information about the source data obtained during the process of performing machine learning-related processing based on the source data set, but is not limited to this.
  • the operation of performing machine learning related processing based on the source data set may include: training the source machine learning model based on the source data set in a privacy protection manner of the source data, but is not limited thereto, and may also include, for example, the source data Sets perform machine learning-related processing such as feature processing or data statistical analysis.
  • model parameters, objective functions, and / or statistical information about the source data may be the above-mentioned information itself obtained directly in the process of performing machine learning-related processing based on the source data, or may be The information obtained after further transformation or processing of this information is not limited in this disclosure.
  • the migration term involving model parameters may be a parameter of the source machine learning model, for example, a model parameter of the source machine learning model obtained in a process of training the source machine learning model in a source data protection mode satisfying a definition of differential privacy protection, In addition, it may be, for example, statistical information of parameters of the source machine learning model, but is not limited thereto.
  • the objective function involved in the migration term may refer to an objective function constructed in order to train the source machine learning model. In the case that the parameters of the source machine learning model itself are not migrated, the objective function may not be performed separately. Actual solution, but this disclosure is not limited to this.
  • the migration item related to the statistical information about the source data may be data distribution information and / or data distribution change information about the source data obtained under a source data privacy protection mode (for example, a protection mode that satisfies a differential privacy protection definition). , But not limited to this.
  • the migration item acquisition device 820 may train the source machine learning model based on the source data set in a source data privacy protection mode.
  • the source machine learning model may be, for example, a generalized linear model, for example, a logistic regression model, but is not limited thereto.
  • the migration term acquisition device 820 may construct an objective function for training the source machine learning model to include at least a loss function and a noise term.
  • the noise term can be used to add random noise in the process of training the source machine learning model, so that the privacy protection of the source data can be achieved.
  • the objective function used to train the source machine learning model can be constructed to include a loss function and a noise term, and can also be constructed to include other constraint terms used to constrain the model parameters.
  • it can also be constructed as Including regular terms to prevent over-fitting of the model or to prevent model parameters from being too complex, and compensation terms for privacy protection.
  • the source data privacy protection method is a protection method that follows the definition of differential privacy
  • the source machine learning model is a generalized linear model.
  • Equation 2 Before using Equation 2 to solve the parameters of the source machine learning model, we can make:
  • Sampling b may first be derived from the Gamma distribution Sampling the second norm
  • of b, and then based on the direction u of uniformly and randomly sampling b, b
  • Equation 2 can be used to train the source machine learning model based on the source data set in the privacy protection mode of the source data. Equation 2 is as follows:
  • Equation 2 w is a parameter of the source machine learning model, Is the loss function, g (w) is the regularization function, Is a noise term used to add random noise in the process of training the source machine learning model to achieve privacy protection of the source data, Is a compensation term used for privacy protection, ⁇ is a constant used to control the strength of regularization, It is the objective function constructed for training the source machine learning model. According to Equation 2 above, the value of w when the value of the objective function is the smallest is the parameter w * of the source machine learning model finally solved.
  • the regularization function g (w) needs to be a 1-strongly convex function and second-order differentiable, and for all z, the loss function needs to satisfy and among them, with These are the first and second derivatives of the loss function, respectively. That is, as long as it is a generalized linear model that meets the above conditions, the parameters of the source data machine model that satisfy the differential privacy protection can be obtained through Equation 2 above.
  • the source machine learning model is a logistic regression model
  • the above equation 2 can be used to solve the parameters of the source machine learning model, and the parameters of the source machine learning model solved in the above manner satisfy the privacy protection of the source data. And carry the knowledge of the source dataset.
  • the parameters of the source machine learning model can then be used as migration terms to transfer the knowledge of the source dataset to the target dataset to train the target machine learning model on the target dataset.
  • the source data and the target data may be data from any one or more of the following entities, respectively:
  • Data from banks such as user registration information, bank transaction flow information, deposit information, financial product purchase information, bill information (image), etc .;
  • Data from insurance institutions such as information about policyholders, insurance policies, and insurance information;
  • Data from medical institutions such as medical record information, diagnosis information, treatment information, etc .;
  • Data from securities companies and other financial institutions such as user registration information, financial product transaction information, and financial product price fluctuation information;
  • Data from Internet entities such as user registration information from e-commerce platforms or app operating entities, user network behavior (search, browse, favorite, purchase, click, payment, etc.) or network video, audio, Related data such as pictures and texts;
  • Data from telecommunication operators such as mobile user communication data, fixed network or mobile network traffic related data, etc .;
  • industrial control data such as grid-related operating data, wind turbine control data, air conditioning system control data, mine group control data, and so on.
  • the source data and the target data involved in the embodiments of the present invention may be video data, image data, voice data, text data, formatted form data, and the like.
  • the target machine learning model training device 830 may train the target machine learning model based on the target data set and the migration term in the target data privacy protection mode.
  • the target machine learning model may be applied to any of the following scenarios:
  • Image processing scenarios including: optical character recognition OCR, face recognition, object recognition, and picture classification; more specifically, for example, OCR can be used for bill (such as invoice) recognition, handwriting recognition, etc., face recognition can be applied for security
  • OCR optical character recognition
  • face recognition can be applied for security
  • object recognition can be applied to traffic sign recognition in autonomous driving scenarios
  • picture classification can be applied to "photograph purchase” and "find the same paragraph” on e-commerce platforms.
  • Voice recognition scenarios including products that can be used for human-computer interaction through voice, such as voice assistants for mobile phones (such as Siri for Apple phones), smart speakers, etc .;
  • Natural language processing scenarios including: reviewing text (such as contracts, legal documents, and customer service records, etc.), spam identification (such as spam text recognition), and text classification (emotions, intentions, topics, etc.);
  • Automatic control scenarios including: mine group adjustment operation prediction, wind turbine adjustment operation prediction, and air conditioning system adjustment operation prediction; specifically, for a mine group, a group of adjustment operations with a high predictable mining rate, and for a wind turbine, a predictable power generation efficiency
  • a set of adjustment operations for the air conditioning system which can predict the set of adjustment operations that meet the demand while saving energy consumption;
  • Intelligent question and answer scenarios including: chatbots and intelligent customer service;
  • Fintech fields include: marketing (e.g., coupon usage prediction, advertising click behavior prediction, user portrait mining, etc.) and customer acquisition, anti-fraud, anti-money laundering, underwriting and credit scoring, and commodity price prediction;
  • Medical fields include: disease screening and prevention, personalized health management and auxiliary diagnosis;
  • the municipal area includes: social governance and regulatory enforcement, resource environment and facility management, industrial development and economic analysis, public services and livelihood security, and smart cities (the allocation and management of various urban resources such as public transportation, Internet-ride, shared bicycles, etc.);
  • Search scenarios including: web search, image search, text search, video search, etc.
  • Scenarios for abnormal behavior detection including detection of abnormal behaviors of power consumption by customers of the State Grid, detection of malicious network traffic, and detection of abnormal behaviors in operation logs.
  • the target data privacy protection method may be the same as the source data privacy protection method.
  • it may also be a protection method that follows the definition of differential privacy, but is not limited to this.
  • the target machine learning model may be based on the same type of machine learning model as the source machine learning model.
  • the target machine learning model may also be a generalized linear model, such as a logistic regression model, but is not limited thereto.
  • it may be any linear model that meets a predetermined condition.
  • the target data privacy protection method may also be a privacy protection method different from the source data privacy protection method, and the target machine learning model may also be a different type of machine learning model from the source machine learning model. Unlimited.
  • the target data privacy protection manner may be adding random noise in the process of training the target machine learning model.
  • the target machine learning model training device 820 may construct an objective function for training the target machine learning model to include at least a loss function and a noise term.
  • the target machine learning model training device 830 may be used to train the target machine learning model based on the target data set in combination with the migration term in the target data privacy protection mode.
  • the objective function for training the target machine learning model is structured to further reflect the difference between the parameters of the target machine learning model and the migration term, and then the target machine can be trained by solving the constructed objective function based on the target data set. Learning model.
  • the knowledge in the source data set can be transferred to the target data set, thereby enabling the training process to The knowledge from the source data set and the target data set are used together, so the target machine learning model that can be trained has better results.
  • the objective function can also be constructed to include regular terms to prevent the over-fitting phenomenon of the trained machine learning model, or can be constructed to include other constraint terms according to the actual task requirements, for example,
  • the compensation item for privacy protection is not limited in this application, as long as the constructed objective function can effectively achieve the privacy protection of the target data, and at the same time, the knowledge on the source data set can be transferred to the target data set.
  • the source machine learning model is a logistic regression model
  • the target machine learning model is a generalized linear model
  • the target data privacy protection mode is a protection mode that follows the definition of differential privacy protection.
  • the regularization function of the source machine learning model In the case of parameters, the parameters of the source machine learning model can be solved by using the process of solving the parameters of the source machine learning model described above. (here That is w * in Equation 2 above, where A 1 is the solution mechanism described in Equation 2 above, They are the source data set, the privacy protection degree constants that the source data set needs to meet, the constants used to control the strength of regularization and the regularization functions in the objective function used to train the source machine learning model. Subsequently, after obtaining the parameters of the source machine learning model, the regularization function in the objective function of the target machine learning model can be made as:
  • Equation 2 can be used by replacing g (w) in equation 2 with g t (w) and following the process of training the source machine learning model described above.
  • the target machine learning model based on the target data set in a way that satisfies the definition of differential privacy protection, so as to solve the parameters of the target machine learning model when the objective function used for the training of the target machine learning model takes a minimum among them, They are the target data set, the privacy protection constants that the target data set needs to meet, the constants that control the regularization intensity in the target function used to train the target machine learning model, and the regularization function.
  • Equation 3 since The objective function used for training of the target machine learning model is constructed to reflect the difference between the parameters of the target machine learning model and the migration term (that is, the parameters of the source machine learning model), thereby effectively implementing the source data set Transfer of knowledge on the target dataset.
  • the source machine learning model and the target machine learning model is not limited to a logistic regression model, but may be, for example, any linear model that satisfies a predetermined condition as described above, or even any other appropriate model.
  • the trained target machine learning model may be used to execute a business decision, wherein the business decision involves at least one of transaction anti-fraud, account opening anti-fraud, smart marketing, smart recommendation, and loan evaluation, but it is not limited to this.
  • the trained target machine learning model can also be used for business decisions related to physiological conditions.
  • the target machine learning model training device 830 may successfully migrate the knowledge in the source data set to the target data set under the condition that both the source data privacy and the target data privacy are protected, thereby enabling more knowledge to be synthesized. Train target machine learning models with better model effects to apply to corresponding business decisions.
  • the machine learning system 800 has been described with reference to FIG. 8. It should be noted that, while the machine learning system is described above as being divided into devices for respectively performing corresponding processes (for example, targets Data set acquisition device 810, migration item acquisition device 820, and target machine learning model training device 830), however, it is clear to those skilled in the art that the processes performed by the above devices may also be performed in the machine learning system without any specific device division or Executed without clear demarcation between devices.
  • the machine learning system 800 described above with reference to FIG. 8 is not limited to including the devices described above, but may also add some other devices (for example, prediction devices, storage devices, and / or model update devices, etc.) as needed, or The above devices can also be combined.
  • machine learning mentioned in the present disclosure can be implemented in the form of “supervised learning”, “unsupervised learning”, or “semi-supervised learning”.
  • the exemplary embodiments of the present invention are The specific form of machine learning is not specifically limited.
  • FIG. 9 is a flowchart illustrating a method of performing machine learning in a data privacy protection mode (hereinafter, simply referred to as a “machine learning method”) according to an exemplary embodiment of the present disclosure.
  • the machine learning method shown in FIG. 9 may be executed by the machine learning system 800 shown in FIG. 8, or may be implemented entirely in software by a computer program or instruction, or may be performed by a computing system or computing device with a specific configuration. To execute.
  • the method shown in FIG. 9 is executed by the machine learning system 800 shown in FIG. 8, and that the machine learning system 800 can have the configuration shown in FIG. 8.
  • the target data set acquisition device 810 may acquire a target data set. Any content described above regarding the acquisition of the target data set when the target data set acquisition device 810 is described with reference to FIG. 8 is adapted to this, and therefore, it will not be repeated here.
  • the migration item acquisition device 820 may acquire a migration item regarding the source data set.
  • the migration term can be used to transfer the knowledge of the source data set to the target data set under the privacy protection mode of the source data to train the target machine learning model on the target data set.
  • the migration item acquisition device 820 may receive the migration item from the outside.
  • the migration item acquisition device 820 may obtain a migration item about the source data set by itself performing machine learning related processing on the source data set.
  • the migration item acquisition device 820 may first acquire the source data set, and then, in the privacy protection mode of the source data, perform processing related to machine learning based on the source data set, and perform processing related to machine learning based on the source data set. The process of obtaining migration items about the source dataset.
  • the source data privacy protection method may be a protection method following the definition of differential privacy protection, but is not limited thereto.
  • the source data privacy protection method may be to add random noise in the process of performing processing related to machine learning based on the source data set to achieve privacy protection of the source data.
  • performing processing related to machine learning based on the source data set may include training the source machine learning model based on the source data set in a source data privacy protection mode, but is not limited thereto. For example, it may also be performed under the source data privacy protection mode. Perform statistical analysis or feature processing on the source data set.
  • an objective function for training a source machine learning model may be constructed in the source data privacy protection manner to include at least a loss function and a noise term.
  • the noise term is used to add random noise in the process of training the source machine learning model, thereby achieving privacy protection of the source data.
  • the objective function may be configured to include other constraint terms for constraining model parameters.
  • the migration term may involve a model parameter, an objective function, and / or statistical information about the source data obtained during the process of performing machine learning-related processing based on the source data set.
  • the migration term may be a parameter of the source machine learning model, that is, a parameter of the source machine learning model trained in a source data privacy protection mode.
  • the source machine learning model may be a generalized linear model (for example, a logistic regression model), but is not limited thereto. For example, it may be any linear model that meets a predetermined condition, or even any suitable that meets a certain condition. model.
  • the process of the migration item acquisition device 820 training the source machine learning model based on the source data set to obtain the migration item (ie, the parameters of the source machine learning model) in the source data privacy protection mode Therefore, I will not repeat them here.
  • all descriptions of the source data set, source data privacy protection method, migration item, and objective function mentioned in the description of the migration item acquisition device 820 with reference to FIG. 8 are applicable to FIG. 9. Therefore, here, The details are not repeated, and the same or similar content may be referred to each other when describing the migration item obtaining device 820 and step S920.
  • the target machine learning model training device 830 may train the target based on the target data set and the migration item in the target data privacy protection mode.
  • Machine learning models may train the target based on the target data set and the migration item in the target data privacy protection mode.
  • the target data privacy protection method may also be a protection method that follows the definition of differential privacy, but is not limited to this, but may be another data privacy protection method that is the same as or different from the source data privacy protection method.
  • the target data privacy protection mode may be adding random noise in the process of training the target machine learning model to achieve privacy protection of the target data.
  • an objective function for training an objective machine learning model may be constructed to include at least a loss function and a noise term, but is not limited thereto.
  • the objective function may be constructed to include other Constraint terms used to constrain the model, such as regular terms used to limit model parameter complexity or prevent the model from overfitting, compensation terms used for privacy protection, and so on.
  • the target machine learning model may be based on the same type of machine learning model as the source machine learning model, for example, the same type of machine learning model may be logistic regression, but is not limited thereto, but may be, for example, a method that satisfies a predetermined condition Any linear model. It should be noted that the target machine learning model may also be a machine learning model that is different from the source machine learning model.
  • the target machine learning model training device 830 may The objective function for training the target machine learning model is structured to further reflect the difference between the parameters of the target machine learning model and the migration term, and then, the target machine learning can be trained by solving the constructed objective function based on the target data set. model.
  • the specific process of training the target machine learning model by using the constructed objective function has been described above with reference to FIG. 8 in combination with the mathematical representation, and therefore, it will not be repeated here.
  • the target machine learning model trained in the above manner can be used to execute business decisions.
  • the business decision may involve at least one of transaction anti-fraud, account opening anti-fraud, smart marketing, smart recommendation, and loan evaluation. Limited to this.
  • the disclosure does not place any restrictions on the types of specific business decisions to which the target machine learning model can be applied, as long as it is a business suitable for making decisions using the machine learning model.
  • the method for performing machine learning in a data privacy protection mode can not only ensure that the privacy of the source data and the privacy of the target data are not leaked, but also transfer the knowledge of the source data to The target data set, thereby facilitating the use of data from more data sources for machine learning to train a machine learning model, so that the effect of the trained target machine learning model can have a better model effect.
  • steps in FIG. 9 have been described in order when describing FIG. 9 above, it is clear to those skilled in the art that the steps in the above method are not necessarily performed in order, but may be performed in sequence. Execute in reverse order or in parallel, for example, steps S910 and S920 described above can be performed in reverse order or in parallel, that is, migration items about the source data set can be obtained before the target data set is obtained, or Get both the target dataset and migration items.
  • step S810 or step 820 may also be performed, that is, in the process of training the target machine learning model by using the already acquired target data set and migration items, a new target data set may be obtained at the same time. Or migrating terms for, for example, subsequent update operations of the target machine learning model.
  • FIG. 10 is a schematic diagram illustrating a concept of performing machine learning in a data privacy protection manner according to an exemplary embodiment of the present disclosure.
  • a target data source 310 may send a target data set that it owns to a user's historical financial activities to a machine learning system 330.
  • each target data record in the target data set may include various attribute information such as the user's name, nationality, occupation, salary, property, credit history, historical loan amount, and so on.
  • the target data record may include, for example, flag information about whether the user paid off the loan on time.
  • the machine learning system 330 may be the machine learning system 800 described above with reference to FIG. 8.
  • the machine learning system 330 may be provided by an entity (eg, a machine learning service provider) that specializes in providing machine learning services, or may be constructed by the target data source 310 itself.
  • the machine learning system 330 may be provided in the cloud (such as a public cloud, a private cloud, or a hybrid cloud) or a local system of a banking institution.
  • the machine learning system 330 is set in a public cloud and is constructed by a machine learning service provider.
  • the first banking institution may, for example, reach an agreement with the source data source 320 (eg, the second institution) to share data with each other while protecting the privacy of the user's data.
  • the source data source 320 may send the source data set it owns to the machine learning system 330.
  • the source data set may be, for example, as described above.
  • the target dataset is similar to the dataset involving user financial activity.
  • the machine learning system 330 may perform machine learning related processing based on the source data set in the source data privacy protection mode as described above with reference to FIGS.
  • the machine learning system 330 may train a source machine learning model based on the source data set, and use the parameters of the trained source machine learning model as migration terms.
  • the source machine learning model may be, for example, a machine learning model for predicting a user's loan risk index or loan solvency, or a machine learning model similar to a prediction target, or a machine learning model for other prediction targets related to a loan estimation business. .
  • the machine learning system 330 may also obtain the migration item directly from the source data source 320.
  • the source data source 320 may obtain a migration item in advance through its own machine learning system or entrust other machine learning service providers to perform machine learning related processing based on the source data set in the source data privacy protection mode.
  • the migration item is sent to the machine learning system 330.
  • the source data source 320 may also choose to send the source data set / migration item to the target data source, and then the source data set / migration item and the target data set are provided to the machine learning system 330 by the target data source to Used for machine learning.
  • the machine learning system 330 trains the target machine learning model based on the target data set and the acquired migration terms in the target data privacy protection mode.
  • the target data machine learning model may be, for example, a machine learning model for predicting a user's loan risk index or loan solvency.
  • the target data source 310 may send a to-be-predicted data set involving at least one loan applicant to the machine learning system 330.
  • the machine learning system 330 may use the trained target machine learning model to provide a loan risk index or loan solvency score for each loan applicant for the data set to be predicted, and feedback the prediction result to the target data source 310.
  • the target data source 310 may determine whether to approve the loan application made by the loan applicant based on the received prediction result.
  • banking institutions can use machine learning to protect user data privacy while using other institutions' data and their own data to obtain more accurate judgment results, thereby avoiding unnecessary financial risks.
  • the concept of the present disclosure has been introduced by taking the example of the application of machine learning in the field of finance as an example, it is clear to those skilled in the art that according to the exemplary embodiment of the present disclosure, it is executed under the protection of data privacy.
  • the methods and systems of machine learning are not limited to applications in the financial sector, nor are they limited to business decisions such as performing loan estimates. Instead, it can be applied to any domain and business decision involving data security and machine learning.
  • the method and system for performing machine learning under data privacy protection according to exemplary embodiments of the present disclosure can also be applied to transaction anti-fraud, account opening anti-fraud, intelligent marketing, intelligent recommendation, and the like.
  • the method and system for performing machine learning under data privacy protection may also be applied to the public health field, for example, for performing prediction of physiological data.
  • the effect of the prediction model may be poor.
  • many other medical institutions may have corresponding data. If the data of other medical institutions can be used, the prediction effect of the medical institution's prediction model for a certain health indicator can be improved.
  • the concept of the present disclosure can be utilized to protect user data privacy of each medical institution, and the data of each medical structure is integrated to provide more accurate prediction results using machine learning.
  • scenarios applicable based on the target model in this application include, but are not limited to, the following scenarios: image processing scenarios, speech recognition scenarios, natural language processing scenarios, automatic control scenarios, intelligent question answering scenarios, business decision scenarios, recommendations Business scenarios, search scenarios, and abnormal behavior detection scenarios.
  • image processing scenarios speech recognition scenarios
  • natural language processing scenarios natural language processing scenarios
  • automatic control scenarios intelligent question answering scenarios
  • business decision scenarios business decision scenarios
  • recommendations Business scenarios search scenarios
  • abnormal behavior detection scenarios For more specific application scenarios under the above scenarios, see the previous description.
  • the method and system for performing machine learning under data privacy protection of this application can also be applied to any of the above scenarios, and the method and system for performing machine learning under data privacy protection of this application are applied to different In the scenario, there is no difference in the overall execution scheme, but the data targeted in different scenarios is different. Therefore, those skilled in the art can apply the scheme of this application to different scenarios without any obstacles based on the foregoing scheme disclosure. Explain each scene one by one.
  • FIGS. 8 and 9 The machine learning method and the machine learning system according to an exemplary embodiment of the present disclosure have been described above with reference to FIGS. 8 and 9, and the concept of the present disclosure has been schematically described with reference to FIG. 10.
  • the devices and systems shown in the figures may be individually configured as software, hardware, firmware, or any combination of the foregoing, that perform particular functions.
  • these systems and devices can correspond to dedicated integrated circuits, can also correspond to pure software codes, and can also correspond to modules combining software and hardware.
  • one or more functions implemented by these systems or devices may be uniformly performed by components in a physical entity device (for example, a processor, a client, or a server, etc.).
  • a computer-readable storage medium storing instructions may be provided, wherein when the instructions are When at least one computing device is running, the at least one computing device is caused to perform the following steps: obtaining a target data set; obtaining a migration item about a source data set, wherein the migration item is used to convert the source data in a privacy protection mode of the source data
  • the set of knowledge is transferred to the target data set to train the target machine learning model on the target data set; and in the target data privacy protection mode, based on the target data set, the target machine learning model is trained in combination with the migration terms.
  • the instructions stored in the computer-readable storage medium can be run in an environment deployed in a computer device such as a client, a host, a proxy device, or a server. It should be noted that the instructions can also be used to perform additional steps in addition to the above steps or More specific processing is performed when the above steps are performed. The content of these additional steps and further processing has been mentioned in the description of the related method with reference to FIG. 9, so it will not be repeated here in order to avoid repetition.
  • a machine learning system may completely rely on the execution of a computer program or instructions to implement corresponding functions, that is, each device corresponds to each step in the functional architecture of the computer program, so that the entire system passes a special The software package (for example, the lib library) is called to implement the corresponding function.
  • the software package for example, the lib library
  • program code or code segments for performing corresponding operations may be stored in a computer-readable medium such as a storage medium, This enables at least one processor or at least one computing device to perform corresponding operations by reading and running corresponding program code or code segments.
  • a system including at least one computing device and at least one storage device storing instructions may be provided, wherein the instructions, when executed by the at least one computing device, cause the at least one A computing device performs the following steps: obtaining a target data set; obtaining a migration item about a source data set, wherein the migration item is used to transfer knowledge of the source data set to the target data set in a source data privacy protection mode to Training a target machine learning model on the target data set; and training the target machine learning model based on the target data set in combination with the migration term under the target data privacy protection mode.

Abstract

提供了一种在数据隐私保护下执行机器学习的方法和系统,所述方法包括:获取包括多条目标数据记录的目标数据集;获取关于源数据集的多个迁移项,其中,所述多个迁移项之中的每个迁移项用于在源数据隐私保护下将对应的一部分源数据集的知识迁移到目标数据集;分别利用所述多个迁移项之中的每个迁移项来获得与每个迁移项对应的第一目标机器学习模型,以获得多个第一目标机器学习模型;利用所述多个第一目标机器学习模型获得第二目标机器学习模型,其中,在获得所述多个第一目标机器学习模型的过程中和/或获得第二目标机器学习模型的过程中,在目标数据隐私保护方式下利用了所述多条目标数据记录中的全部或部分。

Description

在数据隐私保护下执行机器学习的方法和系统 技术领域
本发明总体说来涉及人工智能领域中的数据安全技术,更具体地说,涉及一种在数据隐私保护下执行机器学习的方法和系统、以及利用具有数据隐私保护的机器学习模型进行预测的方法和系统。
背景技术
众所周知,机器学习往往需要大量的数据以通过计算的手段从大量数据中挖掘出有价值的潜在信息。尽管随着信息技术的发展产生了海量的数据,然而,当前环境下,人们对数据的隐私保护越来越重视,这使得即使理论上可用于机器学习的数据很多,也因为不同数据源出于其对自身所拥有的数据的隐私保护上的考虑,而不愿或不能将其数据直接共享给其他有需要的数据使用者,从而使得实际上可用于机器学习的数据仍然可能不足,由此导致无法有效地利用机器学习基于更多的相关数据挖掘出能够创造更多价值的信息。此外,即使已经从其他数据源获取到含有隐私信息的数据或者机构本身拥有含有隐私信息的数据,基于这些数据训练出的机器学习模型仍然可能泄露数据的隐私信息。
另外,虽然目前存在一些对数据进行隐私保护的方式,但是实际操作中却往往难以同时兼顾数据隐私保护和受隐私保护数据的后续可用性这两者,从而导致机器学习效果不佳。
鉴于此,需要既保证数据中的隐私信息不被泄露,同时在能够保证受隐私保护的数据的后续可用性的情况下有效利用不同数据源的数据进行机器学习的技术。
发明内容
根据本公开示例性实施例,提供了一种在数据隐私保护下执行机器学习的方法,所述方法可包括:获取包括多条目标数据记录的目标数据集;获取关于源数据集的多个迁移项,其中,所述多个迁移项之中的每个迁移项用于在源数据隐私保护下将对应的一部分源数据集的知识迁移到目标数据集;分别利用所述多个迁移项之中的每个迁移项来获得与每个迁移项对应的第一目标机器学习模型,以获得多个第一目标机器学习模型;利用所述多个第一目标机器学习模型获得第二目标机器学习模型,其中,在获得所述多个第一目标机器学习模型的过程中和/或获得第二目标机器学习模型的过程中,在目标数据隐私保护方式下利用了所述多条目标数据记录中的全部或部分。
根据本公开另一示例性实施例,提供了一种利用具有数据隐私保护的机器学习模型进行预测的方法,所述方法可包括:获取如上所述的多个第一目标机器学习模型和第二目标机器学习模型;获取预测数据记录;将预测数据记录划分为多个子预测数据;针对每条预测数据记录之中的每个子预测数据,利用与其对应的第一目标机器学习模型执行预测以获取针对每个子预测数据的预测结果;以及将由多个第一目标机器学习模型获取的与每条预测数据记录对应的多个预测结果输入第二目标机器学习模型,以得到针对所述每条预测数据记录的预测结果。
根据本公开另一示例性实施例,提供了一种存储指令的计算机可读存储介质,其中,当所述指令被至少一个计算装置运行时,可促使所述至少一个计算装置执行如上所述的在数据隐私保护下执行机器学习的方法和/或如上所述的利用具有数据隐私保护的机器学习模型进行预测的方法。
根据本公开另一示例性实施例,提供了一种包括至少一个计算装置和至少一个存储指令的存储装置的系统,其中,所述指令在被所述至少一个计算装置运行时,可促使所述至少一个计算装置执行如上所述的在数据隐私保护下执行机器学习的方法和/或如上所述的利用具有数据隐私保护的机器学习模型进行预测的方法。
根据本公开另一示例性实施例,提供了一种在数据隐私保护下执行机器学习的系统,所述系统可包括:目标数据集获取装置,被配置为获取包括多条目标数据记录的目标数据集;迁移项获取装置,被配置为获取关于源数据集的多个迁移项,其中,所述多个迁移项之中的每个迁移项用于在源数据隐私保护下将对应的一部分源数据集的知识迁移到目标数据集;第一目标机器学习模型获得装置,被配置为分别利用所述多个迁移项之中的每个迁移项来获得与每个迁移项对应的第一目标机器学习模型,以获得多个第一目标机器学习模型;第二目标机器学习模型获得装置,被配置为利用所述多个第一目标机器学习模型获得第二目标机器学习模型,其中,在第一目标机器学习模型获得装置获得所述多个第一目标机器学习模型的过程中和/或第二目标机器学习模型获得装置获得第二目标机器学习模型的过程中,在目标数据隐私保护方式下利用了所述多条目标数据记录中的全部或部分。
根据本公开另一示例性实施例,提供了一种利用具有数据隐私保护的机器学习模型进行预测的系统,所述系统可包括:目标机器学习模型获取装置,被配置为获取如上所述的多个第一目标机器学习模 型和第二目标机器学习模型;预测数据记录获取装置,被配置为获取预测数据记录;划分装置,被配置为将预测数据记录划分为多个子预测数据;预测装置,被配置为针对每条预测数据记录之中的每个子预测数据,利用与其对应的第一目标机器学习模型执行预测以获取针对每个子预测数据的预测结果,并且将由多个第一目标机器学习模型获取的与每条预测数据记录对应的多个预测结果输入第二目标机器学习模型,以得到针对所述每条预测数据记录的预测结果。
根据本公开示例性实施例,提供了一种在数据隐私保护下执行机器学习的方法,所述方法可包括:获取目标数据集;获取关于源数据集的迁移项,其中,所述迁移项用于在源数据隐私保护方式下将源数据集的知识迁移到目标数据集以在目标数据集上训练目标机器学习模型;以及在目标数据隐私保护方式下,基于目标数据集,结合所述迁移项来训练目标机器学习模型。
根据本公开另一示例性实施例,提供了一种存储指令的计算机可读存储介质,其中,当所述指令被至少一个计算装置运行时,促使所述至少一个计算装置执行如上所述的在数据隐私保护下执行机器学习的方法。
根据本公开另一示例性实施例,提供了一种包括至少一个计算装置和至少一个存储指令的存储装置的系统,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置执行如上所述的在数据隐私保护下执行机器学习的方法。
根据本公开另一示例性实施例,提供了一种用于在数据隐私保护下执行机器学习的系统,所述系统可包括:目标数据集获取装置,被配置为获取目标数据集;迁移项获取装置,被配置为获取关于源数据集的迁移项,其中,所述迁移项用于在源数据隐私保护方式下将源数据集的知识迁移到目标数据集以在目标数据集上训练目标机器学习模型;以及目标机器学习模型训练装置,被配置为在目标数据隐私保护方式下,基于目标数据集,结合所述迁移项来训练目标机器学习模型。
根据本公开示例性实施例的在数据隐私保护下执行机器学习的方法和系统不仅可实现对源数据和目标数据的隐私保护,并且同时可将源数据集中的知识迁移到目标数据集,进而能够基于目标数据集,结合迁移的知识,训练出模型效果更佳的目标机器学习模型。
根据本公开示例性实施例的在数据隐私保护方式下执行机器学习的方法和系统不仅可保证数据隐私信息不被泄露,同时可在能够保证经过隐私保护的数据的可用性的情况下有效利用不同数据源的数据进行机器学习,从而使得机器学习模型的效果更佳。
附图说明
从下面结合附图对本发明实施例的详细描述中,本发明的这些和/或其他方面和优点将变得更加清楚并更容易理解,其中:
图1是示出根据本公开示例性实施例的在数据隐私保护下执行机器学习的系统的框图;
图2是示出根据本公开示例性实施例的在数据隐私保护方式下执行机器学习的方法的流程图;
图3是示出根据本公开第一示例性实施例的在数据隐私保护方式下执行机器学习的方法的示意图;
图4是示出根据本公开第二示例性实施例的在数据隐私保护方式下执行机器学习的方法的示意图;
图5是示出根据本公开第三示例性实施例的在数据隐私保护方式下执行机器学习的方法的示意图;
图6是示出根据本公开第四示例性实施例的在数据隐私保护方式下执行机器学习的方法的示意图;
图7是示出根据本公开示例性实施例的在数据隐私保护方式下执行机器学习的构思的示意图;
图8是示出根据本公开示例性实施例的在数据隐私保护下执行机器学习的系统的框图;
图9是示出根据本公开示例性实施例的在数据隐私保护方式下执行机器学习的方法的流程图;
图10是示出根据本公开示例性实施例的在数据隐私保护方式下执行机器学习的构思的示意图。
具体实施方式
为了使本领域技术人员更好地理解本发明,下面结合附图和具体实施方式对本发明的示例性实施例作进一步详细说明。在此需要说明的是,在本公开中出现的“并且/或者”、“和/或”均表示包含三种并列的情况。例如“包括A和/或B”表示包括A和B中的至少一个,即包括如下三种并列的情况:(1)包括A;(2)包括B;(3)包括A和B。类似地,“包括A、B和/或C”表示包括A、B和C中的至少一个。又例如“执行步骤一并且/或者步骤二”表示执行步骤一和步骤二中的至少一个,即表示如下三种并列的情况:(1)执行步骤一;(2)执行步骤二;(3)执行步骤一和步骤二。
图1是示出根据本公开示例性实施例的在数据隐私保护下执行机器学习的系统(以下,为描述方便,将其简称为“机器学习系统”)100的框图。参照图1,机器学习系统100可包括目标数据集获取装置110、迁移项获取装置120、第一目标机器学习模型获得装置130和第二目标机器学习模型获得装置140。
具体说来,目标数据集获取装置110可获取包括多条目标数据记录的目标数据集。这里,目标数据 集可以是任何可被用于机器学习模型训练的数据集,并且,可选地,目标数据集还可包括目标数据记录关于机器学习目标(预测目标)的标记(label)。例如,目标数据记录可包括反映对象或事件的各种属性的多个数据属性字段(例如,用户ID、年龄、性别、历史信用记录等),目标数据记录关于机器学习目标的标记可以是例如用户是否有能力偿还贷款、用户是否接受推荐的内容等,但不限于此。这里,目标数据记录关于机器学习目标的标记并不仅限于目标数据记录关于一个机器学习目标的标记,而是可包括目标数据记录关于一个或多个机器学习目标的标记,即,一条目标数据记录不限于对应于一个标记,而是可对应于一个或多个标记。此外,目标数据集可涉及用户不期望被他人获知的各种个人隐私信息(例如,用户的姓名、身份证号码、手机号码、财产总额、贷款记录等),并且也可包括不涉及个人隐私的其他相关信息。这里,目标数据记录可来源于不同的数据源(例如,网络运营商、银行机构、医疗机构等),并且目标数据集可被特定机构或组织在获得用户授权的情况下使用,但是往往期望涉及个人隐私的信息不再进一步被其他组织或个人获知。需要说明的是,在本公开中,“隐私”可泛指涉及单个个体的任何属性。
作为示例,目标数据集获取装置110可一次性或分批次地从目标数据源获取目标数据集,并且可以手动、自动或半自动方式获取目标数据集。此外,目标数据集获取装置110可实时或离线地获取目标数据集中的目标数据记录和/或目标数据记录关于机器学习目标的标记,并且目标数据集获取装置110可同时获取目标数据记录和目标数据记录关于机器学习目标的标记,或者获取目标数据记录关于机器学习目标的标记的时间可滞后于获取目标数据记录的时间。此外,目标数据集获取装置110可以以加密的形式从目标数据源获取目标数据集或者直接利用其本地已经存储的目标数据集。如果获取的目标数据集是加密的数据,则可选地,机器学习系统100还可包括对目标数据进行解密的装置,并还可包括数据处理装置以将目标数据处理为适用于当前机器学习的形式。需要说明的是,本公开对目标数据集中的目标数据记录及其标记的种类、形式、内容、目标数据集的获取方式等均无限制,采用任何手段获取的可用于机器学习的数据均可作为以上提及的目标数据集。
然而,如本公开背景技术所述,对于期望挖掘出更多有价值信息的机器学习而言,实际中,仅基于获取的目标数据集可能不足以学习出满足实际任务需求或达到预定效果的机器学习模型,因此,可设法获取来自其他数据源的相关信息,以将来自其他数据源的知识迁移到目标数据集,从而结合目标数据集与来自其他数据源的知识共同进行机器学习,进而可提高机器学习模型的效果。但是,迁移的前提是需要确保:其他数据源的数据集(在本公开中,可被称为“源数据集”)中所涉及的隐私信息不被泄露,即,需要对源数据进行隐私保护。
为此,根据本公开示例性实施例,迁移项获取装置120可获取关于源数据集的多个迁移项。具体地,所述多个迁移项之中的每个迁移项可用于在源数据隐私保护下将对应的一部分源数据集的知识迁移到目标数据集。这里,对应的一部分源数据集可指与每个迁移项对应的一部分数据集,也就是说,每个迁移项仅用于在源数据隐私保护方式下将与其对应的一部分源数据集的知识迁移到目标数据集,最终,通过所述多个迁移项将整个源数据集的知识迁移到目标数据集。具体地,每个迁移项可以是在源数据被进行隐私保护的情况下(即,在源数据隐私保护方式下)获得的任何与该迁移项对应的一部分源数据集所包含的知识有关的信息,本公开对每个迁移项的具体内容和形式不作限制,只要其能够在源数据隐私保护方式下将对应的一部分源数据集的知识迁移到目标数据集即可,例如,每个迁移项可涉及对应的一部分源数据集的样本、对应的一部分源数据集的特征、基于对应的一部分源数据集获得的模型、用于基于对应的一部分源数据集进行模型训练的目标函数、关于对应的一部分源数据集的统计信息等。
根据示例性实施例,所述对应的一部分源数据集可以是通过将源数据集按照数据属性字段划分而获得的对应的源数据子集。与目标数据集类似,源数据集可包括多条源数据记录,并且可选地,还可包括每条源数据记录关于机器学习目标的标记。此外,与目标数据记录类似,每条源数据记录也可包括反映对象或事件的各种属性的多个数据属性字段(例如,用户ID、年龄、性别、历史信用记录、历史贷款记录等)。这里,“按照数据属性字段划分”可指对源数据集中包括的每条源数据记录中包括的多个数据属性字段进行分组,使得划分后的每个数据记录(即,每个划分后得到的子数据记录)可包括至少一个数据属性字段,而由具有相同数据属性字段的数据记录构成的集合即为将源数据集按照数据属性字段划分而获得的对应的源数据子集。也就是说,这里,所述对应的源数据子集中的每条数据记录可包括相同的数据属性字段,并且,每条数据记录所包括的数据属性字段可以为一个或多个。此外,不同源数据子集中的数据记录所包括的数据属性字段的数量可以相同或不同。例如,如上所述,假设每条源数据记录可包括以下五个数据属性字段:用户ID、年龄、性别、历史信用记录和历史贷款记录,则可例如将这五个数据属性字段划分为三个数据属性字段组,其中,例如,第一数据属性字段组可包括用户ID和年龄这两个数据属性字段,第二数据属性字段组可包括性别和历史信用记录这两个数据属性字段,第三数据属性字段组可包括历史贷款记录这一个数据属性字段。在这种情况下,通过将源数据集按照数据属性字段划分而获得的对应的源数据子集便可以是由包括第一数据属性字段组中的数据属性字段的数据记录构成 的第一源数据子集、由包括第二数据属性字段组中的数据属性字段的数据记录构成的第二源数据子集或由包括第三数据属性字段组中的数据属性字段的数据记录构成的第三源数据子集。以上结合示例对源数据集的划分方式进行了解释,然而,本领域技术人员清楚的是,无论是源数据记录所包括的数据属性字段的数量和内容,还是源数据集的具体划分方式等均不限于以上示例。
作为示例,迁移项获取装置120可从外部接收关于源数据集的多个迁移项。例如,迁移项获取装置120可从拥有源数据集的实体、或者授权可对源数据集执行相关处理的实体(例如,提供机器学习相关服务的服务提供商)获取上述迁移项。在这种情况下,每个迁移项可以是由拥有源数据集的实体或者授权可对源数据集执行相关处理的实体基于以上所描述的对应的源数据子集执行机器学习相关处理而获得的,并且可由这些实体将获得的迁移项发送给迁移项获取装置120。
与直接从外部获取迁移项不同,可选地,迁移项获取装置120也可通过对源数据集执行机器学习相关处理来获取关于源数据集的多个迁移项。这里,迁移项获取装置120对源数据集的获取和使用可以是经过授权或经过保护措施的,使得其能够对获取的源数据集进行相应的处理。具体说来,迁移项获取装置120可首先获取包括多条源数据记录的源数据集。这里,源数据集可以是与目标数据集有关的任何数据集,相应地,以上关于目标数据集的构成、目标数据集的获取方式等的描述均适用于源数据集,这里不再赘述。此外,根据示例性实施例,源数据记录和目标数据记录可包括相同的数据属性字段。另外,尽管为了描述方便,将源数据集描述为由迁移项获取装置120获取,但是,需要说明的是,也可由目标数据集获取装置110来执行获取源数据集的操作,或者,由以上两者共同获取源数据集,本公开对此并不限制。此外,获取的目标数据集、源数据集和迁移项均可存储在机器学习系统的存储装置(未示出)中。作为可选方式,以上存储的目标数据、源数据或迁移项可进行物理或访问权限上的隔离,以确保数据的安全使用。
在获取到源数据集的情况下,出于隐私保护的考虑,机器学习系统100并不能够直接利用获取的源数据集连同目标数据集一起进行机器学习,而是需要在保证源数据被执行隐私保护的情况下才可利用其进行机器学习。为此,迁移项获取装置120可在源数据隐私保护方式下通过对源数据集执行与机器学习相关的处理来获取关于源数据集的多个迁移项。具体地,根据示例性实施例,迁移项获取装置120可将源数据集按照数据属性字段划分为多个源数据子集,并在源数据隐私保护方式下,基于每个源数据子集,针对第一预测目标训练与每个源数据子集对应的源机器学习模型,并将训练出的每个源机器学习模型的参数作为与每个源数据子集相关的迁移项。这里,每个源数据子集中的数据记录可包括至少一个数据属性字段。由于以上已经结合示例对按照数据属性字段划分源数据集的方式进行了解释,因此这里不再赘述。
这里,需要说明的是,可选地,源数据集除了可包括多条源数据记录之外,还可包括源数据记录关于机器学习目标的标记,而在源数据集包括源数据记录和源数据记录关于机器学习目标的标记的情况下,以上所言按照数据属性字段划分源数据集仅限于按照数据属性字段划分源数据集中的源数据记录,而不对源数据集中包括的源数据记录关于机器学习目标的标记进行划分。并且,划分每条源数据记录所获得的每个数据记录(包括至少一个数据字段)关于机器学习目标的标记仍然是该源数据记录被划分前关于该机器学习目标的标记。相应地,这里,针对第一预测目标训练与每个源数据子集对应的源机器学习模型可以是基于每个源数据子集(即,每个源数据子集中包括的每个数据记录及其对应的标记)来训练与每个源数据子集对应的源机器学习模型,而所述每个数据记录(通过划分源数据记录而获得)针对第一预测目标的标记就是该源数据记录针对第一预测目标的标记。作为示例,第一预测目标可以是预测交易是否为欺诈交易、预测用户是否有能力清偿贷款等,但不限于此。
此外,需要说明的是,尽管以上将训练出的每个源机器学习模型的参数作为了与每个源数据子集相关的迁移项,但是这仅是示例。事实上,与每个源数据子集相关的迁移项可以是在源数据隐私保护方式下获得的任何与该源数据子集所包含的知识有关的信息。具体地,根据本公开示例性实施例,与每个源数据子集相关的迁移项可涉及在基于该源数据子集执行与机器学习相关的处理的过程中得到的模型参数、目标函数和/或关于该源数据子集中的数据的统计信息,但不限于此。此外,基于源数据子集执行与机器学习相关的处理的操作可除了包括以上所描述的在源数据隐私保护方式下基于每个源数据子集训练与每个源数据子集对应的源机器学习模型之外,还可包括例如对源数据子集执行特征处理或数据统计分析等机器学习相关处理。另外,需要说明的是,上述模型参数、目标函数和/或关于源数据子集的统计信息均既可以是在基于源数据子集执行与机器学习相关的处理的过程中直接获得的上述信息本身,也可以是对这些信息进行进一步变换或处理之后所获得的信息,本公开对此并无限制。作为示例,涉及模型参数的迁移项可以是源机器学习模型的参数或源机器学习模型的参数的统计信息等,但不限于此。作为示例,迁移项所涉及的目标函数可以是指为了训练源机器学习模型而构建出的目标函数,在源机器学习模型本身的参数并不进行迁移的情况下,该目标函数可并不单独进行实际求解,但本公开不限于此。作为示例,涉及关于源数据子集的统计信息的迁移项可以是在源数据隐私保护方式下获取的关于源数据子集的数据 分布信息和/或数据分布变化信息,但不限于此。
根据示例性实施例,源数据隐私保护方式可以是遵循差分隐私定义的保护方式,但不限于此,而是可以是任何已经存在的或未来可能出现的能够对源数据进行隐私保护的任何隐私保护方式。
为便于理解,现在对遵循差分隐私定义的保护方式进行简要描述。假设有一随机机制M(例如,M是机器学习模型的训练过程),对于M而言,输入的任意两个仅相差一个样本的数据集
Figure PCTCN2019101441-appb-000001
Figure PCTCN2019101441-appb-000002
的输出等于t的概率分别为
Figure PCTCN2019101441-appb-000003
Figure PCTCN2019101441-appb-000004
并且满足以下等式1(其中,ε是隐私预算(privacy budget)),则可认为M对于任意输入是满足ε差分隐私保护的。
Figure PCTCN2019101441-appb-000005
在以上等式1中,ε越小,隐私保护程度越好,反之则越差。ε的具体取值,可根据用户对数据隐私保护程度的要求进行相应地设置。假设有一个用户,对于他而言,是否输入他的个人数据给机制M(假设该个人数据输入前的数据集是
Figure PCTCN2019101441-appb-000006
该个人数据输入后的数据集是
Figure PCTCN2019101441-appb-000007
Figure PCTCN2019101441-appb-000008
仅相差该个人数据),对于输出的影响很小(其中,影响由ε的大小来定义),那么可以认为M对于他的隐私起到了保护作用。假设ε=0,则这个用户是否输入自己的数据给M,对M的输出没有任何影响,所以用户的隐私完全被保护。
根据示例性实施例,源数据隐私保护方式可以是在如上所述训练源机器学习模型的过程中添加随机噪声。例如,可通过添加随机噪声,使得遵循上述差分隐私保护定义。但是,需要说明的是,关于隐私保护的定义并不仅限于差分隐私保护定义这一种定义方式,而是可以是例如K-匿名化、L-多样化、T-保密等其他关于隐私保护的定义方式。
根据示例性实施例,源机器学习模型可以是例如广义线性模型,例如,逻辑回归模型,但不限于此。此外,在源数据隐私保护方式中,迁移项获取装置120可将用于训练源机器学习模型的目标函数构造为至少包括损失函数和噪声项。这里,噪声项可用于在训练源机器学习模型的过程中添加随机噪声,从而使得可实现对源数据的隐私保护。此外,用于训练源机器学习模型的目标函数除了被构造为包括损失函数和噪声项之外,还可被构造为包括其他用于对模型参数进行约束的约束项,例如,还可被构造为包括用于防止模型过拟合现象或防止模型参数过于复杂的正则项、用于隐私保护的补偿项等。
为了便于更直观地理解以上所描述的在源数据隐私保护方式下基于每个源数据子集针对第一预测目标训练与每个源数据子集对应的源机器学习模型的过程,下面将进一步对该过程进行解释。为描述方便,这里,假设源数据隐私保护方式是遵循差分隐私定义的保护方式,并且源机器学习模型是广义线性模型。
具体地,假设数据集
Figure PCTCN2019101441-appb-000009
其中,x i是样本,y i是样本的标记(即,x i针对预测目标的标记),
Figure PCTCN2019101441-appb-000010
其中,n为数据集中的样本数量,d是样本空间的维度,
Figure PCTCN2019101441-appb-000011
是d维样本空间,此外,假设将数据集中的数据记录包括的数据属性字段集合S G划分为不重叠的K个数据属性字段组G 1,G 2,…,G K(即,S G={G 1,…,G K}),其中,每个组G k中包括至少一个数据属性字段。在以上假设下,可通过下面的过程来训练与每个数据子集对应的机器学习模型:
对于每个k(其中,k=1,…,K),执行以下操作来获得
Figure PCTCN2019101441-appb-000012
1、令
Figure PCTCN2019101441-appb-000013
其中,q k是缩放常数(具体地,其是用于限制每个数据子集中的样本的二范数的上界),并且缩放常数集合
Figure PCTCN2019101441-appb-000014
需要满足
Figure PCTCN2019101441-appb-000015
c为常数,λ k为常数集合,ε是以上等式1中的隐私预算;
2、对于G k∈S G,获取
Figure PCTCN2019101441-appb-000016
其中,
Figure PCTCN2019101441-appb-000017
表示将数据集
Figure PCTCN2019101441-appb-000018
中属于G k的数据属性字段提取出来而形成的每条数据记录均包括G k中的数据属性字段的数据子集,也就是说,
Figure PCTCN2019101441-appb-000019
是按照数据属性字段划分数据集
Figure PCTCN2019101441-appb-000020
而获得的第k个数据子集;
3、如果ε′>0,则Δ=0,否则,
Figure PCTCN2019101441-appb-000021
并且ε′=ε/2;
4、对数据子集
Figure PCTCN2019101441-appb-000022
中的包括的样本进行缩放,使得对于任何
Figure PCTCN2019101441-appb-000023
满足||x i||≤q k
5、从密度函数
Figure PCTCN2019101441-appb-000024
采样b,具体地可首先从Gamma分布
Figure PCTCN2019101441-appb-000025
采样b的二范数||b||,然后基于均匀随机采样b的方向u便可获得b=||b||u。
6、利用等式2,在数据隐私保护方式下,基于数据子集
Figure PCTCN2019101441-appb-000026
针对预测目标来训练与数据子集
Figure PCTCN2019101441-appb-000027
对应的机器学习模型:
Figure PCTCN2019101441-appb-000028
其中,在等式2中,w是机器学习模型的参数,
Figure PCTCN2019101441-appb-000029
是损失函数,g k(w)是正则化函数,
Figure PCTCN2019101441-appb-000030
是用于在训练机器学习模型的过程中添加随机噪声以实现数据隐私保护的噪声项,
Figure PCTCN2019101441-appb-000031
是用于隐私保护的补偿项,λ k是用于控制正则化强度的常数,
Figure PCTCN2019101441-appb-000032
便为构造的用于训练第k个机器学习模型的目标函数。根据以上等式2,在目标函数的取值最小时的w值便为最终求解出的第k个机器学习模型的参数
Figure PCTCN2019101441-appb-000033
按照以上描述的过程求解机器学习模型的参数的机制可被定义为A 2,需要说明的是,A 2既可用于求解源机器学习模型的参数,也可用于求解目标机器学习模型的参数。
要使按照以上等式2求解出的
Figure PCTCN2019101441-appb-000034
满足ε差分隐私定义,则需要满足以下预定条件:正则化函数g k(w)需要是1-强凸函数并且二阶可微,其次,对于所有的z,损失函数需要满足
Figure PCTCN2019101441-appb-000035
并且
Figure PCTCN2019101441-appb-000036
其中,
Figure PCTCN2019101441-appb-000037
Figure PCTCN2019101441-appb-000038
分别是损失函数的一阶导数和二阶导数。也就是说,只要是满足以上条件的广义线性模型,均可通过上面的等式2来获得满足差分隐私保护的机器学习模型的参数。
例如,对于逻辑回归模型,其损失函数
Figure PCTCN2019101441-appb-000039
如果令常数c等于1/4,正则化函数
Figure PCTCN2019101441-appb-000040
则正则化函数g k(w)满足是1-强凸函数并且二阶可微,并且对于所有的z,损失函数满足
Figure PCTCN2019101441-appb-000041
并且
Figure PCTCN2019101441-appb-000042
因此,当源机器学习模型是逻辑回归模型时,可利用以上描述的求解机器学习模型参数的机制A 2来求解源机器学习模型的参数。具体地,可令每个源机器学习模型的正则化函数等于
Figure PCTCN2019101441-appb-000043
即对于k∈{1,…,K},令正则化函数
Figure PCTCN2019101441-appb-000044
(这里的g sk(w)即为以上等式2中的g k(w)),在这种情况下,可利用以上描述的求解机器学习模型的参数
Figure PCTCN2019101441-appb-000045
的机制A 2最终求解出K个源机器学习模型的参数
Figure PCTCN2019101441-appb-000046
其中,
Figure PCTCN2019101441-appb-000047
为源数据集、ε s为源数据隐私保护方式的隐私预算、S G为每条源数据记录包括的数据属性字段的集合,
Figure PCTCN2019101441-appb-000048
为用于控制正则化强度的常数λ sk(即,以上等式2中的λ k)、正则化函数g sk(即,以上等式2中的g k(w))和缩放常数q sk(即,以上描述的q k)的集合。而按照以上机制A 2求解出的与每个源数据子集对应的源机器学习模型的参数既满足了对源数据的隐私保护,又携带了对应的源数据子集的知识。随后,训练出的每个源机器学习模型的参数可作为与每个源数据子集相关的迁移项被用于将该源数据子集的知识迁移到目标数据集。
如上所述,由于按照数据属性字段对源数据集划分之后针对每个源数据子集来训练对应的源机器学习模型以获取迁移项,而不是针对整个源数据集来训练源机器学习模型以获取迁移项,因此,可有效地减小在训练过程中添加的随机噪声,从而使得按照以上方式训练出的与每个源数据子集对应的源机器学习模型的参数(作为与每个源数据子集相关的迁移项)不仅实现了对对应的源数据子集中的隐私信息的保护,同时能够确保迁移项的可用性。
需要说明的是,尽管以上以广义线性模型(例如,逻辑回归模型)为例介绍了求解源机器学习模型的参数的过程,但是,事实上,只要是满足以上提及的关于正则化函数和损失函数的限制条件的线性模型均可利用等式2来求解源机器学习模型的参数,作为迁移项。
在迁移项获取装置120获取到关于源数据集的多个迁移项之后,第一目标机器学习模型获得装置130可分别利用所述多个迁移项之中的每个迁移项来获得与每个迁移项对应的第一目标机器学习模型,以获得多个第一目标机器学习模型。具体地,作为示例,第一目标机器学习模型获得装置130可在不使用目标数据集的情况下,直接将每个迁移项作为与其对应的第一目标机器学习模型的参数(为描述方便,以下将这种获得第一目标机器学习模型的方式简称为“第一目标机器学习模型直接获得方式”)。也就是说,假设多个第一目标机器学习模型的参数分别为
Figure PCTCN2019101441-appb-000049
则可令第一目标机器学习模型与源机器学习模型是相同类型的机器学习模型,并且可直接令
Figure PCTCN2019101441-appb-000050
从而获得与每个迁移项对应的第一目标机器学习模型。
可选地,第一目标机器学习模型获得装置130可通过以下方式(为描述方便,以下将这种获得第一目标机器学习模型的方式简称为“通过训练的第一目标机器学习模型获得方式”)来获得与每个迁移项对应的第一目标机器学习模型。具体地,第一目标机器学习模型获得装置130可首先将目标数据集或第一目标数据集按照数据属性字段以与划分源数据集相同的方式划分为多个第一目标数据子集,随后,在目标数据隐私保护方式下,基于每个第一目标数据子集,结合和与每个第一目标数据子集对应的源数据子集相关的迁移项,针对第二预测目标训练与该迁移项对应的第一目标机器学习模型。
这里,第一目标数据集可包括目标数据集中所包括的部分目标数据记录,并且每个第一目标数据子集和与其对应的源数据子集中的数据记录可包括相同的数据属性字段。如上所述,目标数据记录和源数据记录包括相同的数据属性字段,在这种情况下,可将目标数据集或第一目标数据集按照数据属性字段以与划分源数据集相同的方式划分为多个第一目标数据子集。例如,与以上描述的源数据记录的示例相同,假设每条目标数据记录也包括以下五个数据属性字段:用户ID、年龄、性别、历史信用记录和历史贷款记录,则可按照与以上描述的划分源数据记录的示例性划分方式相同的划分方式来对目标数据集或第一目标数据集进行划分。具体地,将这五个数据属性字段也划分为三个数据属性字段组,其中,例如,第一数据属性字段组可包括用户ID和年龄这两个数据属性字段,第二数据属性字段组可包括性别和历史信用记录这两个数据属性字段,第三数据属性字段组可包括历史贷款记录这一个数据属性字段。在这种情况下,通过将目标数据集或第一目标数据集按照数据属性字段划分而获得的多个第一目标数据子集便可以是由包括第一数据属性字段组中的数据属性字段的数据记录构成的第一目标数据子集、由包括第二数据属性字段组中的数据属性字段的数据记录构成的第一目标数据子集和由包括第三数据属性字段组中的数据属性字段的数据记录构成的第一目标数据子集。在这种情况下,例如,与以上的第一个第一目标数据子集对应的源数据子集便为在描述源数据集的划分时所提及的第一源数据子集,并且该第一目标数据子集和第一源数据子集中的数据记录包括相同的数据属性字段(即,均包括用户ID和年龄这两个数据属性字段),以此类推。
根据示例性实施例,上述目标数据隐私保护方式可与源数据隐私保护方式相同,例如,也可以是遵循差分隐私定义的保护方式,但不限于此。此外,第一目标机器学习模型可与源机器学习模型属于相同类型的机器学习模型。例如,第一目标机器学习模型也可以是广义线性模型,例如,逻辑回归模型,但不限于此,例如,可以是满足预定条件的任何线性模型。需要说明的是,这里的目标数据隐私保护方式也可以是与源数据隐私保护方式不同的隐私保护方式,并且第一目标机器学习模型也可以与源机器学习模型属于不同类型的机器学习模型,本申请对此均无限制。
此外,根据示例性实施例,上述目标数据隐私保护方式可以是在获得第一目标机器学习模型的过 程中添加随机噪声。作为示例,在目标数据隐私保护方式中,第一目标机器学习模型获得装置130可将用于训练第一目标机器学习模型的目标函数构造为至少包括损失函数和噪声项。除了将目标函数构造为至少包括损失函数和噪声项之外,第一目标机器学习模型获得装置130可将用于训练第一目标机器学习模型的目标函数构造为至少包括损失函数和噪声项并反映第一目标机器学习模型的参数与对应于该第一目标机器学习模型的迁移项之间的差值,然后,第一目标机器学习模型获得装置130可在目标数据隐私保护方式下,基于每个第一目标数据子集,结合和与每个第一目标数据子集对应的源数据子集相关的迁移项,通过求解构造的目标函数来针对第二预测目标训练与该迁移项对应的第一目标机器学习模型。通过在用于训练第一目标机器学习模型的目标函数中反映第一目标机器学习模型的参数与对应于该第一目标机器学习模型的迁移项之间的差值,可将与该迁移项对应的源数据子集中的知识迁移到目标数据集,从而使得该训练过程能够共同利用源数据集上的知识和目标数据集,因而训练出的第一目标机器学习模型的效果更佳。
需要说明的是,这里,第二预测目标可与以上所述的训练源机器学习模型所针对的第一预测目标相同(例如,两者均为预测交易是否为欺诈交易)或相似(例如,第一预测目标可以是预测交易是否为欺诈交易,第二预测目标可以是预测交易是否涉嫌违法)。此外,根据实际需要,上述目标函数还可被构造为包括用于防止训练出的第一目标机器学习模型出现过拟合现象的正则项等,或还可根据实际任务需求被构造为包括其他约束项,例如,用于隐私保护的补偿项,本申请对此并不限制,只要构造的目标函数能够有效地实现对目标数据的隐私保护,同时能够将对应的源数据子集上的知识迁移到目标数据集即可。
以下,为便于更加直观地理解上述内容,将进一步对第一目标机器学习模型获得装置130训练与每个迁移项对应的第一目标机器学习模型的上述过程进行说明。
这里,为描述方便,假设源机器学习模型是逻辑回归模型,第一目标机器学习模型是广义线性模型,并且目标数据隐私保护方式为遵循差分隐私保护定义的保护方式。
首先,将目标数据集
Figure PCTCN2019101441-appb-000051
或第一目标数据集
Figure PCTCN2019101441-appb-000052
(其中,
Figure PCTCN2019101441-appb-000053
是包括
Figure PCTCN2019101441-appb-000054
中所包括的部分目标数据记录的目标数据集,例如,可将
Figure PCTCN2019101441-appb-000055
中的所有目标数据记录按照1:1-p的比例划分为第一目标数据集
Figure PCTCN2019101441-appb-000056
和第二目标数据集
Figure PCTCN2019101441-appb-000057
)按照数据属性字段以与划分源数据集相同的方式划分为多个第一目标数据子集。如上所述,源数据记录包括的数据属性字段集合S G被划分为不重叠的K个数据字段组G 1,G 2,…,G K,同样,目标数据记录包括的数据属性字段集合也可以是S G,并且S G={G 1,…,G K}。
其次,对于每个k={1,…,K},可令用于训练第k个第一目标机器学习模型的目标函数中的正则化函数为:
Figure PCTCN2019101441-appb-000058
其中,0≤η k≤1,u为第k个第一目标机器学习模型的参数,
Figure PCTCN2019101441-appb-000059
是K个源机器学习模型的参数
Figure PCTCN2019101441-appb-000060
中的第k个源机器学习模型的参数。由于g tk(u)是1-强凸函数并且二阶可微,并且逻辑回归模型的损失函数满足上述预定条件中关于损失函数的要求,因此,可利用以上描述的求解机器学习模型的参数
Figure PCTCN2019101441-appb-000061
的机制A 2,通过将w替换为u,将
Figure PCTCN2019101441-appb-000062
替换为
Figure PCTCN2019101441-appb-000063
Figure PCTCN2019101441-appb-000064
将g k(w)替换为g tk(u),并将λ k替换为λ tk(用于训练第一目标机器学习模型的目标函数中的用于控制正则化强度的常数),将q k替换为q tk(用于缩放第k个第一目标数据子集中的样本的缩放常数)来获得与第k个迁移项
Figure PCTCN2019101441-appb-000065
对应的第k个第一目标机器学习模型的参数
Figure PCTCN2019101441-appb-000066
具体地,假设令整个目标数据隐私保护方式的隐私预算为ε t,则在先前被划分的目标数据集是
Figure PCTCN2019101441-appb-000067
且后续用于训练第二目标机器学习模型的目标数据集与
Figure PCTCN2019101441-appb-000068
完全重叠或部分重叠的情况下,获得的K个第一目标机器学习模型的参数
Figure PCTCN2019101441-appb-000069
(其中,pε t 是与用于训练第一目标机器学习模型的目标函数所包括的噪声项对应的隐私预算,其中,p为与用于训练第一目标机器学习模型的目标函数所包括的噪声项对应的隐私预算与整个目标数据隐私保护方式的隐私预算的比,并且0≤p≤1),在先前被划分的目标数据集是第一目标数据集
Figure PCTCN2019101441-appb-000070
而后续用于训练第二目标机器学习模型的目标数据集与
Figure PCTCN2019101441-appb-000071
完全不重叠的情况下,获得的K个第一目标机器学习模型的参数
Figure PCTCN2019101441-appb-000072
(其中,ε t是与用于训练第一目标机器学习模型的目标函数所包括的噪声项对应的隐私预算和与用于训练第二目标机器学习模型的目标函数所包括的噪声项对应的隐私预算两者之中较大的隐私预算)。
如上所述,在等式3中,正则化函数g tk(u)中含有
Figure PCTCN2019101441-appb-000073
使得用于第一目标机器学习模型的训练的目标函数被构造为反映了第一目标机器学习模型的参数与对应于该第一目标机器学习模型的迁移项之间的差值,从而有效地实现了对应的源数据子集上的知识到目标数据集的迁移。
需要说明的是,以上虽然重点以逻辑回归模型为例介绍了在目标数据隐私保护方式下训练第一目标机器学习模型的过程,但是,本领域技术人员应清楚是,本公开中的源机器学习模型和第一目标机器学习模型均不限于逻辑回归模型,而是可以是例如满足如上所述的预定条件的任何线性模型,甚至还可以是其他任何适当的模型。
在获得多个第一目标机器学习模型(例如,按照以上提及的“第一目标机器学习模型直接获得方式”或“通过训练的第一目标机器学习模型”获得多个第一目标机器学习模型)的情况下,第二目标机器学习模型获得装置140可利用所述多个第一目标机器学习模型获得第二目标机器学习模型。这里,第一目标机器学习模型和第二目标机器学习模型通常为上下层的结构,例如,第一目标机器学习模型可对应于第一层机器学习模型,第二目标机器学习模型可对应于第二层机器学习模型。
具体地,在第一目标机器学习模型获得装置130通过以上描述的“第一目标机器学习模型直接获得方式”获得了多个第一目标机器学习模型的情况下,第二目标机器学习模型获得装置140可按照如下方式(以下为描述方便,将该方式简称为“通过训练的第二目标机器学习模型获得方式”)获得第二目标机器学习模型:首先,第二目标机器学习模型获得装置140可将目标数据集按照数据属性字段以与划分源数据集相同的方式划分为多个目标数据子集。这里,每个目标数据子集和与其对应的源数据子集中的数据记录包括相同的数据属性字段。以上已经在描述获得第一目标机器学习模型的“通过训练的第一目标机器学习模型”中描述了如何按照数据属性字段以与划分源数据集相同的方式划分目标数据集,因此这里不再赘述,具体内容可参见上面的描述。其次,第二目标机器学习模型获得装置140可针对每个目标数据子集,利用与其对应的第一目标机器学习模型执行预测以获取针对每个目标数据子集中的每条数据记录的预测结果。最后,在目标数据隐私保护方式下,基于由获取的与每条目标数据记录对应的多个预测结果构成的训练样本的集合,针对第三预测目标训练第二目标机器学习模型。这里,所述训练样本的标记即为目标数据记录针对第三预测目标的标记。以下将详细描述训练样本的特征的产生过程。
具体地,例如,可假设获得的K个第一目标机器学习模型均为逻辑回归模型,并且K个第一目标机器学习模型的参数分别为
Figure PCTCN2019101441-appb-000074
(K同时也是划分出的多个目标数据子集的数量),则可将由获取的与目标数据集中的每条目标数据记录对应的多个预测结果构成的训练样本表示为:
Figure PCTCN2019101441-appb-000075
其中,x ki是第k(其中,k∈{1,…,K})个目标数据子集中的第i个数据记录,作为示例,
Figure PCTCN2019101441-appb-000076
为K个第一目标机器学习模型中的第一个第一目标机器学习模型针对K个目标数据子集中的第一个目标数据子集中的第i个数据记录的预测结果(这里,例如,该预测结果可以为该第一目标机器学习模型针对第i个数据记录输出的预测概率值(即,置信度值)),以此类推,便可获得K个第一目标机器学习模型分别针对对应的目标数据子集中的第i个数据记录的预测结果
Figure PCTCN2019101441-appb-000077
而上述K个预测结果便为与目标数据集中的第i个目标数据记录对应的K个预测结果,而这K个预测结果
Figure PCTCN2019101441-appb-000078
可构成第二目标机器学习模型的训练样本的特征部分。
根据示例性实施例,第一目标机器学习模型和第二目标机器学习模型可属于相同类型的机器学习模 型。例如,第二目标机器学习模型也可以为广义线性模型(例如,逻辑回归模型)。此外,这里的目标数据隐私保护方式可以是遵循差分隐私定义的保护方式,但不限于此。具体地,所述目标数据隐私保护方式可以是在获得第二目标机器学习模型的过程中添加随机噪声。例如,在所述目标数据隐私保护方式中,第二目标机器学习模型获得装置140可将用于训练第二目标机器学习模型的目标函数构造为至少包括损失函数和噪声项。在这种情况下,可按照以下所描述的训练机器学习模型的机制A 1来训练第二目标机器学习模型,其中,A 1是在满足差分隐私保护定义的情况下求解机器学习模型的参数的机制。具体地,机制A 1的实现过程如下:
假设数据集
Figure PCTCN2019101441-appb-000079
其中,x i是样本,y i是样本的标记,
Figure PCTCN2019101441-appb-000080
Figure PCTCN2019101441-appb-000081
其中,n为样本数量,d是样本空间的维度,
Figure PCTCN2019101441-appb-000082
是d维样本空间,则可基于数据集
Figure PCTCN2019101441-appb-000083
利用以下等式4来训练机器学习模型,从而获得满足差分隐私保护定义的机器学习模型的参数。
具体地,在利用等式4求解机器学习模型的参数之前,可令:
1、对数据集
Figure PCTCN2019101441-appb-000084
进行缩放,使得对于任意i均满足||x i||≤1,其中,||x i||表示x i的二范数;
2、
Figure PCTCN2019101441-appb-000085
其中,c和λ为常数,ε是以上等式1中的隐私预算;
3、如果ε′>0,则Δ=0,否则,
Figure PCTCN2019101441-appb-000086
并且ε′=ε/2;
4、从密度函数
Figure PCTCN2019101441-appb-000087
采样b,具体地,可首先从Gamma分布
Figure PCTCN2019101441-appb-000088
采样b的二范数||b||,然后基于均匀随机采样b的方向u便可获得b=||b||u。
接下来,可利用等式4,在数据隐私保护方式下,基于数据集
Figure PCTCN2019101441-appb-000089
训练机器学习模型,等式4如下:
Figure PCTCN2019101441-appb-000090
在等式4中,w是机器学习模型的参数,
Figure PCTCN2019101441-appb-000091
是损失函数,g(w)是正则化函数,
Figure PCTCN2019101441-appb-000092
是用于在训练机器学习模型的过程中添加随机噪声以实现数据隐私保护的噪声项,
Figure PCTCN2019101441-appb-000093
是用于隐私保护的补偿项,λ是用于控制正则化强度的常数,
Figure PCTCN2019101441-appb-000094
便为构造的用于训练机器学习模型的目标函数。根据以上等式4,在目标函数的取值最小时的w值便为最终求解出的机器学习模型的参数w *
在训练第二目标机器学习模型时,按照以上机制A 1,可通过令以上的
Figure PCTCN2019101441-appb-000095
(其中,x i是如上所述的训练样本
Figure PCTCN2019101441-appb-000096
y i是x i针对第三预测目标的标记,
Figure PCTCN2019101441-appb-000097
是由训练样本
Figure PCTCN2019101441-appb-000098
构成的训练样本的集合),λ=λ v(其中,λ v是用于训练第二目标机器学习模型的目标函数中用于控制正则化强度的常数),正则化函数
Figure PCTCN2019101441-appb-000099
Figure PCTCN2019101441-appb-000100
并且ε=ε tt为训练第二目标机器学习模型时使用的目标数据隐私保护方式的隐私预算)来利用等式4求解出第二目标机器学习模型的参数
Figure PCTCN2019101441-appb-000101
需要说明的是,尽管以上以第一目标机器学习模型和第二目标机器学习模型均为逻辑回归模型为例对训练第二目标机器学习模型的过程进行了描述,但是,第一目标机器学习模型和第二目标机器学习模型均不限于是逻辑回归模型,并且第二目标机器学习模型可以是与第一目标机器学习模型相同或不同类型的任何机器学习模型。此外,这里的第三预测目标可以与以上描述第一目标机器学习模型的训练时提及的第二预测目标相同或相似。另外,需要说明的是,当第二预测目标与第三预测目标不完全相同时,目标数据集中的每条目标数据记录事实上可对应于两个标记,这两个标记分别为目标数据记录关于第二预测目标的标记和目标数据记录关于第三预测目标的标记。
此外,可选地,根据本公开另一示例性实施例,在第一目标机器学习模型获得装置130通过以上描述的“通过训练的第一目标机器学习模型获得方式”获得了多个第一目标机器学习模型的情况下,第二目标机器学习模型获得装置140可通过以下操作来获得第二目标机器学习模型(以下,为描述方便,将这种获得第二目标机器学习模型的方式简称为“第二目标机器学习模型直接获得方式”):将第二目标机器学习模型的规则设置为:基于通过以下方式获取的与每条预测数据记录对应的多个预测结果来获得第二目标机器学习模型针对所述每条预测数据记录的预测结果,其中,所述方式包括:获取预测数据记录,并将预测数据记录按照数据属性字段以与划分源数据集相同的方式划分为多个子预测数据;针对每条预测数据记录之中的每个子预测数据,利用与其对应的第一目标机器学习模型执行预测以获取针对每个子预测数据的预测结果。这里,预测数据记录可与先前描述的目标数据记录和源数据记录包括相同的数据属性字段,不同之处在于预测数据记录不包括标记,并且以上已经通过示例对按照数据属性字段以与划分源数据集相同的方式划分数据记录的方式进行了描述,因此,这里不再对如何将预测数据记录划分为多个子预测数据进行赘述。这里,每个子预测数据可包括至少一个数据属性字段。另外,以上也已经对针对每个目标数据子集利用与其对应的第一目标机器学习模型执行预测以获取针对每个目标数据子集中的每条数据记录的预测结果的过程进行了描述,因此,这里不再对针对每个子预测数据利用与其对应的第一目标机器学习模型执行预测以获取针对每条预测数据记录中划分出的每个子预测数据的预测结果的过程进行赘述,不同之处仅在于这里预测过程所针对的对象是划分出的子预测数据。作为示例,基于获取的与每条预测数据记录对应的多个预测结果来获得第二目标机器学习模型针对所述每条预测数据记录的预测结果可以是对所述多个预测结果求平均、取最大值或对所述多个预测结果进行投票等方式来获得第二目标机器学习模型针对所述每条预测数据记录的预测结果。作为示例,如果所述多个预测结果为五个预测结果(即,所述多个第一目标机器学习模型的数量为五个)并且分别是交易为欺诈的概率为20%、50%、60%、70%和80%,则第二目标机器学习模型针对预测数据记录的预测结果可以是将20%、50%、60%、70%和80%求平均之后的所获得的概率值。作为另一示例,如果所述多个预测结果分别是“交易为欺诈”、“交易非欺诈”、“交易为欺诈”、“交易为欺诈”、“交易为欺诈”,则按照投票方式可获得第二目标机器学习模型针对预测数据记录的预测结果是“交易为欺诈”。
需要说明的是,本公开的第二目标机器学习模型不限于通过机器学习而获得的模型,而是可以泛指对数据进行处理的任何适当的机制(例如,以上所述的综合多个预测结果来获得针对每条预测数据记录的预测结果的规则)。
如上所述,第一目标机器学习模型获得装置130在以上的“通过训练的第一目标机器学习模型获得方式”中既可利用目标数据集
Figure PCTCN2019101441-appb-000102
来获得多个第一目标机器学习模型,也可利用
Figure PCTCN2019101441-appb-000103
中的第一目标数据集
Figure PCTCN2019101441-appb-000104
来获得多个第一目标机器学习模型。在第一目标机器学习模型获得装置130在以上描述的“通过训练的第一目标机器学习模型获得方式”中利用目标数据集
Figure PCTCN2019101441-appb-000105
来获得多个第一目标机器学习模型的情况下,可选地,根据本公开另一示例性实施例,第二目标机器学习模型获得装置140可针对每个第一目标数据子集,利用与其对应的第一目标机器学习模型执行预测以获取针对每个第一目标数据子集中的每条数据记录的预测结果;并且在目标数据隐私保护方式下,基于由获取的与每条目标数据记录对应的多个预测结果构成的训练样本的集合,针对第三预测目标训练第二目标机器学习模型。以上过程与先前描述的“通过训练的第二目标机器学习模型获得方式”类似,所不同的是,由于在获得第一目标机器学习模型的“通过训练的第一目标机器学习模型获得方式”中已经将目标数据集划分为了多个第一目标数据子集,因此,这里无需再进行数据集的划分,而是可直接针对每个目标数据子集,利用与其对应的第一目标机器学习模型执行预测操作,并进而基于由与目标数据集中的每条目标数据记录对应的多个预测结果构成的训练样本的集合来训练第二目标机器学习模型。对于具体的预测操作以及训练第二目标机器学习模型的过程,由于以上已经在先前“通过训练的第二目标机器学习模型获得方式”中进行过描述,因此,这里不再赘述,最终,可获得第二目标机器学习模型的参数
Figure PCTCN2019101441-appb-000106
其中,ε t为整个目标数据隐私保护方式的隐私预算,(1-p)ε t是与用于训练第二目标机器学习模型的目标函数所包括的噪声项对应的隐私预算。
可选地,根据本公开另一示例性实施例,在第一目标机器学习模型获得装置130在以上描述的“通过训练的第一目标机器学习模型获得方式”中利用第一目标数据集
Figure PCTCN2019101441-appb-000107
来获得多个第一目标机器学习模型的情况下,第二目标机器学习模型获得装置140可将第二目标数据集按照数据属性字段以与划分源数据集相同的方式划分为多个第二目标数据子集。这里,第二目标数据集可至少包括目标数据集中排除第一目标数据集之后的剩余目标数据记录,其中,第二目标数据集中的目标数据记录与源数据记录具有 相同的属性字段。作为示例,第二目标数据集可仅包括目标数据集中排除第一目标数据集之后的剩余目标数据记录(即,第二目标数据集可以是以上提及的
Figure PCTCN2019101441-appb-000108
),或者,第二目标数据集除了包括目标数据集中排除第一目标数据集之后的剩余目标数据记录之外还可包括第一目标数据集中的部分目标数据记录。此外,以上已经按照数据属性字段划分源数据集的方式进行过描述,因此,这里不再对划分第二目标数据集的操作进行赘述。随后,第二目标机器学习模型获得装置140可针对每个第二目标数据子集,利用与其对应的第一目标机器学习模型执行预测以获取针对每个第二目标数据子集中的每条数据记录的预测结果,并且在目标数据隐私保护方式下,基于由获取的与每条目标数据记录(第二目标数据集中的每条目标数据记录)对应的多个预测结果构成的训练样本的集合,针对第三预测目标训练第二目标机器学习模型。由于以上已经对针对每个目标数据子集利用与其对应的第一目标机器学习模型执行预测以获取针对每个目标数据子集中的每条数据记录的预测结果的过程进行了描述,因此,这里不再对针对每个第二目标数据子集利用与其对应的第一目标机器学习模型执行预测以获取针对每个第二目标数据子集中的每条数据记录的预测结果的过程进行赘述,不同之处仅在于这里预测过程所针对的对象是每个第二目标数据子集。最终,获得的第二目标机器学习模型的参数可被表示为
Figure PCTCN2019101441-appb-000109
在以上各种示例性实施例中,第三预测目标可以与以上描述第一目标机器学习模型的训练时提及的第二预测目标相同或相似,例如,第二预测目标可以是预测交易是否涉嫌违法,第三预测目标可以是预测交易是否涉嫌违法或者预测交易是否为欺诈。另外,第二目标机器学习模型可以是与第一目标机器学习模型相同或不同类型的任何机器学习模型,并且,第二目标机器学习模型可用于执行业务决策。这里,所述业务决策可涉及交易反欺诈、账户开通反欺诈、智能营销、智能推荐、贷款评估之中的至少一项,但不限于此,例如,训练出的目标机器学习模型还可用于与生理状况相关的业务决策等。事实上,本公开对目标机器学习模型可被应用于的具体业务决策的类型并无任何限制,只要是适于利用机器学习模型进行决策的业务即可。
从以上所描述的获得第一目标机器学习模型的过程和获得第二目标机器学习模型的过程可知,在获得所述多个第一目标机器学习模型的过程中和/或获得第二目标机器学习模型的过程中,在目标数据隐私保护方式下利用了获取的目标数据集中所包括的多条目标数据记录中的全部或部分。
另外,如上所述,在目标数据隐私保护方式中,第一目标机器学习模型获得装置130可将用于训练第一目标机器学习模型的目标函数构造为至少包括损失函数和噪声项,并且第二目标机器学习模型获得装置140可将用于训练第二目标机器学习模型的目标函数构造为至少包括损失函数和噪声项,而所述目标数据隐私保护方式的隐私预算可取决于与用于训练第一目标机器学习模型的目标函数所包括的噪声项对应的隐私预算和与用于训练第二目标机器学习模型的目标函数所包括的噪声项对应的隐私预算两者之和或两者之中较大的隐私预算。具体地,在训练第一目标机器学习模型的过程中使用的目标数据集与训练第二目标机器学习模型的过程中使用的目标数据集完全相同或部分相同(例如,训练第一目标机器学习模型的过程中使用的目标数据集是第一目标数据集,而训练第二目标机器学习模型的过程中使用的目标数据集包括目标数据集中排除第一目标数据集之后的剩余目标数据记录以及第一目标数据集之中的部分目标数据记录)的情况下,所述目标数据隐私保护方式的隐私预算可取决于与用于训练第一目标机器学习模型的目标函数所包括的噪声项对应的隐私预算和与用于训练第二目标机器学习模型的目标函数所包括的噪声项对应的隐私预算两者之和。在训练第一目标机器学习模型的过程中使用的目标数据集与在训练第二目标机器学习模型的过程中使用的目标数据集完全不同或完全不重叠(例如,目标数据集可按照目标数据记录被划分为第一目标数据集和第二目标数据集,在训练第一目标机器学习模型的过程中使用第一目标数据集,而在训练第二目标机器学习模型的过程中使用第二目标数据集)的情况下,所述目标数据隐私保护方式的隐私预算可取决于与用于训练第一目标机器学习模型的目标函数所包括的噪声项对应的隐私预算和与用于训练第二目标机器学习模型的目标函数所包括的噪声项对应的隐私预算两者之中较大的隐私预算。
以上,已经结合图1对根据本公开示例性实施例的机器学习系统100进行了描述,根据以上示例性实施例,机器学习系统100可分别在源数据隐私保护方式下将对应的一部分源数据子集中的知识成功迁移到目标数据集,并且同时可确保迁移的知识的可用性,从而使得能够进一步在目标数据隐私保护方式下综合更多知识来训练出模型效果更佳的第二目标机器学习模型,以应用于相应的业务决策。
需要说明的是,尽管以上在描述机器学习系统时将其划分为用于分别执行相应处理的装置(例如,目标数据集获取装置110、迁移项获取装置120、第一目标机器学习模型获得装置130和第二目标机器学习模型获得装置140),然而,本领域技术人员清楚的是,上述各装置执行的处理也可以在机器学习系统不进行任何具体装置划分或者各装置之间并无明确划界的情况下执行。此外,以上参照图1所描述的机器学习系统100并不限于包括以上描述的装置,而是还可以根据需要增加一些其他装置(例如,预测装 置、存储装置和/或模型更新装置等),或者以上装置也可被组合。例如,在机器学习系统100包括预测装置的情况下,预测装置可获取包括至少一个预测数据记录的预测数据集并将预测数据集按照数据属性字段以与划分源数据集相同的方式划分为多个预测数据子集,针对每个预测数据子集利用训练出的与其对应的第一目标机器学习模型执行预测以获取针对每个预测数据子集中的每条数据记录的预测结果,并且基于获取的与每条预测数据记录对应的多个预测结果来获得针对所述每条预测数据记录的预测结果。例如,可直接综合由获取的与每条预测数据记录对应的多个预测结果(例如,对所述多个预测结果求平均)来获得针对所述每条预测数据记录的预测结果,或者,可针对由获取的与每条预测数据记录对应的多个预测结果构成的预测样本利用训练好的第二目标机器学习模型执行预测来获得针对所述每条预测数据记录的预测结果。
具体地,根据示例性实施例,利用具有数据隐私保护的机器学习模型进行预测的系统(以下,为描述方便将其简称为“预测系统”)可包括目标机器学习模型获取装置、预测数据记录获取装置、划分装置和预测装置。这里,目标机器学习模型获取装置可获取以上所描述的多个第一目标机器学习模型和第二目标机器学习模型。具体地,目标机器学习模型获取装置可按照以上提及“第一目标机器学习模型直接获得方式”或“通过训练的第一目标机器学习模型获得方式”获取多个第一目标机器学习模型。相应地,目标机器学习模型获取装置可按照“通过训练的第二目标机器学习模型获得方式”或“第二目标机器学习模型直接获得方式”获取第二目标机器学习模型。也就是说,目标机器学习模型获取装置可本身执行以上描述的获得第一目标机器学习模型和第二目标机器学习模型的操作来获取多个第一目标机器学习模型和第二目标机器学习模型,在这种情况下,目标机器学习模型获取装置可相应于以上所描述的机器学习系统100。可选地,目标机器学习模型获取装置也可在机器学习系统100已经通过上述方式分别获得了多个第一目标机器学习模型和第二目标机器学习模型的情况下,从机器学习系统100直接获取所述多个第一目标机器学习模型和第二目标机器学习模型以进行后续预测。
预测数据记录获取装置可获取预测数据记录。这里,预测数据记录可与先前描述的源数据记录和目标数据记录包括相同的数据属性字段。此外,预测数据记录获取装置可实时地逐条获取预测数据记录,或者可离线地批量获取预测数据记录。划分装置可将预测数据记录划分为多个子预测数据。作为示例,划分装置可按照数据属性字段以与先前描述的划分源数据集相同的方式将预测数据记录划分为多个子预测数据,并且每个子预测数据可包括至少一个数据属性字段。以上已经结合示例对该划分方式进行了描述,因此,这里不再赘述,不同之处在于这里所划分的对象是预测数据记录。
预测装置可针对每条预测数据记录之中的每个子预测数据,利用与其对应的第一目标机器学习模型执行预测以获取针对每个子预测数据的预测结果。例如,如果子预测数据包括性别和历史信用记录这两个数据属性字段,则基于与该子预测数据包括相同数据属性字段的数据记录的集合(即,以上提及的第一目标数据子集)训练出的第一目标机器学习模型便为与该子数据记录对应的第一目标机器学习模型。此外,这里的预测结果可以是例如置信度值,但不限于此。
随后,预测装置可将由多个第一目标机器学习模型获取的与每条预测数据记录对应的多个预测结果输入第二目标机器学习模型,以得到针对所述每条预测数据记录的预测结果。例如,预测装置可按照设置的第二目标机器学习模型的规则基于所述多个预测结果来得到第二目标机器学习模型针对所述每条预测数据记录的预测结果,例如,通过对所述多个预测结果求平均、取最大值或进行投票来获得针对所述每条预测数据记录的预测结果。可选地,预测装置可利用事先训练的第二目标机器学习模型(具体训练过程参见先前描述的训练第二目标机器学习模型的相关描述)针对由所述多个预测结果构成的预测样本执行预测来获得针对所述每条预测数据记录的预测结果。
根据本公开示例性实施例的预测系统可通过在划分预测数据记录之后,利用多个第一目标机器学习模型执行预测以获得与每条预测数据记录对应的多个预测结果,并进一步基于多个预测结果利用第二目标机器学习模型获得最终的预测结果,从而可提高模型预测效果。
另外,需要说明的是,本公开中所提及的“机器学习”可被实现为“有监督学习”、“无监督学习”或“半监督学习”的形式,本发明的示例性实施例对具体的机器学习形式并不进行特定限制。
图2是示出根据本公开示例性实施例的在数据隐私保护方式下执行机器学习的方法(以下,为描述方便,将其简称为“机器学习方法”)的流程图。
这里,作为示例,图2所示的机器学习方法可由图1所示的机器学习系统100来执行,也可完全通过计算机程序或指令以软件方式实现,还可通过特定配置的计算系统或计算装置来执行,例如,可通过包括至少一个计算装置和至少一个存储指令的存储装置的系统来执行,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置执行上述机器学习方法。为了描述方便,假设图2所示的方法由图1所示的机器学习系统100来执行,并假设机器学习系统100可具有图1所示的配置。
参照图2,在步骤S210,目标数据集获取装置110可获取包括多条目标数据记录的目标数据集。以上在参照图1描述目标数据集获取装置110时描述的与获取目标数据集有关的任何内容均适应于此,因 此,这里不在对其进行赘述。
在获取到目标数据集之后,在步骤S220,迁移项获取装置120可获取关于源数据集的多个迁移项,这里,所述多个迁移项之中的每个迁移项可用于在源数据隐私保护下将对应的一部分源数据集的知识迁移到目标数据集,作为示例,所述对应的一部分源数据集可以是通过将源数据集按照数据属性字段划分而获得的源数据子集。关于源数据集、迁移项和对应的源数据子集以及源数据集的划分方式等内容,已经在描述图1的迁移项获取装置120时进行过描述,这里不再赘述。
具体地,在步骤S220,迁移项获取装置120可从外部接收关于源数据集的多个迁移项。或者,迁移项获取装置120可通过自身对源数据集执行机器学习处理来获取关于源数据集的多个迁移项。具体地,迁移项获取装置120可首先获取包括多条源数据记录的源数据集,这里,源数据记录和目标数据记录可包括相同的数据属性字段。随后,迁移项获取装置120可将源数据集按照数据属性字段划分为多个源数据子集,其中,每个源数据子集中的数据记录包括至少一个数据属性字段。接下来,迁移项获取装置120可在源数据隐私保护方式下,基于每个源数据子集,针对第一预测目标训练与每个源数据子集对应的源机器学习模型,并将训练出的每个源机器学习模型的参数作为与每个源数据子集相关的迁移项。
这里,作为示例,源数据隐私保护方式可以是遵循差分隐私保护定义的保护方式,但不限于此。另外,源数据隐私保护方式可以是在基于源数据集执行与机器学习相关的处理的过程中添加随机噪声,以实现对源数据的隐私保护。例如,源数据隐私保护方式可以是在训练源机器学习模型的过程中添加随机噪声。根据示例性实施例,在所述源数据隐私保护方式中可将用于训练源机器学习模型的目标函数构造为至少包括损失函数和噪声项。这里,噪声项用于在训练源机器学习模型的过程中添加随机噪声,从而实现对源数据隐私保护。此外,可选地,在所述源数据隐私保护方式中还可将目标函数构造为包括其他用于约束模型参数的约束项。根据示例性实施例,源机器学习模型可以是广义线性模型(例如,逻辑回归模型),但不限于此,例如,可以是满足预定条件的任何线性模型,甚至还可以是满足一定条件的任何适当模型。
以上在参照图1描述迁移项获取装置120时已经对获取迁移项的细节进行过描述,因此这里不再赘述。此外,需要说明的是,参照图1在描述迁移项获取装置120时提及的关于源数据隐私保护方式、目标函数等的所有描述均适用于图2,因此,这里不再赘述。
在获得了关于源数据集的多个迁移项之后,在步骤S230,第一目标机器学习模型获得装置130可分别利用所述多个迁移项之中的每个迁移项来获得与每个迁移项对应的第一目标机器学习模型,以获得多个第一目标机器学习模型。
随后,在步骤S240,第二目标机器学习模型获得装置140可利用在步骤S230获得的所述多个第一目标机器学习模型来获得第二目标机器学习模型。这里,作为示例,目标数据隐私保护方式也可以是遵循差分隐私定义的保护方式,但不限于此,而是可以是与源数据隐私保护方式相同或不同的其他数据隐私保护方式。此外,目标数据隐私保护方式可以是在获得第一目标机器学习模型和/或第二目标机器学习模型的过程中添加随机噪声。
以下,将参照图3至图6详细描述根据本公开示例性实施例的在数据隐私保护下执行机器学习的方法的示例。
图3是示出根据本公开第一示例性实施例的在数据隐私保护方式下执行机器学习的方法的示意图。
具体地,根据本公开第一示例性实施例,在步骤S220,将获取的源数据集按照数据属性字段划分为多个源数据集,例如,参照图3,D s是源数据集,其被按照数据属性字段划分为四个源数据子集
Figure PCTCN2019101441-appb-000110
Figure PCTCN2019101441-appb-000111
Figure PCTCN2019101441-appb-000112
随后,在源数据隐私保护方式下,基于每个源数据子集,针对第一预测目标训练与每个源数据子集对应的源机器学习模型,并将训练出的每个源机器学习模型的参数作为与每个源数据子集相关的迁移项。在图3中,
Figure PCTCN2019101441-appb-000113
Figure PCTCN2019101441-appb-000114
分别是与源数据子集
Figure PCTCN2019101441-appb-000115
Figure PCTCN2019101441-appb-000116
对应的源机器学习模型的参数,并被分别作为与源数据子集
Figure PCTCN2019101441-appb-000117
Figure PCTCN2019101441-appb-000118
相关的迁移项。
在步骤S230,第一目标机器学习模型获得装置130可在不使用目标数据集的情况下,直接将每个迁移项作为与其对应的第一目标机器学习模型的参数。例如,参照图3,
Figure PCTCN2019101441-appb-000119
Figure PCTCN2019101441-appb-000120
分别是与迁移项
Figure PCTCN2019101441-appb-000121
Figure PCTCN2019101441-appb-000122
对应的第一目标机器学习模型的参数,并且
Figure PCTCN2019101441-appb-000123
Figure PCTCN2019101441-appb-000124
随后,在步骤S240,第二目标机器学习模型获得装置140可将目标数据集按照数据属性字段以与划分源数据集相同的方式划分为多个目标数据子集,其中,每个目标数据子集和与其对应的源数据子集中的数据记录包括相同的数据属性字段。例如,参照图3,按照与划分源数据集D s相同的划分方式,可将目标数据集D t划分为四个目标数据子集
Figure PCTCN2019101441-appb-000125
Figure PCTCN2019101441-appb-000126
其中,
Figure PCTCN2019101441-appb-000127
Figure PCTCN2019101441-appb-000128
中的数据记录包括相同的数据属性字段,类似的,
Figure PCTCN2019101441-appb-000129
Figure PCTCN2019101441-appb-000130
中的数据记录包括相同的数据属性字段,
Figure PCTCN2019101441-appb-000131
Figure PCTCN2019101441-appb-000132
中的数据记录包括相同的数据属性字段,
Figure PCTCN2019101441-appb-000133
Figure PCTCN2019101441-appb-000134
中的数据记录包括相同的数据属性字段。随后,在步骤S240中,第二目标机器学习模型获得装置140可针对每个目标数据子集,利用与其对应的第一目标机器学习模型执行预测以获取针对每个目标数据子集中的每条数据记录的预测结果。例如,参照图3,针对目标数据子集
Figure PCTCN2019101441-appb-000135
Figure PCTCN2019101441-appb-000136
Figure PCTCN2019101441-appb-000137
分别利用参数为
Figure PCTCN2019101441-appb-000138
Figure PCTCN2019101441-appb-000139
的第一目标机器学习模型执行预测,其中,p 1是利用参数为
Figure PCTCN2019101441-appb-000140
的第一目标机器学习模型针对目标数据子集
Figure PCTCN2019101441-appb-000141
执行预测的预测结果集,其包括参数为
Figure PCTCN2019101441-appb-000142
的第一目标机器学习模型针对
Figure PCTCN2019101441-appb-000143
中的每条数据记录的预测结果。类似地,p 2、p 3和p 4分别是利用参数为
Figure PCTCN2019101441-appb-000144
的第一目标机器学习模型、参数为
Figure PCTCN2019101441-appb-000145
的第一目标机器学习模型和参数为
Figure PCTCN2019101441-appb-000146
的第一目标机器学习模型执行预测的预测结果集。接下来,在步骤S240,第二目标机器学习模型获得装置140可在目标数据隐私保护方式下,基于由获取的与每条目标数据记录对应的多个预测结果构成的训练样本的集合,针对第三预测目标训练第二目标机器学习模型。例如,对于目标数据集D t中的每条目标数据记录在预测结果集p 1、p 2、p 3和p 4中均有与其对应的一个预测结果,而这四个预测结果便可构成与每条目标数据记录对应的训练样本,而这样的训练样本的集合可用于在目标数据隐私保护下针对第三预测目标训练第二目标机器学习模型。
如图3所示,在获得第二目标机器学习模型的过程中,在目标数据隐私保护方式下利用了所述多条目标数据记录中的全部数据记录。
图4示出根据本公开第二示例性实施例的在数据隐私保护方式下执行机器学习的方法的示意图。
图4与图3的不同在于步骤S230和步骤S240。具体地,在第二示例性实施例中,在步骤S220获取到关于源数据集的多个迁移项(例如,
Figure PCTCN2019101441-appb-000147
Figure PCTCN2019101441-appb-000148
)之后,在步骤S230,第一目标机器学习模型获得装置130可将目标数据集按照数据属性字段以与划分源数据集相同的方式划分为多个第一目标数据子集,其中,每个第一目标数据子集和与其对应的源数据子集中的数据记录包括相同的数据属性字段。参照图4,例如,可将目标数据集D t按照与划分源数据集D s相同的方式划分为四个第一目标数据子集
Figure PCTCN2019101441-appb-000149
Figure PCTCN2019101441-appb-000150
其中,
Figure PCTCN2019101441-appb-000151
Figure PCTCN2019101441-appb-000152
分别对应于源数据子集
Figure PCTCN2019101441-appb-000153
Figure PCTCN2019101441-appb-000154
随后,第一目标机器学习模型获得装置130可在目标数据隐私保护方式下,基于每个第一目标数据子集,结合和与每个第一目标数据子集对应的源数据子集相关的迁移项,针对第二预测目标训练与该迁移项对应的第一目标机器学习模型。例如,参照图4,基于每个第一目标数据子集
Figure PCTCN2019101441-appb-000155
Figure PCTCN2019101441-appb-000156
分别结合迁移项
Figure PCTCN2019101441-appb-000157
Figure PCTCN2019101441-appb-000158
来针对第二预测目标训练与各迁移项对应的第一目标机器学习模型。如图4中所示,基于第一目标数据子集
Figure PCTCN2019101441-appb-000159
结合迁移项
Figure PCTCN2019101441-appb-000160
训练出的第一目标机器学习模型的参数是
Figure PCTCN2019101441-appb-000161
基于第一目标数据子集
Figure PCTCN2019101441-appb-000162
结合迁移项
Figure PCTCN2019101441-appb-000163
训练出的第一目标机器学习模型的参数是
Figure PCTCN2019101441-appb-000164
基于第一目标数据子集
Figure PCTCN2019101441-appb-000165
结合迁移项
Figure PCTCN2019101441-appb-000166
训练出的第一目标机器学习模型的参数是
Figure PCTCN2019101441-appb-000167
基于第一目标数据子集
Figure PCTCN2019101441-appb-000168
结合迁移项
Figure PCTCN2019101441-appb-000169
训练出的第一目标机器学习模型的参数是
Figure PCTCN2019101441-appb-000170
接下来,在步骤S240,第二目标机器学习模型获得装置140可将第二目标机器学习模型的规则设置为:基于通过以下方式获取的与每条预测数据记录对应的多个预测结果来获得第二目标机器学习模型针对所述每条预测数据记录的预测结果,其中,所述方式包括:获取预测数据记录,并将预测数据记录按照数据属性字段以与划分源数据集相同的方式划分为多个子预测数据;针对每条预测数据记录之中的每个子预测数据,利用与其对应的第一目标机器学习模型执行预测以获取针对每个子预测数据的预测结果。这里,预测数据记录可以是实时预测或批量预测时需要进行预测的数据记录。参照图4,例如,获取的预测数据记录D p被按照与划分源数据集相同的方式划分为四个子预测数据
Figure PCTCN2019101441-appb-000171
Figure PCTCN2019101441-appb-000172
并且与
Figure PCTCN2019101441-appb-000173
Figure PCTCN2019101441-appb-000174
对应的第一目标机器学习模型的参数分别为
Figure PCTCN2019101441-appb-000175
Figure PCTCN2019101441-appb-000176
随后,第二目标机器学习模型获得装置140可针对每个子预测数据,利用与其对应的第一目标机器学习模型执行预测以获取针对每个子预测数据的预测结果。例如,参照图4,针对子预测数据
Figure PCTCN2019101441-appb-000177
可利用参数为
Figure PCTCN2019101441-appb-000178
的第一目标机器学习模型执行预测来获得预测结果p 1。类似地,p 2、p 3和p 4分别是利用参数为
Figure PCTCN2019101441-appb-000179
的第一目标机器学习模型针对
Figure PCTCN2019101441-appb-000180
执行预测的预测结果、利用参数为
Figure PCTCN2019101441-appb-000181
的第一目标机器学习模型针对
Figure PCTCN2019101441-appb-000182
执行预测的预测结果、以及利用参数为
Figure PCTCN2019101441-appb-000183
的第一目标机器学习模型针对
Figure PCTCN2019101441-appb-000184
执行预测的预测结果。第二目标机器学习模型获得装置140可将第二目标机器学习模型的规则设置为:基于获取的与每条预测数据记录对应的多个预测结果来获得第二目标机器学习模型针对所述每条预测数据记录的预测结果。例如,通过将与每条预测数据记录对应的以上四个预测结果求平均可获得第二目标机器学习模型针对每条预测数据记录的预测结果,但获得第二目标机器学习模型针对每条预测数据记录的预测结果的方式不限于此,例如,还可以通过投票的方式来获得第二目标机器学习模型针对每条预测数据记录的预测结果。
如图4所示,在获得第一目标机器学习模型的过程中,在目标数据隐私保护方式下利用了目标数据集中的所述多条目标数据记录中的全部。
图5是示出根据本公开第三示例性实施例的在数据隐私保护方式下执行机器学习的方法的示意图。
图5中在步骤S220获得关于源数据的多个迁移项的方式以及在步骤S230获得多个第一目标机器学习模型的方式与图4完全相同,这里不再赘述。与图4不同的是,在图5的示例性实施例中,在步骤S240,第二目标机器学习模型可直接针对在步骤S230划分出的每个第一目标数据子集,利用与其对应的第一目标机器学习模型执行预测以获取针对每个第一目标数据子集中的每条数据记录的预测结果,并且, 在目标数据隐私保护方式下,基于由获取的与每条目标数据记录对应的多个预测结果构成的训练样本的集合,针对第三预测目标训练第二目标机器学习模型。例如,参照图5,针对第一目标数据子集
Figure PCTCN2019101441-appb-000185
利用参数为
Figure PCTCN2019101441-appb-000186
的第一目标机器学习模型执行预测来获得预测结果集p 1,其中,p 1包括参数为
Figure PCTCN2019101441-appb-000187
的第一目标机器学习模型针对
Figure PCTCN2019101441-appb-000188
中的每条数据记录的预测结果。类似地,针对第一目标数据子集
Figure PCTCN2019101441-appb-000189
利用参数为
Figure PCTCN2019101441-appb-000190
的第一目标机器学习模型执行预测来获得预测结果集p 2,针对第一目标数据子集
Figure PCTCN2019101441-appb-000191
利用参数为
Figure PCTCN2019101441-appb-000192
的第一目标机器学习模型执行预测来获得预测结果集p 3,针对第一目标数据子集
Figure PCTCN2019101441-appb-000193
利用参数为
Figure PCTCN2019101441-appb-000194
的第一目标机器学习模型执行预测来获得预测结果集p 4。此外,对于目标数据集D t中的每条目标数据记录在预测结果集p 1、p 2、p 3和p 4中均有与其对应的一个预测结果,而这四个预测结果便可构成与每条目标数据记录对应的训练样本,而这样的训练样本的集合可用于在目标数据隐私保护下针对第三预测目标训练第二目标机器学习模型。
如图5所示,在获得第一目标机器学习模型的过程中和获得第二目标机器学习模型的过程中,在目标数据隐私保护方式下利用了在步骤S210获取的目标数据集中的多条目标数据记录中的全部。
图6是示出根据本公开第四示例性实施例的在数据隐私保护方式下执行机器学习的方法的示意图。
与图5不同的是,在图6的示例性实施例中,在获得第一目标机器学习模型的步骤S230中,并不是将目标数据集按照数据属性字段以与划分源数据集相同的方式划分为多个第一目标数据子集,而是将目标数据集中的第一目标数据集(例如,图6中的D t1)按照数据属性字段以与划分源数据集相同的方式划分为多个第一目标数据子集(例如,图6中的
Figure PCTCN2019101441-appb-000195
Figure PCTCN2019101441-appb-000196
),其中,第一目标数据集可包括目标数据集中所包括的部分目标数据记录,每个第一目标数据子集和与其对应的源数据子集中的数据记录包括相同的数据属性字段。随后,第一目标机器学习模型获得装置130可在目标数据隐私保护方式下,基于每个第一目标数据子集,结合和与每个第一目标数据子集对应的源数据子集相关的迁移项,针对第二预测目标训练与该迁移项对应的第一目标机器学习模型。接下来,与图5不同,在图6的示例性实施例中,在步骤S240,第二目标机器学习模型获得装置140在利用所述多个第一目标机器学习模型获得第二目标机器学习模型的过程中并非使用在步骤S230中使用的目标数据集完全相同的目标数据集,而是使用与第一目标数据集不同的第二目标数据集。具体地,在步骤S240,第二目标机器学习模型获得装置140可将第二目标数据集(例如,图6中的D t2)按照数据属性字段以与划分源数据集相同的方式划分为多个第二目标数据子集(例如,图6中的
Figure PCTCN2019101441-appb-000197
Figure PCTCN2019101441-appb-000198
)。这里,第二目标数据集不同于第一目标数据集并至少包括目标数据集中排除第一目标数据集之后的剩余目标数据记录。随后,第二目标机器学习模型获得装置140,针对每个第二目标数据子集,利用与其对应的第一目标机器学习模型执行预测以获取针对每个第二目标数据子集中的每条数据记录的预测结果,最后,在目标数据隐私保护方式下,基于由获取的与每条目标数据记录对应的多个预测结果构成的训练样本的集合,针对第三预测目标训练第二目标机器学习模型。
如图6所示,在获得第一目标机器学习模型的过程和获得第二目标机器学习模型的过程中,在目标数据隐私保护方式下利用了在步骤S210获取的目标数据集中的所述多条目标数据记录中的部分。
综上所述,在获得所述多个第一目标机器学习模型的过程中和/或获得第二目标机器学习模型的过程中,在目标数据隐私保护方式下利用了目标数据集中的所述多条目标数据记录中的全部或部分。
此外,在以上示例性实施例中提及的目标数据隐私保护方式中,可将用于训练第一目标机器学习模型的目标函数和/或用于训练第二目标机器学习模型的目标函数构造为至少包括损失函数和噪声项,而所述目标数据隐私保护方式的隐私预算可取决于与用于训练第一目标机器学习模型的目标函数所包括的噪声项对应的隐私预算和与用于训练第二目标机器学习模型的目标函数所包括的噪声项对应的隐私预算两者之和或两者之中较大的隐私预算。具体地,在训练第一目标机器学习模型的过程中使用的目标数据集与在训练第二目标机器学习模型的过程中使用的目标数据集完全重叠或部分重叠的情况下,目标数据隐私保护方式的隐私预算可取决于与用于训练第一目标机器学习模型的目标函数所包括的噪声项对应的隐私预算和与用于训练第二目标机器学习模型的目标函数所包括的噪声项对应的隐私预算两者之和。然而,在训练第一目标机器学习模型的过程中使用的目标数据集与在训练第二目标机器学习模型的过程中使用的目标数据集完全不重叠的情况下,目标数据隐私保护方式的隐私预算可取决于与用于训练第一目标机器学习模型的目标函数所包括的噪声项对应的隐私预算和与用于训练第二目标机器学习模型的目标函数所包括的噪声项对应的隐私预算两者之中较大的隐私预算。例如,在以上描述的图5的示例性实施例中目标数据隐私保护方式的隐私预算取决于以上两者之和,在图6的示例性实施例中目标数据隐私保护方式的隐私预算取决于以上两者之中较大的隐私预算。
此外,源机器学习模型和第一目标机器学习模型可属于相同类型的机器学习模型,并且/或者,第一预测目标和第二预测目标相同或相似。作为示例,所述相同类型的机器学习模型为逻辑回归模型。在这种情况下,在步骤S230中,可通过以下方式来训练第一目标机器学习模型:将用于训练第一目标机器学习模型的目标函数构造为至少包括损失函数和噪声项并反映第一目标机器学习模型的参数与对应于 该第一目标机器学习模型的迁移项之间的差值;在目标数据隐私保护方式下,基于每个第一目标数据子集,结合和与每个第一目标数据子集对应的源数据子集相关的迁移项,通过求解构造的目标函数来针对第二预测目标训练与该迁移项对应的第一目标机器学习模型。
另外,根据示例性实施例,第一目标机器学习模型和第二目标机器学习模型可属于相同类型的机器学习模型,并且/或者,第二预测目标和第三预测目标可以相同或相似。另外,在本公开中,第二目标机器学习模型可用于执行业务决策。作为示例,所述业务决策可涉及交易反欺诈、账户开通反欺诈、智能营销、智能推荐、贷款评估之中的至少一项,但不限于此。
以上描述的根据本公开示例性实施例的在数据隐私保护方式下执行机器学习的方法,既可以确保源数据隐私和目标数据隐私的不被泄露,同时能够通过多个迁移项将源数据集上的知识迁移到目标数据集,并且由于每个迁移项仅用于将对应的一部分源数据集的知识迁移到目标数据集,使得在源数据隐私保护方式下获得第一目标机器学习模型的过程中为了实现源数据隐私而添加的噪声相对小,从而既可保证迁移项的可用性,又可有效地将知识迁移到目标数据集。相应地,在目标数据隐私保护下获得第二目标机器学习模型的过程中为实现目标数据隐私保护而添加的噪声也会相对小,从而既实现了目标数据隐私,又可获得模型效果更佳的目标机器学习模型。
需要说明的是,尽管以上在描述图2时,按顺序对图2中的步骤进行了描述,但是,本领域技术人员清楚的是,上述方法中的各个步骤不一定按顺序执行,而是可按照相反的顺序或并行地执行,例如,以上描述的步骤S210与步骤S220便可按照相反顺序或并行执行,也就是说,可在获取目标数据集之前获取关于源数据集的多个迁移项,或者可同时获取目标数据集和迁移项。另外,在执行步骤S230的同时,也可执行步骤S210或步骤S220,也就是说,在获得第一目标机器学习模型的过程中,可同时获取新的目标数据集或迁移项,以用于例如后续目标机器学习模型的更新操作等。此外,尽管以上仅参照图3至图6描述了根据本公开的机器学习方法的四个示例性实施例,但是可根据本公开的机器学习方法不限于以上示例性实施例,而是可通过适当的变形获得更多的示例性实施例。
此外,根据本公开另一示例性实施例,可提供一种利用具有数据隐私保护的机器学习模型进行预测的方法(为便于描述,将该方法检测为“预测方法”)。作为示例,该预测方法可由以上描述的“预测系统”来执行,也可完全通过计算机程序或指令以软件方式实现,还可通过特定配置的计算系统或计算装置来执行。为描述方便,假设“预测方法”由上述“预测系统”执行,并假设预测系统包括目标机器学习模型获取装置、预测数据记录获取装置、划分装置和预测装置。
具体地,目标机器学习模型获取装置可在以上描述的步骤S240之后,获取通过上述步骤S210至S240已经获得的多个第一目标机器学习模型和第二目标机器学习模型。可选地,目标机器学习模型获取装置也可本身通过执行步骤S210至S240来获得多个第一目标机器学习模型和第二目标机器学习模型,关于获得第一目标机器学习模型和第二目标机器学习模型的具体方式,以上已经参照图2至图6进行过描述,因此这里不再赘述。也就是说,这里的“预测方法”既可以是上述“机器学习方法”的继续,也可以是完全独立的预测方法。
在获取到多个第一目标机器学习模型和第二目标机器学习模型之后,预测数据记录获取装置可获取预测数据记录。这里,预测数据记录可与先前描述的源数据记录和目标数据记录包括相同的数据属性字段。此外,预测数据记录获取装置可实时地逐条获取预测数据记录,并且可离线地批量获取预测数据记录。接下来,划分装置可将预测数据记录划分为多个子预测数据。作为示例,划分装置可按照数据属性字段以与先前描述的划分源数据集相同的方式将预测数据记录划分为多个子预测数据,并且每个子预测数据可包括至少一个数据属性字段。随后,预测装置可针对每条预测数据记录之中的每个子预测数据,利用与其对应的第一目标机器学习模型执行预测以获取针对每个子预测数据的预测结果。最后,预测装置可将由多个第一目标机器学习模型获取的与每条预测数据记录对应的多个预测结果输入第二目标机器学习模型,以得到针对所述每条预测数据记录的预测结果。
根据以上预测方法,通过在划分预测数据记录之后利用多个第一目标机器学习模型执行预测以获得与每条预测数据记录对应的多个预测结果,并进一步基于获得的多个预测结果利用第二机器学习模型获得最终的预测结果,从而可提高模型预测效果。
图7是示出根据本公开示例性实施例的在数据隐私保护方式下执行机器学习的构思的示意图。
为便于更清楚且直观地理解本公开的构思,以下结合图7以金融领域中的贷款审核场景为例(即,目标机器学习模型将用于贷款审核这一业务决策),对根据本公开示例性实施例的在数据隐私保护下执行机器学习的构思进行简要描述。
如今,随着机器学习的不断发展,其在金融领域开始发挥着日益重要的作用,从审批贷款到资产管理,再到风险评估和信贷反欺诈等,机器学习在金融生态系统的许多阶段都起着不可或缺的作用。例如,银行可利用机器学习来决定是否批准贷款申请者的贷款申请。但是,单个银行自身所能获得的关于贷款申请者的历史金融活动相关记录可能并不足以全面地反映该贷款申请者的真实信用或贷款偿还能力等情 况,在这种情况下,该银行可能期望能够获得该贷款申请者在其他机构的历史金融活动相关记录。然而,出于客户隐私保护的考虑,该银行很难利用其他机构所拥有的贷款申请者的历史金融活动相关记录。然而,根据利用本公开的构思则可实现在用户数据保护隐私的情况下充分利用多个机构的数据来帮助银行更准确地判断是否批准贷款申请者的贷款申请,进而减少金融风险。
参照图7,目标数据源710(例如,第一银行机构)可将其拥有的涉及用户历史金融活动的包括多条目标数据记录的目标数据集发送给机器学习系统730。这里,每条目标数据记录可包括例如用户的姓名、国籍、职业、薪酬、财产、信用记录、历史贷款金额的多个数据属性字段,但不限于此。此外,每条目标数据记录还可包括例如关于用户是否按时清偿贷款的标记信息。
这里,机器学习系统730可以是以上参照图1描述的机器学习系统100。作为示例,机器学习系统730可以由专门提供机器学习服务的实体(例如,机器学习服务提供商)提供,或者也可由目标数据源710自己构建。相应地,机器学习系统730既可设置在云端(如公有云、私有云或混合云),也可以设置在银行机构的本地系统。这里,为描述方便,假设机器学习系统730被设置在公有云端,并且由机器学习服务提供商构建。
为了更准确地预测用户的贷款风险指数或者用户的贷款偿还能力,第一银行机构可例如与源数据源720(例如,第二机构)达成彼此在保护用户数据隐私的情况下共享数据的协议。在这种情况下,基于该协议,作为示例,在相应安全措施下,源数据源720可将其所拥有的包括多条源数据记录的源数据集发送给机器学习系统730,这里,源数据集例如可以是与以上描述的目标数据集类似的涉及用户金融活动的数据集,并且源数据记录和目标数据记录可包括相同的数据属性字段,例如,源数据记录也可包括例如用户的姓名、国籍、职业、薪酬、财产、信用记录、历史贷款金额的多个数据属性字段。根据在本公开的构思,机器学习系统730可如以上参照图1至6所述将源数据集按照数据属性字段划分为多个源数据子集,并在源数据隐私保护方式下,基于每个源数据子集针对第一预测目标训练对应的源机器学习模型,并将训练出的每个源机器学习模型的参数作为与每个源数据子集相关的迁移项。这里,源机器学习模型可以是例如用于预测用户贷款风险指数或贷款清偿能力的机器学习模型或者其他类似预测目标的机器学习模型,或者是与贷款估计业务相关的针对其他预测目标的机器学习模型。
或者,机器学习系统730也可从源数据源720直接获取迁移项。在这种情况下,例如,源数据源720可事先通过其自身的机器学习系统或者委托其他机器学习服务提供商在源数据隐私保护方式下基于通过按照数据属性字段划分源数据集而获得的每个源数据子集执行机器学习相关处理来获取与每个源数据子集相关的迁移项,并将多个迁移项发送给机器学习系统730。可选地,源数据源720也可选择将源数据集/多个迁移项发送给目标数据源710,然后,由目标数据源710将源数据集/多个迁移项与目标数据集一起提供给机器学习系统730,以用于后续机器学习。
随后,机器学习系统730可分别利用所述多个迁移项之中的每个迁移项来获得与每个迁移项对应的第一目标机器学习模型,以获得多个第一目标机器学习模型。例如,第一目标机器学习模型也可以是用于预测用户贷款风险指数或贷款清偿能力的机器学习模型。然后,机器学习系统730可进一步利用所述多个第一目标机器学习模型获得第二目标机器学习模型。具体地获得第一目标机器学习模型和第二目标机器学习模型的方式可参见图1至图6的描述。这里,第二目标机器学习模型可与第一目标机器学习模型属于相同类型的机器学习模型。例如,第二目标机器学习模型可以是用于预测用户贷款风险指数或贷款清偿能力的机器学习模型,或者可以是用于预测用户贷款行为是否涉嫌欺诈的机器学习模型。根据本公开的构思,如以上参照图1至图6所述,在获得多个第一目标机器学习模型的过程中和/或获得第二目标机器学习模型的过程中,在目标数据隐私保护方式下利用了目标数据集中的多条目标数据记录中的全部或部分。
在获得了目标机器学习模型(包括第一目标机器学习模型和第二目标机器学习模型)之后,目标数据源710可将涉及至少一个贷款申请者的包括至少一条预测数据记录的预测数据集发送给机器学习系统730。这里,预测数据记录可与以上提及的源数据记录和目标数据记录包括相同的数据属性字段,例如,也可包括用户的姓名、国籍、职业、薪酬、财产、信用记录、历史贷款金额的多个数据属性字段。机器学习系统730可将预测数据集按照数据属性字段以与划分源数据集相同的方式划分为多个预测数据子集,并且针对每个预测数据子集,利用与其对应的第一目标机器学习模型执行预测以获取针对每个预测数据子集中的每条数据记录的预测结果。随后,机器学习系统730可基于获取的与每条预测数据记录对应的多个预测结果来获得第二目标机器学习模型针对所述每条预测数据记录的预测结果。或者,可选地,机器学习系统730可在目标数据隐私保护方式下,利用训练出的第二目标机器学习模型执行预测来提供针对由获取的与每条预测数据记录对应的多个预测结果构成的预测样本的预测结果。这里,预测结果可以是每个贷款申请者的贷款风险指数或贷款清偿能力评分,或者可以是每个贷款申请者的贷款行为是否涉嫌欺诈。此外,机器学习系统730可将预测结果反馈给目标数据源710。随后,目标数据源710可基于接收到的预测结果判断是否批准贷款申请者提出的贷款申请。通过以上方式,银行机构可以利用机器学 习在保护用户数据隐私的同时利用其他机构的数据和自身拥有的数据获得更准确的判断结果,从而可避免不必要的金融风险。
需要说明的是,尽管以上以机器学习在金融领域中的贷款估计应用为例介绍了本公开的构思,但是,本领域人员清楚的是,根据本公开示例性实施例的在数据隐私保护下执行机器学习的方法和系统不限于应用于金融领域,也不限于用于执行贷款估计这样的业务决策。而是,可应用于任何涉及数据安全和机器学习的领域和业务决策。例如,根据本公开示例性实施例的在数据隐私保护下执行机器学习的方法和系统还可应用于交易反欺诈、账户开通反欺诈、智能营销、智能推荐、以及公共卫生领域中生理数据的预测等。
以上已参照图1至图7描述了根据本公开示例性实施例的机器学习方法和机器学习系统。然而,应理解的是:附图中示出的装置和系统可被分别配置为执行特定功能的软件、硬件、固件或上述项的任意组合。例如,这些系统、装置可对应于专用的集成电路,也可对应于纯粹的软件代码,还可对应于软件与硬件相结合的模块。此外,这些系统或装置所实现的一个或多个功能也可由物理实体设备(例如,处理器、客户端或服务器等)中的组件来统一执行。
此外,上述方法可通过记录在计算机可读存储介质上的指令来实现,例如,根据本申请的示例性实施例,可提供一种存储指令的计算机可读存储介质,其中,当所述指令被至少一个计算装置运行时,促使所述至少一个计算装置执行以下步骤:获取包括多条目标数据记录的目标数据集;获取关于源数据集的多个迁移项,其中,所述多个迁移项之中的每个迁移项用于在源数据隐私保护下将对应的一部分源数据集的知识迁移到目标数据集;分别利用所述多个迁移项之中的每个迁移项来获得与每个迁移项对应的第一目标机器学习模型,以获得多个第一目标机器学习模型;利用所述多个第一目标机器学习模型获得第二目标机器学习模型,其中,在获得所述多个第一目标机器学习模型的过程中和/或获得第二目标机器学习模型的过程中,在目标数据隐私保护方式下利用了所述多条目标数据记录中的全部或部分。
上述计算机可读存储介质中存储的指令可在诸如客户端、主机、代理装置、服务器等计算机设备中部署的环境中运行,应注意,所述指令还可用于执行除了上述步骤以外的附加步骤或者在执行上述步骤时执行更为具体的处理,这些附加步骤和进一步处理的内容已经在参照图2至图6进行机器学习方法的描述过程中提及,因此这里为了避免重复将不再进行赘述。
应注意,根据本公开示例性实施例的机器学习系统可完全依赖计算机程序或指令的运行来实现相应的功能,即,各个装置在计算机程序的功能架构中与各步骤相应,使得整个系统通过专门的软件包(例如,lib库)而被调用,以实现相应的功能。
另一方面,当图1所示的系统和装置以软件、固件、中间件或微代码实现时,用于执行相应操作的程序代码或者代码段可以存储在诸如存储介质的计算机可读介质中,使得至少一个处理器或至少一个计算装置可通过读取并运行相应的程序代码或者代码段来执行相应的操作。
例如,根据本申请示例性实施例,可提供一种包括至少一个计算装置和至少一个存储指令的存储装置的系统,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置执行下述步骤:获取包括多条目标数据记录的目标数据集;获取关于源数据集的多个迁移项,其中,所述多个迁移项之中的每个迁移项用于在源数据隐私保护下将对应的一部分源数据集的知识迁移到目标数据集;分别利用所述多个迁移项之中的每个迁移项来获得与每个迁移项对应的第一目标机器学习模型,以获得多个第一目标机器学习模型;利用所述多个第一目标机器学习模型获得第二目标机器学习模型,其中,在获得所述多个第一目标机器学习模型的过程中和/或获得第二目标机器学习模型的过程中,在目标数据隐私保护方式下利用了所述多条目标数据记录中的全部或部分。
具体说来,上述系统可以部署在服务器或客户端中,也可以部署在分布式网络环境中的节点上。此外,所述系统可以是PC计算机、平板装置、个人数字助理、智能手机、web应用或其他能够执行上述指令集合的装置。此外,所述系统还可包括视频显示器(诸如,液晶显示器)和用户交互接口(诸如,键盘、鼠标、触摸输入装置等)。另外,所述系统的所有组件可经由总线和/或网络而彼此连接。
这里,所述系统并非必须是单个系统,还可以是任何能够单独或联合执行上述指令(或指令集)的装置或电路的集合体。所述系统还可以是集成控制系统或系统管理器的一部分,或者可被配置为与本地或远程(例如,经由无线传输)以接口互联的便携式电子装置。
在所述系统中,所述至少一个计算装置可包括中央处理器(CPU)、图形处理器(GPU)、可编程逻辑装置、专用处理器系统、微控制器或微处理器。作为示例而非限制,所述至少一个计算装置还可包括模拟处理器、数字处理器、微处理器、多核处理器、处理器阵列、网络处理器等。计算装置可运行存储在存储装置之一中的指令或代码,其中,所述存储装置还可以存储数据。指令和数据还可经由网络接口装置而通过网络被发送和接收,其中,所述网络接口装置可采用任何已知的传输协议。
存储装置可与计算装置集成为一体,例如,将RAM或闪存布置在集成电路微处理器等之内。此外,存储装置可包括独立的装置,诸如,外部盘驱动、存储阵列或任何数据库系统可使用的其他存储装置。 存储装置和计算装置可在操作上进行耦合,或者可例如通过I/O端口、网络连接等互相通信,使得计算装置能够读取存储在存储装置中的指令。
图8是示出根据本公开示例性实施例的在数据隐私保护下执行机器学习的系统(以下,为描述方便,将其简称为“机器学习系统”)800的框图。参照图8,机器学习系统800可包括目标数据集获取装置810、迁移项获取装置820和目标机器学习模型训练装置830。
具体说来,目标数据集获取装置810可获取目标数据集。这里,目标数据集可以是任何可被用于目标机器学习模型训练的数据集,并且可包括多条目标数据记录和/或目标数据记录经过各种数据处理或特征处理之后的结果。此外,可选地,目标数据集还可包括目标数据记录关于机器学习目标的标记(label)。例如,目标数据记录可包括反映对象或事件的各种属性的至少一个属性字段(例如,用户ID、年龄、性别、历史信用记录等),目标数据记录关于机器学习目标的标记可以是例如用户是否有能力偿还贷款、用户是否接受推荐的内容等,但不限于此。此外,目标数据集可涉及用户不期望被他人获知的各种个人隐私信息(例如,用户的姓名、身份证号码、手机号码、财产总额、贷款记录等),并且也可包括不涉及个人隐私的群体相关信息。这里,目标数据记录可来源于不同的数据源(例如,网络运营商、银行机构、医疗机构等),并且目标数据集可被特定机构或组织在获得用户授权的情况下使用,但是用户往往期望其涉及个人隐私的信息不再进一步被其他组织或个人获知。需要说明的是,在本公开中,“隐私”可泛指涉及单个个体的任何属性。
作为示例,目标数据集获取装置810可一次性或分批次地从目标数据源获取目标数据集,并且可以手动、自动或半自动方式获取目标数据集。此外,目标数据集获取装置810可实时或离线地获取目标数据集中的目标数据记录和/或关于目标数据记录的标记,并且目标数据集获取装置810可同时获取目标数据记录和关于目标数据记录的标记,或者获取关于目标数据记录的标记的时间可滞后于获取目标数据记录的时间。此外,目标数据集获取装置810可以以加密的形式从目标数据源获取目标数据集或者直接利用其本地已经存储的目标数据集。如果获取的目标数据集是加密的数据,则可选地,机器学习系统800还可包括对目标数据进行解密的装置,并还可包括数据处理装置以将目标数据处理为适用于当前机器学习的形式。需要说明的是,本公开对目标数据集中的目标数据记录及其标记的种类、形式、内容、目标数据集的获取方式等均无限制,采用任何手段获取的可用于机器学习的数据均可作为以上提及的目标数据集。
然而,如本公开背景技术所述,对于期望挖掘出更多有价值信息的机器学习而言,实际中,仅基于获取的目标数据集可能不足以学习出满足实际任务需求或达到预定效果的机器学习模型,因此,可设法获取来自其他数据源的相关信息,以将来自其他数据源的知识迁移到目标数据集,从而结合目标数据集与来自其他数据源的知识共同进行机器学习,进而可提高机器学习模型的效果。但是,迁移的前提是需要确保:其他数据源的数据集(在本公开中,可被称为“源数据集”)中所涉及的隐私信息不被泄露,即,需要对源数据进行隐私保护。
为此,迁移项获取装置820可获取关于源数据集的迁移项。这里,迁移项可用于在源数据隐私保护方式下将源数据集的知识迁移到目标数据集以在目标数据集上训练目标机器学习模型。具体地,迁移项可以是在源数据被进行隐私保护的情况下(即,在源数据隐私保护方式下)获得的任何与源数据集所包含的知识有关的信息,本公开对迁移项的具体内容和形式不作限制,只要其能够在源数据隐私保护方式下将源数据集的知识迁移到目标数据集即可,例如,迁移项可涉及源数据集的样本、源数据集的特征、基于源数据集获得的模型、用于模型训练的目标函数、关于源数据的统计信息等。
作为示例,迁移项获取装置820可从外部接收关于源数据集的迁移项。例如,迁移项获取装置820可从拥有源数据集的实体、或者授权可对源数据源执行相关处理的实体(例如,提供机器学习相关服务的服务提供商)获取所述迁移项。在这种情况下,迁移项可以是由拥有源数据集的实体或者授权可对源数据源执行相关处理的实体基于源数据集执行机器学习相关处理而获得的,并且可由这些实体将获得的迁移项发送给迁移项获取装置820。这里,根据本发明的示例性实施例,基于源数据集执行机器学习相关处理所针对的预测目标与目标数据集上的目标机器学习模型所针对的预测目标可以是相同的目标(例如,均为预测交易是否为欺诈交易)或相关的目标(例如,具有一定程度近似性的分类问题,例如,预测交易是否为欺诈交易与预测交易是否涉嫌违法)。
与直接从外部获取迁移项不同,可选地,迁移项获取装置820也可通过对源数据集执行机器学习相关处理来获取关于源数据集的迁移项。这里,迁移项获取装置820对源数据集的获取和使用可以是经过授权或经过保护措施的,使得其能够对获取的源数据集进行相应的处理。具体说来,迁移项获取装置820可首先获取源数据集。这里,源数据集可以是与目标数据集有关的任何数据集,相应地,以上关于目标数据集的构成、目标数据集的获取方式等的描述均适用于源数据集,这里不再赘述。另外,尽管为了描述方便,将源数据集描述为由迁移项获取装置820获取,但是,需要说明的是,也可由目标数据集获取装置810来执行获取源数据集的操作,或者,由以上两者共同获取源数据集,本公开对此并不限制。此 外,获取的目标数据集、源数据集和迁移项均可存储在机器学习系统的存储装置(未示出)中。作为可选方式,以上存储的目标数据、源数据或迁移项可进行物理或访问权限上的隔离,以确保数据的安全使用。
在获取了源数据集的情况下,出于隐私保护的考虑,机器学习系统800并不能够直接利用获取的源数据集连同目标数据集一起进行机器学习,而是需要在保证源数据被执行隐私保护的情况下才可利用其进行机器学习。为此,迁移项获取装置820可在源数据隐私保护方式下,基于源数据集执行与机器学习相关的处理,并且在基于源数据集执行与机器学习相关的处理的过程中获取关于源数据集的迁移项。根据示例性实施例,源数据隐私保护方式可以是遵循差分隐私定义的保护方式,但不限于此,而是可以是任何已经存在的或未来可能出现的能够对源数据进行隐私保护的任何隐私保护方式。
为便于理解,现在对遵循差分隐私定义的保护方式进行简要描述。假设有一随机机制M(例如,M可以是机器学习模型),对于M而言,输入的任意两个仅相差一个样本的数据集
Figure PCTCN2019101441-appb-000199
Figure PCTCN2019101441-appb-000200
的输出等于t的概率分别为
Figure PCTCN2019101441-appb-000201
Figure PCTCN2019101441-appb-000202
并且满足以下等式1(其中,ε是隐私保护程度常数或隐私预算),则可认为M对于任意输入是满足ε差分隐私保护的。
Figure PCTCN2019101441-appb-000203
在以上等式1中,ε越小,隐私保护程度越好,反之则越差。ε的具体取值,可根据用户对数据隐私保护程度的要求进行相应地设置。假设有一个用户,对于他而言,是否输入他的个人数据给机制M(假设该个人数据输入前的数据集是
Figure PCTCN2019101441-appb-000204
该个人数据输入后的数据集是
Figure PCTCN2019101441-appb-000205
Figure PCTCN2019101441-appb-000206
仅相差该个人数据),对于输出的影响很小(其中,影响由ε的大小来定义),那么可以认为M对于他的隐私起到了保护作用。假设ε=0,则这个用户是否输入自己的数据给M,对M的输出没有任何影响,所以用户的隐私完全被保护。
根据示例性实施例,源数据保护方式可以是在基于源数据集执行与机器学习相关的处理的过程中添加随机噪声。例如,可通过添加随机噪声,使得遵循上述差分隐私保护定义。但是,需要说明的是,关于隐私保护的定义并不仅限于差分隐私保护定义这一种定义方式,而是可以是例如k-匿名化、I多样化、t-closeness等其他关于隐私保护的定义方式。
如上所述,迁移项可以是在源数据隐私保护方式下获得的任何与源数据集所包含的知识有关的信息。具体地,根据本公开示例性实施例,迁移项可涉及在基于源数据集执行与机器学习相关的处理的过程中得到的模型参数、目标函数和/或关于源数据的统计信息,但不限于此。作为示例,基于源数据集执行与机器学习相关的处理的操作可包括:在源数据隐私保护方式下基于源数据集训练源机器学习模型,但不限于此,而是还可包括例如对源数据集执行特征处理或数据统计分析等机器学习相关处理。此外,需要说明的是,上述模型参数、目标函数和/或关于源数据的统计信息均既可以是在基于源数据执行与机器学习相关的处理的过程中直接获得的上述信息本身,也可以是对这些信息进行进一步变换或处理之后所获得的信息,本公开对此并无限制。
作为示例,涉及模型参数的迁移项可以是源机器学习模型的参数,例如,在满足差分隐私保护定义的源数据保护方式下训练源机器学习模型的过程中获得的源机器学习模型的模型参数,此外,还可以是例如源机器学习模型的参数的统计信息等,但不限于此。作为示例,迁移项所涉及的目标函数可以是指为了训练源机器学习模型而构建出的目标函数,在源机器学习模型本身的参数并不进行迁移的情况下,该目标函数可并不单独进行实际求解,但本公开不限于此。作为示例,涉及关于源数据的统计信息的迁移项可以是在源数据隐私保护方式(例如,满足差分隐私保护定义的保护方式)下获取的关于源数据的数据分布信息和/或数据分布变化信息,但不限于此。
如上所述,迁移项获取装置820可在源数据隐私保护方式下基于源数据集训练源机器学习模型。根据示例性实施例,源机器学习模型可以是例如广义线性模型,例如,逻辑回归模型,但不限于此。此外,在源数据隐私保护方式中,迁移项获取装置820可将用于训练源机器学习模型的目标函数构造为至少包括损失函数和噪声项。这里,噪声项可用于在训练源机器学习模型的过程中添加随机噪声,从而使得可实现对源数据的隐私保护。此外,用于训练源机器学习模型的目标函数除了被构造为包括损失函数和噪声项之外,还可被构造为包括其他用于对模型参数进行约束的约束项,例如,还可被构造为包括用于防止模型过拟合现象或防止模型参数过于复杂的正则项、用于隐私保护的补偿项等。
为了便于更直观地理解在源数据隐私保护方式下基于源数据集训练源机器学习模型以获得关于源数据集的迁移项的过程,下面将进一步结合数学表示对该过程进行解释。为描述方便,这里,假设源数据 隐私保护方式是遵循差分隐私定义的保护方式,并且源机器学习模型是广义线性模型。
具体地,假设源数据集
Figure PCTCN2019101441-appb-000207
其中,x i是样本,y i是样本的标记,
Figure PCTCN2019101441-appb-000208
其中,n为样本数量,d是样本空间的维度,
Figure PCTCN2019101441-appb-000209
是d维样本空间,则可基于源数据集
Figure PCTCN2019101441-appb-000210
利用以下等式2来训练源机器学习模型,从而获得满足差分隐私保护的关于源数据集的迁移项(在该示例性实施例中为源机器学习模型的参数)。
具体地,在利用等式2求解源机器学习模型的参数之前,可令:
1、对源数据集
Figure PCTCN2019101441-appb-000211
进行缩放,使得对于任意i均满足||x i||≤1,其中,||x i||表示x i的二范数;
2、
Figure PCTCN2019101441-appb-000212
其中,c和λ为常数,ε是以上等式1中的隐私程度保护常数;
3、如果ε′>0,则Δ=0,否则,
Figure PCTCN2019101441-appb-000213
并且ε′=ε/2;
4、从密度函数
Figure PCTCN2019101441-appb-000214
采样b,具体地,例如,可首先从Gamma分布
Figure PCTCN2019101441-appb-000215
采样b的二范数||b||,然后基于均匀随机采样b的方向u便可获得b=||b||u。
接下来,可利用等式2,在源数据隐私保护方式下,基于源数据集训练源机器学习模型,等式2如下:
Figure PCTCN2019101441-appb-000216
在等式2中,w是源机器学习模型的参数,
Figure PCTCN2019101441-appb-000217
是损失函数,g(w)是正则化函数,
Figure PCTCN2019101441-appb-000218
是用于在训练源机器学习模型的过程中添加随机噪声以实现源数据隐私保护的噪声项,
Figure PCTCN2019101441-appb-000219
是用于隐私保护的补偿项,λ是用于控制正则化强度的常数,
Figure PCTCN2019101441-appb-000220
便为构造的用于训练源机器学习模型的目标函数。根据以上等式2,在目标函数的取值最小时的w值便为最终求解出的源机器学习模型的参数w *
要使按照以上等式2求解出的w *满足ε差分隐私定义,则需要满足以下预定条件:正则化函数g(w)需要是1-强凸函数并且二阶可微,其次,对于所有的z,损失函数需要满足
Figure PCTCN2019101441-appb-000221
并且
Figure PCTCN2019101441-appb-000222
其中,
Figure PCTCN2019101441-appb-000223
Figure PCTCN2019101441-appb-000224
分别是损失函数的一阶导数和二阶导数。也就是说,只要是满足以上条件的广义线性模型,均可通过上面的等式2来获得满足差分隐私保护的源数据机器模型的参数。
例如,对于逻辑回归模型,其损失函数
Figure PCTCN2019101441-appb-000225
如果令常数c等于1/4,正则化函数
Figure PCTCN2019101441-appb-000226
则正则化函数g(w)满足是1-强凸函数并且二阶可微,并且对于所有的z,损失函数满足
Figure PCTCN2019101441-appb-000227
并且
Figure PCTCN2019101441-appb-000228
因此,当源机器学习模型是逻辑回归模型时,可利用以上等式2来求解源机器学习模型的参数,而按照以上方式求解出的源机器学习模型的参数既满足了对源数据的隐私保护,又携带了源数据集的知识。随后,源机器学习模型的参数可作为迁移项被用于将源数据集的知识迁移到目标数据集以在目标数据集上训练目标机器学习模型。
需要说明的是,尽管以上以广义线性模型(例如,逻辑回归模型)为例介绍了求解源机器学习模型的参数的过程,但是,事实上,只要是满足以上提及的关于正则化函数和损失函数的限制条件的线性模型均可利用等式2来求解源机器学习模型的参数,作为迁移项。
在发明的实施例中,所述源数据和目标数据可分别是来自如下实体中的任一个或多个的数据:
来自银行的数据:如用户的登记信息、银行交易流水信息、存款信息、金融产品购买信息、票据信息(图像)等;
来自保险机构的数据:如投保人信息、保单信息、赔付保险的信息等;
来自医疗机构的数据:如病历信息、确诊信息、治疗信息等;
来自证券公司等其他金融机构的数据;如用户登记信息、金融产品交易信息、金融产品价格浮动信 息等;
来自学校的数据:如生源信息、升学率、就业率、教学信息、教师信息等;
来自政府部门的数据:如社保信息、人力资源信息、市政资源信息、市政项目相关信息、财政相关信息、教育相关信息等;
来自互联网实体的数据:如用来自电商平台或app运营实体的用户登记信息、用户网络行为(搜索、浏览、收藏、购买、点击、支付等)信息,或来自搜索引擎的网络视频、音频、图片、文本等相关的数据等;
来自电信运营商的数据:如移动用户通信数据、固定网络或移动网络流量相关数据等;
来自传统工业企业的数据:工业控制数据如电网相关操作数据、风力发电机组操控数据、空调系统操控数据、矿井组操控数据等等。
从类型上,在本发明的实施例中涉及的源数据和目标数据可以是视频数据、图像数据、语音数据、文本数据、格式化的表单数据等。
在迁移项获取装置820获取到迁移项之后,目标机器学习模型训练装置830可在目标数据隐私保护方式下,基于目标数据集,结合所述迁移项来训练目标机器学习模型。
在本发明的实施例中,所述目标机器学习模型可被应用于如下场景中的任一场景:
图像处理场景,包括:光学字符识别OCR、人脸识别、物体识别和图片分类;更具体地举例来说,OCR可应用于票据(如发票)识别、手写字识别等,人脸识别可应用安防等领域,物体识别可应用于自动驾驶场景中的交通标志识别,图片分类可应用于电商平台的“拍照购”、“找同款”等。
语音识别场景,包括可通过语音进行人机交互的产品,如手机的语音助手(如苹果手机的Siri)、智能音箱等;
自然语言处理场景,包括:审查文本(如合同、法律文书和客服记录等)、垃圾内容识别(如垃圾短信识别)和文本分类(情感、意图和主题等);
自动控制场景,包括:矿井组调节操作预测、风力发电机组调节操作预测和空调系统调节操作预测;具体的对于矿井组可预测开采率高的一组调节操作,对于风力发电机组可预测发电效率高的一组调节操作,对于空调系统,可以预测满足需求的同时节省能耗的一组调节操作;
智能问答场景,包括:聊天机器人和智能客服;
业务决策场景,包括:金融科技领域、医疗领域和市政领域的场景,其中:
金融科技领域包括:营销(如优惠券使用预测、广告点击行为预测、用户画像挖掘等)与获客、反欺诈、反洗钱、承保和信用评分、商品价格预测;
医疗领域包括:疾病筛查和预防、个性化健康管理和辅助诊断;
市政领域包括:社会治理与监管执法、资源环境和设施管理、产业发展和经济分析、公众服务和民生保障、智慧城市(公交、网约车、共享单车等各类城市资源的调配和管理);
推荐业务场景,包括:新闻、广告、音乐、咨询、视频和金融产品(如理财、保险等)的推荐;
搜索场景,包括:网页搜索、图像搜索、文本搜索、视频搜索等;
异常行为检测场景,包括:国家电网客户用电异常行为检测、网络恶意流量检测、操作日志中的异常行为检测等。
根据示例性实施例,目标数据隐私保护方式可与源数据隐私保护方式相同,例如,也可以是遵循差分隐私定义的保护方式,但不限于此。此外,目标机器学习模型可与源机器学习模型属于基于相同类型的机器学习模型。例如,目标机器学习模型也可以是广义线性模型,例如,逻辑回归模型,但不限于此,例如,可以是满足预定条件的任何线性模型。需要说明的是,目标数据隐私保护方式也可以是与源数据隐私保护方式不同的隐私保护方式,并且目标机器学习模型也可以与源机器学习模型属于不同类型的机器学习模型,本申请对此均无限制。
根据示例性实施例,目标数据隐私保护方式可以是在训练目标机器学习模型的过程中添加随机噪声。例如,目标机器学习模型训练装置820可将用于训练目标机器学习模型的目标函数构造为至少包括损失函数和噪声项。可选地,除了将目标函数构造为至少包括损失函数和噪声项之外,在目标数据隐私保护方式下基于目标数据集结合迁移项来训练目标机器学习模型时,目标机器学习模型训练装置830可将用于训练目标机器学习模型的目标函数构造为还反映目标机器学习模型的参数与所述迁移项之间的差值,然后,可基于目标数据集,通过求解构造的目标函数来训练目标机器学习模型。通过在用于训练目标机器学习模型的目标函数中反映目标机器学习模型的参数与所述迁移项之间的差值,可将源数据集中的知识迁移到目标数据集,从而使得该训练过程能够共同利用源数据集上的知识和目标数据集,因而可训练出的目标机器学习模型的效果更佳。
此外,根据实际需要,目标函数还可被构造为包括用于防止训练出的机器学习模型出现过拟合现象的正则项等,或还可根据实际任务需求被构造为包括其他约束项,例如,用于隐私保护的补偿项,本申 请对此并不限制,只要构造的目标函数能够有效地实现对目标数据的隐私保护,同时能够将源数据集上的知识迁移到目标数据集即可。
以下,为便于更加直观地理解上述内容,将进一步结合数学表示对目标机器学习模型训练装置830训练目标机器学习模型的上述过程进行说明。
这里,为描述方便,假设源机器学习模型是逻辑回归模型,目标机器学习模型是广义线性模型,并且目标数据隐私保护方式为遵循差分隐私保护定义的保护方式。
首先,在令源机器学习模型的正则化函数
Figure PCTCN2019101441-appb-000229
的情况下,可利用以上描述的求解源机器学习模型参数的过程求解出源机器学习模型的参数
Figure PCTCN2019101441-appb-000230
(这里的
Figure PCTCN2019101441-appb-000231
即为以上等式2中的w *),其中,A 1为如以上等式2所述的求解机制,
Figure PCTCN2019101441-appb-000232
分别为源数据集、源数据集需要满足的隐私保护程度常数、用于训练源机器学习模型的目标函数中的用于控制正则化强度的常数和正则化函数。随后,在获得源机器学习模型的参数后,可令用于目标机器学习模型的目标函数中的正则化函数为:
Figure PCTCN2019101441-appb-000233
其中,0≤η≤1。由于g t(w)是1-强凸函数并且二阶可微,并且逻辑回归模型的损失函数
Figure PCTCN2019101441-appb-000234
满足上述预定条件中关于损失函数的要求,因此,可通过将等式2中的g(w)替换为g t(w),并按照以上描述的训练源机器学习模型的过程,利用等式2基于目标数据集在满足差分隐私保护定义的方式下训练目标机器学习模型,从而在用于目标机器学习模型的训练的目标函数取最小值时求解出目标机器学习模型的参数
Figure PCTCN2019101441-appb-000235
其中,
Figure PCTCN2019101441-appb-000236
分别是目标数据集、目标数据集需要满足的隐私保护程度常数、用于训练目标机器学习模型的目标函数中的控制正则化强度的常数和正则化函数。
此外,在等式3中,由于含有
Figure PCTCN2019101441-appb-000237
使得用于目标机器学习模型的训练的目标函数被构造为反映了目标机器学习模型的参数与迁移项(即,源机器学习模型的参数)之间的差值,从而有效地实现了源数据集上的知识到目标数据集的迁移。
需要说明的是,以上虽然重点以逻辑回归模型为例介绍了在目标数据隐私保护方式下训练目标机器学习模型的过程,但是,本领域技术人员应清楚是,本公开中的源机器学习模型和目标机器学习模型均不限于逻辑回归模型,而是可以是例如满足如上所述的预定条件的任何线性模型,甚至还可以是其他任何适当的模型。
根据示例性实施例,训练出的目标机器学习模型可用于执行业务决策,其中,所述业务决策涉及交易反欺诈、账户开通反欺诈、智能营销、智能推荐、贷款评估之中的至少一项,但不限于此,例如,训练出的目标机器学习模型还可用于与生理状况相关的业务决策等。
根据以上示例性实施例,目标机器学习模型训练装置830可在源数据隐私和目标数据隐私均被保护的情况下将源数据集中的知识成功迁移到目标数据集,从而使得能够综合更多知识来训练出模型效果更佳的目标机器学习模型,以应用于相应的业务决策。
以上,已经参照图8描述了根据本申请示例性实施例的机器学习系统800,需要说明的是,尽管以上在描述机器学习系统时将其划分为用于分别执行相应处理的装置(例如,目标数据集获取装置810、迁移项获取装置820和目标机器学习模型训练装置830),然而,本领域技术人员清楚的是,上述各装置执行的处理也可以在机器学习系统不进行任何具体装置划分或者各装置之间并无明确划界的情况下执行。此外,以上参照图8所描述的机器学习系统800并不限于包括以上描述的装置,而是还可以根据需要增加一些其他装置(例如,预测装置、存储装置和/或模型更新装置等),或者以上装置也可被组合。
另外,需要说明的是,本公开中所提及的“机器学习”可被实现为“有监督学习”、“无监督学习”或“半监督学习”的形式,本发明的示例性实施例对具体的机器学习形式并不进行特定限制。
图9是示出根据本公开示例性实施例的在数据隐私保护方式下执行机器学习的方法(以下,为描述方便,将其简称为“机器学习方法”)的流程图。
这里,作为示例,图9所示的机器学习方法可由图8所示的机器学习系统800来执行,也可完全通过计算机程序或指令以软件方式实现,还可通过特定配置的计算系统或计算装置来执行。为了描述方便, 假设图9所示的方法由图8所示的机器学习系统800来执行,并假设机器学习系统800可具有图8所示的配置。
参照图9,在步骤S910,目标数据集获取装置810可获取目标数据集。以上在参照图8描述目标数据集获取装置810时描述的与获取目标数据集有关的任何内容均适应于此,因此,这里不在对其进行赘述。
在获取到目标数据集之后,在步骤S920,迁移项获取装置820可获取关于源数据集的迁移项。这里,迁移项可用于在源数据隐私保护方式下将源数据集的知识迁移到目标数据集以在目标数据集上训练目标机器学习模型。具体地,在步骤S920,迁移项获取装置820可从外部接收所述迁移项。或者,迁移项获取装置820可通过自身对源数据集执行机器学习相关处理来获取关于源数据集的迁移项。具体地,迁移项获取装置820可首先获取源数据集,然后,在源数据隐私保护方式下,基于源数据集执行与机器学习相关的处理,并且在基于源数据集执行与机器学习相关的处理的过程中获取关于源数据集的迁移项。
这里,作为示例,源数据隐私保护方式可以是遵循差分隐私保护定义的保护方式,但不限于此。另外,源数据隐私保护方式可以是在基于源数据集执行与机器学习相关的处理的过程中添加随机噪声,以实现对源数据的隐私保护。这里,基于源数据集执行与机器学习相关的处理可包括在源数据隐私保护方式下基于源数据集训练源机器学习模型,但不限于此,例如,还可以是对在源数据隐私保护方式下对源数据集进行统计分析或特征处理等。根据示例性实施例,在所述源数据隐私保护方式中可将用于训练源机器学习模型的目标函数构造为至少包括损失函数和噪声项。这里,噪声项用于在训练源机器学习模型的过程中添加随机噪声,从而实现对源数据隐私保护。此外,可选地,在所述源数据隐私保护方式中还可将目标函数构造为包括其他用于约束模型参数的约束项。
根据示例性实施例,迁移项可涉及在基于源数据集执行与机器学习相关的处理的过程中得到的模型参数、目标函数和/或关于源数据的统计信息。作为示例,迁移项可以是源机器学习模型的参数,即,在源数据隐私保护方式下训练出的源机器学习模型的参数。根据示例性实施例,源机器学习模型可以是广义线性模型(例如,逻辑回归模型),但不限于此,例如,可以是满足预定条件的任何线性模型,甚至还可以是满足一定条件的任何适当模型。
由于以上已经参照图8结合数学表示描述了关于迁移项获取装置820在源数据隐私保护方式下基于源数据集训练源机器学习模型以获得迁移项(即,源机器学习模型的参数)的过程,因此这里不再赘述。此外,需要说明的是,参照图8在描述迁移项获取装置820时提及的关于源数据集、源数据隐私保护方式、迁移项、目标函数等的所有描述均适用于图9,因此,这里不再赘述,并且在描述迁移项获取装置820和步骤S920时相同或相似的内容可相互参考。
在目标数据集和关于源数据集的迁移项被获取到之后,在步骤S930,目标机器学习模型训练装置830可在目标数据隐私保护方式下,基于目标数据集,结合所述迁移项来训练目标机器学习模型。这里,作为示例,目标数据隐私保护方式也可以是遵循差分隐私定义的保护方式,但不限于此,而是可以是与源数据隐私保护方式相同或不同的其他数据隐私保护方式。此外,所述目标数据隐私保护方式可以是在训练目标机器学习模型的过程中添加随机噪声,以实现对目标数据的隐私保护。具体地,例如,在目标数据隐私保护方式中可将用于训练目标机器学习模型的目标函数构造为至少包括损失函数和噪声项,但是不限于此,例如,可将目标函数构造为还包括其他用于约束模型的约束项,例如,用于限制模型参数复杂度或防止模型过拟合的正则项、用于隐私保护的补偿项等。此外,目标机器学习模型可与源机器学习模型属于基于相同类型的机器学习模型,例如,所述相同类型的机器学习模型可以是逻辑回归,但不限于此,而是可以是例如满足预定条件的任何线性模型。需要说明的是,目标机器学习模型也可以是与源机器学习模型属于不同类型的机器学习模型。
除了在目标数据隐私保护方式中将用于训练目标机器学习模型的目标函数构造为至少包括损失函数和噪声项之外,根据示例性实施例,在步骤S930,目标机器学习模型训练装置830可将用于训练目标机器学习模型的目标函数构造为还反映目标机器学习模型的参数与所述迁移项之间的差值,随后,可基于目标数据集,通过求解构造的目标函数来训练目标机器学习模型。关于利用构造的目标函数训练目标机器学习模型的具体过程,以上已参照图8结合数学表示进行过描述,因此,这里不再赘述。
按照以上方式训练出的目标机器学习模型可用于执行业务决策,例如,所述业务决策可涉及交易反欺诈、账户开通反欺诈、智能营销、智能推荐、贷款评估之中的至少一项,但不限于此。事实上,本公开对目标机器学习模型可被应用于的具体业务决策的类型并无任何限制,只要是适于利用机器学习模型进行决策的业务即可。
以上描述的根据本公开示例性实施例的在数据隐私保护方式下执行机器学习的方法,既可以确保源数据隐私和目标数据隐私的不被泄露,同时能够通过迁移项将源数据的知识迁移到目标数据集,从而便于利用更多数据源的数据进行机器学习来训练机器学习模型,使得训练出的目标机器学习模型的效果能 够具有更佳的模型效果。
需要说明的是,尽管以上在描述图9时,按顺序对图9中的步骤进行了描述,但是,本领域技术人员清楚的是,上述方法中的各个步骤不一定按顺序执行,而是可按照相反的顺序或并行地执行,例如,以上描述的步骤S910与步骤S920便可按照相反顺序或并行执行,也就是说,可在获取目标数据集之前获取关于源数据集的迁移项,或者可同时获取目标数据集和迁移项。另外,在执行步骤S830的同时,也可执行步骤S810或步骤820,也就是说,在利用已经获取的目标数据集和迁移项训练目标机器学习模型的过程中,可同时获取新的目标数据集或迁移项,以用于例如后续目标机器学习模型的更新操作等。
图10是示出根据本公开示例性实施例的在数据隐私保护方式下执行机器学习的构思的示意图。
为便于更清楚且直观地理解本公开的构思,以下结合图10以金融领域中的贷款审核场景为例(即,目标机器学习模型将用于贷款审核这一业务决策),对根据本公开示例性实施例的在数据隐私保护下执行机器学习的构思进行简要描述。
如今,随着机器学习的不断发展,其在金融领域开始发挥着日益重要的作用,从审批贷款到资产管理,再到风险评估,机器学习在金融生态系统的许多阶段都起着不可或缺的作用。例如,银行可利用机器学习来决定是否批准贷款申请者的贷款申请。但是,单个银行自身所能获得的关于贷款申请者的历史金融活动相关记录可能并不足以全面地反映该贷款申请者的真实信用或贷款偿还能力等情况,在这种情况下,该银行可能期望能够获得该贷款申请者在其他机构的历史金融活动相关记录。然而,出于客户隐私保护的考虑,该银行很难利用其他机构所拥有的贷款申请者的历史金融活动相关记录。然而,利用本公开的构思则可实现在用户数据保护隐私的情况下充分利用多个机构的数据来帮助银行更准确地判断是否批准贷款申请者的贷款申请,进而减少金融风险。
参照图10,目标数据源310(例如,第一银行机构)可将其拥有的涉及用户历史金融活动的目标数据集发送给机器学习系统330。这里,目标数据集中的每条目标数据记录可包括例如用户的姓名、国籍、职业、薪酬、财产、信用记录、历史贷款金额等多种属性信息。此外,目标数据记录还可包括例如关于用户是否按时清偿贷款的标记信息。
这里,机器学习系统330可以是以上参照图8描述的机器学习系统800。作为示例,机器学习系统330可以由专门提供机器学习服务的实体(例如,机器学习服务提供商)提供,或者也可由目标数据源310自己构建。相应地,机器学习系统330既可设置在云端(如公有云、私有云或混合云),也可以设置在银行机构的本地系统。这里,为描述方便,假设机器学习系统330被设置在公有云端,并且由机器学习服务提供商构建。
为了更准确地预测用户的贷款风险指数或者用户的贷款偿还能力,第一银行机构可例如与源数据源320(例如,第二机构)达成彼此在保护用户数据隐私的情况下共享数据的协议。在这种情况下,基于该协议,作为示例,在相应安全措施下,源数据源320可将其所拥有的源数据集发送给机器学习系统330,这里,源数据集例如可以是与以上描述的目标数据集类似的涉及用户金融活动的数据集。然后,机器学习系统330可如以上参照图8和图9所述在源数据隐私保护方式下基于源数据集执行机器学习相关处理,并在执行机器学习处理的过程中获取关于源数据集的迁移项,以将源数据集上的知识迁移到目标数据集。例如,机器学习系统330可基于源数据集训练源机器学习模型,并将训练的源机器学习模型的参数作为迁移项。这里,源机器学习模型可以是例如用于预测用户贷款风险指数或贷款清偿能力的机器学习模型或者其他类似预测目标的机器学习模型,或者是与贷款估计业务相关的针对其他预测目标的机器学习模型。
或者,机器学习系统330也可从源数据源320直接获取迁移项。在这种情况下,例如,源数据源320可事先通过其自身的机器学习系统或者委托其他机器学习服务提供商在源数据隐私保护方式下基于源数据集执行机器学习相关处理来获取迁移项,并将迁移项发送给机器学习系统330。可选地,源数据源320也可选择将源数据集/迁移项发送给目标数据源,然后,由目标数据源将源数据集/迁移项与目标数据集一起提供给机器学习系统330,以用于机器学习。
随后,机器学习系统330在目标数据隐私保护方式下,基于目标数据集,结合获取的迁移项来训练目标机器学习模型。目标数据机器学习模型可以是例如用于预测用户贷款风险指数或贷款清偿能力的机器学习模型。在目标机器学习模型被训练出之后,目标数据源310可将涉及至少一个贷款申请者的的待预测数据集发送给机器学习系统330。机器学习系统330可利用训练出的目标机器学习模型针对待预测数据集提供关于每个贷款申请者的贷款风险指数或贷款清偿能力评分,并将预测结果反馈给目标数据源310。随后,目标数据源310可基于接收到的预测结果判断是否批准贷款申请者提出的贷款申请。通过以上方式,银行机构可以利用机器学习在保护用户数据隐私的同时利用其他机构的数据和自身拥有的数据获得更准确的判断结果,从而可避免不必要的金融风险。
需要说明的是,尽管以上以机器学习在金融领域中的贷款估计应用为例介绍了本公开的构思,但是,本领域人员清楚的是,根据本公开示例性实施例的在数据隐私保护下执行机器学习的方法和系统不限于 应用于金融领域,也不限于用于执行贷款估计这样的业务决策。而是,可应用于任何涉及数据安全和机器学习的领域和业务决策。例如,根据本公开示例性实施例的在数据隐私保护下执行机器学习的方法和系统还可应用于交易反欺诈、账户开通反欺诈、智能营销、智能推荐等。
作为另一示例,根据本公开示例性实施例的在数据隐私保护下执行机器学习的方法和系统还可应用于公共卫生领域,例如,用于执行生理数据的预测。例如,一家医疗机构希望建立起对某项健康指标的预测模型,但是只用本医疗机构的数据进行训练,则预测模型的效果可能欠佳。而事实上,可能很多其他医疗机构都拥有相应的数据,如果可以利用其它医疗机构的数据,则可以提升该医疗机构的针对某项健康指标的预测模型的预测效果。此时,便可利用本公开的构思在保护各医疗机构的用户数据隐私的情况下,综合各医疗结构的数据利用机器学习提供更加准确的预测结果。
更进一步来说,基于本申请中的目标模型可应用于的场景包括但不限于以下场景:图像处理场景、语音识别场景、自然语言处理场景、自动控制场景、智能问答场景、业务决策场景、推荐业务场景、搜索场景和异常行为检测场景。上述各类场景下的更具体应用场景详见前面的描述。
因此,本申请的在数据隐私保护下执行机器学习的方法和系统,也可以应用于上述的任一场景,并且本申请的在数据隐私保护下执行机器学习的方法和系统,在应用于不同的场景时,总体执行方案并无差别,只是在不同场景下针对的数据不同,因此本领域的技术人员基于前述的方案公开可以毫无障碍地将本申请的方案应用于不同的场景,因此不需要对每个场景一一进行说明。
以上已参照图8和图9描述了根据本公开示例性实施例的机器学习方法和机器学习系统,并参照图10示意性地描述了本公开的构思。然而,应理解的是:附图中示出的装置和系统可被分别配置为执行特定功能的软件、硬件、固件或上述项的任意组合。例如,这些系统、装置可对应于专用的集成电路,也可对应于纯粹的软件代码,还可对应于软件与硬件相结合的模块。此外,这些系统或装置所实现的一个或多个功能也可由物理实体设备(例如,处理器、客户端或服务器等)中的组件来统一执行。
此外,上述方法可通过记录在计算机可读存储介质上的指令来实现,例如,根据本申请的示例性实施例,可提供一种存储指令的计算机可读存储介质,其中,当所述指令被至少一个计算装置运行时,促使所述至少一个计算装置执行以下步骤:获取目标数据集;获取关于源数据集的迁移项,其中,所述迁移项用于在源数据隐私保护方式下将源数据集的知识迁移到目标数据集以在目标数据集上训练目标机器学习模型;以及在目标数据隐私保护方式下,基于目标数据集,结合所述迁移项来训练目标机器学习模型。
上述计算机可读存储介质中存储的指令可在诸如客户端、主机、代理装置、服务器等计算机设备中部署的环境中运行,应注意,所述指令还可用于执行除了上述步骤以外的附加步骤或者在执行上述步骤时执行更为具体的处理,这些附加步骤和进一步处理的内容已经在参照图9进行相关方法的描述过程中提及,因此这里为了避免重复将不再进行赘述。
应注意,根据本公开示例性实施例的机器学习系统可完全依赖计算机程序或指令的运行来实现相应的功能,即,各个装置在计算机程序的功能架构中与各步骤相应,使得整个系统通过专门的软件包(例如,lib库)而被调用,以实现相应的功能。
另一方面,当图8所示的系统和装置以软件、固件、中间件或微代码实现时,用于执行相应操作的程序代码或者代码段可以存储在诸如存储介质的计算机可读介质中,使得至少一个处理器或至少一个计算装置可通过读取并运行相应的程序代码或者代码段来执行相应的操作。
例如,根据本申请示例性实施例,可提供一种包括至少一个计算装置和至少一个存储指令的存储装置的系统,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置执行下述步骤:获取目标数据集;获取关于源数据集的迁移项,其中,所述迁移项用于在源数据隐私保护方式下将源数据集的知识迁移到目标数据集以在目标数据集上训练目标机器学习模型;以及在目标数据隐私保护方式下,基于目标数据集,结合所述迁移项来训练目标机器学习模型。
以上描述了本申请的各示例性实施例,应理解,上述描述仅是示例性的,并非穷尽性的,本申请不限于所披露的各示例性实施例。在不偏离本申请的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。因此,本申请的保护范围应该以权利要求的范围为准。

Claims (58)

  1. 一种由至少一个计算装置在数据隐私保护下执行机器学习的方法,包括:
    获取包括多条目标数据记录的目标数据集;
    获取关于源数据集的多个迁移项,其中,所述多个迁移项之中的每个迁移项用于在源数据隐私保护下将对应的一部分源数据集的知识迁移到目标数据集;
    分别利用所述多个迁移项之中的每个迁移项来获得与每个迁移项对应的第一目标机器学习模型,以获得多个第一目标机器学习模型;
    利用所述多个第一目标机器学习模型获得第二目标机器学习模型,
    其中,在获得所述多个第一目标机器学习模型的过程中和获得第二目标机器学习模型的过程中的至少一个过程中,在目标数据隐私保护方式下利用了所述多条目标数据记录中的全部或部分。
  2. 如权利要求1所述的方法,其中,所述对应的一部分源数据集是通过将源数据集按照数据属性字段划分而获得的源数据子集。
  3. 如权利要求1所述的方法,其中,获取关于源数据集的多个迁移项的步骤包括:从外部接收关于源数据集的多个迁移项。
  4. 如权利要求2所述的方法,其中,获取关于源数据集的多个迁移项的步骤包括:
    获取包括多条源数据记录的源数据集,其中,源数据记录和目标数据记录包括相同的数据属性字段;
    将源数据集按照数据属性字段划分为多个源数据子集,其中,每个源数据子集中的数据记录包括至少一个数据属性字段;
    在源数据隐私保护方式下,基于每个源数据子集,针对第一预测目标训练与每个源数据子集对应的源机器学习模型,并将训练出的每个源机器学习模型的参数作为与每个源数据子集相关的迁移项。
  5. 如权利要求4所述的方法,其中,获得与每个迁移项对应的第一目标机器学习模型的步骤包括:
    在不使用目标数据集的情况下,直接将每个迁移项作为与其对应的第一目标机器学习模型的参数。
  6. 如权利要求4所述的方法,其中,获得与每个迁移项对应的第一目标机器学习模型的步骤包括:
    将目标数据集或第一目标数据集按照数据属性字段以与划分源数据集相同的方式划分为多个第一目标数据子集,其中,第一目标数据集包括目标数据集中所包括的部分目标数据记录,每个第一目标数据子集和与其对应的源数据子集中的数据记录包括相同的数据属性字段;
    在目标数据隐私保护方式下,基于每个第一目标数据子集,结合和与每个第一目标数据子集对应的源数据子集相关的迁移项,针对第二预测目标训练与该迁移项对应的第一目标机器学习模型。
  7. 如权利要求5所述的方法,其中,获得第二目标机器学习模型的步骤包括:
    将目标数据集按照数据属性字段以与划分源数据集相同的方式划分为多个目标数据子集,其中,每个目标数据子集和与其对应的源数据子集中的数据记录包括相同的数据属性字段;
    针对每个目标数据子集,利用与其对应的第一目标机器学习模型执行预测以获取针对每个目标数据子集中的每条数据记录的预测结果;
    在目标数据隐私保护方式下,基于由获取的与每条目标数据记录对应的多个预测结果构成的训练样本的集合,针对第三预测目标训练第二目标机器学习模型。
  8. 如权利要求6所述的方法,其中,获得第二目标机器学习模型的步骤包括:
    将第二目标机器学习模型的规则设置为:基于通过以下方式获取的与每条预测数据记录对应的多个预测结果来获得第二目标机器学习模型针对所述每条预测数据记录的预测结果,其中,所述方式包括:获取预测数据记录,并将预测数据记录按照数据属性字段以与划分源数据集相同的方式划分为多个子预测数据;针对每条预测数据记录之中的每个子预测数据,利用与其对应的第一目标机器学习模型执行预测以获取针对每个子预测数据的预测结果;或者
    针对每个第一目标数据子集,利用与其对应的第一目标机器学习模型执行预测以获取针对每个第一目标数据子集中的每条数据记录的预测结果;并且在目标数据隐私保护方式下,基于由获取的与每条目标数据记录对应的多个预测结果构成的训练样本的集合,针对第三预测目标训练第二目标机器学习模型;或者
    将第二目标数据集按照数据属性字段以与划分源数据集相同的方式划分为多个第二目标数据子集,其中,第二目标数据集至少包括目标数据集中排除第一目标数据集之后的剩余目标数据记录;针 对每个第二目标数据子集,利用与其对应的第一目标机器学习模型执行预测以获取针对每个第二目标数据子集中的每条数据记录的预测结果;在目标数据隐私保护方式下,基于由获取的与每条目标数据记录对应的多个预测结果构成的训练样本的集合,针对第三预测目标训练第二目标机器学习模型。
  9. 如权利要求4所述的方法,其中,所述源数据隐私保护方式和所述目标数据隐私保护方式中的至少一个为遵循差分隐私定义的保护方式。
  10. 如权利要求8所述的方法,其中,该方法包括如下中的至少一个:
    所述源数据隐私保护方式为在训练源机器学习模型的过程中添加随机噪声;
    所述目标数据隐私保护方式为在获得第一目标机器学习模型和第二目标机器学习模型中的至少一个的过程中添加随机噪声。
  11. 如权利要求10所述的方法,其中,该方法包括如下中的至少一个:
    在所述源数据隐私保护方式中将用于训练源机器学习模型的目标函数构造为至少包括损失函数和噪声项;
    在所述目标数据隐私保护方式中将用于训练第一目标机器学习模型的目标函数构造为至少包括损失函数和噪声项;
    在所述目标数据隐私保护方式中将用于训练第二目标机器学习模型的目标函数构造为至少包括损失函数和噪声项。
  12. 如权利要求11所述的方法,其中,所述目标数据隐私保护方式的隐私预算取决于与用于训练第一目标机器学习模型的目标函数所包括的噪声项对应的隐私预算和与用于训练第二目标机器学习模型的目标函数所包括的噪声项对应的隐私预算两者之和或两者之中较大的隐私预算。
  13. 如权利要求11所述的方法,其中,包括如下中的至少一项:
    源机器学习模型和第一目标机器学习模型属于相同类型的机器学习模型;
    第一预测目标和第二预测目标相同或相似。
  14. 如权利要求13所述的方法,其中,所述相同类型的机器学习模型为逻辑回归模型,其中,训练第一目标机器学习模型的步骤包括:将用于训练第一目标机器学习模型的目标函数构造为至少包括损失函数和噪声项并反映第一目标机器学习模型的参数与对应于该第一目标机器学习模型的迁移项之间的差值;在目标数据隐私保护方式下,基于每个第一目标数据子集,结合和与每个第一目标数据子集对应的源数据子集相关的迁移项,通过求解构造的目标函数来针对第二预测目标训练与该迁移项对应的第一目标机器学习模型。
  15. 如权利要求8所述的方法,其中,包括如下中的至少一项:
    第一目标机器学习模型和第二目标机器学习模型属于相同类型的机器学习模型;
    第二预测目标和第三预测目标相同或相似。
  16. 如权利要求1所述的方法,其中,第二目标机器学习模型用于执行业务决策,其中,所述业务决策涉及交易反欺诈、账户开通反欺诈、智能营销、智能推荐、贷款评估之中的至少一项。
  17. 一种由至少一个计算装置利用具有数据隐私保护的机器学习模型进行预测的方法,包括:
    获取如权利要求1至16中的任一权利要求所述的多个第一目标机器学习模型和第二目标机器学习模型;
    获取预测数据记录;
    将预测数据记录划分为多个子预测数据;
    针对每条预测数据记录之中的每个子预测数据,利用与其对应的第一目标机器学习模型执行预测以获取针对每个子预测数据的预测结果;以及
    将由多个第一目标机器学习模型获取的与每条预测数据记录对应的多个预测结果输入第二目标机器学习模型,以得到针对所述每条预测数据记录的预测结果。
  18. 一种存储指令的计算机可读存储介质,其中,当所述指令被至少一个计算装置运行时,促使所述至少一个计算装置执行如权利要求1至16中的任一权利要求所述的在数据隐私保护下执行机器学习的方法或如权利要求17所述的利用具有数据隐私保护的机器学习模型进行预测的方法。
  19. 一种在数据隐私保护下执行机器学习的系统,包括:
    目标数据集获取装置,被配置为获取包括多条目标数据记录的目标数据集;
    迁移项获取装置,被配置为获取关于源数据集的多个迁移项,其中,所述多个迁移项之中的每个迁移项用于在源数据隐私保护下将对应的一部分源数据集的知识迁移到目标数据集;
    第一目标机器学习模型获得装置,被配置为分别利用所述多个迁移项之中的每个迁移项来获得与每个迁移项对应的第一目标机器学习模型,以获得多个第一目标机器学习模型;
    第二目标机器学习模型获得装置,被配置为利用所述多个第一目标机器学习模型获得第二目标机 器学习模型,
    其中,在第一目标机器学习模型获得装置获得所述多个第一目标机器学习模型的过程中和第二目标机器学习模型获得装置获得第二目标机器学习模型的过程中的至少一个过程中,在目标数据隐私保护方式下利用了所述多条目标数据记录中的全部或部分。
  20. 一种包括至少一个计算装置和至少一个存储指令的存储装置的系统,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置执行在数据隐私保护下执行机器学习的以下步骤:
    获取包括多条目标数据记录的目标数据集;
    获取关于源数据集的多个迁移项,其中,所述多个迁移项之中的每个迁移项用于在源数据隐私保护下将对应的一部分源数据集的知识迁移到目标数据集;
    分别利用所述多个迁移项之中的每个迁移项来获得与每个迁移项对应的第一目标机器学习模型,以获得多个第一目标机器学习模型;
    利用所述多个第一目标机器学习模型获得第二目标机器学习模型,
    其中,在获得所述多个第一目标机器学习模型的过程中和获得第二目标机器学习模型的过程中的至少一个过程中,在目标数据隐私保护方式下利用了所述多条目标数据记录中的全部或部分。
  21. 如权利要求20所述的系统,其中,所述对应的一部分源数据集是通过将源数据集按照数据属性字段划分而获得的源数据子集。
  22. 如权利要求20所述的系统,其中,迁移项获取装置被配置为从外部接收关于源数据集的多个迁移项。
  23. 如权利要求21所述的系统,其中,获取关于源数据集的多个迁移项的步骤包括:
    获取包括多条源数据记录的源数据集,其中,源数据记录和目标数据记录包括相同的数据属性字段;
    将源数据集按照数据属性字段划分为多个源数据子集,其中,每个源数据子集中的数据记录包括至少一个数据属性字段;
    在源数据隐私保护方式下,基于每个源数据子集,针对第一预测目标训练与每个源数据子集对应的源机器学习模型,并将训练出的每个源机器学习模型的参数作为与每个源数据子集相关的迁移项。
  24. 如权利要求23所述的系统,其中,获得与每个迁移项对应的第一目标机器学习模型的步骤包括:
    在不使用目标数据集的情况下,直接将每个迁移项作为与其对应的第一目标机器学习模型的参数。
  25. 如权利要求23所述的系统,其中,获得与每个迁移项对应的第一目标机器学习模型的步骤包括:
    将目标数据集或第一目标数据集按照数据属性字段以与划分源数据集相同的方式划分为多个第一目标数据子集,其中,第一目标数据集包括目标数据集中所包括的部分目标数据记录,每个第一目标数据子集和与其对应的源数据子集中的数据记录包括相同的数据属性字段;
    在目标数据隐私保护方式下,基于每个第一目标数据子集,结合和与每个第一目标数据子集对应的源数据子集相关的迁移项,针对第二预测目标训练与该迁移项对应的第一目标机器学习模型。
  26. 如权利要求24所述的系统,其中,获得第二目标机器学习模型的步骤包括:
    将目标数据集按照数据属性字段以与划分源数据集相同的方式划分为多个目标数据子集,其中,每个目标数据子集和与其对应的源数据子集中的数据记录包括相同的数据属性字段;
    针对每个目标数据子集,利用与其对应的第一目标机器学习模型执行预测以获取针对每个目标数据子集中的每条数据记录的预测结果;
    在目标数据隐私保护方式下,基于由获取的与每条目标数据记录对应的多个预测结果构成的训练样本的集合,针对第三预测目标训练第二目标机器学习模型。
  27. 如权利要求25所述的系统,其中,获得第二目标机器学习模型的步骤包括:
    将第二目标机器学习模型的规则设置为:基于通过以下方式获取的与每条预测数据记录对应的多个预测结果来获得第二目标机器学习模型针对所述每条预测数据记录的预测结果,其中,所述方式包括:获取预测数据记录,并将预测数据记录按照数据属性字段以与划分源数据集相同的方式划分为多个子预测数据;针对每条预测数据记录之中的每个子预测数据,利用与其对应的第一目标机器学习模型执行预测以获取针对每个子预测数据的预测结果;或者
    针对每个第一目标数据子集,利用与其对应的第一目标机器学习模型执行预测以获取针对每个第一目标数据子集中的每条数据记录的预测结果;并且在目标数据隐私保护方式下,基于由获取的与每 条目标数据记录对应的多个预测结果构成的训练样本的集合,针对第三预测目标训练第二目标机器学习模型;或者
    将第二目标数据集按照数据属性字段以与划分源数据集相同的方式划分为多个第二目标数据子集,其中,第二目标数据集不同于第一目标数据集并至少包括目标数据集中排除第一目标数据集之后的剩余目标数据记录;针对每个第二目标数据子集,利用与其对应的第一目标机器学习模型执行预测以获取针对每个第二目标数据子集中的每条数据记录的预测结果;在目标数据隐私保护方式下,基于由获取的与每条目标数据记录对应的多个预测结果构成的训练样本的集合,针对第三预测目标训练第二目标机器学习模型。
  28. 如权利要求23所述的系统,其中,所述源数据隐私保护方式和所述目标数据隐私保护方式中的至少一个为遵循差分隐私定义的保护方式。
  29. 如权利要求27所述的系统,其中,包括如下中的至少一项:
    所述源数据隐私保护方式为在训练源机器学习模型的过程中添加随机噪声;
    所述目标数据隐私保护方式为在获得第一目标机器学习模型和第二目标机器学习模型中的至少一个的过程中添加随机噪声。
  30. 如权利要求29所述的系统,其中,包括如下中的至少一项:
    在所述源数据隐私保护方式中迁移项获取装置将用于训练源机器学习模型的目标函数构造为至少包括损失函数和噪声项;
    在所述目标数据隐私保护方式中将用于训练第一目标机器学习模型的目标函数构造为至少包括损失函数和噪声项;
    在所述目标数据隐私保护方式中将用于训练第二目标机器学习模型的目标函数构造为至少包括损失函数和噪声项。
  31. 如权利要求30所述的系统,其中,所述目标数据隐私保护方式的隐私预算取决于与用于训练第一目标机器学习模型的目标函数所包括的噪声项对应的隐私预算和与用于训练第二目标机器学习模型的目标函数所包括的噪声项对应的隐私预算两者之和或两者之中较大的隐私预算。
  32. 如权利要求30所述的系统,其中,包括如下中的至少一项:
    源机器学习模型和第一目标机器学习模型属于相同类型的机器学习模型;
    第一预测目标和第二预测目标相同或相似。
  33. 如权利要求32所述的系统,其中,所述相同类型的机器学习模型为逻辑回归模型,其中,训练第一目标机器学习模型的步骤包括:将用于训练第一目标机器学习模型的目标函数构造为至少包括损失函数和噪声项并反映第一目标机器学习模型的参数与对应于该第一目标机器学习模型的迁移项之间的差值;在目标数据隐私保护方式下,基于每个第一目标数据子集,结合和与每个第一目标数据子集对应的源数据子集相关的迁移项,通过求解构造的目标函数来针对第二预测目标训练与该迁移项对应的第一目标机器学习模型。
  34. 如权利要求27所述的系统,其中,包括如下中的至少一项:
    第一目标机器学习模型和第二目标机器学习模型属于相同类型的机器学习模型;
    第二预测目标和第三预测目标相同或相似。
  35. 如权利要求20所述的系统,其中,第二目标机器学习模型用于执行业务决策,其中,所述业务决策涉及交易反欺诈、账户开通反欺诈、智能营销、智能推荐、贷款评估之中的至少一项。
  36. 一种包括至少一个计算装置和至少一个存储指令的存储装置的系统,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置执行利用具有数据隐私保护的机器学习模型进行预测的以下步骤:
    获取如权利要求20至35中的任一权利要求所述的多个第一目标机器学习模型和第二目标机器学习模型;
    获取预测数据记录;
    将预测数据记录划分为多个子预测数据;
    针对每条预测数据记录之中的每个子预测数据,利用与其对应的第一目标机器学习模型执行预测以获取针对每个子预测数据的预测结果,并且将由多个第一目标机器学习模型获取的与每条预测数据记录对应的多个预测结果输入第二目标机器学习模型,以得到针对所述每条预测数据记录的预测结果。
  37. 一种由至少一个计算装置在数据隐私保护下执行机器学习的方法,包括:
    获取目标数据集;
    获取关于源数据集的迁移项,其中,所述迁移项用于在源数据隐私保护方式下将源数据集的知识 迁移到目标数据集以在目标数据集上训练目标机器学习模型;以及
    在目标数据隐私保护方式下,基于目标数据集,结合所述迁移项来训练目标机器学习模型。
  38. 如权利要求37所述的方法,其中,
    获取关于源数据集的迁移项的步骤包括:从外部接收所述迁移项;
    或者,
    获取关于源数据集的迁移项的步骤包括:获取源数据集;在源数据隐私保护方式下,基于源数据集执行与机器学习相关的处理;以及在基于源数据集执行与机器学习相关的处理的过程中获取关于源数据集的迁移项。
  39. 如权利要求38所述的方法,其中,所述源数据隐私保护方式和所述目标数据隐私保护方式中的至少一个为遵循差分隐私定义的保护方式。
  40. 如权利要求38所述的方法,其中,所述迁移项涉及在基于源数据集执行与机器学习相关的处理的过程中得到的模型参数、目标函数和关于源数据的统计信息中的至少一个。
  41. 如权利要求38所述的方法,其中,包括如下中的至少一项:
    所述源数据隐私保护方式为在基于源数据集执行与机器学习相关的处理的过程中添加随机噪声;
    所述目标数据隐私保护方式为在训练目标机器学习模型的过程中添加随机噪声。
  42. 如权利要求41所述的方法,其中,在源数据隐私保护方式下基于源数据集执行与机器学习相关的处理包括:在源数据隐私保护方式下基于源数据集训练源机器学习模型。
  43. 如权利要求42所述的方法,其中,源机器学习模型与目标机器学习模型属于基于相同类型的机器学习模型。
  44. 如权利求43所述的方法,其中,包括如下中的至少一项:
    在所述源数据隐私保护方式中将用于训练源机器学习模型的目标函数构造为至少包括损失函数和噪声项;
    在所述目标数据隐私保护方式中将用于训练目标机器学习模型的目标函数构造为至少包括损失函数和噪声项。
  45. 如权利要求44所述的方法,其中,所述相同类型的机器学习模型为逻辑回归模型,并且,所述迁移项为源机器学习模型的参数,
    其中,在目标数据隐私保护方式下,基于目标数据集,结合所述迁移项来训练目标机器学习模型的步骤包括:将用于训练目标机器学习模型的目标函数构造为还反映目标机器学习模型的参数与所述迁移项之间的差值;基于目标数据集,通过求解构造的目标函数来训练目标机器学习模型。
  46. 如权利要求37所述的方法,其中,所述目标机器学习模型用于执行业务决策,其中,所述业务决策涉及交易反欺诈、账户开通反欺诈、智能营销、智能推荐、贷款评估之中的至少一项。
  47. 一种存储指令的计算机可读存储介质,其中,当所述指令被至少一个计算装置运行时,促使所述至少一个计算装置执行如权利要求37至46中的任一权利要求所述的在数据隐私保护下执行机器学习的方法。
  48. 一种用于在数据隐私保护下执行机器学习的系统,所述系统包括:
    目标数据集获取装置,被配置为获取目标数据集;
    迁移项获取装置,被配置为获取关于源数据集的迁移项,其中,所述迁移项用于在源数据隐私保护方式下将源数据集的知识迁移到目标数据集以在目标数据集上训练目标机器学习模型;以及
    目标机器学习模型训练装置,被配置为在目标数据隐私保护方式下,基于目标数据集,结合所述迁移项来训练目标机器学习模型。
  49. 一种包括至少一个计算装置和至少一个存储指令的存储装置的系统,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置执行用于在数据隐私保护下执行机器学习的以下步骤:
    获取目标数据集;
    获取关于源数据集的迁移项,其中,所述迁移项用于在源数据隐私保护方式下将源数据集的知识迁移到目标数据集以在目标数据集上训练目标机器学习模型;以及
    在目标数据隐私保护方式下,基于目标数据集,结合所述迁移项来训练目标机器学习模型。
  50. 如权利要求49所述的系统,其中,
    获取关于源数据集的迁移项的步骤包括:外部接收所述迁移项;
    或者,获取关于源数据集的迁移项的步骤包括:获取源数据集;在源数据隐私保护方式下,基于源数据集执行与机器学习相关的处理;以及在基于源数据集执行与机器学习相关的处理的过程中获取关于源数据集的迁移项。
  51. 如权利要求50所述的系统,其中,所述源数据隐私保护方式和所述目标数据隐私保护方式中的至少一个为遵循差分隐私定义的保护方式。
  52. 如权利要求50所述的系统,其中,所述迁移项涉及在基于源数据集执行与机器学习相关的处理的过程中得到的模型参数、目标函数和关于源数据的统计信息中的至少一个。
  53. 如权利要求50所述的系统,其中,包括如下中的至少一项:
    所述源数据隐私保护方式为在基于源数据集执行与机器学习相关的处理的过程中添加随机噪声;
    所述目标数据隐私保护方式为在训练目标机器学习模型的过程中添加随机噪声。
  54. 如权利要求53所述的系统,其中,在源数据隐私保护方式下基于源数据集执行与机器学习相关的处理的操作包括:在源数据隐私保护方式下基于源数据集训练源机器学习模型。
  55. 如权利要求54所述的系统,其中,源机器学习模型与目标机器学习模型属于基于相同类型的机器学习模型。
  56. 如权利要求55所述的系统,其中,包括如下中的至少一项:
    在所述源数据隐私保护方式中,迁移项获取装置将用于训练源机器学习模型的目标函数构造为至少包括损失函数和噪声项;
    在所述目标数据隐私保护方式中,目标机器学习模型训练装置将用于训练目标机器学习模型的目标函数构造为至少包括损失函数和噪声项。
  57. 如权利要求56所述的系统,其中,所述相同类型的机器学习模型为逻辑回归模型,并且,所述迁移项为源机器学习模型的参数,
    其中,在目标数据隐私保护方式下,基于目标数据集,结合所述迁移项来训练目标机器学习模型的步骤包括:将用于训练目标机器学习模型的目标函数构造为还反映目标机器学习模型的参数与所述迁移项之间的差值;基于目标数据集,通过求解构造的目标函数来训练目标机器学习模型。
  58. 如权利要求49所述的系统,其中,所述目标机器学习模型用于执行业务决策,其中,所述业务决策涉及交易反欺诈、账户开通反欺诈、智能营销、智能推荐、贷款评估之中的至少一项。
PCT/CN2019/101441 2018-08-17 2019-08-19 在数据隐私保护下执行机器学习的方法和系统 WO2020035075A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP19849826.3A EP3839790A4 (en) 2018-08-17 2019-08-19 METHOD AND SYSTEM FOR PERFORMING MACHINE LEARNING UNDER DATA PRIVACY PROTECTION

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
CN201810939380.3 2018-08-17
CN201810939380 2018-08-17
CN201811136436.8 2018-09-28
CN201811136436.8A CN110990859B (zh) 2018-09-28 2018-09-28 在数据隐私保护下执行机器学习的方法和系统
CN201910618274.XA CN110858253A (zh) 2018-08-17 2019-07-10 在数据隐私保护下执行机器学习的方法和系统
CN201910618274.X 2019-07-10

Publications (1)

Publication Number Publication Date
WO2020035075A1 true WO2020035075A1 (zh) 2020-02-20

Family

ID=69524689

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/101441 WO2020035075A1 (zh) 2018-08-17 2019-08-19 在数据隐私保护下执行机器学习的方法和系统

Country Status (2)

Country Link
EP (1) EP3839790A4 (zh)
WO (1) WO2020035075A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111340555A (zh) * 2020-02-29 2020-06-26 重庆百事得大牛机器人有限公司 基于法律领域用户画像模型的建议决策系统及方法
CN111444958A (zh) * 2020-03-25 2020-07-24 北京百度网讯科技有限公司 一种模型迁移训练方法、装置、设备及存储介质
CN112241549A (zh) * 2020-05-26 2021-01-19 中国银联股份有限公司 安全的隐私计算方法、服务器、系统以及存储介质
CN114238583A (zh) * 2021-12-21 2022-03-25 润联软件系统(深圳)有限公司 自然语言处理方法、装置、计算机设备及存储介质
CN117094032A (zh) * 2023-10-17 2023-11-21 成都乐超人科技有限公司 一种基于隐私保护的用户信息加密方法及系统

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160283735A1 (en) * 2015-03-24 2016-09-29 International Business Machines Corporation Privacy and modeling preserved data sharing
CN106791195A (zh) * 2017-02-20 2017-05-31 努比亚技术有限公司 一种操作处理方法及装置
CN107704930A (zh) * 2017-09-25 2018-02-16 阿里巴巴集团控股有限公司 基于共享数据的建模方法、装置、系统及电子设备

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160283735A1 (en) * 2015-03-24 2016-09-29 International Business Machines Corporation Privacy and modeling preserved data sharing
CN106791195A (zh) * 2017-02-20 2017-05-31 努比亚技术有限公司 一种操作处理方法及装置
CN107704930A (zh) * 2017-09-25 2018-02-16 阿里巴巴集团控股有限公司 基于共享数据的建模方法、装置、系统及电子设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3839790A4 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111340555A (zh) * 2020-02-29 2020-06-26 重庆百事得大牛机器人有限公司 基于法律领域用户画像模型的建议决策系统及方法
CN111340555B (zh) * 2020-02-29 2023-07-18 重庆百事得大牛机器人有限公司 基于法律领域用户画像模型的建议决策系统及方法
CN111444958A (zh) * 2020-03-25 2020-07-24 北京百度网讯科技有限公司 一种模型迁移训练方法、装置、设备及存储介质
CN111444958B (zh) * 2020-03-25 2024-02-13 北京百度网讯科技有限公司 一种模型迁移训练方法、装置、设备及存储介质
CN112241549A (zh) * 2020-05-26 2021-01-19 中国银联股份有限公司 安全的隐私计算方法、服务器、系统以及存储介质
CN114238583A (zh) * 2021-12-21 2022-03-25 润联软件系统(深圳)有限公司 自然语言处理方法、装置、计算机设备及存储介质
CN114238583B (zh) * 2021-12-21 2024-01-02 华润数字科技有限公司 自然语言处理方法、装置、计算机设备及存储介质
CN117094032A (zh) * 2023-10-17 2023-11-21 成都乐超人科技有限公司 一种基于隐私保护的用户信息加密方法及系统
CN117094032B (zh) * 2023-10-17 2024-02-09 成都乐超人科技有限公司 一种基于隐私保护的用户信息加密方法及系统

Also Published As

Publication number Publication date
EP3839790A4 (en) 2022-04-27
EP3839790A1 (en) 2021-06-23

Similar Documents

Publication Publication Date Title
WO2020020088A1 (zh) 神经网络模型的训练方法和系统以及预测方法和系统
US20210264272A1 (en) Training method and system of neural network model and prediction method and system
WO2020035075A1 (zh) 在数据隐私保护下执行机器学习的方法和系统
Wang et al. Ponzi scheme detection via oversampling-based long short-term memory for smart contracts
Xu et al. Loan default prediction of Chinese P2P market: a machine learning methodology
US9798788B1 (en) Holistic methodology for big data analytics
US10956986B1 (en) System and method for automatic assistance of transaction sorting for use with a transaction management service
Liang et al. Analyzing credit risk among Chinese P2P-lending businesses by integrating text-related soft information
CN110858253A (zh) 在数据隐私保护下执行机器学习的方法和系统
US20230023630A1 (en) Creating predictor variables for prediction models from unstructured data using natural language processing
Kashyap Industrial applications of machine learning
US20210192496A1 (en) Digital wallet reward optimization using reverse-engineering
CN110751285A (zh) 神经网络模型的训练方法和系统以及预测方法和系统
US20230034820A1 (en) Systems and methods for managing, distributing and deploying a recursive decisioning system based on continuously updating machine learning models
Rutskiy et al. Prospects for the Use of Artificial Intelligence to Combat Fraud in Bank Payments
Wang et al. Leveraging Multisource Heterogeneous Data for Financial Risk Prediction: A Novel Hybrid-Strategy-Based Self-Adaptive Method.
CN110968887B (zh) 在数据隐私保护下执行机器学习的方法和系统
Zhang et al. Data Security, Customer Trust and Intention for Adoption of Fintech Services: An Empirical Analysis From Commercial Bank Users in Pakistan
CN110990859B (zh) 在数据隐私保护下执行机器学习的方法和系统
Zhou et al. FinBrain 2.0: when finance meets trustworthy AI
US20230113118A1 (en) Data compression techniques for machine learning models
CN111625572B (zh) 在数据隐私保护下执行机器学习的方法和系统
Dash et al. Evolving of Smart Banking with NLP and Deep Learning
US11561963B1 (en) Method and system for using time-location transaction signatures to enrich user profiles
Nguyen Exploring input enhancements big data analysts need to improve a credit qualification model to support large banks in their risk management operations

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19849826

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019849826

Country of ref document: EP

Effective date: 20210317