CN114912139A - Method, apparatus, storage medium, and processor for determining model training data - Google Patents

Method, apparatus, storage medium, and processor for determining model training data Download PDF

Info

Publication number
CN114912139A
CN114912139A CN202210334436.9A CN202210334436A CN114912139A CN 114912139 A CN114912139 A CN 114912139A CN 202210334436 A CN202210334436 A CN 202210334436A CN 114912139 A CN114912139 A CN 114912139A
Authority
CN
China
Prior art keywords
model
desensitization
data
training
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210334436.9A
Other languages
Chinese (zh)
Inventor
沈丽忠
陈晗
李婉华
谢立东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN202210334436.9A priority Critical patent/CN114912139A/en
Publication of CN114912139A publication Critical patent/CN114912139A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Bioethics (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Electric Propulsion And Braking For Vehicles (AREA)
  • Train Traffic Observation, Control, And Security (AREA)

Abstract

The embodiment of the application provides a method for determining model training data. The method comprises the following steps: inputting sample data which is not subjected to desensitization treatment into the neural network model so as to train the neural network model to obtain a first model; desensitizing the sample data by multiple desensitization methods to obtain desensitization data corresponding to each desensitization method; desensitization data corresponding to each desensitization method are respectively input into the neural network model so as to train the neural network model to obtain a plurality of second models; determining model parameters of the first model and each second model; comparing the model parameters of the first model with the model parameters of each second model respectively to determine a model difference value between the first model and each second model; determining a second model with the minimum model difference value as a target model; and determining a desensitization method corresponding to the target model as a target desensitization method, and performing desensitization treatment on sample data by using the target desensitization method to reduce model training difference.

Description

Method, apparatus, storage medium, and processor for determining model training data
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a storage medium, and a processor for determining model training data.
Background
In order to reliably protect sensitive data, desensitizing sensitive data is an effective means for preventing the sensitive data from leaking. In the current state of the art, common methods of data desensitization include substitution, shuffling, numerical transformation, and encryption. However, desensitization is performed on sensitive data by using different desensitization methods, and training is performed on the model by using desensitized data, which brings great difference to the result of model training. For example, desensitization of sensitive data by using an alternative method may cause information loss in the desensitized data, and training a model by using desensitized data with information loss and the like affects the training effect of model training to a certain extent.
Disclosure of Invention
An object of the embodiments of the present application is to provide a method, an apparatus, a storage medium, and a processor for determining model training data.
To achieve the above object, a first aspect of the present application provides a method for determining model training data, comprising:
inputting sample data which is not subjected to desensitization treatment into the neural network model to train the neural network model to obtain a first model;
desensitizing the sample data by multiple desensitization methods to obtain desensitization data corresponding to each desensitization method;
desensitization data corresponding to each desensitization method are respectively input into the neural network model so as to train the neural network model to obtain a plurality of second models;
determining model parameters of the first model and each second model;
comparing the model parameters of the first model with the model parameters of each second model respectively to determine a model difference value between the first model and each second model;
determining a second model with the minimum model difference value as a target model;
and determining a desensitization method corresponding to the target model as a target desensitization method, and performing desensitization processing on the sample data by using the target desensitization method to obtain data trained for the neural network model.
In an embodiment of the present application, the model difference values comprise AUC values between models, and the model difference value Y of each second model from the first model is determined according to formula (1):
Figure BDA0003574015290000021
wherein Y is i Expressed as the value of model difference, V, between the first model and the second model trained using sample data obtained by desensitization using the ith desensitization method s AUC value, V, for the first model ei AUC values from a second model trained for sample data obtained by desensitization using the ith desensitization method,
Figure BDA0003574015290000022
a degree of overfitting value between the second model and the first model trained for sample data obtained by desensitization using the ith desensitization method.
In an embodiment of the present application, inputting sample data that is not subjected to desensitization processing to a neural network model to train the neural network model, and obtaining a first model includes: submitting a machine learning production line to a first operation environment, and inputting sample data which is not subjected to desensitization processing to a neural network model of the machine learning production line under the first operation environment so as to train the neural network model to obtain a first model; inputting desensitization data corresponding to each desensitization method into the neural network model respectively to train the neural network model, and obtaining a plurality of second models comprises: and submitting the machine learning assembly line to a second operating environment, and respectively inputting desensitization data corresponding to each desensitization method to the neural network model of the machine learning assembly line under the second operating environment so as to train the neural network model to obtain a plurality of second models.
In an embodiment of the application, the method further comprises: acquiring target training parameters of the target model, wherein the target training parameters comprise environmental operating parameters of a second operating environment in which the target model is positioned, model parameters of the target model and a desensitization method adopted by desensitization data of the training target model; and determining the target training parameters as the training parameters of the subsequent model training.
In an embodiment of the application, the first operating environment is a trusted environment, data in the trusted environment includes sensitive data, the second operating environment is a debugging environment, and the data in the debugging environment is desensitized data.
In an embodiment of the present application, the model parameters for training the neural network model by the desensitized sample data are consistent with the model parameters for training the neural network model by the non-desensitized sample data.
In an embodiment of the present application, the model difference values include a degree of lifting and/or ks statistic between the models.
A second aspect of the application provides a processor configured to perform the above-described method for determining model training data.
A third aspect of the application provides an apparatus for determining model training data, the apparatus comprising:
the first training module is configured to input sample data which is not subjected to desensitization processing into the neural network model so as to train the neural network model to obtain a first model;
the data desensitization module is configured to desensitize the sample data through a plurality of desensitization methods to obtain desensitization data corresponding to each desensitization method;
a second training module configured to input desensitization data corresponding to each desensitization method to the neural network model, respectively, to train the neural network model to obtain a plurality of second models;
a model comparison module configured to determine model parameters of the first model and each of the second models; comparing the model parameters of the first model with the model parameters of each second model respectively to determine a model difference value between the first model and each second model;
the model selection module is configured to determine a second model with the minimum model difference value as a target model; and determining a desensitization method corresponding to the target model as a target desensitization method, and performing desensitization processing on the sample data by using the target desensitization method to obtain data trained for the neural network model.
In an embodiment of the application, the model difference values comprise AUC values between the models, and the model alignment module is further configured to determine the model difference value Y of each second model from the first model according to formula (1):
Figure BDA0003574015290000041
wherein, Y i Expressed as the value of model difference, V, between the first model and the second model trained using sample data obtained by desensitization using the ith desensitization method s AUC value, V, for the first model ei AUC values from a second model trained for sample data obtained by desensitization using the ith desensitization method,
Figure BDA0003574015290000042
a degree of overfitting value between the second model and the first model trained for sample data obtained by desensitization using the ith desensitization method.
In an embodiment of the application, the first training module is further configured to: submitting a machine learning production line to a first operation environment, and inputting sample data which is not subjected to desensitization processing to a neural network model of the machine learning production line under the first operation environment so as to train the neural network model to obtain a first model; the second training module is further configured to: and submitting the machine learning assembly line to a second operating environment, and respectively inputting desensitization data corresponding to each desensitization method to the neural network model of the machine learning assembly line under the second operating environment so as to train the neural network model to obtain a plurality of second models.
In an embodiment of the application, the model selection module is further configured to: acquiring target training parameters of the target model, wherein the target training parameters comprise environmental operating parameters of a second operating environment in which the target model is positioned, model parameters of the target model and a desensitization method adopted by desensitization data of the training target model; and determining the target training parameters as the training parameters of the subsequent model training.
In an embodiment of the application, the first operating environment is a trusted environment, data in the trusted environment includes sensitive data, the second operating environment is a debugging environment, and the data in the debugging environment is desensitized data.
In an embodiment of the present application, the model parameters for training the neural network model by the desensitized sample data are consistent with the model parameters for training the neural network model by the non-desensitized sample data.
In embodiments of the present application, the model difference values include the degree of lifting and/or ks statistics between models.
A fourth aspect of the present application provides a machine-readable storage medium having instructions stored thereon, which when executed by a processor, cause the processor to be configured to perform the above-described method for determining model training data.
A fifth aspect of the application provides a computer program product comprising a computer program which, when executed by a processor, implements the method for determining model training data described above.
By the technical scheme, the target desensitization method can be determined, and the data which are not subjected to desensitization treatment are desensitized by the desensitization method to obtain model training data, so that the difference of model training effects is reduced, and the training effect of the model is further improved.
Additional features and advantages of embodiments of the present application will be described in detail in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the embodiments of the disclosure, but are not intended to limit the embodiments of the disclosure. In the drawings:
FIG. 1 schematically illustrates a flow diagram of a method for determining model training data in accordance with an embodiment of the present application;
FIG. 2 schematically illustrates another flow diagram of a method for determining model training data according to an embodiment of the present application;
FIG. 3 schematically illustrates an application environment diagram of a method for determining model training data according to an embodiment of the present application;
FIG. 4 schematically illustrates an application environment diagram of a method for determining model training data according to another embodiment of the present application;
FIG. 5 schematically illustrates an application environment for a method for determining model training data according to yet another embodiment of the present application;
FIG. 6 schematically illustrates a block diagram of an apparatus for determining model training data according to an embodiment of the present application;
fig. 7 schematically shows an internal structure diagram of a computer device according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the specific embodiments described herein are only used for illustrating and explaining the embodiments of the present application and are not used for limiting the embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 schematically shows a flow diagram of a method for determining model training data according to an embodiment of the present application. As shown in FIG. 1, in one embodiment of the present application, a method for determining model training data is provided, comprising the steps of:
step 101, inputting sample data which is not subjected to desensitization processing into a neural network model to train the neural network model, so as to obtain a first model.
And 102, desensitizing the sample data by a plurality of desensitizing methods to obtain desensitizing data corresponding to each desensitizing method.
103, inputting desensitization data corresponding to each desensitization method into the neural network model respectively to train the neural network model to obtain a plurality of second models.
Step 104, determining model parameters of the first model and each second model.
And 105, comparing the model parameters of the first model with the model parameters of each second model respectively to determine a model difference value between the first model and each second model.
And 106, determining the second model with the minimum model difference value as the target model.
And step 107, determining a desensitization method corresponding to the target model as a target desensitization method, and performing desensitization processing on sample data by using the target desensitization method to obtain data trained on the neural network model.
When the model is trained, the training data may include two types. One is data after desensitization treatment by a desensitization method, the other is data without desensitization treatment, and the data without desensitization treatment can include sensitive data, so that the model can be trained by two training data respectively. The processor may input sample data that is not desensitized to the neural network model to train the neural network model to obtain the first model. The sample data which is not desensitized may include sensitive data. Sensitive data may refer to data that may present a hazard to society or individuals after leakage. In particular, the sensitive data may include personal privacy data, such as name, identification number, address, telephone number, bank card account number, etc., and may refer to data that is not suitable for publishing by a business or social institution, such as business data of the business, etc. The first model may refer to a model obtained by training a neural network model with sample data that is not desensitized.
The processor can perform desensitization processing on the sample data which is not subjected to desensitization processing through a plurality of desensitization methods to obtain desensitization data corresponding to each desensitization method. Among them, the desensitization method may refer to substitution, shuffling, numerical value conversion, encryption, and the like. The processor may input desensitization data corresponding to each desensitization method to the neural network model, respectively, to train the neural network model to obtain a plurality of second models. Namely, after the desensitization data corresponding to each desensitization method is used for training the neural network model, a second model can be obtained. The second model may refer to a model obtained by training the neural network model through desensitization data.
The processor may determine model parameters for the first model and each of the second models. Wherein the model parameters may refer to weights in the neural network, etc. The processor may compare the model parameters of the first model with the model parameters of each of the second models, respectively, to determine a value of model difference between the first model and each of the second models. In order to ensure comparability of subsequent model parameters, the model parameters of the first model and each second model may be identical. Wherein, the difference of the models can refer to AUC value, lifting degree, ks statistic and the like between the models. The processor may determine the second model having the smallest model difference value as the target model. The processor can determine a desensitization method corresponding to the target model as a target desensitization method, so that desensitization processing is performed on the sample data by using the target desensitization method to obtain data for training the neural network model.
By the technical scheme, a proper desensitization method can be obtained, desensitization is carried out on data which are not subjected to desensitization treatment by the desensitization method to obtain model training data, difference of model training effects can be reduced, and the model training effect is further improved.
In one embodiment, the model difference values comprise AUC values between the models, and the model difference value Y for each second model from the first model is determined according to equation (1):
Figure BDA0003574015290000081
wherein, Y i Expressed as the value of model difference, V, between the first model and the second model trained using sample data obtained by desensitization using the ith desensitization method s AUC value, V, for the first model ei AUC values from a second model trained for sample data obtained by desensitization using the ith desensitization method,
Figure BDA0003574015290000082
a process fit value between the second model and the first model trained for sample data obtained by desensitization using the ith desensitization method.
The model difference values for the first model and each second model may include AUC values between the models. AUC values may range from 0 ≦ AUC ≦ 1. The processor may determine a model difference value Y for each second model from the first model according to equation (1). Wherein, the formula (1) can be expressed as
Figure BDA0003574015290000083
Figure BDA0003574015290000084
Wherein, Y i Expressed as the value of model difference, V, between the first model and the second model trained using sample data obtained by desensitization using the ith desensitization method s AUC value, V, for the first model ei AUC values for a second model trained for sample data obtained by desensitization using the ith desensitization method,
Figure BDA0003574015290000085
a process fit value between the second model and the first model trained for sample data obtained by desensitization using the ith desensitization method.
In one embodiment, inputting sample data without desensitization processing to the neural network model to train the neural network model, and obtaining the first model includes: and submitting the machine learning production line to a first operation environment, and inputting sample data which is not subjected to desensitization processing to a neural network model of the machine learning production line in the first operation environment so as to train the neural network model to obtain a first model.
The processor may submit the machine learning pipeline to a first operating environment, and sample data that is not subjected to desensitization processing may be input to a neural network model of the machine learning pipeline in the first operating environment to train the neural network model to obtain a first model. Wherein, a machine learning pipeline may refer to a workflow that is composed of various steps and can be executed. For example, a machine pipeline may include steps for data extraction, data validation, data preparation, model training, model evaluation, and model validation. The first operating environment may refer to a trusted environment, and specifically, may refer to an operating environment in which data that is not desensitized can be directly accessed, and data information fed back to a developer is controllable. Taking the identification number as data which is not subjected to desensitization processing as an example, what the modeler sees at the front end may be desensitized data, and in the first operating environment, the machine learning pipeline may be allowed to read the data which is not subjected to desensitization processing. Therefore, under the first operation environment, a modeler can extract information which is helpful for model training, such as information of gender, place of birth and the like, from the machine learning pipeline according to the coding rule of the identification number. The first model may refer to a model obtained by training a neural network model with sample data that is not desensitized.
In one embodiment, the desensitization data corresponding to each desensitization method is input to the neural network model to train the neural network model, and obtaining the plurality of second models includes: and submitting the machine learning assembly line to a second operating environment, and respectively inputting desensitization data corresponding to each desensitization method to the neural network model of the machine learning assembly line under the second operating environment so as to train the neural network model to obtain a plurality of second models.
The processor may submit the machine learning pipeline to a second operating environment, and desensitization data corresponding to each desensitization method may be input to the neural network model of the machine learning pipeline under the second operating environment, respectively, to train the neural network model to obtain a plurality of second models. Wherein, a machine learning pipeline may refer to a workflow that is composed of various steps and can be executed. For example, a machine pipeline may include steps for data extraction, data validation, data preparation, model training, model evaluation, and model validation. The second runtime environment may refer to a debugging environment, in particular, a machine learning pipeline debugging environment, and may refer to a runtime environment that is capable of accessing desensitized data. The second model may refer to a model obtained after training the neural network model with desensitization data.
In one embodiment, the method further comprises: acquiring target training parameters of the target model, wherein the training parameters comprise environmental operating parameters of a second operating environment in which the target model is positioned, model parameters of the target model and a desensitization method adopted by desensitization data of the training target model; and determining the target training parameters as the training parameters of the subsequent model training.
The processor may obtain a target training parameter of the target model in a case where the second model having the smallest model difference value is determined as the target model. The target training model may include environmental operating parameters of a second operating environment in which the target model is located, model parameters of the target model, and a desensitization method used for training desensitization data of the target model. Specifically, the environment operating parameters of the second operating environment may include machine learning pipeline code and the like in the second operating environment. The model parameters of the object model may refer to weights of the neural network, etc. Desensitization methods used to train desensitization data of the target model may include substitution, shuffling, numerical transformation, encryption, and the like. In the case of obtaining the target training parameters of the target model, the processor may determine the target model parameters as training parameters for subsequent model training.
In one embodiment, the first operating environment is a trusted environment, the data in the trusted environment includes sensitive data, and the second operating environment is a debugging environment, and the data in the debugging environment is desensitized data.
The first operating environment may refer to a trusted environment, and the data in the trusted environment may include sensitive data, i.e., data that has not been desensitized. The second runtime environment may refer to a debugging environment, and in particular, may be a machine learning pipeline debugging environment. The data in the debug environment may be desensitized data.
In one embodiment, the model parameters under which the neural network model is trained with desensitized sample data are consistent with the model parameters under which the neural network model is trained with non-desensitized sample data.
Training the neural network model through desensitized sample data to obtain a plurality of second models. The first model may be derived by training the nerve played model with sample data that is not desensitized. The model parameters under which the neural network model is trained with desensitized sample data may be consistent with the model parameters under which the neural network model is trained with non-desensitized sample data. That is, the model parameters of the second model may be consistent with the model parameters of the first model. The model parameters may refer to weights of the neural network, and the like.
In one embodiment, the model difference values include a degree of lifting and/or ks statistics between the models.
Wherein, the lifting degree and the ks statistic can refer to an evaluation index of the model. The degree of lift may evaluate the predictive power of the model. The ks statistic may evaluate the accuracy of the model prediction. The processor may determine a model difference value for each second model from the first model based on the lifting degree and/or ks statistics between the models.
In one embodiment, as shown in FIG. 2, a flow diagram of another method for determining model training data is provided. As shown in fig. 2, first, the processor may perform an analysis exploration on the data. After analyzing and exploring the data, the processor may develop a machine learning pipeline and test the developed machine learning pipeline. And in the case that the machine learning pipeline test fails, analyzing and exploring the data again. In the event that the machine learning pipeline passes the test, the machine learning pipeline that passes the test may be run to train the model of the machine learning pipeline. After the model of the machine learning assembly line is trained, the trained model can be contrasted and analyzed to determine the target model. Then, whether the target model can meet the business requirements is judged. And under the condition that the target model does not meet the business requirements, analyzing and exploring the data again.
In one embodiment, as shown in FIG. 3, a schematic diagram of an application environment for a method for determining model training data is provided.
The processor can analyze and explore the data distribution statistical information and the sampling detail data. Data distribution statistics may be obtained from the model development data area. The data distribution statistics may be information that does not relate to sensitive data. The sample detail data may be obtained from the detail data of the model development data area. The detail data may include sample detail data and full detail data. The sample detail data and the full detail data may be information relating to sensitive data. When the sampling detail data is read, the processor can simply replace the sampling detail data through the model development data access controller, the preset sensitive data identification and the replacement strategy, and then return the replaced sampling detail data.
In one embodiment, as shown in FIG. 4, a schematic diagram of an application environment for another method for determining model training data is provided.
In developing the machine learning pipeline, the processor may submit the machine learning pipeline to a machine learning pipeline commissioning environment. Wherein, a machine learning pipeline may refer to a workflow that is composed of various steps and can be executed. For example, a machine pipeline may include steps for data extraction, data validation, data preparation, model training, model evaluation, and model validation. The machine learning pipeline commissioning environment may include a machine learning pipeline debugging environment and a machine learning pipeline trusted environment.
If the processor submits the machine learning pipeline to a machine learning pipeline debugging environment, the processor can read sampling desensitization data from the model research and development data area and input the sampling desensitization data into the machine learning pipeline so as to train the model in the machine learning pipeline through the sampling desensitization data. The sampling desensitization data can be determined by acquiring sampling detail data from detail data in a model development data area and performing desensitization treatment on the sampling detail data by adopting different desensitization methods. After the model is trained, the processor may evaluate and validate the training effect of the model in the machine learning pipeline.
If the processor submits the machine learning pipeline to a trusted environment of the machine learning pipeline, the processor can read sampling sensitive data from detailed data of the model research and development data area and input the sampling sensitive data into the machine learning pipeline so as to train the model in the machine learning pipeline through the sampling sensitive data. After the model is trained, the processor may evaluate and validate the training effect of the model in the machine learning pipeline.
The processor may compare the models trained in the machine learning pipeline debugging environment and the machine learning pipeline trusted environment to determine model differences between the models. Specifically, the difference between the models may be determined by the model index. The processor may then determine the model with the least difference as the target model and determine the desensitization method to which the target model corresponds. The processor may return the running results information of the least diverse model for the model developer to view the machine learning pipeline. The operation result information may include model index information, machine learning pipeline codes, model training parameters, and a desensitization method corresponding to the model with the smallest difference in model index.
In one embodiment, as shown in FIG. 5, a schematic diagram of an application environment for another method for determining model training data is also provided.
The processor may submit the machine learning pipeline to a machine learning pipeline execution environment. Wherein, a machine learning pipeline may refer to a workflow that is composed of various steps and can be executed. For example, a machine pipeline may include steps for data extraction, data validation, data preparation, model training, model evaluation, and model validation. The processor may read the desensitized full amount of detail data from the detail data in the model development data area. And performing data desensitization on the desensitized full detailed data by adopting a desensitization method corresponding to the target model. The processor may input the desensitized full-scale detail data into the machine learning pipeline to train the model in the machine learning pipeline with the desensitized full-scale detail data. After the model is trained, the processor may evaluate and validate the training effect of the model in the machine learning pipeline. After the model is verified, the processor can return operation result information for a model research and development worker to check the operation result of the machine learning production line.
By the technical scheme, a proper desensitization method can be obtained, desensitization is carried out on data which are not subjected to desensitization treatment by the desensitization method to obtain model training data, difference of model training effects can be reduced, and the model training effect is further improved. Meanwhile, the desensitization method is used for desensitizing data which is not subjected to desensitization treatment, so that the safety of data desensitization can be improved, and the cost of safety control of sensitive data is greatly reduced.
Fig. 1 and 2 are schematic flow diagrams of a method for determining model training data in an embodiment. It should be understood that, although the steps in the flowcharts of fig. 1 and 2 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1 and 2 may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.
The embodiment of the application provides a processor, and the processor is used for running a program, wherein the program executes the method for determining the model training data during running.
In one embodiment, as shown in fig. 6, an apparatus for determining model training data is provided, comprising a first training module, a data desensitization module, a second training module, a model alignment module, and a model selection module, wherein:
the first training module 601 is configured to input sample data which is not subjected to desensitization processing into the neural network model to train the neural network model, so as to obtain a first model.
A data desensitization module 602 configured to perform desensitization processing on the sample data by a plurality of desensitization methods to obtain desensitization data corresponding to each desensitization method.
A second training module 603 configured to input desensitization data corresponding to each desensitization method to the neural network model, respectively, to train the neural network model to obtain a plurality of second models.
A model alignment module 604 configured to determine model parameters of the first model and each of the second models; and comparing the model parameters of the first model with the model parameters of each second model respectively to determine the model difference value between the first model and each second model.
A model selection module 605 configured to determine a second model with the smallest model difference value as the target model; and determining a desensitization method corresponding to the target model as a target desensitization method, and performing desensitization processing on the sample data by using the target desensitization method to obtain data trained for the neural network model.
In one embodiment, the model difference values comprise AUC values between the models, the model alignment module is further configured to determine the model difference value Y for each second model from the first model according to equation (1):
Figure BDA0003574015290000151
wherein, Y i Expressed as the value of model difference, V, between the first model and the second model trained using sample data obtained by desensitization using the ith desensitization method s AUC value, V, for the first model ei AUC values for a second model trained for sample data obtained by desensitization using the ith desensitization method,
Figure BDA0003574015290000152
a degree of overfitting value between the second model and the first model trained for sample data obtained by desensitization using the ith desensitization method.
In one embodiment, the first training module is further configured to: submitting a machine learning production line to a first operation environment, and inputting sample data which is not subjected to desensitization processing to a neural network model of the machine learning production line under the first operation environment so as to train the neural network model to obtain a first model; the second training module is further configured to: and submitting the machine learning assembly line to a second operating environment, and respectively inputting desensitization data corresponding to each desensitization method to the neural network model of the machine learning assembly line under the second operating environment so as to train the neural network model to obtain a plurality of second models.
In one embodiment, the model selection module is further configured to: acquiring target training parameters of the target model, wherein the target training parameters comprise environmental operating parameters of a second operating environment in which the target model is positioned, model parameters of the target model and a desensitization method adopted by desensitization data of the training target model; and determining the target training parameters as the training parameters of the subsequent model training.
In one embodiment, the first operating environment is a trusted environment, the data in the trusted environment includes sensitive data, and the second operating environment is a debug environment, and the data in the debug environment is desensitized data.
In one embodiment, the model parameters under which the neural network model is trained with desensitized sample data are consistent with the model parameters under which the neural network model is trained with non-desensitized sample data.
In one embodiment, the model difference values include a degree of lifting and/or ks statistics between the models.
Embodiments of the present application provide a storage medium on which a program is stored, which when executed by a processor implements the above-described method for determining model training data.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor a01, a network interface a02, a memory (not shown), and a database (not shown) connected by a system bus. Wherein processor a01 of the computer device is used to provide computing and control capabilities. The memory of the computer device comprises an internal memory a03 and a non-volatile storage medium a 04. The non-volatile storage medium a04 stores an operating system B01, a computer program B02, and a database (not shown in the figure). The internal memory a03 provides an environment for the operation of the operating system B01 and computer programs B02 in the non-volatile storage medium a 04. The database of the computer device is used for storing data such as sample data and model parameters. The network interface a02 of the computer device is used for communication with an external terminal through a network connection. The computer program B02 is executed by the processor a01 to implement a method for determining model training data.
Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
The embodiment of the application provides equipment, the equipment comprises a processor, a memory and a program which is stored on the memory and can run on the processor, and the following steps are realized when the processor executes the program: inputting sample data which is not subjected to desensitization treatment into the neural network model to train the neural network model to obtain a first model; desensitizing the sample data by multiple desensitization methods to obtain desensitization data corresponding to each desensitization method; desensitization data corresponding to each desensitization method are respectively input into the neural network model so as to train the neural network model to obtain a plurality of second models; determining model parameters of the first model and each second model; comparing the model parameters of the first model with the model parameters of each second model respectively to determine a model difference value between the first model and each second model; determining a second model with the minimum model difference value as a target model; and determining a desensitization method corresponding to the target model as a target desensitization method, and performing desensitization processing on the sample data by using the target desensitization method to obtain data trained for the neural network model.
In one embodiment, the model difference values comprise AUC values between models, and the model difference value Y for each second model from the first model is determined according to equation (1):
Figure BDA0003574015290000171
wherein, Y i Expressed as the model difference, V, between the first model and the second model trained using sample data obtained by desensitization using the ith desensitization method s AUC value, V, for the first model ei AUC values from a second model trained for sample data obtained by desensitization using the ith desensitization method,
Figure BDA0003574015290000172
a degree of overfitting value between the second model and the first model trained for sample data obtained by desensitization using the ith desensitization method.
In one embodiment, inputting sample data without desensitization processing to the neural network model to train the neural network model, and obtaining the first model includes: submitting a machine learning production line to a first operation environment, and inputting sample data which is not subjected to desensitization processing to a neural network model of the machine learning production line under the first operation environment so as to train the neural network model to obtain a first model; inputting desensitization data corresponding to each desensitization method into the neural network model respectively to train the neural network model, and obtaining a plurality of second models comprises: and submitting the machine learning assembly line to a second operating environment, and respectively inputting desensitization data corresponding to each desensitization method to the neural network model of the machine learning assembly line under the second operating environment so as to train the neural network model to obtain a plurality of second models.
In one embodiment, the method further comprises: acquiring target training parameters of the target model, wherein the target training parameters comprise environmental operating parameters of a second operating environment in which the target model is positioned, model parameters of the target model and a desensitization method adopted by desensitization data of the training target model; and determining the target training parameters as the training parameters of the subsequent model training.
In one embodiment, the first operating environment is a trusted environment, the data in the trusted environment includes sensitive data, and the second operating environment is a debugging environment, and the data in the debugging environment is desensitized data.
In one embodiment, the model parameters under which the neural network model is trained with desensitized sample data are consistent with the model parameters under which the neural network model is trained with non-desensitized sample data.
In one embodiment, the model difference values include a degree of lifting and/or ks statistics between the models.
The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device: inputting sample data which is not subjected to desensitization treatment into the neural network model to train the neural network model to obtain a first model; desensitizing the sample data by multiple desensitization methods to obtain desensitization data corresponding to each desensitization method; desensitization data corresponding to each desensitization method are respectively input into the neural network model so as to train the neural network model to obtain a plurality of second models; determining model parameters of the first model and each second model; comparing the model parameters of the first model with the model parameters of each second model respectively to determine a model difference value between the first model and each second model; determining a second model with the minimum model difference value as a target model; and determining a desensitization method corresponding to the target model as a target desensitization method, and performing desensitization processing on the sample data by using the target desensitization method to obtain data trained for the neural network model.
In one embodiment, the model difference values comprise AUC values between models, and the model difference value Y for each second model from the first model is determined according to equation (1):
Figure BDA0003574015290000181
wherein Y is i Expressed as the value of model difference, V, between the first model and the second model trained using sample data obtained by desensitization using the ith desensitization method s AUC value, V, for the first model ei AUC values from a second model trained for sample data obtained by desensitization using the ith desensitization method,
Figure BDA0003574015290000182
a degree of overfitting value between the second model and the first model trained for sample data obtained by desensitization using the ith desensitization method.
In one embodiment, inputting sample data without desensitization processing to the neural network model to train the neural network model, and obtaining the first model includes: submitting a machine learning production line to a first operation environment, and inputting sample data which is not subjected to desensitization processing to a neural network model of the machine learning production line under the first operation environment so as to train the neural network model to obtain a first model; inputting desensitization data corresponding to each desensitization method into the neural network model respectively to train the neural network model, and obtaining a plurality of second models comprises: and submitting the machine learning assembly line to a second operating environment, and respectively inputting desensitization data corresponding to each desensitization method to the neural network model of the machine learning assembly line under the second operating environment so as to train the neural network model to obtain a plurality of second models.
In one embodiment, the method further comprises: acquiring target training parameters of the target model, wherein the target training parameters comprise environmental operating parameters of a second operating environment in which the target model is positioned, model parameters of the target model and a desensitization method adopted by desensitization data of the training target model; and determining the target training parameters as the training parameters of the subsequent model training.
In one embodiment, the first operating environment is a trusted environment, the data in the trusted environment includes sensitive data, and the second operating environment is a debugging environment, and the data in the debugging environment is desensitized data.
In one embodiment, the model parameters under which the neural network model is trained with desensitized sample data are consistent with the model parameters under which the neural network model is trained with non-desensitized sample data.
In one embodiment, the model difference values include a degree of lifting and/or ks statistics between the models.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (17)

1. A method for determining model training data, the method comprising:
inputting sample data which is not subjected to desensitization treatment into the neural network model to train the neural network model to obtain a first model;
desensitizing the sample data by a plurality of desensitizing methods to obtain desensitizing data corresponding to each desensitizing method;
desensitization data corresponding to each desensitization method are respectively input into the neural network model so as to train the neural network model, and a plurality of second models are obtained;
determining model parameters of the first model and each of the second models;
comparing the model parameters of the first model with the model parameters of each second model respectively to determine a model difference value between the first model and each second model;
determining a second model with the minimum model difference value as a target model;
and determining a desensitization method corresponding to the target model as a target desensitization method, and performing desensitization processing on sample data by using the target desensitization method to obtain data trained for the neural network model.
2. The method for determining model training data of claim 1, wherein the model difference values comprise AUC values between models, the model difference value Y for each second model from the first model being determined according to equation (1):
Figure FDA0003574015280000011
wherein, Y i Expressed as a value of model difference, V, between a second model trained using sample data obtained by desensitization using the ith desensitization method and the first model s Is the AUC value, V, of said first model ei AUC values for a second model trained for sample data obtained by desensitization using the ith desensitization method,
Figure FDA0003574015280000012
a process fit value between a second model trained for sample data obtained by desensitization using the ith desensitization method and the first model.
3. The method for determining model training data according to claim 1, wherein inputting sample data without desensitization processing to the neural network model to train the neural network model, and obtaining the first model comprises: submitting a machine learning production line to a first operation environment, and inputting sample data which is not subjected to desensitization processing to a neural network model of the machine learning production line under the first operation environment so as to train the neural network model to obtain a first model;
the step of inputting desensitization data corresponding to each desensitization method into the neural network model to train the neural network model to obtain a plurality of second models comprises the following steps: and submitting the machine learning assembly line to a second operating environment, and respectively inputting desensitization data corresponding to each desensitization method to the neural network model of the machine learning assembly line in the second operating environment so as to train the neural network model to obtain a plurality of second models.
4. The method for determining model training data of claim 3, further comprising:
acquiring target training parameters of the target model, wherein the target training parameters comprise environmental operating parameters of a second operating environment in which the target model is located, model parameters of the target model and a desensitization method adopted for training desensitization data of the target model;
and determining the target training parameters as the training parameters of the subsequent model training.
5. The method for determining model training data of claim 3, wherein the first operating environment is a trusted environment, the data in the trusted environment comprises sensitive data, the second operating environment is a debugging environment, and the data in the debugging environment is desensitized data.
6. The method for determining model training data of claim 1, wherein model parameters under training of the neural network model with desensitized sample data are consistent with model parameters under training of the neural network model with non-desensitized sample data.
7. The method for determining model training data of claim 1, wherein the model difference values comprise a degree of lifting and/or ks statistics between models.
8. A processor configured to perform the method for determining model training data according to any one of claims 1 to 7.
9. An apparatus for determining model training data, the apparatus comprising:
the first training module is configured to input sample data which is not subjected to desensitization processing into the neural network model so as to train the neural network model to obtain a first model;
the data desensitization module is configured to perform desensitization processing on the sample data through a plurality of desensitization methods to obtain desensitization data corresponding to each desensitization method;
the second training module is configured to input desensitization data corresponding to each desensitization method into the neural network model respectively so as to train the neural network model to obtain a plurality of second models;
a model alignment module configured to determine model parameters of the first model and each of the second models; comparing the model parameters of the first model with the model parameters of each second model respectively to determine a model difference value between the first model and each second model;
the model selection module is configured to determine a second model with the minimum model difference value as a target model; and determining a desensitization method corresponding to the target model as a target desensitization method, and performing desensitization processing on sample data by using the target desensitization method to obtain data trained for the neural network model.
10. The apparatus of claim 9, wherein the model difference values comprise AUC values between models, the model alignment module further configured to determine a model difference value Y for each second model from the first model according to equation (1):
Figure FDA0003574015280000031
wherein Y is i Expressed as the value of the model difference, V, between the second model trained using sample data obtained by desensitization using the ith desensitization method and the first model s Is the AUC value, V, of said first model ei AUC values from a second model trained for sample data obtained by desensitization using the ith desensitization method,
Figure FDA0003574015280000041
a process fit value between a second model trained for sample data obtained by desensitization using the ith desensitization method and the first model.
11. The apparatus for determining model training data of claim 9, wherein the first training module is further configured to: submitting a machine learning production line to a first operation environment, and inputting sample data which is not subjected to desensitization processing to a neural network model of the machine learning production line under the first operation environment so as to train the neural network model to obtain a first model;
the second training module is further configured to: and submitting the machine learning assembly line to a second operating environment, and respectively inputting desensitization data corresponding to each desensitization method to the neural network model of the machine learning assembly line in the second operating environment so as to train the neural network model to obtain a plurality of second models.
12. The apparatus for determining model training data of claim 11, wherein the model selection module is further configured to:
acquiring target training parameters of the target model, wherein the target training parameters comprise environmental operating parameters of a second operating environment in which the target model is located, model parameters of the target model and a desensitization method adopted for training desensitization data of the target model;
and determining the target training parameters as the training parameters of the subsequent model training.
13. The apparatus for determining model training data of claim 11, wherein the first operating environment is a trusted environment, the data in the trusted environment comprises sensitive data, the second operating environment is a debugging environment, and the data in the debugging environment is desensitized data.
14. The apparatus for determining model training data of claim 9, wherein model parameters under training of the neural network model with desensitized sample data are consistent with model parameters under training of the neural network model with non-desensitized sample data.
15. The apparatus for determining model training data of claim 9, wherein the model difference values comprise a degree of lifting and/or ks statistics between models.
16. A machine-readable storage medium having instructions stored thereon, which when executed by a processor causes the processor to be configured to perform a method for determining model training data according to any one of claims 1 to 7.
17. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the method for determining model training data according to any one of claims 1 to 7.
CN202210334436.9A 2022-03-30 2022-03-30 Method, apparatus, storage medium, and processor for determining model training data Pending CN114912139A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210334436.9A CN114912139A (en) 2022-03-30 2022-03-30 Method, apparatus, storage medium, and processor for determining model training data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210334436.9A CN114912139A (en) 2022-03-30 2022-03-30 Method, apparatus, storage medium, and processor for determining model training data

Publications (1)

Publication Number Publication Date
CN114912139A true CN114912139A (en) 2022-08-16

Family

ID=82763054

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210334436.9A Pending CN114912139A (en) 2022-03-30 2022-03-30 Method, apparatus, storage medium, and processor for determining model training data

Country Status (1)

Country Link
CN (1) CN114912139A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115514564A (en) * 2022-09-22 2022-12-23 窦彦彬 Data security processing method and system based on data sharing

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115514564A (en) * 2022-09-22 2022-12-23 窦彦彬 Data security processing method and system based on data sharing

Similar Documents

Publication Publication Date Title
CN108876133B (en) Risk assessment processing method, device, server and medium based on business information
CN109344906B (en) User risk classification method, device, medium and equipment based on machine learning
US11354749B2 (en) Computing device for machine learning based risk analysis
CN112527321B (en) Deep learning-based application online method, system, device and medium
CN112765659A (en) Data leakage protection method for big data cloud service and big data server
CN111222994A (en) Client risk assessment method, device, medium and electronic equipment
AU2020419020A1 (en) Creating predictor variables for prediction models from unstructured data using natural language processing
CN110888625A (en) Method for controlling code quality based on demand change and project risk
CN114912139A (en) Method, apparatus, storage medium, and processor for determining model training data
CN115936895A (en) Risk assessment method, device and equipment based on artificial intelligence and storage medium
CN118445412A (en) Method and device for detecting risk text, storage medium and electronic equipment
CN114971638A (en) Transaction authentication method and device based on risk identification
CN115099988A (en) Model training method, data processing method, device and computer medium
CN116957828A (en) Method, equipment, storage medium and device for checking account
CN115827290A (en) Processing strategy determination method and device, storage medium and electronic equipment
CN111737090B (en) Log simulation method and device, computer equipment and storage medium
CN114792007A (en) Code detection method, device, equipment, storage medium and computer program product
CN110827144A (en) Application risk evaluation method and application risk evaluation device for user and electronic equipment
CN113591932B (en) User abnormal behavior processing method and device based on support vector machine
Orony et al. Automobile odometer fraud prevention with the implementation of blockchain and deep learning
CN118378299A (en) Document desensitizing method, device, computer program product and electronic equipment
CN114757788A (en) User transaction behavior identification method and device
CN116205745A (en) Financial system safety processing method and system based on artificial intelligence
Kayabaşı A credit classification application with machine learning methods: German credit dataset example
CN118096370A (en) Transaction behavior detection method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination