CN117236420B

CN117236420B - Method and system for debugging vertical federation learning abnormal data based on data subset

Info

Publication number: CN117236420B
Application number: CN202311509786.5A
Authority: CN
Inventors: 韩培义; 刘川意; 郭蕴哲; 段少明
Original assignee: Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Current assignee: Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Priority date: 2023-11-14
Filing date: 2023-11-14
Publication date: 2024-03-26
Anticipated expiration: 2043-11-14
Also published as: CN117236420A

Abstract

The invention discloses a method and a system for debugging longitudinal federal learning abnormal data based on a data subset, wherein the method comprises the following steps: the initiator models based on longitudinal federal learning and performs federal model training; acquiring a problem data subset in the data set by using the trained federation model, wherein the prediction accuracy of the problem data subset in the federation model is lower than that of other data subsets in the federation model; screening the problem data subsets based on feature description combinations to obtain the problem data subsets with abnormal descriptions; the initiator or participant performs data tracing and correction based on the subset of problem data with anomaly descriptions and retrains the federal model after correction. The invention relates to a federation data subset evaluation technology for privacy protection, which is used for correctly calculating federation data subset evaluation indexes under the condition of ensuring data privacy, forming a federation learning and debugging method based on data subsets, automatically positioning abnormal data and solving the problem of abnormal performance of a federation learning model.

Description

Method and system for debugging vertical federation learning abnormal data based on data subset

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a method and a system for debugging longitudinal federal learning abnormal data based on a data subset.

Background

Federal learning techniques involving multiple data holders are currently being used in such areas as financial management and intelligent medical applications. However, the debugging technology for the federal learning model is still in the blank field, and the main reasons for this phenomenon are as follows: 1) Privacy protection reasons: because the data required for training the federal learning model comes from two or more participants, for privacy protection reasons, the data and characteristic information of other data providers are difficult to access by the initiator of model training, so that the data problem is difficult to determine and debugging is difficult to perform; 2) The debugging technical reasons are as follows: the existing federation debugging technology is mainly aimed at a centralized training scene, namely, an initiator of model training can access all data and debug the data according to the data condition, and the execution flow of the debugging technology is difficult to adapt to the data distribution condition in federation learning; 3) The cooperation reasons of the participants are as follows: when federal learning models face problems, the initiator and the data provider are often required to manually find the problems, and such processes are time-consuming and labor-consuming, and excessive manual participation can further cause privacy leakage problems.

The longitudinal federal learning aims at the longitudinal data distribution situation, the longitudinal data distribution is also called sample aligned data distribution, in the scene, the data is distributed in databases of a plurality of participants, each participant has highly overlapped data IDs, but has no or little data characteristic overlap, when the data analysis service is carried out, the plurality of data participants firstly need to obtain intersections of the data IDs respectively held, and take out the part of data with the same data ID in the data of each participant to carry out subsequent data analysis query task, and the longitudinal data distribution is mainly applied to the situation that users of the data sets of each participant overlap more and the characteristic dimension overlap is less. Multi-party data longitudinal distribution is common across industry scenarios, such as: in a financial wind control scenario, the bank is taken as a data holder and holds the pre-loan information of a part of clients, the bank can train a pre-loan wind control model by utilizing the information, while in a typical application scenario of federal modeling, the bank introduces operator data as the data supplement of pre-loan wind control, in which case only the clients of the bank and the client data for the operator can participate in federal modeling. In this scenario, the data held by the banks and operators are longitudinally distributed among multiple parties, participate in federal modeling, and the data IDs thereof are simultaneously located in the data of the banks and operators, and the distribution of such data belongs to the longitudinal distribution of the data among multiple parties.

During the longitudinal federal learning process, multiple participants train a federal learning model using the respective owned data. When the data held by each party belongs to normal data and no data error exists, the performance of the federal learning model usually meets the service requirement. However, in many cases, the model performs poorly due to the data error or missing in the data owned by a certain participant, and the service requirement cannot be met. For example: in a longitudinal federal learning scenario of an insurance company and a medical institution, the medical institution may set a detection result of a certain medical index of a patient aged 30-40 years to be positive by mistake when operating the database due to negligence of staff, and such data errors may affect performance of the model in the federal model. It is difficult for the staff of the insurance company to locate this problem without looking at the medical facility data in such a case. In the existing solutions, the data held by the participants cannot be ensured to be local on the premise of successful debugging, and the completion of debugging on the premise of no manual data inspection is difficult to ensure. Both of these approaches pose a serious risk of data leakage.

Federally learned models are trained by two or more participants, and the data involved in the model training needs to be prepared prior to model training, e.g., of the two participants, the participant with the tag is called the initiator, typically the initiator is also the party with the actual business requirements, the non-tagged participant is called the collaborator, typically the collaborator has a large number of data features and desires to earn profits by providing data services. However, in a business scenario, for various reasons of privacy and law, an initiator can only possess the right to query, view, modify, add to its own mastered data, but cannot do so across the participants. In this scenario, therefore, it is neither possible to allow the relevant person operating federal learning to see data of non-own participants nor to allow any data participating in federal learning to be sent locally to other participants.

In the federation learning scenario, data anomalies may lead to a steep increase in the federation model's prediction error rate for some of the test data, resulting in model anomalies. The existing research works have the problems that the data abnormality problem under the federal learning is difficult to locate or the data abnormality problem is difficult to be directly applied to the federal learning scene.

Disclosure of Invention

The invention aims at the problems and provides a method and a system for debugging longitudinal federal learning abnormal data based on a data subset.

According to a first aspect of embodiments of the present disclosure, there is provided a method for debugging vertical federal learning abnormal data based on a subset of data, the method comprising:

the initiator models based on longitudinal federal learning and performs federal model training;

acquiring a problem data subset in a data set by using the trained federation model, wherein the prediction accuracy of the problem data subset in the federation model is lower than that of other data subsets in the federation model;

screening the problem data subsets based on feature description combinations to obtain problem data subsets with abnormal descriptions;

and the initiator performs data tracing and correction based on the problem data subset with the anomaly description, and retrains the federal model after correction.

In an embodiment, before the screening based on feature description combination is performed on the problem data subset, the discrete features are classified according to the categories of the discrete data, and the continuous features are subjected to data interval segmentation.

In one embodiment, the screening based on feature description combination adopts multiparty security calculation method to protect data ID, and forms data anomaly description by section combining data ID.

In an embodiment, the method for debugging vertical federation learning abnormal data further includes a method for protecting federation data subset privacy based on a mask vector, specifically including:

performing ID alignment on the data ID before longitudinal federal learning modeling;

the mask vector is an array of full intersection data, an ID set corresponding to a feature in each data subset corresponds to an independent mask vector, the independent mask vector only exists in the data set corresponding to the feature, when the data subset contains a plurality of features, the mask vector held by each feature owner is needed and is subjected to para-multiplication by using a multipartite safety calculation method, and the true mask vector of the current data subset is determined;

and calculating the evaluation index of the current data subset according to the real mask vector.

In an embodiment, the screening based on feature description combination further includes training a machine learning model for screening the problem data subsets, in the training process of the machine learning model, selecting the data subsets to be subjected to label destruction by using a plurality of public data sets and randomly generating rules, marking the data subsets really having problems as positive sample labels for model training, and the rest as negative sample labels, and acquiring feature data of different data subsets through a data subset discovery technology to form a data set for the machine learning model; after training using the dataset, the machine learning model has the ability to finely resolve the subset of problem data; and inputting the problem data subset into the machine learning model after training is completed, and obtaining the problem data subset with abnormal description.

In an embodiment, the multiparty secure computing method is implemented using a secret sharing technique as an underlying technique, the secret sharing technique comprising: dividing a number into two or more numbers at random, dividing the divided numbers into different calculation parties, and expanding arithmetic calculation under privacy protection according to the data of the division parties.

According to a second aspect of embodiments of the present disclosure, there is provided a longitudinal federal learning exception data debug system based on a subset of data, the system comprising:

the federal model training unit is used for the sponsor to model and train the federal model based on longitudinal federal learning;

the problem data subset obtaining unit is used for obtaining a problem data subset in a data set by utilizing the trained federation model, and the prediction accuracy of the problem data subset in the federation model is lower than that of other data subsets in the federation model;

the problem data subset screening unit is used for screening the problem data subsets based on feature description combinations and acquiring the problem data subsets with abnormal descriptions;

and the problem data subset correction unit is used for enabling the initiator to trace and correct data based on the problem data subset with the abnormal description, and retraining the federation model after correction.

In an embodiment, the filtering based on the feature description combination in the problem data subset filtering unit further includes training a machine learning model for filtering the problem data subset, in the training process of the machine learning model, using a plurality of public data sets, randomly generating rules to select the data subset for label destruction, marking the data subset actually having the problem as a positive sample label for model training, and the rest as a negative sample label, and acquiring feature data of different data subsets through a data subset discovery technology to form a data set for the machine learning model; after training using the dataset, the machine learning model has the ability to finely resolve the subset of problem data; and inputting the problem data subset into the machine learning model after training is completed, and obtaining the problem data subset with abnormal description.

According to a third aspect of embodiments of the present disclosure, there is provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the above-described data subset-based longitudinal federal learning exception data debugging method when executing the program.

According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having stored thereon computer instructions that, when executed by a processor, implement the above-described data subset-based vertical federal learning exception data debugging method.

According to the method and the system for debugging the vertical federal learning abnormal data based on the data subset, which are provided by the embodiment of the disclosure, the federal data subset evaluation technology for privacy protection is designed to correctly calculate the relevant evaluation indexes of the federal data subset on the premise of ensuring the data privacy, so that a federal learning debugging framework based on the federal data subset is formed, abnormal data is automatically positioned, and the problem that a federal learning model is abnormal is solved by personnel. The method has the advantages that the method does not need the key privacy information such as data, characteristics and the like of the participants of federal learning to be contacted by the debugging personnel in the debugging process of the initiator, and does not need the participants of federal learning (or federal learning application programs) to mutually leak or send the information such as the data, the characteristics and the like held by the participants, and the method comprises the following steps:

When the federation model is not good due to the problem of data errors, the debugging process of the federation model is executed, the data with errors can be automatically searched on the premise that the data of the participants do not appear locally, and the federation model trained after deleting the data with errors can have normal performance;

the method comprises the steps that a privacy-protected problem data subset searching technology is used, on the premise that manual participation is not carried out, data held by all participants cannot be found out locally, a problem subset in a global data set is found out by means of a safe multiparty computing technology, and for the problem subset, the prediction accuracy of a federal model on the data subset is far lower than the prediction accuracy of the model on other data sets except the problem subset on the data set, and the problem data subset found by the technology can be described by an explanatory condition which is convenient for human understanding;

introducing a screening method aiming at the problem data subset, and after finishing screening the problem data subset, using a tracing technology of error data to find out data which causes label errors of the federation model in the data set by using a screened machine learning model; finally, debugging is completed by eliminating the traced error data and retraining the federal model;

Using a machine learning based problem subset filtering technique that screens against numerous problem data subsets that have been found, culls out problem data subsets that do not meet the actual requirements, and leaves problem data subsets that actually have a negative impact on the performance of the union model. The machine learning model used in the technology extracts characteristics and labels by utilizing the information such as the sizes, the effect amounts and the like of the existing multiple problem data subsets, and then trains the machine learning model as a training data set, and the trained machine learning model has the capability of finely distinguishing the problem data subsets.

A tracing technique of erroneous data is used. The technique uses a subset of problem data that has been filtered, and after performing an interval merge operation, forms a subset of description problem data. The technology uses a multiparty secure computing bottom technology, the data of the participators cannot be local in the whole execution flow, meanwhile, the data does not need to be checked manually, and the security of the data and the sensitive information of the participators is ensured.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention

FIG. 1 is a schematic flow chart of a method for debugging vertical federal learning abnormal data based on a data subset in an embodiment of the present invention;

FIG. 2 is a logic block diagram of a method for debugging vertical federal learning abnormal data based on a subset of data in an embodiment of the present invention;

FIG. 3 is a schematic diagram of data subset definitions in an embodiment of the invention;

FIG. 4 is a schematic diagram of a screening process of a subset of data in an embodiment of the present invention;

FIG. 5 is a flowchart of a federal data subset privacy protection method based on mask vectors in an embodiment of the present invention;

FIG. 6 is a schematic diagram of error rate calculation for a subset of federal data in an embodiment of the present invention;

FIG. 7 is a flow chart of a method for screening a subset of questions based on machine learning in an embodiment of the invention;

FIG. 8 is a schematic diagram of a vertical federal learning abnormal data debug system based on a subset of data in an embodiment of the present invention;

fig. 9 is a schematic diagram of an electronic device according to an embodiment of the invention.

Description of the embodiments

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Before discussing exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart depicts steps as a sequential process, many of the steps may be implemented in parallel, concurrently, or with other steps. Furthermore, the order of the steps may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like.

The embodiment of the invention provides a method and a system for debugging longitudinal federal learning abnormal data based on a data subset, and provides the following embodiments:

embodiment 1 is used for explaining a method for debugging vertical federation learning abnormal data based on a data subset, referring to fig. 1, and is a flowchart of a method for debugging vertical federation learning abnormal data based on a data subset, and specifically includes the steps of:

s1, modeling based on longitudinal federal learning and training a federal model by an initiator;

s2, acquiring a problem data subset in a data set by using the trained federation model, wherein the prediction accuracy of the problem data subset in the federation model is lower than that of other data subsets in the federation model;

S3, screening the problem data subsets based on feature description combinations to obtain problem data subsets with abnormal descriptions;

and S4, the initiator or the participant performs data tracing and correction based on the problem data subset with the abnormal description, and retrains the federal model after correction.

In a specific implementation process, see fig. 2, which is a schematic diagram of an overall framework of an embodiment debugging method, on the basis of not affecting the training of an existing federal learning model, using an existing training model to perform data debugging through test data, and the specific steps of a debugging flow are as follows:

step one: the method comprises the steps that a participant A (an initiator of a federal learning process) and a participant B (a data provider of the federal learning process) use data sets held by the participants to perform federal model training, wherein the training process comprises data preprocessing, federal feature engineering, federal learning training and the like, and in a federal learning debugging framework based on federal data subsets, the content of the step one is the same as that of the prior federal learning training process, and the prior framework, codes and the like do not need to be modified;

step two: the method comprises the steps that a party A and a party B put models into actual service use, test data are used as input, prediction results of a federal learning model are obtained, and verification is carried out on the prediction results of the federal learning model in the actual service use, and if no abnormality exists in the prediction results of the federal model, no subsequent federal debugging framework intervention is needed; if the prediction result of the federal learning model is found to be abnormal, entering a federal debugging link;

Step three: the party A initiates a federation debugging link, the party B cooperates with the federation debugging process according to the federation debugging step, the two parties cannot exchange original data in the federation debugging process, and after the federation debugging link is finished, the party A can obtain the output of the federation debugging link;

step four: the party A communicates with the party B through the output result of federal debugging to solve the data problem, and the party A corrects the data abnormality problem and still needs the data holder to intervene and adjust by means of abnormal data information due to complex abnormal conditions of the data, and does not need other data holders (parties) to intervene or observe the data adjustment, so that the privacy protection capability is maintained, and after the abnormal data problem is solved, the federal learning model is retrained.

The federal learning automatic debugging method for privacy protection supports a plurality of participants to participate in operation, wherein one participant is used as an initiator of a federal learning process, and the other participants are used as cooperators of the federal learning process. The specific technical flow of the federal learning automation debugging with privacy protection is as follows: 1) Before debugging, firstly performing federal model training, wherein the model training needs to be performed by an initiator, modeling is performed by using a longitudinal federal learning technology, and the embodiment takes logistic regression as an example; 2) After the training of the federation model is completed, calculating problem data subsets in the whole data set, wherein for the problem data subsets, the prediction accuracy of the federation model on the data subsets is far lower than the prediction accuracy of the model on other data sets except for the problem data subsets, and the federation model after the training is not as good as the global effect on the problem data subsets; 3) Because of the technical nature, the problem subset found in 2) is not necessarily due to real problem data, and further a problem subset filtering technique based on machine learning is used. The technology screens a plurality of problem data subsets which are found, eliminates the problem data subsets which do not meet the actual requirements, leaves the problem data subsets which have negative effects on the performance of the binding model in practice, and forms abnormal descriptions for the data subsets; 4) After the tracing operation of the error data is completed, the party A needs to communicate with the party B through the output result of federal debugging to solve the data problem, and the party for correcting the data abnormality problem still needs to intervene and adjust by means of the problem data information due to complex abnormal conditions of the data, and does not need to intervene or observe the data adjustment of other data holders (parties) so as to maintain the privacy protection capability, and retrain the federal learning model after solving the abnormal data problem. So far, the federal learning automation debugging process of the whole privacy protection is finished.

It should be noted that the definition of the federal data subset. As shown in fig. 3, the privacy-preserving federal data subset is a federal data subset across data participants that needs to be composed together by descriptions of features distributed across different participants, for example: sex is male, between ages 35 and 40, with sex characteristics in party a and age characteristics in party B. The federation data subset determines sample entries specifically belonging to the federation data subset by the IDs of the data samples, that is, each federation data subset needs to maintain a list of data IDs in which data IDs conforming to the federation feature description are stored, and the data corresponding to the data IDs form the federation data subset. For a given subset of data, the model performance of the federal model on that subset of data that has been trained, the Accuracy of the Accuracy definition is used in the preferred embodiment to define performance that is lower than when the federal model performs on another portion of the data in the entire data set except for that subset of data, such a subset of data is referred to as a problem data subset. It should be noted that, since the problem data subsets are not strictly limited in size and performance index gaps, not all problem data subsets may reveal the problem data features, and further filtering by using other indexes in the next step is needed, and in the embodiment, the filtering is further performed by using a problem subset filtering technology based on machine learning.

Before screening the problem data subsets based on feature description combination, classifying the discrete features according to the categories of the discrete data, and segmenting the data interval of the continuous features.

Specifically, as shown in fig. 4, the screening step for the subset of problem data is divided into 4 steps: 1) Classifying for discrete features: for the discrete features, classifying the discrete features held by each participant according to the categories of the discrete data; 2) For continuous feature segmentation: carrying out box division operation on the continuous features held by each participant, and taking the data area after box division as a segmentation result; 3) Screening a problem data subset based on feature description combination; 4) And outputting the problem subset.

Classification for discrete and continuous features, due to discrete features, such as: gender, job type, presence or absence of a house, etc., typically have fewer data types, and in embodiments, descriptions of discrete features may be categorized individually as a descriptive term for a subset of the data. Whereas for continuous features, for example: age, deposit, annual income, social security payment amount, etc., because the number and span of the related numerical values are generally much larger than those of discrete features, a box division mode is needed to divide a single continuous feature into a plurality of data segments through a box division, and the data segments are used as description conditions of a data subset.

After the classification of all the features is completed, the invention provides a privacy-preserving data subset index calculation scheme. Taking the vertical distribution as an example to show the possible risk of data leakage in this scenario: assuming A, B is two parties holding 10 features each, where feature 1 of a is referred to as FA1, feature 10 of B can be referred to as FB10, and so on, expressed as follows: the feature held by initiator a is called: the features held by the FA 1-FA 10 collaborators B are called: FB 1-FB 10, one of which is a gender feature (sex). When a data subset is described as sex=male, if the initiator a knows which feature attributes are of the data subset, the data of the sex attributes of the IDs belonging to the data subset are subject to an accurate leak-! I.e. initiator a may obtain information of participant B by looking at the data ID. In order to avoid the occurrence of the data leakage risk, the invention uses multiparty computing technology to protect the data ID in the screening stage of the problem data subset, and combines the found data ID in intervals in the problem data screening link to form data error description.

Privacy protection is an important requirement of federal data subsets because, for one party, the feature descriptions in the federal data subset contain numerical descriptions of features held by other parties, and such federal feature descriptions, if corresponding to the data ID, pose a certain risk of data leakage. In the embodiment, the data features, the sexes are men, and the ages are between 35 and 40 years, wherein the sex features are located in the participant a, the age features are located in the participant B, if the data ID list is disclosed to the participant a and the participant B, the participant a can know that the ages of the data corresponding to the data ID list are between 30 and 40 years, the leakage risk of the privacy information is increased, and the participant B can know the sex information of the data corresponding to the data ID list, which belongs to direct data leakage. In summary, the data list of the federal data subset is required to be invisible to the participants, so as to solve the problem of avoiding direct or indirect privacy disclosure.

In order to solve the problem of privacy leakage, the invention provides a federal data subset privacy protection method based on Mask vectors. In the vertical federation learning application scenario, before the vertical federation learning modeling, an ID alignment operation needs to be performed on the data ID, and in the embodiment, the data ID of the default data set is after the ID alignment operation. A mask vector is defined as an array of full length intersection data, with elements consisting of either 1 or 0. For a federation data subset, the ID belonging to the federation data subset has an element of 1 at the corresponding position of the mask vector, otherwise 0, while the contents of the mask vector are to be protected, and no other party than the party holding the feature in the federation data subset should know the specific value of each element in the mask vector. The correspondence with the features is shown in fig. 5.

When the data subset contains a plurality of features, the mask vectors held by the owners of the features are needed to be subjected to para-multiplication by using a multiparty secure computing technology so as to determine the true mask vector of the current data subset, and the multiplication result of the mask vector still belongs to an encryption state. After obtaining the encrypted mask vector, the subsequent index calculation related to the data subset needs to use the mask vector to participate in calculation, i.e. the evaluation index of the current data subset can be calculated according to the real mask vector.

Taking the calculation of the accuracy of a data subset of 10 samples as an example, assuming that whether a vector is used correctly for the prediction of a data subset of 10 samples is denoted as [0,1,1,1,1,1,1,1,1,0], and the mask vector is an encrypted vector of [1,0,1,0,1,1,1,1,0,1], the accuracy of this data subset is 5/7×100% = 71.4286%, where 7 is the number of samples of the data subset and 5 is the number of samples that are predicted correctly, note that all operations in this operation are implemented using secure multipartite computing techniques, without risk of data leakage. Similarly, the computing method can be extended to the computation of other evaluation indexes by using mask vectors and secure multiparty computing technology.

After obtaining the encrypted mask vector, the calculation of the error rate and evaluation index related to the federal data subset needs to be performed by using the mask vector, and privacy protection is performed by adopting a privacy calculation technology, and the following explanation is given by taking the calculation of the error rate of the federal data subset as an example. Only party a has the tag Y and the predicted value of one dataset is also obtained by party a, no matter how the federal model is trained. The initiator can thus conclude whether the prediction of each sample of the model is correct or not and form a vector consisting of 0 or 1. As shown in fig. 6, taking an example of calculation of the accuracy of a data subset of 10 samples as an example, it is assumed whether a vector is used correctly for prediction of a data subset of 10 samples, denoted as [0,1,1,1,1,1,1,1,1,0], and the mask vector is an encrypted vector of [1,0,1,0,1,1,1,1,0,1], the accuracy of the data subset is 5/7×100% = 71.4286%, where 7 is the number of samples of the data subset and 5 is the number of samples for which prediction is correct.

The screening based on the feature description combination further comprises training a machine learning model for screening the problem data subsets, wherein in the training process of the machine learning model, a plurality of public data sets are used, a random generation rule is selected for carrying out label destruction on the data subsets, the data subsets really having problems are marked as positive sample labels for model training, the rest are negative sample labels, and feature data of different data subsets are collected through a data subset discovery technology to form a data set for the machine learning model; and inputting the problem data subset into the machine learning model after training is completed, and obtaining the problem data subset with abnormal description.

Specifically, as shown in fig. 7, in the problem subset filtering technology based on machine learning, since the number of problem data subsets that have been found in the data set is generally huge, and not all problem data subsets are caused by real problem data, many data subsets cannot fully reflect the problem data and cannot help to locate the problem data due to the reasons of excessively large data volume, excessively small data volume, basically similar performance difference to global data, and the like, so that such data subsets need to be removed and filtered in the subsequent flow. In summary, the present invention proposes a problem subset filtering technique based on machine learning.

The characteristics of the machine learning input include characteristics such as evaluation indexes which are not limited to the data subset, evaluation indexes of the whole set of data except the data subset, the number of data pieces of the data subset, the percentage of the data pieces of the data subset, the ratio of the evaluation indexes of the data subset and the data except the data subset, and the like, wherein the evaluation indexes comprise different indexes such as accuracy, precision, F-Score, effect quantity and the like.

In the model training stage, a plurality of public data sets are used, a data subset is selected according to a randomly generated rule to perform label destruction, the data subset with real problem data is marked as a positive sample label for model training, the rest is a negative sample label, and characteristic data of different data subsets are collected through a data subset discovery technology, so that a data set for training a machine learning model for data subset screening is formed. After the screening of the data subset is completed, the data subset determined to be the positive sample label by the machine learning model is used as the input of the next technical link.

The multiparty secure computing method is implemented by using a secret sharing technology as a bottom technology, wherein the secret sharing technology comprises the following steps: dividing a number into two or more numbers at random, dividing the divided numbers into different calculation parties, and expanding arithmetic calculation under privacy protection according to the data of the division parties.

Specifically, the secure and controllable multiparty secure computing technology is used to implement the problem data subset searching and problem data tracing process involved in the whole federal debugging process, so as to ensure that the result acquirer can correctly acquire the final analysis result, but can not acquire sensitive information except the analysis result, including but not limited to: privacy data of other participants, data IDs of other participants, etc.; the data provider, except that it provides data for computation, cannot view or infer sensitive information owned by other parties from intermediate results of execution, and the specific techniques used in the preferred embodiments are described below.

The multiparty secure computing operation is implemented by using a secret sharing technology as a bottom technology, and secret sharing refers to splitting a secret in a proper mode, each split share is managed by different participants, each participant holds one share, and computing tasks (such as addition and multiplication computing) are completed cooperatively. Individual participants cannot recover the secret information, and only if the individual participants cooperate together can recover the secret message. Each participant can independently perform addition and multiplication calculation based on the sliced data, and each participant sends the calculated sliced result to a result party for summarizing and restoring the calculated result. In the whole process, each participant cannot obtain any secret information, and the result party can only obtain the result information, so that the original data is effectively protected from leakage, and the expected result is calculated. In a secret sharing system, an attacker must obtain a certain number of secret fragments at the same time to obtain a secret key, so that the security of the system is ensured. On the other hand, when some secret fragments are lost or destroyed, secret information can still be obtained by using other secret shares, and the reliability of the system is ensured.

The secret sharing scheme consists of a secret segmentation algorithm and a secret reorganization algorithm. Because of the computational problem, it can always be expressed as an arithmetic circuit consisting of an addition gate and a multiplication gate, so if the secret sharing scheme is able to calculate both addition and multiplication, it is theoretically possible to calculate any complex problem. The multiparty computing service node supports the multiparty data collaborative computing to obtain a model prediction result under the condition that each participant cannot acquire (also cannot decrypt or calculate) the original data of any other party, and simultaneously protects algorithm parameters, model parameters and final results. Support a variety of secure multiparty computation operators including, but not limited to, four-rule operations, comparison operations, logical operations, statistical operations, functional operations, examples of which are as follows:

four operations (+, -, ×, ≡c)

Comparison operation (>, > or more, =, noteq, <, +.

Logical operations (AND, OR, NOT, etc)

Statistical operations (summation, count, mean, variance, etc.)

The key idea of secret sharing is to split a number into two or more numbers randomly, the split numbers belong to different computing parties, and each computing party can develop arithmetic computation under privacy protection according to the shared data.

Addition (in computer processing, subtraction will be converted into addition, i.e., adding the minus sign (-1)):

assume that party a, party B each have a number x and y, where x and y are decomposed, x=x1+x2, y=y1+y2

Both parties share x2 and y1 respectively, party a shares x2 to party B, party B shares y1 to party a, then party a calculates z1=x1+y1, and party B calculates z2=x2+y2;

sharing z2 to party a is evident as: z=x+y=z1+z2=x1+x2+y1+y2, party a can calculate z, i.e. the sum of x, y.

Multiplication (in computer processing, division operations are converted into multiplication operations, i.e. multiplication of the reciprocal of the denominator):

party a and party B hold x and y, respectively, resulting in a pair of fragments of a random multiplication triplet. Wherein a= [ a ] 1+ [ a ]2, b= [ b ] 1+ [ b ]2, ab=a= [ b ] [ ab ] 1+ [ ab ]2.

The product of x and y is calculated by the following method:

party A and Party B each share own data with one shard to the other, exchanging [ x ]2 and [ y ]1.

Party a and party B each share and restore by addition to get x blinded d=x-a, y blinded e=y-B, respectively, and disclose d and e to each other. In this process, none of x, y, a, b leaks.

At this time, x+y= (d+a) = (e+b) = de+b ] d+a+e+ab ], and the triples in the shared state are placed in brackets, it can be seen that the problem has been converted into an additive problem. And after the corresponding part is calculated by the party B, the result is sent to the party A, and the party A substitutes the formula of the step x y to obtain the multiplication result.

The tracing of the problem data subset, in the embodiment, uses a tracing technology of error data, and the tracing technology uses the problem data subset which has been filtered, and forms a description problem data subset after performing interval merging operation. The technology uses a multiparty secure computing bottom technology, the data of the participators cannot be local in the whole execution flow, meanwhile, the data does not need to be checked manually, and the security of the data and the sensitive information of the participators is ensured.

Another embodiment is directed to a system for debugging vertical federal learning exception data based on a subset of data, see fig. 8, the system 800 comprising:

the federal model training unit 810 is configured to perform federal model training based on longitudinal federal learning modeling by the initiator;

a problem data subset obtaining unit 820, configured to obtain a problem data subset in a data set by using the federal model after training, where a prediction accuracy of the problem data subset in the federal model is lower than a prediction accuracy of other data subsets in the federal model;

A problem data subset screening unit 830, configured to perform screening based on feature description combinations on the problem data subset, and obtain a problem data subset with abnormal descriptions;

and the problem data subset correcting unit 840 is configured to perform data tracing and correction based on the problem data subset with the anomaly description, and retrain the federation model after correction.

The filtering based on the feature description combination in the problem data subset filtering unit 830 further includes training a machine learning model for filtering the problem data subset, in the training process of the machine learning model, selecting the data subset by using a plurality of public data sets and randomly generating rules to perform label destruction, marking the data subset actually having the problem as a positive sample label for model training, and the rest as a negative sample label, and acquiring feature data of different data subsets through a data subset discovery technology to form a data set for the machine learning model; and inputting the problem data subset into the machine learning model after training is completed, and obtaining the problem data subset with abnormal description.

In addition to the above modules, the system 800 may include other components, however, since these components are not related to the contents of the embodiments of the present disclosure, illustration and description thereof are omitted herein.

Other specific working processes of the data subset-based vertical federation learning abnormal data debug system 800 refer to the description of the data subset-based vertical federation learning abnormal data debug method embodiment described above, and are not repeated.

Another embodiment is provided to illustrate that the system of the present invention may also be implemented with the architecture of a computing device as shown in fig. 9. Fig. 9 illustrates an architecture of the computing device. As shown in fig. 9, a computer system 910, a system bus 930, one or more CPUs 940, input/output 920, memory 950, and the like. The memory 950 may store various data or files used for computer processing and/or communication and program instructions executed by the CPU including the longitudinal federal learning exception data debugging method based on the subset of data of the embodiment. The architecture shown in fig. 9 is merely exemplary, and one or more of the components in fig. 9 may be adapted as needed to implement different devices. The memory 950 is used as a computer readable storage medium, and may be used to store a software program, a computer executable program, and modules, such as program instructions/modules corresponding to the data subset-based vertical federal learning abnormal data debugging method in the embodiment of the present invention (for example, the federal model training unit 810, the problem data subset obtaining unit 820, the problem data subset screening unit 830, and the problem data subset correcting unit 840 in the data subset-based vertical federal learning abnormal data debugging system 800). One or more CPUs 940 execute various functional applications and data processing of the system of the present invention by running software programs, instructions and modules stored in the memory 950, i.e., implement the above-described data subset-based vertical federal learning exception data debugging method, which includes:

Of course, the processor of the server provided by the embodiment of the present invention is not limited to executing the method operations described above, and may also execute the related operations in the longitudinal federal learning abnormal data debugging method based on the data subset provided by any embodiment of the present invention.

The memory 950 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for functions; the storage data area may store data created according to the use of the terminal, etc. In addition, memory 950 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, memory 950 may further include memory remotely located relative to one or more CPUs 940, which may be connected to the device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input/output 920 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the device. Input/output 920 may also include a display device such as a display screen.

The embodiment of the invention also provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the method for debugging vertical federal learning abnormal data based on the data subset described in the above embodiment. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

In addition, other specific working processes of a non-transitory computer readable storage medium refer to the description of the embodiment of the method for debugging abnormal data of longitudinal federal learning based on the data subset, and are not repeated.

According to the technical scheme provided by the embodiments, the method and the system for debugging the vertical federal learning abnormal data based on the data subsets are used for automatically positioning the abnormal data by designing the federal data subset evaluation technology of privacy protection to correctly calculate the relevant evaluation indexes of the federal data subsets on the premise of ensuring the data privacy, so that the problem that the federal learning model is abnormal is solved by helping personnel. The method has the advantages that the method does not need the key privacy information such as data, characteristics and the like of the participants of federal learning to be contacted by the debugging personnel in the debugging process of the initiator, and does not need the participants of federal learning (or federal learning application programs) to mutually leak or send the information such as the data, the characteristics and the like held by the participants, and the method comprises the following steps:

Using a machine learning based problem subset filtering technique that screens against numerous problem data subsets that have been found, culls out problem data subsets that do not meet the actual requirements, and leaves problem data subsets that actually have a negative impact on the performance of the union model. The machine learning model used in the technique extracts features and labels using information such as the size, the effect quantity, etc. of the existing plurality of problem data subsets, and then trains the machine learning model as a training data set.

In this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, or apparatus.

The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims

1. A method for debugging vertical federal learning abnormal data based on a subset of data, the method comprising:

the initiator or the participant performs data tracing and correction based on the problem data subset with the abnormal description, and retrains the federal model after correction;

the vertical federation learning abnormal data debugging method further comprises a federation data subset privacy protection method based on a mask vector, and specifically comprises the following steps:

2. The method for debugging vertical federal learning abnormal data based on data subsets according to claim 1, wherein before screening the problem data subsets based on feature description combinations, discrete features are classified according to the categories of the discrete data, and data interval segmentation is performed on continuous features.

3. The method for debugging vertical federal learning abnormal data based on data subset according to claim 1, wherein the screening based on feature description combination protects data IDs by multiparty secure computation method, and forms data abnormal description by section merging data IDs.

4. The method for debugging longitudinal federal learning abnormal data based on data subsets according to claim 1, wherein the screening based on feature description combination further comprises training a machine learning model for screening problem data subsets, wherein in the training process of the machine learning model, a plurality of public data sets are used, a random generation rule is selected for label destruction of the data subsets, the data subsets with real problems are marked as positive sample labels for model training, the rest are negative sample labels, and feature data of different data subsets are collected through a data subset discovery technology to form a data set for the machine learning model; and inputting the problem data subset into the machine learning model after training is completed, and obtaining the problem data subset with abnormal description.

5. A method of debugging data subset-based vertical federal learning anomaly data according to claim 1 or claim 3, wherein the multiparty secure computing method is implemented using a secret sharing technique as an underlying technique, the secret sharing technique comprising: dividing a number into two or more numbers at random, dividing the divided numbers into different calculation parties, and expanding arithmetic calculation under privacy protection according to the data of the division parties.

6. A longitudinal federal learning exception data debug system based on a subset of data, the system comprising:

the problem data subset correction unit is used for enabling an initiator or a participant to trace and correct data based on the problem data subset with the abnormal description, and retraining the federation model after correction;

a federal data subset privacy protection unit based on the mask vector for ID alignment of the data IDs prior to modeling based on longitudinal federal learning; the mask vector is an array of full intersection data, an ID set corresponding to a feature in each data subset corresponds to an independent mask vector, the independent mask vector only exists in the data set corresponding to the feature, when the data subset contains a plurality of features, the mask vector held by each feature owner is needed and is subjected to para-multiplication by using a multipartite safety calculation method, and the true mask vector of the current data subset is determined; and calculating the evaluation index of the current data subset according to the real mask vector.

7. The data subset-based vertical federal learning abnormal data debugging system according to claim 6, wherein the screening based on the feature description combination in the problem data subset screening unit further comprises training a machine learning model for problem data subset screening, wherein in the machine learning model training process, a plurality of public data sets are used, a random generation rule is selected for data subset to perform label destruction, the data subset truly problematic is marked as a positive sample label for model training, the rest is marked as a negative sample label, and feature data of different data subsets are collected through a data subset discovery technology to form a data set for the machine learning model; and inputting the problem data subset into the machine learning model after training is completed, and obtaining the problem data subset with abnormal description.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a longitudinal federal learning exception data debugging method based on a subset of data according to any one of claims 1 to 5 when the program is executed by the processor.

9. A non-transitory computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement a method of data subset based longitudinal federal learning exception data debugging according to any one of claims 1 to 5.