CN115982779A

CN115982779A - Data anonymization method and device, electronic equipment and storage medium

Info

Publication number: CN115982779A
Application number: CN202310258354.5A
Authority: CN
Inventors: 赵东; 卞阳; 尤志强
Original assignee: Beijing Fucun Technology Co ltd
Current assignee: Beijing Fucun Technology Co ltd
Priority date: 2023-03-17
Filing date: 2023-03-17
Publication date: 2023-04-18
Anticipated expiration: 2043-03-17
Also published as: CN115982779B

Abstract

The application provides a data anonymization method, a data anonymization device, electronic equipment and a storage medium, relates to the technical field of machine learning, federal learning and privacy protection, and is used for solving the problem that personal privacy is easily revealed by model training data used in the process of federal learning. According to the method, secret state equidistant box separation is carried out on secret state sample data shared by secrets, and anonymization is carried out on the secret state sample data after secret sharing by using a secret state equidistant box separation means, so that the data is replaced by an evidence weight vector, and the anonymized secret state anonymous data is obtained, and therefore the risk of revealing personal privacy is reduced. The data anonymization method is mainly used for scenes such as federal machine learning or privacy set intersection.

Description

Data anonymization method and device, electronic equipment and storage medium

Technical Field

The present application relates to the technical field of machine learning, federal learning, and privacy protection, and in particular, to a data anonymization method, apparatus, electronic device, and storage medium.

Background

Federal Machine Learning (FML), also known as federal Learning, joint Learning or league Learning, is a Machine Learning framework that enables multiple organizations to efficiently train a Machine Learning model using their respective data collaboratively in multiple parties, while meeting the requirements of user privacy protection, data security and laws and regulations.

In the federal learning process, raw data is homomorphic encrypted or safely calculated by multiple parties (namely secret sharing), and model training is carried out by using homomorphic encrypted or secret shared data; however, the electronic device participating in federal learning can still decrypt the original data from the homomorphic encrypted data, or restore the original data when the shard data after secret sharing is enough. Further, the electronic device may also analyze personal privacy data from the raw data, so that the current model training data used in the federal learning process is prone to reveal personal privacy.

Disclosure of Invention

An embodiment of the application aims to provide a data anonymization method, a data anonymization device, electronic equipment and a storage medium, which are used for solving the problem that personal privacy is easily revealed by model training data used in a federal learning process.

The embodiment of the application provides a data anonymization method, which comprises the following steps: secret sharing is carried out on plaintext sample data to be processed, and secret sample data is obtained; carrying out dense state equidistant box separation on dense state sample data to obtain a plurality of box separation ranges; determining a dense-state unique hot matrix corresponding to the dense-state sample data according to the Boolean state of the dense-state sample data falling into each box-dividing range of the plurality of box-dividing ranges; and determining the secret state anonymous data corresponding to the plaintext sample data according to the secret state unique hot matrix. In the implementation process of the scheme, secret state equidistant binning is performed on secret state sample data obtained by secret sharing of plaintext sample data, and secret state anonymous data corresponding to the plaintext sample data is determined according to the binned secret state unique hot matrix, that is, the secret state sample data after secret sharing is anonymized by using a secret state equidistant binning method, so that the data is replaced by an evidence weight vector, and the anonymized secret state anonymous data is obtained, and therefore the risk of revealing personal privacy can be effectively reduced by using the secret state anonymous data.

Optionally, in this embodiment of the present application, performing dense equidistant binning on dense state sample data includes: counting the intersection maximum value and the intersection minimum value in the dense-state sample data, and determining a sample interval of the dense-state sample data according to the intersection maximum value and the intersection minimum value; and carrying out dense state equidistant box separation on the sample intervals of the dense state sample data according to the preset box separation quantity. In the implementation process of the scheme, dense-state equidistant box separation is carried out on a sample interval of dense-state sample data according to the preset box separation quantity, and dense-state anonymous data corresponding to the plaintext sample data are determined according to the boxed dense-state unique heat matrix, that is, the dense-state sample data after secret sharing are subjected to anonymization by using a means of dense-state equidistant box separation, so that the data are replaced by evidence weight vectors, and the anonymized dense-state anonymous data are obtained, and therefore the risk of revealing personal privacy can be effectively reduced by using the dense-state anonymous data.

Optionally, in this embodiment of the present application, determining a dense state unique hot matrix corresponding to dense state sample data according to a boolean state that the dense state sample data falls into each of a plurality of binning ranges includes: obtaining a Boolean matrix corresponding to the dense-state sample data for the Boolean state of the dense-state sample data falling into a plurality of box-dividing ranges; the boolean matrix is converted into a dense-state one-hot matrix. In the implementation process of the scheme, a Boolean matrix corresponding to the dense state sample data is obtained by converting the Boolean matrix into a dense state one-hot matrix, wherein the dense state sample data falls into Boolean states in a plurality of box ranges, and the dense state one-hot matrix is used for determining the dense state anonymous data corresponding to the plaintext sample data, that is, the dense state sample data after secret sharing is anonymized by using a dense state equidistant box dividing method, so that the data is replaced by an evidence weight vector, and the anonymized dense state anonymous data is obtained, and therefore the risk of revealing personal privacy can be effectively reduced by using the dense state anonymous data.

Optionally, in this embodiment of the present application, determining the secret anonymous data corresponding to the plaintext sample data according to the secret one-hot matrix includes: calculating the evidence weight of each box-dividing range in the plurality of box-dividing ranges to obtain an evidence weight vector; and determining the secret state anonymous data corresponding to the plaintext sample data according to the secret state unique hot matrix and the evidence weight vector. In the implementation process of the scheme, the evidence weight of each of the plurality of box ranges is calculated through a WOE algorithm, an evidence weight vector is obtained, and the information can be safely and effectively protected from being leaked through the WOE algorithm according to the secret state unique heat matrix and the evidence weight vector.

Optionally, in this embodiment of the present application, after determining the secret state anonymous data corresponding to the plaintext sample data according to the secret state unique hot matrix, the method further includes: the machine learning model is federally learned using confidential anonymous data. In the implementation process of the scheme, the machine learning model is subjected to federal learning by using the secret anonymous data, so that the requirement of client data circulation enabling is met under the condition of not exposing sensitive information, and the data security in the federal learning process is effectively improved.

Optionally, in this embodiment of the present application, the machine learning model is federally learned using confidential anonymous data, including: secret sharing is carried out on a sample tag to be processed to obtain secret state tag data, wherein the sample tag is a category tag of sample data; the machine learning model is federally learned using the confidential anonymous data and the confidential label data. In the implementation process of the scheme, the machine learning model is subjected to federal learning by using the secret anonymous data and the secret tag data, so that the requirement of client data circulation enabling is met under the condition of not exposing sensitive information, and the data security in the federal learning process is effectively improved.

Optionally, in this embodiment of the present application, after determining the secret anonymous data corresponding to the plaintext sample data according to the secret one-hot matrix, the method further includes: recovering the secret anonymous data by using a threshold scheme in a secret shared password mechanism to obtain plaintext anonymous data; the machine learning model is federally learned using plaintext anonymized data. In the implementation process of the scheme, the anonymous data in the secret state are recovered by using a threshold scheme in a secret shared password mechanism to obtain anonymous data in the plaintext, and the anonymous data in the plaintext is used for carrying out federated learning on a machine learning model, so that the requirement of client data circulation enabling is met under the condition of not exposing sensitive information, and the data security in the federated learning process is effectively improved.

An embodiment of the present application further provides a data anonymization apparatus, including: the sample data acquisition module is used for carrying out secret sharing on plaintext sample data to be processed to obtain secret sample data; the box separation range obtaining module is used for carrying out dense state equidistant box separation on the dense state sample data to obtain a plurality of box separation ranges; the unique hot matrix determining module is used for determining a dense state unique hot matrix corresponding to dense state sample data according to the Boolean state of the dense state sample data falling into each of a plurality of box ranges; and the anonymous data determining module is used for determining the secret state anonymous data corresponding to the plaintext sample data according to the secret state unique hot matrix.

Optionally, in an embodiment of the present application, the binning range obtaining module includes: the sample interval determining submodule is used for counting the maximum intersection value and the minimum intersection value in the dense-state sample data and determining the sample interval of the dense-state sample data according to the maximum intersection value and the minimum intersection value; and the sample interval box-dividing submodule is used for carrying out dense state equidistant box-dividing on the sample intervals of the dense state sample data according to the preset box-dividing quantity.

Optionally, in an embodiment of the present application, the unique hot matrix determining module includes: the Boolean matrix obtaining submodule is used for obtaining a Boolean matrix corresponding to the dense-state sample data in the Boolean state of the dense-state sample data falling into a plurality of sub-box ranges; and the Boolean matrix conversion submodule is used for converting the Boolean matrix into a dense-state one-hot matrix.

Optionally, in an embodiment of the present application, the anonymous data determining module includes: the weight vector obtaining submodule is used for calculating the evidence weight of each box-dividing range in the plurality of box-dividing ranges to obtain an evidence weight vector; and the secret state data determining submodule is used for determining the secret state anonymous data corresponding to the plaintext sample data according to the secret state unique heat matrix and the evidence weight vector.

Optionally, in an embodiment of the present application, the data anonymization apparatus further includes: the first federated learning module is used for carrying out federated learning on the machine learning model by using the confidential anonymous data.

Optionally, in an embodiment of the present application, the first federal learning module includes: the tag data obtaining submodule is used for carrying out secret sharing on a sample tag to be processed to obtain secret tag data, wherein the sample tag is a type tag of sample data; and the model federation learning submodule is used for carrying out federation learning on the machine learning model by using the secret anonymous data and the secret tag data.

Optionally, in an embodiment of the present application, the data anonymization apparatus further includes: the anonymous data acquisition module is used for recovering the secret anonymous data by using a threshold scheme in a secret shared password mechanism to obtain plaintext anonymous data; and the second federated learning module is used for carrying out federated learning on the machine learning model by using plaintext anonymous data.

An embodiment of the present application further provides an electronic device, including: a processor and a memory, the memory storing processor-executable machine-readable instructions, the machine-readable instructions when executed by the processor performing the method as described above.

Embodiments of the present application also provide a computer-readable storage medium having a computer program stored thereon, where the computer program is executed by a processor to perform the method as described above.

Additional features and advantages of embodiments of the present application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of embodiments of the present application.

Drawings

To more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments in the embodiments of the present application and therefore should not be considered as limiting the scope, and it will be apparent to those skilled in the art that other relevant drawings may be obtained based on the drawings without inventive effort.

Fig. 1 is a schematic flow chart of a data anonymization method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating processing of plaintext sample data according to an embodiment of the application;

fig. 3 is a schematic structural diagram of a data anonymization apparatus provided in an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the embodiments of the present application are for illustrative and descriptive purposes only and are not used to limit the scope of the embodiments of the present application. Additionally, it should be understood that the schematic drawings are not necessarily drawn to scale. The flowcharts used in the embodiments of the present application illustrate operations implemented according to some of the embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and steps without logical context may be performed in reverse order or simultaneously. In addition, one skilled in the art can add one or more other operations to the flowchart or remove one or more operations from the flowchart under the guidance of the content of the embodiments of the present application.

In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, as generally described and illustrated in the figures herein, could be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as presented in the figures, is not intended to limit the scope of the embodiments of the application, as claimed, but is merely representative of selected embodiments of the application.

It is to be understood that "first" and "second" in the embodiments of the present application are used to distinguish similar objects. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance. In the description of the embodiments of the present application, the term "and/or" is only one kind of association relationship describing an associated object, and means that three relationships may exist, for example, a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. The term "plurality" refers to more than two (including two), and similarly, "sets" refers to more than two (including two).

Before describing the data anonymization method provided in the embodiment of the present application, some concepts related to the embodiment of the present application are described:

the Privacy Set Intersection (PSI), also called security Intersection, also called privacy protection Set Intersection protocol, is a part of the operation process of vertical federal learning, and the PSI protocol allows two parties holding respective sets to jointly calculate the Intersection operation of the two sets. At the end of the protocol interaction, the two parties should get the correct intersection and not get any information in the other party's set outside the intersection.

The vertical federal learning refers to a state that when the users of two participants overlap a lot, but the user feature overlap of two data sets is less, a federal learning model is trained, specifically for example: two institutions in the same region, one institution has a consumption record of a user, the other institution has a bank record of the user, the two institutions have a plurality of overlapping users (namely, privacy set intersection is required to be carried out under the condition that any set information outside the intersection is not exposed so as to obtain the overlapping users), but the recorded data characteristics are different, and the two institutions want to jointly train a stronger federal learning model by encrypting and aggregating different characteristics of the users.

Multi-Party Secure computing (MPC), also called Secure Multi-Party computing (SMPC), is mainly studied on how to securely compute an agreed function without a trusted third Party.

Secret sharing (Secret sharing), also called Secret sharing or Secret splitting (Secret splitting), is a method of distributing secrets to multiple parties, each obtaining a part of the Secret called Share (Share). Only when a sufficient number of shares are combined can the secret text be restored, each share individually having no use.

It should be noted that the data anonymization method provided in the embodiments of the present application may be executed by an electronic device, where the electronic device refers to a device terminal or a server having a function of executing a computer program, and the device terminal includes, for example: smart phones, personal computers, tablet computers, personal digital assistants, or mobile internet access devices, and the like. A server refers to a device that provides computing services over a network, such as: the server system comprises an x86 server and a non-x 86 server, wherein the non-x 86 server comprises: mainframe, minicomputer, and UNIX server.

Application scenarios to which the data anonymization method is applicable are described below, where the application scenarios include, but are not limited to: the data anonymization method is used in the privacy set intersection process of full-hiding trace federal learning, and is explained in a two-party scene (actually, the data anonymization method can be expanded to a multi-party scene), specifically, for example: before the training data is used for training the model, the data anonymization method can be used for anonymizing the training data and the like, so that the risk that the training data reveals the individual privacy is reduced, and the training data can better accord with the privacy-related regulations in the circulation process of federal learning.

Please refer to fig. 1, which illustrates a flow chart of a data anonymization method provided in the embodiment of the present application; the data anonymization method mainly includes the steps that secret state equidistant box separation is carried out on secret state sample data obtained by secret sharing of plaintext sample data, anonymization is carried out on the secret state sample data after secret sharing through a secret state equidistant box separation method, so that data are replaced by evidence weight vectors, and therefore anonymized secret state anonymous data are obtained, and the risk of revealing individual privacy is reduced. The embodiment of the data anonymization method can comprise the following steps:

step S110: and secret sharing is carried out on plaintext sample data to be processed, and secret sample data is obtained.

The embodiment of step S110 described above is, for example: and carrying out secret sharing on the plaintext sample data to be processed by using a secret sharing protocol of any one of the three categories to obtain secret sample data. Among them, the secret sharing protocols of the three major categories may include: threshold Secret Sharing (Threshold Secret Sharing Scheme), general Secret Sharing (General Secret Sharing Scheme) of a General access structure, and a Secret Sharing protocol facing a special access structure between the two, etc. Taking the threshold secret sharing category as an example, the secret sharing protocol that can be used includes: a Shamir protocol based on polynomials, a Blakley protocol based on hyperplane, a Mignotee protocol based on the chinese remainder theorem, an Asumth-Bloom protocol, a Brickell protocol, a secret sharing protocol based on matrix projection, an arithmetric sharing protocol, a Boolean sharing protocol, a Yao sharing protocol, and the like.

Step S120: and carrying out dense state equidistant box separation on the dense state sample data to obtain a plurality of box separation ranges.

Step S130: and determining a dense state unique hot matrix corresponding to the dense state sample data according to the Boolean state of the dense state sample data falling into each box-dividing range of the plurality of box-dividing ranges.

Step S140: and determining the secret state anonymous data corresponding to the plaintext sample data according to the secret state unique hot matrix.

In the implementation process, secret state equidistant box separation is carried out on secret state sample data obtained by secret sharing of plaintext sample data, and secret state anonymous data corresponding to the plaintext sample data are determined according to a box-separated secret state unique heat matrix, that is, the secret state sample data after secret sharing is subjected to anonymization by using a secret state equidistant box separation method, so that the data is replaced by evidence weight vectors, and the anonymized secret state anonymous data is obtained, and therefore the risk of revealing individual privacy can be effectively reduced by using the secret state anonymous data.

Please refer to fig. 2, which illustrates a schematic diagram of processing plaintext sample data according to an embodiment of the present application; it will be appreciated that the data anonymization method can be applied in a multi-party scenario,for ease of understanding and explanation, this is set forth herein in two scenarios, specifically for example: the plaintext sample data of the two parties may comprise plaintext sample data of the first party and plaintext sample data of the second party, and the plaintext sample data of the first party may comprise: identification data id, tag dataYAnd characteristic dataX _a1 、X _a2 、X _a3 Similarly, the plaintext sample data of the second party may comprise identification data id, characteristic dataX _b1 AndX _b2 。

as an optional implementation of the step S110, before secret sharing is performed on the plaintext sample data to be processed, an anonymization operation may be performed on Identification (ID) data in the plaintext sample data, which is described by taking a two-party scenario as an example, specifically, for example: the Identification (ID) data in the plaintext sample data is hashed to obtain a hash string, and then the hash string is converted into a numeric string (for example, 124360 in the plaintext data of the first party), so as to obtain the anonymized Identification (ID) data. After the identification data of both parties is subjected to hash calculation, hash character strings of both parties can be obtained, then secret sharing and alignment are performed on the plaintext data of the first party and the plaintext data of the second party according to the hash character strings of both parties, so as to obtain secret state sample data after secret sharing and alignment, specifically, for example: and comparing the hash character strings of the two parties to obtain intersection data (namely positive sample data) and non-intersection data (namely negative sample data) of the two parties, wherein the data records with the same stopted-id in the graph are the intersection data (namely positive sample data), and the data records with the different stopted-id in the graph are the non-intersection data (namely negative sample data). In a specific practical process, an additional sample type field can be added, and the sample type field is used for marking positive sample data and negative sample data.

TABLE 1

Please refer to table 1 above, which shows the provision of embodiments of the present applicationA schematic diagram of anonymized tag data of (1); optionally, after secret sharing is performed on the plaintext sample data to be processed, anonymization operation may also be performed on Feature (Feature) data and tag data in the plaintext sample data, specifically for example: and carrying out anonymization operation on the Feature (Feature) data and the label data in the plaintext sample data by using a nonlinear standardization method, a nonlinear normalization method or a random coding mode to obtain the anonymized Feature (Feature) data. Another example is: using the formula

To calculate the maximum in the anonymized Feature (Feature) data, wherein the->

Represents the characteristic data before anonymization or the characteristic data after anonymization>

Representing the true value of the tag data (e.g. 1 or 0),B2Aindicating that the password sharing operation is performed by using Boolean sharing protocol and Arithmetic sharing protocol. Please refer to table 2 below, which shows the result of calculating the maximum value in the anonymized Feature (Feature) data provided in the embodiments of the present application.

TABLE 2

It will be appreciated that formulas may also be used

Calculating a minimum value in the anonymized characteristic data, based on the comparison result, and evaluating the value>

Processing the tag data to obtain anonymized tag data, wherein the label data is/are greater than or equal to>

To representCharacteristic data before anonymization or characteristic data after anonymization are/is->

A tag value representing the tag data before anonymization or the tag data after anonymization, based on the value of the tag data before anonymization or the value of the tag data after anonymization>

A binary value (e.g. 1 or 0) representing tag data,B2Aindicating that the password sharing operation is carried out by using Boolean sharing protocol and Arithmetric sharing protocol.

TABLE 3

Please refer to table 3 above, which shows the result of calculating the minimum value in the statistical dense state sample data provided by the embodiment of the present application. As an optional implementation manner of the step S120, when dense state equidistant binning is performed on dense state sample data, a sample interval may be determined first, and then dense state equidistant binning is performed on the sample interval, where the implementation manner may include:

step S121: and counting the maximum intersection value and the minimum intersection value in the dense-state sample data, and determining the sample interval of the dense-state sample data according to the maximum intersection value and the minimum intersection value.

Optionally, the intersection maximum value and the intersection minimum value in the dense-state sample data may be counted for the feature data portion, so as to reduce the influence of the feature data of the non-intersection portion on the result, for example, the feature data of the non-intersection portion may be replaced by a fixed value, when the intersection maximum value is counted, all the feature data of the non-intersection portion may be set to-9999, and certainly, the fixed value may also be adjusted according to the specific situation of the data. In the implementation process, the non-intersection feature data is replaced by a fixed value (namely, a data sampling structure), so that even if the replaced data is restored to a plaintext, the data cannot be positioned to a specific natural person, and meanwhile, the size of the intersection data is not leaked, and the safety of the data is effectively improved.

Step S122: and carrying out dense equidistant binning on the sample interval of the dense sample data according to the preset binning quantity to obtain a plurality of binning ranges.

TABLE 4

Please refer to table 4 above, which shows the dividing points and the binning ranges after equidistant binning provided by the embodiments of the present application; the embodiments of the above steps S121 to S122 are, for example: and counting the maximum intersection value and the minimum intersection value in the dense-state sample data by using an executable program compiled or interpreted by using a preset programming language, and determining the sample interval of the dense-state sample data according to the maximum intersection value and the minimum intersection value. And then, carrying out dense equidistant binning on the sample interval of the dense sample data according to the preset binning quantity to obtain a plurality of segmentation points and a plurality of binning ranges. Among others, programming languages that may be used are, for example: C. c + +, java, BASIC, javaScript, LISP, shell, perl, ruby, python, and PHP, among others.

TABLE 5

Please refer to table 5 above, which shows a dense one-hot matrix provided by the embodiments of the present application; as an optional implementation manner of the step S130, when determining the dense state unique hot matrix corresponding to the dense state sample data, the boolean matrix corresponding to the dense state sample data may be converted into the dense state unique hot matrix, and the implementation manner may include:

step S131: and obtaining a Boolean matrix corresponding to the dense-state sample data for the Boolean state in which the dense-state sample data falls into the plurality of box separation ranges.

Step S132: and converting the Boolean matrix into a dense state one-hot matrix.

The above-mentioned steps S131 to S132 may be implemented specifically, for example: using the formula

And for each sample data of the dense-state sample data, if the sample data falls into the plurality of bin ranges, the Boolean state of the sample data is assigned to be 1, and if the sample data does not fall into the plurality of bin ranges, the Boolean state of the sample data is assigned to be 0. The above processing is carried out on each sample data of the dense state sample data, and then a Boolean matrix corresponding to the dense state sample data can be obtained; wherein, the first and the second end of the pipe are connected with each other,Qeach sample data representing dense state sample data, left represents the minimum value of the binning range (i.e., the leftmost value), right represents the maximum value of the binning range (i.e., the rightmost value),&representing a logical symbol that both conditions are satisfied. Then, the boolean matrix is converted into a dense state one-hot matrix using an executable program compiled or interpreted using a preset programming language such as: C. c + +, java, BASIC, javaScript, LISP, shell, perl, ruby, python, and PHP, among others.

As an optional implementation Of the foregoing step S140, when determining the secret state anonymous data corresponding to the plaintext sample data according to the secret state unique heat matrix, the secret state anonymous data may be determined by using a Weight Of Evidence (WOE) vector, and this implementation may include:

step S141: and calculating the evidence weight of each box-dividing range in the plurality of box-dividing ranges to obtain an evidence weight vector.

The embodiment of step S141 described above includes, for example: the positive sample size (which can be expressed asGood _T ) And negative sample size (which can be expressed asBad _T ) And calculating the positive sample size (which can be expressed as the number of positive samples) of each bin range through the dense state unique heat matrixGood _i ) And negative sample size (which can be expressed asBad _i ) Then, using the formula

An Evidence Weight (WOE) Of each Of the plurality Of binning ranges is calculated to obtain an Evidence Weight vector.Wherein the content of the first and second substances,WOE _i is shown asiThe weight of the evidence for a range of individual bins,Good _T represents the positive sample size of the entire tag data,Bad _T represents the negative sample amount of the entire tag data,Good _i denotes the firstiThe amount of positive samples in a bin-wise range,Bad _i is shown asiNegative sample size of individual bin ranges.

Can use the formula

To calculate the Information Value (IV), so as to screen out the contribution of those feature data to the Y-tag data, for example: for each feature data in the total feature data, when the IV value of the feature data is smaller than the information threshold, the feature data is deleted so that the feature data does not participate in federal learning. Conversely, when the IV value of the feature data is greater than or equal to the information threshold, the feature data is made to participate in federal learning.

Step S142: and determining the secret state anonymous data corresponding to the plaintext sample data according to the secret state unique hot matrix and the evidence weight vector.

The embodiment of step S142 described above is, for example: and multiplying each value in the secret state unique hot matrix by an evidence weight vector (namely WOE vector value) corresponding to each box range in the plurality of box ranges respectively, and summing according to rows to obtain the secret state anonymous data corresponding to the sample data. The above calculation process can be expressed as RESULT _ WOE = (matrix) using the formula _0-1 ×WOE _i ) Sum (axis = 1); wherein, RESULT _ WOE represents the secret anonymous data corresponding to the sample data, matrix _0-1 A dense state unique heat matrix is represented, and sum (axis = 1) represents summing by rows.

As an optional implementation of the above data anonymization method, after determining the secret anonymous data corresponding to the plaintext sample data according to the secret one-hot matrix, the secret anonymous data may be further used for federal learning, and the implementation may include:

step S150: the machine learning model is federally learned using confidential anonymous data.

As an alternative embodiment of the step S150, when performing federated learning on the machine learning model by using the confidential anonymous data, the confidential anonymous data and the confidential label data may also be used for federated learning, and this embodiment may include:

step S151: and secret sharing is carried out on the sample label to be processed to obtain secret state label data, wherein the sample label is a category label of the sample data.

The implementation principle and implementation manner of step S151 are similar to those of step S110, and therefore, the implementation principle and implementation manner will not be described here, and if it is not clear, reference may be made to the description of step S110.

Step S152: the machine learning model is federally learned using the confidential anonymous data and the confidential label data.

The embodiment of step S152 described above is, for example: if the first party possesses the confidential tag data, federated learning may be performed on the machine learning model stored locally by the first party using the confidential anonymous data of the second party after the first party obtains the confidential anonymous data of the second party. Similarly, if the second party does not have the confidential tag data, the machine learning model stored locally by the second party may be federately learned using the confidential anonymous data and the confidential tag data of the first party after the second party obtains the confidential anonymous data and the confidential tag data of the first party.

As an optional implementation of the above data anonymization method, after determining the secret state anonymous data corresponding to the plaintext sample data according to the secret state one-hot matrix, the plaintext anonymous data may be further used for federal learning, and the implementation may include:

step S160: and recovering the secret anonymous data by using a threshold scheme in a secret shared password mechanism to obtain plaintext anonymous data.

Step S170: the machine learning model is federally learned using plaintext anonymous data.

The embodiments of the above steps S160 to S170 are, for example: the first party may obtain the plaintext anonymous data after recovering the confidential anonymous data using a threshold scheme (e.g., shamir algorithm, etc.) in the secret shared cryptographic mechanism. The first party may then use the plaintext anonymity data to federately learn the machine learning model stored locally by the first party, obtaining a federately learned machine learning model.

Please refer to fig. 3, which is a schematic structural diagram of a data anonymization apparatus provided in the embodiment of the present application; the embodiment of the present application provides a data anonymization apparatus 200, including:

the sample data obtaining module 210 is configured to perform secret sharing on plaintext sample data to be processed to obtain secret sample data.

And a binning range obtaining module 220, configured to perform dense state equidistant binning on the dense state sample data to obtain multiple binning ranges.

The unique hot matrix determining module 230 is configured to determine a dense state unique hot matrix corresponding to the dense state sample data according to the boolean state of the dense state sample data falling into each of the multiple bin ranges.

And the anonymous data determining module 240 is configured to determine the secret state anonymous data corresponding to the plaintext sample data according to the secret state unique hot matrix.

Optionally, in an embodiment of the present application, the binning range obtaining module includes:

and the sample interval determining submodule is used for counting the maximum intersection value and the minimum intersection value in the dense-state sample data and determining the sample interval of the dense-state sample data according to the maximum intersection value and the minimum intersection value.

And the sample interval box-dividing submodule is used for carrying out dense state equidistant box-dividing on the sample intervals of the dense state sample data according to the preset box-dividing quantity.

Optionally, in an embodiment of the present application, the unique hot matrix determining module includes:

and the Boolean matrix obtaining submodule is used for obtaining the Boolean state of the dense state sample data falling into a plurality of box ranges and obtaining the Boolean matrix corresponding to the dense state sample data.

And the Boolean matrix conversion submodule is used for converting the Boolean matrix into a dense-state one-hot matrix.

Optionally, in an embodiment of the present application, the anonymous data determining module includes:

and the weight vector obtaining submodule is used for calculating the evidence weight of each box-separating range in the plurality of box-separating ranges to obtain an evidence weight vector.

And the secret state data determining submodule is used for determining the secret state anonymous data corresponding to the plaintext sample data according to the secret state unique heat matrix and the evidence weight vector.

Optionally, in an embodiment of the present application, the data anonymization apparatus further includes:

the first federated learning module is used for carrying out federated learning on the machine learning model by using the confidential anonymous data.

Optionally, in an embodiment of the present application, the first federal learning module includes:

and the label data obtaining submodule is used for carrying out secret sharing on the sample label to be processed to obtain secret label data, wherein the sample label is a type label of the sample data.

And the model federation learning submodule is used for carrying out federation learning on the machine learning model by using the secret anonymous data and the secret tag data.

and the anonymous data acquisition module is used for recovering the secret anonymous data by using a threshold scheme in a secret shared password mechanism to acquire plaintext anonymous data.

And the second federated learning module is used for carrying out federated learning on the machine learning model by using plaintext anonymous data.

It should be understood that the apparatus corresponds to the above-mentioned data anonymization method embodiment, and can perform the steps related to the above-mentioned method embodiment, and the specific functions of the apparatus can be referred to the above description, and the detailed description is appropriately omitted here. The device includes at least one software function that can be stored in memory in the form of software or firmware (firmware) or solidified in the Operating System (OS) of the device.

Please refer to fig. 4, which illustrates a schematic structural diagram of an electronic device provided in an embodiment of the present application. An electronic device 300 provided in an embodiment of the present application includes: a processor 310 and a memory 320, the memory 320 storing machine readable instructions executable by the processor 310, the machine readable instructions when executed by the processor 310 performing the method as above.

Embodiments of the present application further provide a computer-readable storage medium 330, where the computer-readable storage medium 330 stores a computer program, and the computer program is executed by the processor 310 to perform the above method. The computer-readable storage medium 330 may be implemented by any type of volatile or nonvolatile storage device or combination thereof, such as a Static Random Access Memory (SRAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), an Erasable Programmable Read-Only Memory (EPROM), a Programmable Read-Only Memory (PROM), a Read-Only Memory (ROM), a magnetic Memory, a flash Memory, a magnetic disk, or an optical disk.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and reference may be made to the partial description of the method embodiment for relevant points.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

In addition, functional modules of the embodiments in the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part. Furthermore, in the description of the present specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the embodiments of the present application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Moreover, various embodiments or examples and features of various embodiments or examples described in this specification can be combined and combined by one skilled in the art without being mutually inconsistent.

The above description is only an alternative embodiment of the embodiments of the present application, but the scope of the embodiments of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the embodiments of the present application, and all the changes or substitutions should be covered by the scope of the embodiments of the present application.

Claims

1. A method of data anonymization, comprising:

secret sharing is carried out on plaintext sample data to be processed, and secret sample data is obtained;

carrying out dense state equidistant binning on the dense state sample data to obtain a plurality of binning ranges;

determining a dense-state unique hot matrix corresponding to the dense-state sample data according to the Boolean state of the dense-state sample data falling into each of the plurality of box ranges;

and determining the secret state anonymous data corresponding to the plaintext sample data according to the secret state unique hot matrix.

2. The method of claim 1, wherein the densely binning the densely populated sample data comprises:

counting the intersection maximum value and the intersection minimum value in the dense-state sample data, and determining a sample interval of the dense-state sample data according to the intersection maximum value and the intersection minimum value;

and carrying out dense state equidistant binning on the sample interval of the dense state sample data according to the preset binning quantity.

3. The method according to claim 1, wherein the determining, according to the boolean state of the dense state sample data falling into each of the plurality of bin ranges, a dense state one-hot matrix corresponding to the dense state sample data comprises:

obtaining a Boolean matrix corresponding to the dense state sample data for the Boolean state in which the dense state sample data falls into the plurality of box separation ranges;

converting the Boolean matrix into the dense one-hot matrix.

4. The method according to claim 1, wherein the determining, according to the secret-state-one-hot matrix, secret-state anonymous data corresponding to the plaintext sample data comprises:

calculating the evidence weight of each box-dividing range in the plurality of box-dividing ranges to obtain an evidence weight vector;

and determining the secret state anonymous data corresponding to the plaintext sample data according to the secret state unique heat matrix and the evidence weight vector.

5. The method according to claim 1, further comprising, after determining the secret-state anonymous data corresponding to the plaintext sample data according to the secret-state one-hot matrix, the method further comprising:

and using the confidential anonymous data to carry out federal learning on a machine learning model.

6. The method of claim 5, wherein the using the confidential anonymous data for federated learning of a machine learning model comprises:

secret sharing is carried out on a sample tag to be processed to obtain secret state tag data, wherein the sample tag is a category tag of the sample data;

performing federated learning on a machine learning model using the dense anonymous data and the dense tag data.

7. The method according to any one of claims 1 to 6, further comprising, after determining the secret-state anonymous data corresponding to the plaintext sample data according to the secret-state one-hot matrix, the following:

recovering the secret anonymous data by using a threshold scheme in a secret shared password mechanism to obtain plaintext anonymous data;

performing federated learning on a machine learning model using the plaintext anonymous data.

8. An apparatus for anonymizing data, comprising:

the sample data acquisition module is used for carrying out secret sharing on plaintext sample data to be processed to obtain secret sample data;

a box separation range obtaining module, configured to perform dense-state equidistant box separation on the dense-state sample data to obtain multiple box separation ranges;

the unique hot matrix determining module is used for determining a dense state unique hot matrix corresponding to the dense state sample data according to the Boolean state of the dense state sample data falling into each of the plurality of box ranges;

and the anonymous data determining module is used for determining the secret state anonymous data corresponding to the plaintext sample data according to the secret state unique hot matrix.

9. An electronic device, comprising: a processor and a memory, the memory storing machine-readable instructions executable by the processor, the machine-readable instructions, when executed by the processor, performing the method of any of claims 1 to 7.

10. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 7.