CN115829048A - Data inspection method for longitudinal federal learning - Google Patents

Data inspection method for longitudinal federal learning Download PDF

Info

Publication number
CN115829048A
CN115829048A CN202111085459.2A CN202111085459A CN115829048A CN 115829048 A CN115829048 A CN 115829048A CN 202111085459 A CN202111085459 A CN 202111085459A CN 115829048 A CN115829048 A CN 115829048A
Authority
CN
China
Prior art keywords
data
probability
batch
party
occurrence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111085459.2A
Other languages
Chinese (zh)
Inventor
杨诗友
章枝宪
李鑫超
严梦嘉
尹虹舒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN202111085459.2A priority Critical patent/CN115829048A/en
Priority to PCT/CN2022/115465 priority patent/WO2023040640A1/en
Publication of CN115829048A publication Critical patent/CN115829048A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The disclosure relates to a data inspection method for longitudinal federal learning, which comprises the following steps: storing probability distribution characteristics of each characteristic of training data obtained based on a batch of effective training data; setting a normal occurrence probability threshold epsilon of the training data according to the probability distribution characteristics of all the characteristics of the training data; calculating the occurrence probability P (x) of a piece of data x appearing in the same distribution as the batch of valid training data; and comparing the occurrence probability P (x) with a normal occurrence probability threshold epsilon, if the occurrence probability P (x) is smaller than the normal occurrence probability threshold epsilon, judging that the data is abnormal data, and otherwise, judging that the data is normal data.

Description

Data inspection method for longitudinal federal learning
Technical Field
The present invention relates to federal learning and more particularly to longitudinal federal learning.
Background
With federal learning, multiparty security computing becoming a mature and mainstream data sharing security technology, new problems also come along. Model contamination or malicious theft that may be caused by data participants is one of the most important issues. The existing data inspection or protection method is only suitable for a horizontal federal learning scene and cannot be applied to a vertical federal learning scene, and the potential problem that a large amount of calculation is needed to consume resource and sacrifice performance exists.
Disclosure of Invention
The following presents a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. However, it should be understood that this summary is not an exhaustive overview of the disclosure. It is not intended to identify key or critical elements of the disclosure or to delineate the scope of the disclosure. Its sole purpose is to present some concepts of the disclosure in a simplified form as a prelude to the more detailed description that is presented later.
The invention provides a data inspection method for longitudinal federal learning, which comprises the following steps:
storing probability distribution characteristics of each characteristic of training data obtained based on a batch of effective training data;
setting a normal occurrence probability threshold epsilon of the training data according to the probability distribution characteristics of all the characteristics of the training data;
calculating the occurrence probability P (x) of a piece of data x appearing in the same distribution as the batch of valid training data; and
and comparing the occurrence probability P (x) with a normal occurrence probability threshold epsilon, if the occurrence probability P (x) is smaller than the normal occurrence probability threshold epsilon, judging that the data is abnormal data, and otherwise, judging that the data is normal data.
Other features of the present invention and its advantages will become apparent from the following detailed description of the preferred embodiments of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
The present disclosure may be more clearly understood from the following detailed description with reference to the accompanying drawings, in which:
FIG. 1 shows a schematic block diagram of the federated learning system of the present invention.
FIG. 2 shows a specific example of a characteristic normal distribution.
FIG. 3 shows a flow chart of a longitudinal federated learning data verification method in accordance with the present invention.
FIG. 4 illustrates an exemplary configuration of a computing device capable of implementing embodiments in accordance with the present disclosure.
Detailed Description
The following detailed description is made with reference to the accompanying drawings and is provided to assist in a comprehensive understanding of various exemplary embodiments of the disclosure. The following description includes various details to aid understanding, but these details are to be regarded as examples only and are not intended to limit the disclosure, which is defined by the appended claims and their equivalents. The words and phrases used in the following description are used only to provide a clear and consistent understanding of the disclosure. In addition, descriptions of well-known structures, functions, and configurations may be omitted for clarity and conciseness. Those of ordinary skill in the art will recognize that various changes and modifications of the examples described herein can be made without departing from the spirit and scope of the disclosure.
FIG. 1 shows a schematic block diagram of the federated learning system of the present invention.
In some embodiments, the federated learning system includes two data participants, however, the data participants are not limited to two and there may be more data participants.
For clarity of description, definitions used subsequently in the description are as follows:
and (3) a data participant: the participators providing own data to participate in the federal learning comprise a data provider (providing data only) and a data user (providing data and using model prediction data, wherein the number of the data user is possibly more than one, but each model prediction task initiator is only one).
A coordinating party: common terms in the field of federal learning are mainly set for preventing data or privacy disclosure, and generally participate in encryption and decryption (such as key distribution) and intermediate result aggregation calculation. The coordinator is generally a non-data participant, i.e., a federal learning participant that participates in the calculation but does not provide data, but the industry also refers to a data participant that performs the result aggregation calculation as a concurrent coordinator.
And (3) a final data occurrence probability calculator: in order to distinguish from the coordinating party, the data occurrence probability final calculating party is a federal learning participating party which performs aggregation calculation after acquiring intermediate results transmitted by other data participating parties and finally acquires complete data occurrence probability and a data normal occurrence probability threshold value. The final calculation party of the data occurrence probability can be one of the data participants, and can also be a non-data participant, namely a coordinator or a data checking party which participates in the calculation but does not provide data.
And (3) data checking: the data occurrence probability is compared with a normal occurrence probability threshold value and is subjected to data inspection by a federal learning participant, and the federal learning participant is also a normal occurrence probability threshold value setting part, the data inspection party is generally a final data occurrence probability calculator, but the final data occurrence probability calculator and the data inspection party are not consistent (for example, after a coordinating party of a non-data participant is used as the final data occurrence probability calculator to calculate and obtain complete data occurrence probability, the data occurrence probability is transmitted to other data participants to perform data inspection, and the data participant is the data inspection party).
In some embodiments, as shown in fig. 1 (a), the federal learning system only includes a party a and a party B providing data for federal learning, which obtain a final calculation result through interaction and fusion of intermediate calculation results. In other embodiments, as shown in fig. 1 (B), the federal learning system may contain, in addition to data-providing party a and party B, third parties that participate in the calculations but do not provide data: and (5) coordinating party C. In the case of three or more data participants, there are a plurality of data participants similar to participants a or B.
For clarity and brevity of description, in the following description, only a case where the federal learning system includes two data participants (e.g., participant a and participant B) is described as an example.
The following describes the phase (preparation phase) of training the model (modeling) using a valid set of training data.
In the stage of training a model (modeling) using a set of valid training data, modeling is performed using the set of valid training data. The training data used for modeling needs to be valid or otherwise will not be modeled correctly.
A batch of valid training data is first prepared. According to the effective training data, the distribution-related characteristics of the training data can be obtained, and the normal occurrence probability threshold value can be set according to the characteristic distribution. The normal occurrence probability threshold is used for subsequent data verification.
In some embodiments, a plurality of data participants locally calculate and store distribution-related features of the training data at each party, respectively, and set a normal occurrence probability threshold according to the distribution of the features for use in subsequent data verification. The calculation of the complete normal occurrence probability threshold of the data requires the transmission of intermediate calculation results, and the final calculation party of the data occurrence probability calculates the complete normal occurrence probability threshold of the data. And the data checking party acquires and stores the normal occurrence probability threshold value of the data from the final data occurrence probability calculating party and sets the normal occurrence probability threshold value according to the normal occurrence probability threshold value of the data.
In some embodiments, the data occurrence probability final calculator doubles as a data validator (e.g., the non-model predictive initiator B in the case of two parties). In some embodiments, the final calculator of the data occurrence probability is coordinator C.
In some embodiments, at the stage of training (modeling) the model using the first batch of data, the data participant (e.g., participant a) that is not the final calculator of the probability of data occurrence may perform the following operations: calculating and storing distribution-related characteristics of each characteristic locally at the data participants; and calculating an intermediate calculation result required for setting a normal occurrence probability threshold epsilon, and sending the intermediate calculation result to the next other data participant needing to be calculated or a data occurrence probability final calculator (depending on the aggregation process of the complete calculation result).
In some embodiments, at the stage of training (modeling) the model using the first batch of data, the data participants that are not the final calculator of probability of occurrence of data may perform the following operations: calculating and storing distribution-related characteristics of each characteristic locally at the data participants; acquiring an intermediate calculation result required by setting a normal occurrence probability threshold epsilon transmitted by other data participants, updating by using own data on the basis of the intermediate calculation result required by the normal occurrence probability threshold epsilon to acquire a new intermediate calculation result, and sending the new intermediate calculation result to other data participants or a data occurrence probability final calculator (depending on an aggregation process of a complete calculation result).
In some embodiments, at the stage of training (modeling) the model using the first batch of data, a data participant (e.g., participant B) doubling as the final calculator of the probability of occurrence of data may perform the following operations: calculating and storing distribution-related characteristics of each characteristic local to a data participant of the data participants; and acquiring an intermediate calculation result required by setting a normal occurrence probability threshold epsilon transmitted by one or more other data participants, and calculating a complete data normal occurrence probability threshold epsilon by using own data on the basis of the intermediate calculation result required by the normal occurrence probability threshold epsilon.
In some embodiments, at the stage of training (modeling) the model using the first batch of data, the data occurrence probability final calculator of the non-data participant (e.g., coordinator C that does not provide data) may perform the following operations: and acquiring intermediate calculation results required by setting the normal occurrence probability threshold epsilon transmitted by one or more other data participants, and aggregating all the intermediate calculation results to calculate the complete data normal occurrence probability threshold epsilon.
In some embodiments, the final data appearance probability calculator doubles as a data verifier (e.g., participant B), and the data verifier can directly set the calculated normal data appearance probability threshold epsilon as the data verification index.
In some embodiments, the data occurrence probability final calculator is not the data inspector, and the data occurrence probability final calculator needs to transmit the calculated data normal occurrence probability threshold epsilon to the data inspector so that the data inspector can set the data normal occurrence probability threshold epsilon to perform data inspection in the subsequent model prediction stage.
The following specifically exemplifies how the distribution-related features of the respective features are calculated.
The distribution definition and estimation of the features of the first training data includes the following situations.
a) Known distribution/existing distribution hypothesis (class a): part of the features are based on the common distribution known by past experience, only the features required by the estimation of the hypothesis distribution can be directly stored, and the occurrence probability calculation is carried out by using the hypothesis distribution (for example, if a certain feature is known to be yes/no, and the yes or no occurrence probability is half-open, the occurrence probability of the feature is 0.5 in the data, and the probability of other values is 0); let the probability of the occurrence of all such features jointly be P (x) a ) = product of all the probabilities of occurrence of the class a features.
b) Discrete variable features of unknown/uncommon distribution hypotheses (class b): in most cases, the value classes of the discrete variables are small, and when the data volume is large, according to the law of large numbers, the distribution probability corresponding to each value of the first training data can be directly used as the assumed distribution of the feature, and there is one probability for each value, such as P (x = 0) =0.3, P (x = 1) =0.5, P (x = 2) =0.2; let the probability of the occurrence of all such features jointly be P (x) b ) = product of all the probabilities of occurrence of the class b features.
c) Continuous variable features/other features of unknown/uncommon distribution hypotheses (class c): in most cases, when the data volume is large, according to the central limit theorem and the law of large numbers, it can be assumed that the random variable approximately follows normal distribution/gaussian distribution, and then the expectation/mean and variance can be used for data distribution/density estimation; let the probability of the occurrence of all such features jointly be P (x) c ) = product of all the probabilities of occurrence of the class c features.
Normal distribution: x to N (mu, sigma) 2 )。
Assuming that the data quantity of the first training data set is n, calculating mu and sigma for each feature j 2 The estimated value of (c) is as follows:
Figure BDA0003265370300000061
assuming that m continuous variable features of unknown distribution/abnormal distribution hypothesis exist, calculating the joint occurrence probability (assuming that each feature is independent and unrelated) p (x) of the m features according to the normal distribution probability c ) Comprises the following steps:
Figure BDA0003265370300000062
the A and B parties calculate P (x) separately a )·P(x b )·P(x c ) Let the result obtained by party A be P (x) A ) The result obtained by the B party is P (x) B )。
The following specifically exemplifies how the normal occurrence probability threshold epsilon is set.
For example, if the features are normally distributed and 99.7% of the points in the 3 standard deviation interval, the probability corresponding to plus or minus 3 standard deviations, or a little lower, for example, can be used as the feature probability reference. FIG. 2 shows a specific example of a characteristic normal distribution.
For example, if the features are distributed in discrete variables, the probability of the lowest value class, for example, can be used as a feature probability reference.
As a specific example, assuming that m features are normally distributed among all the features, assuming that the minimum normal probability threshold value is 0.004, and 1 discrete variable feature is provided (the feature has three value classes, and the distribution is P (x = 0) =0.3, P (x = 1) =0.5, and P (x = 2) =0.2, the lowest 0.2 is selected as the probability, and the normal occurrence probability ∈ =0.2 · (0.004) that should be set for each piece of data is selected) m
The present invention is not limited to the specifically illustrated example, and for example, each feature may exhibit other probability distributions than the distribution specifically illustrated above, and the specific manner of setting the normal occurrence probability threshold value epsilon may be changed within the scope of the present invention.
The complete data verification method for longitudinal federal learning according to the present invention is illustratively described below.
FIG. 3 shows a flow chart of a longitudinal federated learning data verification method in accordance with the present invention.
In some embodiments, the longitudinal federal learned complete data verification method in accordance with the present invention includes the following steps.
Probability distribution characteristics of each characteristic of the training data obtained based on a batch of effective training data are stored. For example, the distribution of the features of the training data may exhibit a normal distribution, a discrete variable distribution, or other type of distribution. Accordingly, data characterizing the probability distribution of features of the training data is stored in, for example, each participant of longitudinal federal learning.
For example, the training data contains a data portion provided by data provider a and a data portion provided by data provider B. The training data is cryptographically sample aligned to form a virtual fused data set.
And setting a normal occurrence probability threshold epsilon of the training data according to the probability distribution characteristics of all the characteristics of the training data. A specific example of setting the normal occurrence probability threshold value epsilon of the training data is described in detail in the preparation phase described in detail above, and a description thereof will not be repeated.
In some embodiments, the part of the model on the a side and the part of the model on the B side communicate with each other for federal model training (including encryption and decryption).
During the stage of training the model using the first batch of data, party A may store the necessary features (e.g., mean μ and variance σ) for each feature distribution/density estimate 2 ) The B-side may also store the necessary features (e.g., mean μ and variance σ) for each feature distribution/density estimate 2 )。
The step of storing the probability distribution characteristics of the features of the training data obtained on the basis of a batch of valid training data and the step of setting the normal occurrence probability threshold epsilon of the training data on the basis of the probability distribution characteristics of the features of the training data correspond to the preparation phase work described in detail above.
The updating of the model or online reasoning/prediction phase with new data is illustratively described below.
At this stage, the model prediction initiator (e.g., party a) applies for updating the model with new data or online reasoning/prediction, the data validator (e.g., party B) detects the application, the data validation module is activated, and the data validator asks for data validation. Each data provider aligns the encryption samples of the new data (e.g., PSI security deals). The model prediction initiator (for example, the A side) automatically calculates and sends an intermediate calculation result required for calculating the new data occurrence probability P (x) to the data checking side according to the requirement of the data checking module.
Calculating the probability of occurrence P (x) of a piece of data x appearing in the same distribution as the batch of valid training data. In some embodiments, the data verifier calculates a new data occurrence probability P (x). The case of calculation for each piece of data and the case of sampling calculation will be explained in detail below.
For example, the new data contains a data portion provided by data provider a and a data portion provided by data provider B. The new data is encrypted sample aligned to form a virtual fused data set. In some embodiments, the piece of data is a d-dimensional vector x, and the A-side has a feature x 1 ,x 2 ,...,x d1 The B side has a characteristic x d1+1 ,x d1+2 ,...,x d
Probability of occurrence of data x in the same distribution as the batch of valid training data, P (x) = P (x) 1 )·P(x 2 |x 1 )·P(x 3 |x 1 ,x 2 )·...·P(x d-1 |x 1 ,x 2 ,x 3 ,...x d-2 )·P(x d |x 1 ,x 2 ,x 3 ,...x d-1 )。
In order to make the calculation and transmission of intermediate parameters more simple and efficient, the above formula can be converted into P (x) = P (x) assuming that there is no independent correlation between the features 1 )·P(x 2 )·P(x 3 )·...·P(x d-1 )·P(x d )。
In some embodiments, P (x) is calculated by calculating P (x) by party A A )=P(x 1 )·P(x 2 )·P(x 3 )·...·P(x d1 ) And selecting P (x) according to the security and federal framework A ) And transmitting the data to a final data occurrence probability calculator after encryption or non-encryption. In some embodiments, the data occurrence probability final calculator may be the B-party or a separate data occurrence probability final calculator. B-side calculates P (x) B )=P(x d1+1 )·P(x d1+2 )·P(x d1+3 )·...·P(x d ). The final data occurrence probability calculator further calculates the probability P (x) = P (x) of single data occurrence A )·P(x B ). In the case that the B party doubles as the final calculation party of the data appearance probability, the P (x) transmitted from the A party is received by the B party A ) And calculating P (x) = P (x) A )·P(x B ). In the case where the data appearance probability final calculation party is a separate data appearance probability final calculation party other than the a party and the B party, P (x) transmitted from the a party is received by the data appearance probability final calculation party A ) And P (x) from party B B ) And calculating P (x) = P (x) A )·P(x B )。
Before the A party provides the new data, the B party requires the A party to provide the joint occurrence probability P (x) of all the characteristics of the A party of the new data A ). The new data of party a is encrypted sample aligned (e.g., PSI security intersection) by party a and party B. For example, in the case where the B party doubles as the final calculation party of the data appearance probability, the a party calculates P (x) A ) And sent to party B, party B calculates P (x) B ) And upon receipt of P (x) A ) Post calculation P (x) = P (x) A )·P(x B )。
The data checking party (in this example, the data checking party and the final data occurrence probability calculating party are both the party B) compares the occurrence probability P (x) with the normal occurrence probability threshold epsilon, if the occurrence probability P (x) is smaller than the normal occurrence probability threshold epsilon, the data is determined to be abnormal data, otherwise, the data is determined to be normal data. If the data is judged to be abnormal data, the data is abnormal or polluted, and therefore the data can be rejected for longitudinal federal learning. If the piece of data is determined to be normal data, the piece of data may be allowed to be used for longitudinal federal learning.
In some embodiments, the data verifier may also perform data verification for a batch of data.
When the input new data is multiple but small (for example, less than the threshold of the number of pieces), calculating the probability of occurrence of each piece of data, and finally calculating the probability of contamination P (x) < epsilon) of the batch of data, if the probability of contamination is less than the threshold of confidence tolerance α (for example, it can be set to 1% -5%, and the confidence coefficient can be referred to for setting), accepting the batch of data to update the model, otherwise rejecting (or accepting the uncontaminated part of the batch of data after deleting the contaminated data according to the probability of occurrence of each piece of data); if the contamination probability exceeds a threat tolerance threshold β (which may be set to 25%, for example), the data source of the data participant may have the purpose of data contamination or a malicious contamination model, refuse to accept all subsequent data of the participant, and communicate with the participant to warn. For example, in a case where the batch of data includes a number of pieces of data smaller than a threshold number, a probability of occurrence of each piece of data in the batch of data is calculated, and a probability P (x) < epsilon) of contamination of the batch of data is calculated, and if the probability P (x) < epsilon) of contamination of the batch of data is smaller than a threshold value α of confidence tolerance, the batch of data is accepted for longitudinal federal learning, otherwise the batch of data is rejected for longitudinal federal learning, and if the probability P (x) < epsilon) of contamination of the batch of data is larger than a threshold value β of threat tolerance, the batch of data is rejected for longitudinal federal learning and further all subsequent data provided by party a is rejected. The confidence tolerance threshold a is less than the threat tolerance threshold β.
When the input new data amount is large (for example, not less than a threshold value of the number of pieces), sampling calculation may be selected, such as extracting 10% -20% of the data, and calculating the occurrence probability of each piece of extracted data, and finally calculating the pollution probability P (x) < epsilon) of the batch of data, where the recommended confidence tolerance threshold alpha may be lower (for example, may be set to 0.1% -1%) than the recommended value in a small amount case, and the threat tolerance threshold beta may be higher (for example, may be set to 30% -40%) than the recommended value in a small amount case, for example, in a case where the batch of data includes no less than a certain threshold number of pieces, the occurrence probability of part of the data in the batch of data is sampled and the probability P (x) < epsilon) of the batch of data being polluted is calculated, if the probability P (P (x) < epsilon) that the batch of data is contaminated is greater than the threat tolerance threshold beta, the batch of data is rejected for longitudinal federal learning and the A-party is notified of data processing, or if each piece of data is calculated as a single piece of data instead of being sampled in this case, the data that does not pass in the batch of data can be directly filtered and the data that passes in the batch of data can be accepted, if the probability P (x) < epsilon) that the batch of data is contaminated is greater than the threat tolerance threshold beta, the batch of data is rejected for longitudinal federal learning and further all subsequent data provided by the A-party is rejected, and judging the A party as a data polluting party and warning the A party of a data source that a problem exists. Similarly, the confidence tolerance threshold α is less than the threat tolerance threshold β.
When there is a separate data occurrence probability final calculator, the flow is similar, as compared with the case where the B party doubles as the data occurrence probability final calculator, except that P (x) = P (x) is calculated A )·P(x B ) The data occurrence probability is calculated at the final calculation party, and the B party needs to transmit P (x) B ) To the final calculator of the data occurrence probability.
In some embodiments, further comprising using results of longitudinal federal learning to perform federal model updates/forecasts (including encryption/decryption).
In the present invention, the probability of the possible occurrence of a single new data in the same distribution as the first batch of training data can be used to estimate the data plausibility probability.
The assumption that there is no independent association between features can be used so that the probability of occurrence can be estimated using the product of the probabilities of occurrence over all individual feature dimensions, making federal probability calculations (transmission of intermediate results is not compromised) possible.
The product of the probability of occurrence over all feature dimensions owned locally by each party can be used as a transmission intermediate result so that the statistics of a single feature are not revealed.
The implementation can be simplified by using the product of three types of feature probabilities of the features of known distribution/existing distribution hypothesis, the discrete variable features of unknown distribution/unusual distribution hypothesis, and the continuous variable features/other features of unknown distribution/unusual distribution hypothesis.
The probability product of normal occurrences for different distributions may be used as a threshold to determine anomalies.
The abnormal rate of each batch of new data can be used as the pollution probability of the batch of data, and a normal probability threshold value is used for judging whether the data source/data provider of the batch of data has data pollution/malicious attack.
The patent provides a brand-new detection method for data pollution or malicious stealing in a longitudinal federated learning scene, and when a model is updated or predicted by online reasoning or using the model, the probability that newly provided data of a participant is real data is calculated based on the characteristics of training data, and if the probability is smaller than a normal probability threshold value, the data is polluted data or malicious data.
Compared with the prior art, the invention has at least the following advantages.
The patent is applicable to a longitudinal federal learning scene, the prior art is only applicable to a transverse federal learning scene, a data inspector is not required to have complete data (all characteristic values) of new data during data inspection, and the data inspector can only have part of characteristics and even can not have any original data part (characteristic value) of the new data.
According to the method, only one data inspection activation module is added in each party, the existing federal learning process is not greatly modified, necessary data inspection can be achieved only through simple storage and calculation steps in a preparation stage and a new data application stage, a large amount of extra calculation power is not needed to be used for training a new data abnormity detection model like other prior art, or complicated parameter transmission and model prediction calculation are carried out, and the method has higher practicability and modification convenience.
Fig. 4 illustrates an exemplary configuration of a computing device 400 capable of implementing embodiments in accordance with the present disclosure.
Computing device 400 is an example of a hardware device to which the above-described aspects of the disclosure can be applied. Computing device 400 may be any machine configured to perform processing and/or computing. Computing device 400 may be, but is not limited to, a workstation, a server, a desktop computer, a laptop computer, a tablet computer, a Personal Data Assistant (PDA), a smart phone, an in-vehicle computer, or a combination thereof.
As shown in fig. 4, computing device 400 may include one or more elements that may be connected to or in communication with bus 402 via one or more interfaces. Bus 402 can include, but is not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an Enhanced ISA (EISA) bus, a Video Electronics Standards Association (VESA) local bus, a Peripheral Component Interconnect (PCI) bus, and the like. Computing device 400 may include, for example, one or more processors 404, one or more input devices 406, and one or more output devices 408. The one or more processors 404 may be any kind of processor and may include, but are not limited to, one or more general purpose processors or special purpose processors (such as special purpose processing chips). The processor 404 may be configured to perform the methods of the present disclosure, for example. Input device 406 may be any type of input device capable of inputting information to a computing device and may include, but is not limited to, a mouse, a keyboard, a touch screen, a microphone, and/or a remote control. Output device 408 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer.
The computing device 400 may also include or be connected to a non-transitory storage device 414, which non-transitory storage device 414 may be any non-transitory and may implement a data storage device, and may include, but is not limited to, a disk drive, an optical storage device, a solid state memory, a floppy disk, a flexible disk, a hard disk, a magnetic tape, or any other magnetic medium, a compact disk, or any other optical medium, a cache memory, and/or any other memory chip or module, and/or any other medium from which a computer may read data, instructions, and/or code. Computing device 400 may also include Random Access Memory (RAM) 410 and read-only memoryA memory (ROM) 412. The ROM 412 may store programs, utilities or processes to be executed in a nonvolatile manner. The RAM 410 may provide volatile data storage and store instructions related to the operation of the computing device 400. Computing device 400 may also include a network/bus interface 416 coupled to a data link 418. The network/bus interface 416 may be any kind of device or system capable of enabling communication with external devices and/or networks, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication device, and/or a chipset (such as bluetooth) TM Devices, 802.11 devices, wiFi devices, wiMax devices, cellular communications facilities, etc.).
The present disclosure may be implemented as any combination of devices, systems, integrated circuits, and computer programs on non-transitory computer readable media. One or more processors may be implemented as an Integrated Circuit (IC), an Application Specific Integrated Circuit (ASIC), or a large scale integrated circuit (LSI), a system LSI, or a super LSI, or as an ultra LSI package that performs some or all of the functions described in this disclosure.
The present disclosure includes the use of software, applications, computer programs or algorithms. Software, applications, computer programs, or algorithms may be stored on a non-transitory computer readable medium to cause a computer, such as one or more processors, to perform the steps described above and depicted in the figures. For example, one or more memories store software or algorithms in executable instructions and one or more processors may associate a set of instructions to execute the software or algorithms to provide various functionality in accordance with embodiments described in this disclosure.
Software and computer programs (which may also be referred to as programs, software applications, components, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural, object-oriented, functional, logical, or assembly or machine language. The term "computer-readable medium" refers to any computer program product, apparatus or device, such as magnetic disks, optical disks, solid state storage devices, memories, and Programmable Logic Devices (PLDs), used to provide machine instructions or data to a programmable data processor, including a computer-readable medium that receives machine instructions as a computer-readable signal.
By way of example, computer-readable media may comprise Dynamic Random Access Memory (DRAM), random Access Memory (RAM), read Only Memory (ROM), electrically erasable read only memory (EEPROM), compact disk read only memory (CD-ROM) or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which may be used to carry or store desired computer-readable program code in the form of instructions or data structures and which may be accessed by a general-purpose or special-purpose computer or a general-purpose or special-purpose processor. Disk or disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above are also included within the scope of computer-readable media.
The subject matter of the present disclosure is provided as examples of apparatus, systems, methods, and programs for performing the features described in the present disclosure. However, other features or variations are contemplated in addition to the features described above. It is contemplated that the implementation of the components and functions of the present disclosure may be accomplished with any emerging technology that may replace the technology of any of the implementations described above.
Additionally, the above description provides examples, and does not limit the scope, applicability, or configuration set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the spirit and scope of the disclosure. Various embodiments may omit, substitute, or add various procedures or components as appropriate. For example, features described with respect to certain embodiments may be combined in other embodiments.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Claims (14)

1. A data inspection method for longitudinal federal learning comprises the following steps:
storing probability distribution characteristics of each characteristic of training data obtained based on a batch of effective training data;
setting a normal occurrence probability threshold epsilon of the training data according to the probability distribution characteristics of all the characteristics of the training data;
calculating the occurrence probability P (x) of a piece of data x appearing in the same distribution as the batch of valid training data; and
and comparing the occurrence probability P (x) with a normal occurrence probability threshold epsilon, if the occurrence probability P (x) is smaller than the normal occurrence probability threshold epsilon, judging that the data is abnormal data, and otherwise, judging that the data is normal data.
2. The method of claim 1, wherein the longitudinal federal learning participants include party a and party B, party a having a characteristic x 1 ,x 2 ,...,x d1 The B side has a feature x d1+1 ,x d1+2 ,...,x d
One piece of data x = (x) 1 ,x 2 ,...,x d1 ,x d1+1 ,x d1+2 ,...,x d ) D1 is a natural number greater than or equal to 1, d is a natural number greater than or equal to 2,
the probability of occurrence is calculated according to the following formula: p (x) = P (x) 1 )·P(x 2 |x 1 )·P(x 3 |x 1 ,x 2 )·...·P(x d-1 |x 1 ,x 2 ,x 3 ,...x d-2 )·P(x d |x 1 ,x 2 ,x 3 ,...x d-1 )。
3. The method of claim 2, wherein,
the probability of occurrence is calculated according to the following formula: p (x) = P (x) 1 )·P(x 2 )·P(x 3 )·...·P(x d-1 )·P(x d )。
4. The method of claim 3, further comprising the step of:
calculation of P (x) by A-party A )=P(x 1 )·P(x 2 )·P(x 3 )·...·P(x d1 ) The B-party calculates P (x) B )=P(x d1+1 )·P(x d1+2 )·P(x d1+3 )·...·P(x d )。
5. The method of claim 4, further comprising the steps of:
final calculation of data appearance probability according to P (x) A ) And P (x) B ) Calculation of P (x) = P (x) A )·P(x B )。
6. The method of claim 5, wherein,
the B party or another third party acts as the final calculator of the probability of occurrence of data,
party A will P (x) A ) And transmitting the data to a final calculator of the data occurrence probability.
7. The method of claim 6, wherein,
party A cryptographically combines P (x) A ) And transmitting the data to a final data occurrence probability calculator.
8. The method of claim 1, wherein,
if the data is judged to be abnormal data, rejecting the data for longitudinal federal learning,
if the piece of data is judged to be normal data, the piece of data is allowed to be used for longitudinal federal learning.
9. The method of claim 1, further comprising the step of:
and judging whether a batch of data provided by the party A and containing a plurality of pieces of data is abnormal or not.
10. The method of claim 9, wherein,
in the case where the batch of data contains a number of pieces of data less than the threshold,
calculating the probability of occurrence of each piece of data in the batch of data, and calculating the probability P (P (x) < epsilon) that the batch of data is polluted,
if the probability P (P (x) < epsilon) that the batch of data is contaminated is less than the confidence tolerance threshold alpha, the batch of data is accepted for longitudinal federal learning,
if the probability P (P (x) < epsilon) that the batch of data is polluted is not less than the confidence tolerance threshold alpha, rejecting the batch of data for longitudinal federal learning, or deleting polluted data according to the probability of occurrence of each piece of data and then accepting the part of the batch of data without pollution,
if the probability P (P (x) < epsilon) that the batch of data is contaminated is greater than the threat tolerance threshold beta, the batch of data is rejected for longitudinal federal learning, and further all subsequent data provided by party A is rejected for acceptance,
wherein the confidence tolerance threshold α is less than the threat tolerance threshold β.
11. The method of claim 9, wherein,
in the case where the batch of data contains a number of pieces of data not less than the threshold value,
sampling to calculate the occurrence probability of partial data in the batch of data, and calculating the pollution probability P (P (x) < epsilon) of the batch of data,
if the probability P (P (x) < epsilon) that the batch of data is contaminated is less than the confidence tolerance threshold alpha, the batch of data is accepted for longitudinal federal learning,
if the probability P (P (x) < epsilon) that the batch of data is contaminated is not less than the confidence tolerance threshold alpha, the batch of data is rejected for longitudinal federal learning,
if the probability P (P (x) < epsilon) that the batch of data is contaminated is greater than the threat tolerance threshold beta, the batch of data is rejected for longitudinal federal learning, and further all subsequent data provided by party A is rejected for acceptance,
wherein the confidence tolerance threshold α is less than the threat tolerance threshold β.
12. The method of claim 1, further comprising the step of:
and updating the federal model by using the result of longitudinal federal learning.
13. A longitudinal federally learned data testing device, comprising:
a memory having instructions stored thereon; and
a processor configured to execute instructions stored on the memory to perform the method of any of claims 1 to 12.
14. A computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform the method of any one of claims 1-12.
CN202111085459.2A 2021-09-16 2021-09-16 Data inspection method for longitudinal federal learning Pending CN115829048A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111085459.2A CN115829048A (en) 2021-09-16 2021-09-16 Data inspection method for longitudinal federal learning
PCT/CN2022/115465 WO2023040640A1 (en) 2021-09-16 2022-08-29 Data validation method for vertical federated learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111085459.2A CN115829048A (en) 2021-09-16 2021-09-16 Data inspection method for longitudinal federal learning

Publications (1)

Publication Number Publication Date
CN115829048A true CN115829048A (en) 2023-03-21

Family

ID=85514998

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111085459.2A Pending CN115829048A (en) 2021-09-16 2021-09-16 Data inspection method for longitudinal federal learning

Country Status (2)

Country Link
CN (1) CN115829048A (en)
WO (1) WO2023040640A1 (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11443240B2 (en) * 2019-09-06 2022-09-13 Oracle International Corporation Privacy preserving collaborative learning with domain adaptation
CN112420187B (en) * 2020-10-15 2022-08-26 南京邮电大学 Medical disease analysis method based on migratory federal learning
CN112231570B (en) * 2020-10-26 2024-04-16 腾讯科技(深圳)有限公司 Recommendation system support attack detection method, device, equipment and storage medium
CN113360896B (en) * 2021-06-03 2022-09-20 哈尔滨工业大学 Free Rider attack detection method under horizontal federated learning architecture
CN113283185B (en) * 2021-07-23 2021-11-12 平安科技(深圳)有限公司 Federal model training and client imaging method, device, equipment and medium

Also Published As

Publication number Publication date
WO2023040640A1 (en) 2023-03-23

Similar Documents

Publication Publication Date Title
US20210234687A1 (en) Multi-model training based on feature extraction
CN112541593B (en) Method and device for jointly training business model based on privacy protection
WO2023065632A1 (en) Data desensitization method, data desensitization apparatus, device, and storage medium
CN111967609B (en) Model parameter verification method, device and readable storage medium
CN111144576A (en) Model training method and device and electronic equipment
CN111628866B (en) Neural network verification method, device and equipment and readable storage medium
CN112308238A (en) Analytical model training method and device, electronic equipment and storage medium
CN114883005A (en) Data classification and classification method and device, electronic equipment and storage medium
CN110704875B (en) Method, device, system, medium and electronic equipment for processing client sensitive information
Patel et al. High dimensional model explanations: An axiomatic approach
CN116955590B (en) Training data screening method, model training method and text generation method
CN115829048A (en) Data inspection method for longitudinal federal learning
CN116702220A (en) Data comparison method and system based on encryption characteristic analysis
CN116244753B (en) Method, device, equipment and storage medium for intersection of private data
Jaiswal et al. Source Camera Identification Using Hybrid Feature Set and Machine Learning Classifiers
CN114626860B (en) Dynamic identity identification method and device for online commodity payment
CN113609391B (en) Event recognition method and device, electronic equipment, medium and program
Dimitrova et al. ICT Innovations 2020
CN115827870A (en) Data processing method, device, equipment and storage medium
Dimitrova et al. ICT Innovations 2020. Machine Learning and Applications: 12th International Conference, ICT Innovations 2020, Skopje, North Macedonia, September 24–26, 2020, Proceedings
Pi et al. Uncovering customer issues through topological natural language analysis
Li et al. Two Types of Solutions to a Class of (p, q)‐Laplacian Systems with Critical Sobolev Exponents in RN
KR20210130048A (en) Method and apparatus for training analysis model
CN117440383A (en) Fraud-related user analysis method based on privacy calculation
CN113034337A (en) Image detection method, related device and computer program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination