WO2023040640A1 - 一种纵向联邦学习的数据检验方法 - Google Patents

一种纵向联邦学习的数据检验方法 Download PDF

Info

Publication number
WO2023040640A1
WO2023040640A1 PCT/CN2022/115465 CN2022115465W WO2023040640A1 WO 2023040640 A1 WO2023040640 A1 WO 2023040640A1 CN 2022115465 W CN2022115465 W CN 2022115465W WO 2023040640 A1 WO2023040640 A1 WO 2023040640A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
probability
batch
party
federated learning
Prior art date
Application number
PCT/CN2022/115465
Other languages
English (en)
French (fr)
Inventor
杨诗友
章枝宪
李鑫超
严梦嘉
尹虹舒
Original Assignee
中国电信股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国电信股份有限公司 filed Critical 中国电信股份有限公司
Publication of WO2023040640A1 publication Critical patent/WO2023040640A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present disclosure relates to federated learning, and in particular to vertical federated learning.
  • Model pollution or malicious theft that may be caused by data participants is one of the most important issues.
  • This disclosure provides a data inspection method for longitudinal federated learning, including steps:
  • Fig. 1 shows a schematic structural block diagram of the federated learning system of the present disclosure.
  • Figure 2 shows a specific example where a feature is normally distributed.
  • FIG. 3 shows a flow chart of a data verification method for longitudinal federated learning according to the present disclosure.
  • FIG. 4 illustrates an exemplary configuration of a computing device capable of implementing embodiments according to the present disclosure.
  • Fig. 1 shows a schematic structural block diagram of the federated learning system of the present disclosure.
  • the federated learning system includes two data participants, however, the data participants are not limited to two, and there may be more data participants.
  • Participants who provide their own data to participate in federated learning, including data providers (simply providing data) and data users (providing data and using models to predict data; there may be multiple data users, but each model predicts There can only be one task initiator).
  • Coordinator A common term in the field of federated learning. It is mainly set up to prevent data or privacy leaks. It generally participates in encryption and decryption (such as key distribution, etc.) and intermediate result aggregation calculations.
  • the coordinator is generally a non-data participant, that is, a federated learning participant who participates in calculations but does not provide data, but the industry also refers to a data participant who conducts result aggregation calculations as a concurrent coordinator.
  • the final calculation party of data occurrence probability In order to distinguish it from the coordinator, the final calculation party of data occurrence probability is to obtain the intermediate results transmitted by other data participants and perform aggregation calculations to finally obtain the complete data occurrence probability and the federation of normal data occurrence probability thresholds learning participants.
  • the final calculation party of the probability of data occurrence can be one of the data participants, or it can be a non-data participant, that is, a coordinator or data verification party that participates in the calculation but does not provide data.
  • Data inspection party The federated learning participant who compares the data occurrence probability with the normal occurrence probability threshold and performs data inspection. It is also the place where the normal occurrence probability threshold is set.
  • the data inspection party is generally the final calculation party of the data occurrence probability, but there are also data occurrence
  • the final calculation party of the probability is inconsistent with the data inspection party (for example, the coordinator who is not a data participant acts as the final calculation party of the data occurrence probability and calculates the complete data occurrence probability, and then transmits the data occurrence probability to other data participants for data processing. inspection, the data participant is the data inspection party).
  • the federated learning system only includes participants A and B who provide data for federated learning, and they obtain the final calculation results through the interaction and fusion of intermediate calculation results .
  • the federated learning system may also have a third party that participates in calculations but does not provide data: Coordinator C. In the case of three or more data parties, there are multiple data parties similar to party A or B.
  • phase preparation phase of training a model (modelling) using a batch of valid training data.
  • modeling is performed using a batch of valid training data.
  • the training data used for modeling needs to be valid, otherwise it will not be modeled correctly.
  • the distribution-related features of each feature of the training data can be obtained, and the normal occurrence probability threshold can be set according to the feature distribution.
  • the normal occurrence probability threshold is used for subsequent data inspection.
  • multiple data participants respectively calculate and store the distribution-related features of each feature of the training data locally, and set a normal occurrence probability threshold according to the distribution of each feature for subsequent data inspection.
  • the calculation of the complete data normal occurrence probability threshold requires the transmission of intermediate calculation results, and the final data occurrence probability calculation party calculates the complete data normal occurrence probability threshold.
  • the data verification party obtains and stores the normal data occurrence probability threshold from the data occurrence probability final calculation party, and sets the normal occurrence probability threshold according to the data normal occurrence probability threshold.
  • the party that finally calculates the probability of occurrence of data doubles as the data verification party (for example, the non-model prediction initiator B in the case of two parties participating).
  • the coordinator C is the final calculation party of the data occurrence probability.
  • the data participant in the stage of using the first batch of data to train the model (modeling), can perform the following operations: calculate and store in the The distribution-related characteristics of each feature local to the data participant; calculate the intermediate calculation results required to set the normal occurrence probability threshold ⁇ , and send the intermediate calculation results to other data participants that need to be calculated next or the final calculation party of the data occurrence probability (depending on the aggregation process for the full computed result).
  • the data participant that is not the final calculation party of the data occurrence probability can perform the following operations: calculate and store the local features of the data participant distribution-related features; obtain the intermediate calculation results required to set the normal occurrence probability threshold ⁇ transmitted by other data participants, and use the own data update to obtain new data based on the intermediate calculation results required for the normal occurrence probability threshold ⁇ , and send the new intermediate calculation result to other data participants or the final calculation party of the probability of data occurrence (depending on the aggregation process of the complete calculation result).
  • the data participant (for example, participant B) who also serves as the final calculation party of the data occurrence probability can perform the following operations: calculate and store in the The distribution-related characteristics of the local features of the data participant of the data participant; obtain the intermediate calculation results required for setting the normal occurrence probability threshold ⁇ transmitted by one or more other data participants, and set the normal occurrence probability threshold ⁇ Based on the required intermediate calculation results, use own data to calculate the normal occurrence probability threshold ⁇ of complete data.
  • the non-data participant's final calculation party of the data occurrence probability (for example, the coordinator C who does not provide data) can perform the following operations: obtain a Or multiple other data participants transmit the intermediate calculation results required to set the normal occurrence probability threshold ⁇ , and aggregate all intermediate calculation results to calculate the complete data normal occurrence probability threshold ⁇ .
  • the final calculation party of the data occurrence probability doubles as the data verification party (for example, participant B), and the data verification party can directly set the calculated data normal occurrence probability threshold ⁇ as the data verification index.
  • the final calculation party of the data appearance probability is not the data inspection party, so the final calculation party of the data appearance probability needs to transmit the calculated data normal occurrence probability threshold ⁇ to the data inspection party so that it can set the data normal occurrence probability threshold ⁇ for data inspection in the subsequent model prediction stage.
  • the following specifically illustrates how to calculate the distribution-related features of each feature.
  • the distribution definition and estimation of each feature of the first batch of training data includes, for example, the following situations.
  • Party A and Party B respectively calculate P(x a ) ⁇ P(x b ) ⁇ P(x c ), assuming that the result obtained by Party A is P(x A ), and the result obtained by Party B is P(x B ).
  • the probability corresponding to plus or minus 3 standard deviations or slightly lower may be used as a feature probability reference.
  • Figure 2 shows a specific example where a feature is normally distributed.
  • the probability of the lowest value class can be taken as the feature probability reference, for example.
  • each feature may exhibit other probability distributions than the above specifically illustrated distributions, and the specific manner of setting the normal occurrence probability threshold ⁇ can be changed within the scope of the present disclosure.
  • FIG. 3 shows a flow chart of a data verification method for longitudinal federated learning according to the present disclosure.
  • the complete data inspection method of longitudinal federated learning according to the present disclosure includes the following steps.
  • the probability distribution features of each feature of the training data obtained based on a batch of effective training data are stored.
  • the distribution of each feature of the training data may exhibit a normal distribution, a discrete variable distribution, or other types of distributions.
  • the data representing the probability distribution of each feature of the training data is stored, for example, in each participant of the longitudinal federated learning.
  • the training data includes a data part provided by data provider A and a data part provided by data provider B. Encrypted sample alignment is performed on the training data to form a virtual fused dataset.
  • the A-party model part and the B-party model part communicate with each other to perform federated model training (including encryption and decryption).
  • party A can store the necessary features of each feature distribution/density estimation (such as mean value ⁇ and variance ⁇ 2 ), and party B can also store the necessary features of each feature distribution/density estimation (such as mean value ⁇ and variance ⁇ 2 ).
  • the steps of storing the probability distribution characteristics of each feature of the training data obtained based on a batch of valid training data and setting the normal occurrence probability threshold ⁇ of the training data according to the probability distribution characteristics of each feature of the training data correspond to the preparatory work described above .
  • the model prediction initiator (e.g. party A) applies to update the model or online reasoning/prediction with new data
  • the data verification party e.g. party B
  • detects the application activates the data verification module
  • the data verification party requests data verification .
  • Each data provider performs encrypted sample alignment (for example, PSI secure intersection) on the new data.
  • the model prediction initiator (for example, Party A) automatically calculates and sends the intermediate calculation results required to calculate the probability of occurrence of new data P(x) to the data verification party at the request of the data detection module.
  • the data verifier calculates the new data occurrence probability P(x). The following will explain in detail the situation of calculating each piece of data and the situation of sampling calculation.
  • the new data contains the data part provided by the data provider A and the data part provided by the data provider B. Encrypted sample alignment is performed on the new data to form a virtual fused dataset.
  • the piece of data is a d-dimensional vector x, side A has features x 1 , x 2 ,...,x d1 , side B has features x d1+1 , x d1+2 ,..., x d .
  • the final calculation party of the data occurrence probability may be Party B or a separate final calculation party of the data occurrence probability.
  • party B also acts as the final calculation party of the data occurrence probability
  • party B Before party A provides new data, party B requires party A to provide the joint occurrence probability P(x A ) of all features of party A in the new data.
  • Party A's new data is encrypted sample alignment (e.g., PSI secure intersection) performed by parties A and B.
  • the data inspection party compares the occurrence probability P(x) with the normal occurrence probability threshold ⁇ , if the occurrence probability P(x) is less than the normal occurrence probability Threshold ⁇ , the piece of data is judged to be abnormal data, otherwise the piece of data is judged to be normal data. If the piece of data is judged to be abnormal, it indicates that there is data anomaly or pollution, so the piece of data can be rejected for longitudinal federated learning. If the piece of data is determined to be normal data, the piece of data may be allowed to be used for longitudinal federated learning.
  • the data verification party may also perform data verification on a batch of data.
  • the probability of occurrence of each piece of data can be calculated, and finally the pollution probability P(P(x) ⁇ ) of the batch of data can be calculated, if the pollution probability If it is less than the credible tolerance threshold ⁇ (for example, it can be set to 1%-5%, you can refer to the setting of the reliability coefficient), then accept the batch of data to update the model, otherwise reject it (or delete the polluted data according to the occurrence probability of each piece of data Accept the non-polluted part of the batch of data); if the pollution probability exceeds the threat tolerance threshold ⁇ (for example, it can be set to 25%), the data source of the data participant may have data pollution or the purpose of maliciously polluting the model, and refuse to accept This party follows up all data and communicates warnings with it.
  • the credible tolerance threshold ⁇ for example, it can be set to 1%-5%, you can refer to the setting of the reliability coefficient
  • the number of data pieces contained in the batch of data is less than a certain threshold number of pieces, calculate the occurrence probability of each piece of data in the batch of data, and calculate the probability of contamination of the batch of data P(P(x) ⁇ ) , if the probability P(P(x) ⁇ ) of the batch of data being polluted is less than the credible tolerance threshold ⁇ , then accept the batch of data for vertical federated learning, otherwise reject the batch of data for vertical federated learning, if the If the probability P(P(x) ⁇ ) of batch data being polluted is greater than the threat tolerance threshold ⁇ , then the batch data is rejected for vertical federated learning and further rejects all subsequent data provided by party A.
  • the credible tolerance threshold ⁇ is smaller than the threat tolerance threshold ⁇ .
  • the recommended credible tolerance threshold ⁇ can be lower than that in a small number of cases (for example, it can be set to 0.1%-1%), and the threat tolerance threshold The recommended value of ⁇ can be higher than that in a small number of cases (for example, it can be set to 30%-40%).
  • the batch is calculated by sampling The probability of occurrence of some data in the data, and calculate the probability P(P(x) ⁇ ) of the batch of data being contaminated. If the probability P(P(x) ⁇ ) of the batch of data being contaminated is less than the credible tolerance threshold ⁇ , the batch of data is accepted for vertical federated learning.
  • the batch of data is rejected for vertical federated learning and Inform Party A to process the data, or, in this case, if each piece of data is calculated according to a single piece of data instead of sampling calculation, you can also directly filter the failed data in the batch of data and accept the passed data in the batch of data . If the probability P(P(x) ⁇ ) of this batch of data being polluted is greater than the threat tolerance threshold ⁇ , then reject this batch of data for vertical federated learning and further refuse to accept all subsequent data provided by party A, and determine that party A is the data polluting party, and warns that there is a problem with the data source of party A. Similarly, the trust tolerance threshold ⁇ is smaller than the threat tolerance threshold ⁇ .
  • it further includes using the result of longitudinal federated learning to update/predict the federated model (including encryption and decryption).
  • the likely probability that a single new piece of data appears in the same distribution as the first batch of training data can be used to estimate the data authenticity probability.
  • the assumption that the features are independent and uncorrelated can be used so that the appearance probability can be estimated by the product of the appearance probabilities on all individual feature dimensions, making federated probability calculation (transmitting intermediate results without leaking secrets) possible.
  • the product of the occurrence probabilities of all feature dimensions owned locally by each party can be used as an intermediate result for transmission, so that the statistical results of a single feature will not be leaked.
  • the product of normal occurrence probabilities of different distributions can be used as a threshold to determine anomalies.
  • the abnormal rate of each batch of new data can be used as the pollution probability of this batch of data, and the normal probability threshold can be used to determine whether there is data pollution/malicious attack on the data source/data provider of this batch of data.
  • This patent proposes a brand-new detection method for data pollution or malicious theft in vertical federated learning scenarios.
  • the newly provided data is calculated based on the characteristics of the training data.
  • the probability of real data if the probability is less than the normal probability threshold, the data is polluted or malicious.
  • the present disclosure has at least the following advantages.
  • This patent is applicable to vertical federated learning scenarios, while related technologies are only applicable to horizontal federated learning scenarios.
  • the data verifier During data verification, it is not necessary for the data verifier to have the complete data (all feature values) of the new data, and the data verifier can only have part of it. features, or even any part of the original data (feature values) without new data.
  • This patent only adds a data verification and activation module to all parties, and there is no major modification to the existing federated learning process, and the necessary data verification can be realized only through simple storage and calculation steps in the preparation stage and the new data application stage. It does not need to use a lot of extra computing power to train new data anomaly detection models like other related technologies, or perform complex parameter transmission and model prediction calculations, which is more practical and convenient to modify.
  • FIG. 4 shows an exemplary configuration of a computing device 400 capable of implementing embodiments according to the present disclosure.
  • Computing device 400 is an example of a hardware device to which the above-described aspects of the present disclosure can be applied.
  • Computing device 400 may be any machine configured to perform processing and/or computation.
  • Computing device 400 may be, but is not limited to, a workstation, server, desktop computer, laptop computer, tablet computer, personal data assistant (PDA), smart phone, vehicle-mounted computer, or combinations thereof.
  • PDA personal data assistant
  • computing device 400 may include one or more elements that may be connected to or communicate with bus 402 via one or more interfaces.
  • the bus 402 may include, but is not limited to, an Industry Standard Architecture (Industry Standard Architecture, ISA) bus, a Micro Channel Architecture (Micro Channel Architecture, MCA) bus, an Enhanced ISA (EISA) bus, a Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, etc.
  • Computing device 400 may include, for example, one or more processors 404 , one or more input devices 406 , and one or more output devices 408 .
  • Processor(s) 404 may be any kind of processor, and may include, but is not limited to, one or more general purpose processors or special purpose processors (such as dedicated processing chips). The processor 404 may, for example, be configured to execute the method of the present disclosure.
  • Input device 406 may be any type of input device capable of entering information into a computing device, and may include, but is not limited to, a mouse, keyboard, touch screen, microphone, and/or remote control.
  • Output devices 408 may be any type of device capable of presenting information, and may include, but are not limited to, displays, speakers, video/audio output terminals, vibrators, and/or printers.
  • the computing device 400 may also include or be connected to a non-transitory storage device 414, which may be any storage device that is non-transitory and capable of data storage, and may include, but is not limited to, a disk drive, optical storage device, solid-state memory, floppy disk, flexible disk, hard disk, magnetic tape or any other magnetic medium, compact disk or any other optical medium, cache memory and/or any other memory chip or module from which data can be read by a computer , instructions and/or code in any other medium.
  • Computing device 400 may also include random access memory (RAM) 410 and read only memory (ROM) 412 .
  • the ROM 412 may store programs, utilities, or processes to be executed in a non-volatile manner.
  • RAM 410 may provide volatile data storage and store instructions related to the operation of computing device 400 .
  • Computing device 400 may also include a network/bus interface 416 coupled to data link 418 .
  • Network/bus interface 416 may be any kind of device or system capable of enabling communication with external devices and/or networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication devices, and/or chipsets such as BluetoothTM devices, 802.11 devices, WiFi devices, WiMax devices, cellular communication facilities, etc.).
  • processors may be implemented as an integrated circuit (IC), application specific integrated circuit (ASIC), or large scale integrated circuit (LSI), system LSI, super LSI, or super LSI that performs some or all of the functions described in this disclosure components.
  • IC integrated circuit
  • ASIC application specific integrated circuit
  • LSI large scale integrated circuit
  • the present disclosure includes the use of software, applications, computer programs or algorithms.
  • Software, applications, computer programs or algorithms may be stored on a non-transitory computer readable medium to cause a computer, such as one or more processors, to perform the steps described above and in the figures.
  • a computer such as one or more processors
  • one or more memories store software or algorithms as executable instructions
  • one or more processors can be associated with a set of instructions that execute the software or algorithms to provide various functions according to the embodiments described in this disclosure.
  • Software and computer programs include machine instructions for a programmable processor and may be written in a high-level procedural language, object-oriented programming language, functional programming language , logic programming language or assembly language or machine language.
  • computer-readable medium means any computer program product, means or device for providing machine instructions or data to a programmable data processor, such as a magnetic disk, optical disk, solid state storage device, memory and programmable logic device (PLD) , including a computer-readable medium for receiving machine instructions as computer-readable signals.
  • a computer readable medium may include dynamic random access memory (DRAM), random access memory (RAM), read only memory (ROM), electrically erasable read only memory (EEPROM), compact disk read only memory (CD-ROM) or other optical disk storage devices, magnetic disk storage devices or other magnetic storage devices, or can be used to carry or store required computer-readable program code in the form of instructions or data structures and can be read by a general-purpose or special-purpose computer or general-purpose or any other medium accessed by a dedicated processor.
  • Disk or disc includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disc and blu-ray Data is copied optically. Combinations of the above should also be included within the scope of computer-readable media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种纵向联邦学习的数据检验方法,包含步骤:存储基于一批有效的训练数据得到的训练数据各特征的概率分布特征;根据训练数据各特征的概率分布特征设置训练数据的正常出现概率阈值ε;计算一条数据x出现在与一批有效的训练数据相同的分布中的出现概率P(x);以及,将出现概率P(x)与正常出现概率阈值ε进行比较,若出现概率P(x)小于正常出现概率阈值ε,则判定该条数据为异常数据,否则判定该条数据为正常数据。

Description

一种纵向联邦学习的数据检验方法
相关申请的交叉引用
本申请是以CN申请号为202111085459.2,申请日为2021年9月16日的申请为基础,并主张其优先权,该CN申请的公开内容在此作为整体引入本申请中。
技术领域
本公开涉及联邦学习,特别涉及纵向联邦学习。
背景技术
随着联邦学习、多方安全计算逐渐成为成熟且主流的数据共享安全技术,新的问题也随之而来。数据参与方可能造成的模型污染或者恶意窃取是其中最为重要的一个问题。
发明内容
在下文中给出了关于本公开的简要概述,以便提供关于本公开的一些方面的基本理解。但是,应当理解,这个概述并不是关于本公开的穷举性概述。它并不是意图用来确定本公开的关键性部分或重要部分,也不是意图用来限定本公开的范围。其目的仅仅是以简化的形式给出关于本公开的某些概念,以此作为稍后给出的更详细描述的前序。
本公开提供了一种纵向联邦学习的数据检验方法,包含步骤:
存储基于一批有效的训练数据得到的训练数据各特征的概率分布特征;
根据训练数据各特征的概率分布特征设置训练数据的正常出现概率阈值ε;
计算一条数据x出现在与所述一批有效的训练数据相同的分布中的出现概率P(x);以及
将所述出现概率P(x)与正常出现概率阈值ε进行比较,若所述出现概率P(x)小于正常出现概率阈值ε,则判定该条数据为异常数据,否则判定该条数据为正常数据。
以下通过本公开的优选的实施方式的详细描述,本公开的其它特征及其优点将会变得清楚。
附图说明
构成说明书的一部分的附图描述了本公开的实施例,并且连同说明书一起用于解释本公开的原理。
参照附图,根据下面的详细描述,可以更清楚地理解本公开,其中:
图1示出了本公开的联邦学习系统的示意性的结构框图。
图2示出了一特征呈正态分布的具体例子。
图3示出了根据本公开的纵向联邦学习的数据检验方法的流程图。
图4示出了能够实现根据本公开的实施例的计算设备的示例性配置。
具体实施方式
参考附图进行以下详细描述,并且提供以下详细描述以帮助全面理解本公开的各种示例实施例。以下描述包括各种细节以帮助理解,但是这些细节仅被认为是示例,而不是为了限制本公开,本公开是由随附权利要求及其等同内容限定的。在以下描述中使用的词语和短语仅用于能够清楚一致地理解本公开。另外,为了清楚和简洁起见,可能省略了对公知的结构、功能和配置的描述。本领域普通技术人员将认识到,在不脱离本公开的精神和范围的情况下,可以对本文描述的示例进行各种改变和修改。
现有公开的数据检验或防护方法都仅适用于横向联邦学习场景,无法应用于纵向联邦学习场景,而且也存在着需要大量计算消耗资源牺牲性能的潜在问题。
图1示出了本公开的联邦学习系统的示意性的结构框图。
在某些实施例中,联邦学习系统包含两个数据参与方,但是,数据参与方不限于两个,可能存在更多个数据参与方。
为了使得描述更加清晰,描述后续会使用到的定义如下:
数据参与方:提供己方数据参与联邦学习的参与方,包括数据提供方(单纯提供数据)及数据使用方(提供数据,也使用模型预测数据;数据使用方可能存在多个,但每个模型预测任务发起方只能为一个)。
协调方:联邦学习领域常见术语,主要为防止数据或隐私泄露而设置,一般参与加解密(如密钥分发等)及中间结果聚合计算等工作。协调方一般为非数据参与方,即参与计算但不提供数据的联邦学习参与方,但业界也称进行结果聚合计算的数据参与方为兼任的协调方。
数据出现概率最终计算方:为了区别于协调方,数据出现概率最终计算方为获取 其他数据参与方传输出的中间结果后进行聚合计算,最终获得完整的数据出现概率以及数据正常出现概率阈值的联邦学习参与方。数据出现概率最终计算方可以是数据参与方的其中之一,也可以是非数据参与方,即参与计算但不提供数据的协调方或数据检验方。
数据检验方:将数据出现概率与正常出现概率阈值进行比较并进行数据检验的联邦学习参与方,也是正常出现概率阈值设置处,数据检验方一般为数据出现概率最终计算方,但也存在数据出现概率最终计算方和数据检验方不一致的情况(例如非数据参与方的协调方作为数据出现概率最终计算方计算获得完整的数据出现概率后,再将该数据出现概率传输给其他数据参与方进行数据检验,则该数据参与方为数据检验方)。
在某些实施例中,如图1中(A)所示,联邦学习系统仅包含有提供数据进行联邦学习的参与方A和参与方B,它们通过中间计算结果的交互和融合获得最终计算结果。在另一些实施例中,如图1中(B)所示,联邦学习系统除了包含提供数据的参与方A和参与方B,还可能存在参与计算但不提供数据的第三方:协调方C。在三个或更多个数据参与方的情况下,存在多个类似于参与方A或B的数据参与方。
为了使得描述清楚和简洁,在以下的描述中,仅仅以联邦学习系统包含两个数据参与方(例如,参与方A和参与方B)的情形作为示例进行描述。
以下描述使用一批有效的训练数据训练模型(建模)的阶段(准备阶段)。
在使用一批有效的训练数据训练模型(建模)的阶段,利用一批有效的训练数据进行建模。用于建模的训练数据需要是有效的,否则将无法正确地建模。
首先准备一批有效的训练数据。根据这批有效的训练数据,能够得到训练数据各特征的分布相关特征,并且能够根据特征分布进行正常出现概率阈值的设置。正常出现概率阈值用于后续数据检验。
在一些实施例中,多个数据参与方分别在各方本地计算并存储训练数据各特征的分布相关特征,并根据各特征的分布设置正常出现概率阈值以供后续数据检验使用。完整的数据正常出现概率阈值计算需要中间计算结果的传输,由数据出现概率最终计算方计算完整的数据正常出现概率阈值。数据检验方从数据出现概率最终计算方处获取并存储数据正常出现概率阈值,并根据所述数据正常出现概率阈值进行正常出现概率阈值的设置。
在一些实施例中,数据出现概率最终计算方兼作数据检验方(例如,两方参与情 况下的非模型预测发起方B)。在一些实施例中,数据出现概率最终计算方为协调方C。
在一些实施例中,在使用第一批数据训练模型(建模)的阶段,非数据出现概率最终计算方的数据参与方(例如,参与方A)可以执行如下操作:计算并存储在所述数据参与方本地的各特征的分布相关特征;计算设置正常出现概率阈值ε所需的中间计算结果,并将该中间计算结果发送至下一个需要计算的其他数据参与方或数据出现概率最终计算方(取决于完整计算结果的聚合流程)。
在一些实施例中,在使用第一批数据训练模型(建模)的阶段,非数据出现概率最终计算方的数据参与方可以执行如下操作:计算并存储在所述数据参与方本地的各特征的分布相关特征;获取其他数据参与方传输来的设置正常出现概率阈值ε所需的中间计算结果,并在所述正常出现概率阈值ε所需的中间计算结果的基础上使用己方数据更新获得新的中间计算结果,并将该新中间计算结果发送至其他数据参与方或数据出现概率最终计算方(取决于完整计算结果的聚合流程)。
在一些实施例中,在使用第一批数据训练模型(建模)的阶段,兼作数据出现概率最终计算方的数据参与方(例如,参与方B)可以执行如下操作:计算并存储在所述数据参与方的数据参与方本地的各特征的分布相关特征;获取一个或多个其他数据参与方传输来的设置正常出现概率阈值ε所需的中间计算结果,并在所述正常出现概率阈值ε所需的中间计算结果的基础上使用己方数据计算完整的数据正常出现概率阈值ε。
在一些实施例中,在使用第一批数据训练模型(建模)的阶段,非数据参与方的数据出现概率最终计算方(例如,不提供数据的协调方C)可以执行如下操作:获取一个或多个其他数据参与方传输来的设置正常出现概率阈值ε所需的中间计算结果,并聚合所有中间计算结果来计算完整的数据正常出现概率阈值ε。
在一些实施例中,数据出现概率最终计算方兼作数据检验方(例如,参与方B),数据检验方可直接设置计算获得的数据正常出现概率阈值ε为数据检验指标。
在一些实施例中,数据出现概率最终计算方不是数据检验方,则数据出现概率最终计算方需要将计算获得的数据正常出现概率阈值ε传输给数据检验方以使其可以设置数据正常出现概率阈值ε以便在后续模型预测阶段进行数据检验。
以下具体例示如何计算各特征的分布相关特征。
第一批训练数据各特征的分布定义及估计包含例如以下的情形。
a)已知分布/已有分布假设的特征(a类):部分特征基于过往经验已知常用分 布,可直接仅存储该假设分布估计所需特征,并利用该假设分布进行出现概率计算(例如已知某特征为是/否,且是或否出现概率对半开,则该条数据该特征出现是或否的出现概率都为0.5,出现其他值的概率为0);令此类所有特征联合出现概率为P(x a)=所有该a类特征出现概率的乘积。
b)未知分布/无常用分布假设的离散变量特征(b类):绝大多数情况下,离散变量的值类较少,当数据量较大时,根据大数定律,可直接使用第一批训练数据每个值对应的分布概率作为该特征的假设分布,针对每个值存一个概率,如P(x=0)=0.3,P(x=1)=0.5,P(x=2)=0.2;令此类所有特征联合出现概率为P(x b)=所有该b类特征出现概率的乘积.
c)未知分布/无常用分布假设的连续变量特征/其他特征(c类):绝大多数情况下,当数据量较大时,根据中心极限定理和大数定律,可以假设随机变量近似服从正态分布/高斯分布,那么就可以使用期望/平均值和方差来进行数据分布/密度估计;令此类所有特征联合出现概率为P(x c)=所有该c类特征出现概率的乘积。
正态分布:X~N(μ,σ 2)。
假设第一批训练数据集数据量为n个,针对每一个特征j计算μ和σ 2的估计值公式如下:
Figure PCTCN2022115465-appb-000001
假设未知分布/无常用分布假设的连续变量特征有m个,则根据正态分布概率计算这m个特征的联合出现概率(假设各特征独立无关联)p(x c)为:
Figure PCTCN2022115465-appb-000002
A方和B方分别计算P(x a)·P(x b)·P(x c),假设A方所得该结果为P(x A),B方所得该结果为P(x B)。
以下具体例示如何设置正常出现概率阈值ε。
例如,如果特征呈正态分布,并且3标准差区间内为99.7%的点,则可以例如以正负3标准差对应的概率或者略低作为特征概率参考。图2示出了一特征呈正态分布的具体例子。
例如,如果特征呈离散变量分布,则可以例如以最低值类的概率作为特征概率参考。
作为具体的例示,假设所有特征中,呈正态分布的特征有m个,假设最低正常概率阈值取0.004,离散变量特征有1个(该特征有三个值类,分布为P(x=0)=0.3,P(x=1)=0.5,P(x=2)=0.2,则选择最低的0.2作为概率,则每条数据应该设置的正常出现概 率ε=0.2·(0.004) m
本公开不限于具体例示的例子,例如各特征可以呈现以上具体例示的分布之外的其他概率分布,可以在本公开的范围内变更设置正常出现概率阈值ε的具体方式。
以下例示性地描述根据本公开的纵向联邦学习的完整的数据检验方法。
图3示出了根据本公开的纵向联邦学习的数据检验方法的流程图。
在一些实施例中,根据本公开的纵向联邦学习的完整的数据检验方法包含如下步骤。
存储基于一批有效的训练数据得到的训练数据各特征的概率分布特征。例如,训练数据的各特征的分布可能呈现正态分布、离散变量分布或者其他类型的分布。相应地,表征训练数据的各特征的概率分布的数据被存储在例如纵向联邦学习的各参与方中。
例如,训练数据包含由数据提供方A提供的数据部分和由数据提供方B提供的数据部分。对训练数据进行加密样本对齐以形成虚拟融合数据集。
根据训练数据各特征的概率分布特征设置训练数据的正常出现概率阈值ε。设置训练数据的正常出现概率阈值ε的具体例示在以上具体描述的准备阶段中进行了详细的描述,在此不再重复描述。
在一些实施例中,A方模型部分和B方模型部分互相通信,以进行联邦模型训练(包括加解密)。
在使用第一批数据训练模型的阶段,A方可以存储各特征分布/密度估计必要特征(例如平均值μ和方差σ 2),B方也可以存储各特征分布/密度估计必要特征(例如平均值μ和方差σ 2)。
存储基于一批有效的训练数据得到的训练数据各特征的概率分布特征的步骤以及根据训练数据各特征的概率分布特征设置训练数据的正常出现概率阈值ε的步骤对应于以上具体描述的准备阶段工作。
以下例示性地描述使用新数据更新模型或在线推理/预测阶段。
在该阶段,模型预测发起方(例如A方)申请使用新数据更新模型或在线推理/预测,数据检验方(例如B方)检测到申请,激活数据检验模块,并且数据检验方要求进行数据检验。各数据提供方对新数据进行加密样本对齐(例如,PSI安全求交)。模型预测发起方(例如A方)应数据检测模块要求自动计算并发送计算新数据出现概率P(x)所需中间计算结果至数据检验方。
计算一条数据x出现在与所述一批有效的训练数据相同的分布中的出现概率P(x)。在一些实施例中,数据检验方计算新数据出现概率P(x)。以下会具体解释每条数据都计算的情形和抽样计算的情形。
例如,新数据包含由数据提供方A提供的数据部分和由数据提供方B提供的数据部分。对新数据进行加密样本对齐以形成虚拟融合数据集。在一些实施例中,该条数据为d维向量x,A方有特征x 1,x 2,...,x d1,B方有特征x d1+1,x d1+2,...,x d
数据x出现在与所述一批有效的训练数据相同的分布中的出现概率P(x)=P(x 1)·P(x 2|x 1)·P(x 3|x 1,x 2)·...·P(x d-1|x 1,x 2,x 3,...x d-2)·P(x d|x 1,x 2,x 3,...x d-1)。
为了使计算和中间参数的传输更简便高效,可假设各特征之间独立无关联,则以上公式可转化为P(x)=P(x 1)·P(x 2)·P(x 3)·...·P(x d-1)·P(x d)。
在一些实施例中,具体如下计算P(x):A方计算P(x A)=P(x 1)·P(x 2)·P(x 3)·...·P(x d1),并根据安全性和联邦框架选择将P(x A)加密或者不加密后传给数据出现概率最终计算方。在一些实施例中,数据出现概率最终计算方可以为B方或单独的数据出现概率最终计算方。B方计算P(x B)=P(x d1+1)·P(x d1+2)·P(x d1+3)·...·P(x d)。数据出现概率最终计算方进一步计算单条数据出现的概率P(x)=P(x A)·P(x B)。在B方兼任数据出现概率最终计算方的情况下,由B方接收A方传来的P(x A),并且计算P(x)=P(x A)·P(x B)。在数据出现概率最终计算方为A方和B方之外的单独的数据出现概率最终计算方的情况下,由数据出现概率最终计算方接收A方传来的P(x A)以及B方传来的P(x B),并且计算P(x)=P(x A)·P(x B)。
在A方提供新数据前,B方要求A方先提供新数据的A方所有特征联合出现概率P(x A)。A方的新数据由A方和B方执行加密样本对齐(例如,PSI安全求交)。例如,在B方兼作数据出现概率最终计算方的情况下,A方计算P(x A)并发送给B方,B方计算P(x B),并且在收到P(x A)后计算P(x)=P(x A)·P(x B)。
数据检验方(此示例中,数据检验方、数据出现概率最终计算方皆为B方)将出现概率P(x)与正常出现概率阈值ε进行比较,若出现概率P(x)小于正常出现概率阈值ε,则判定该条数据为异常数据,否则判定该条数据为正常数据。如果判定该条数据为异常数据,则表明存在数据的异常或污染,因此可以拒绝该条数据用于纵向联邦学习。如果判定该条数据为正常数据,则可以允许该条数据用于纵向联邦学习。
在一些实施例中,数据检验方还可以针对一批数据进行数据检验。
当输入新数据为多条但少量(例如少于条数阈值)时,可以计算每一条数据的出 现概率,并最终计算该批数据的污染概率P(P(x)<ε),若污染概率小于可信容忍度阈值α(例如可设为1%-5%,可参考信度系数设置),则接受该批数据更新模型,反之则拒绝(或者根据每条数据的出现概率删除污染数据后接受该批数据中没有污染的部分);若污染概率超过威胁容忍度阈值β(例如可设为25%),则该数据参与方数据源可能存在数据污染或有恶意污染模型的目的,拒绝接受该参与方后续所有数据,并与其进行沟通警告。例如,在该批数据包含的数据条数小于某阈值条数的情况下,计算该批数据中每一条数据的出现概率,并且计算该批数据被污染的概率P(P(x)<ε),如果该批数据被污染的概率P(P(x)<ε)小于可信容忍度阈值α,则接受该批数据用于纵向联邦学习,否则拒绝该批数据用于纵向联邦学习,如果该批数据被污染的概率P(P(x)<ε)大于威胁容忍度阈值β,则拒绝该批数据用于纵向联邦学习并且进一步拒绝接受A方提供的后续所有数据。可信容忍度阈值α小于威胁容忍度阈值β。
当输入新数据量较大(例如不少于条数阈值)时,可以选择抽样计算,比如抽取10%-20%的数据,并且计算抽取的每一条数据的出现概率,并最终计算该批数据的污染概率P(P(x)<ε。此时推荐可信容忍度阈值α相较于少量情况下的建议值可以更低(例如可设为0.1%-1%),而威胁容忍度阈值β相较于少量情况下的建议值可以更高(例如可设为30%-40%)。例如,在该批数据包含的数据条数不小于某阈值条数的情况下,抽样计算该批数据中部分数据的出现概率,并且计算该批数据被污染的概率P(P(x)<ε)。如果该批数据被污染的概率P(P(x)<ε)小于可信容忍度阈值α,则接受该批数据用于纵向联邦学习。如果该批数据被污染的概率P(P(x)<ε)不小于可信容忍度阈值α,则拒绝该批数据用于纵向联邦学习并且通知A方进行数据处理,或者,如果这种情况下按照单条数据计算过每条数据而非抽样计算,则也可以直接过滤该批数据中没有通过的数据而接受该批数据中通过了的数据。如果该批数据被污染的概率P(P(x)<ε)大于威胁容忍度阈值β,则拒绝该批数据用于纵向联邦学习并且进一步拒绝接受A方提供的后续所有数据,判定A方为数据污染方,并且警告A方数据源存在问题。类似的,可信容忍度阈值α小于威胁容忍度阈值β。
与B方兼作数据出现概率最终计算方的情形相比,当存在单独的数据出现概率最终计算方时,其流程是相似的,区别仅在于计算P(x)=P(x A)·P(x B)时需在数据出现概率最终计算方处进行,且B方需要传输P(x B)至数据出现概率最终计算方处。
在一些实施例中,进一步包含利用纵向联邦学习的结果来进行联邦模型更新/预测(包括加解密)。
在本公开中,可以将单个新数据出现在与第一批训练数据相同的分布中的可能概率用来估计数据真实性概率。
可以使用各特征之间独立无关联的假设条件使得出现概率可以使用所有单个特征维度上的出现概率的乘积来估计,使得联邦概率计算(传输中间结果并不泄密)成为可能。
可以使用每方本地所拥有的所有特征维度上的出现概率的乘积作为传输中间结果,使得单特征的统计结果不被泄露。
可以使用已知分布/已有分布假设的特征,未知分布/无常用分布假设的离散变量特征,未知分布/无常用分布假设的连续变量特征/其他特征三类特征概率的乘积简化实现。
可以使用不同分布的正常出现概率乘积作为阈值去判定异常。
可以使用每批新数据的异常率作为该批数据的污染概率,并使用正常概率阈值去判定该批数据的数据源/数据提供方是否存在数据污染/恶意攻击。
本专利提出的是一种全新的针对纵向联邦学习场景的数据污染或恶意窃取的检测方法,在更新模型或者在线推理或使用模型进行预测时,基于训练数据的特性来计算参与方新提供数据是真实数据的概率,若该概率小于正常概率阈值,则该数据为已污染数据或恶意数据。
相对于相关技术,本公开至少具有如下优点。
本专利适用于纵向联邦学习场景,而相关技术都仅适用于横向联邦学习场景,在数据检验时不需要数据检验方拥有新数据的完整数据(所有特征值),数据检验方可以只拥有一部分的特征,甚至可以没有新数据的任何原数据部分(特征值)。
本专利仅在各方增加了一个数据检验激活模块,对现有联邦学习流程没有大的修改,且仅通过在准备阶段和新数据适用阶段的简易存储和计算步骤就可实现必要的数据检验,不需要像其他相关技术一样使用大量额外的算力去训练新的数据异常检测模型,或者进行复杂的参数传输和模型预测计算,更具实用性及修改方便性。
图4示出了能够实现根据本公开的实施例的计算设备400的示例性配置。
计算设备400是能够应用本公开的上述方面的硬件设备的实例。计算设备400可以是被配置为执行处理和/或计算的任何机器。计算设备400可以是但不限制于工作站、服务器、台式计算机、膝上型计算机、平板计算机、个人数据助手(PDA)、智能电话、车载计算机或以上组合。
如图4所示,计算设备400可以包括可以经由一个或多个接口与总线402连接或通信的一个或多个元件。总线402可以包括但不限于,工业标准架构(Industry Standard Architecture,ISA)总线、微通道架构(Micro Channel Architecture,MCA)总线、增强ISA(EISA)总线、视频电子标准协会(VESA)局部总线、以及外设组件互连(PCI)总线等。计算设备400可以包括例如一个或多个处理器404、一个或多个输入设备406以及一个或多个输出设备408。一个或多个处理器404可以是任何种类的处理器,并且可以包括但不限于一个或多个通用处理器或专用处理器(诸如专用处理芯片)。处理器404例如可以被配置为执行本公开的方法。输入设备406可以是能够向计算设备输入信息的任何类型的输入设备,并且可以包括但不限于鼠标、键盘、触摸屏、麦克风和/或远程控制器。输出设备408可以是能够呈现信息的任何类型的设备,并且可以包括但不限于显示器、扬声器、视频/音频输出终端、振动器和/或打印机。
计算设备400还可以包括或被连接至非暂态存储设备414,该非暂态存储设备414可以是任何非暂态的并且可以实现数据存储的存储设备,并且可以包括但不限于盘驱动器、光存储设备、固态存储器、软盘、柔性盘、硬盘、磁带或任何其他磁性介质、压缩盘或任何其他光学介质、缓存存储器和/或任何其他存储芯片或模块、和/或计算机可以从其中读取数据、指令和/或代码的其他任何介质。计算设备400还可以包括随机存取存储器(RAM)410和只读存储器(ROM)412。ROM 412可以以非易失性方式存储待执行的程序、实用程序或进程。RAM 410可提供易失性数据存储,并存储与计算设备400的操作相关的指令。计算设备400还可包括耦接至数据链路418的网络/总线接口416。网络/总线接口416可以是能够启用与外部装置和/或网络通信的任何种类的设备或系统,并且可以包括但不限于调制解调器、网络卡、红外线通信设备、无线通信设备和/或芯片集(诸如蓝牙 TM设备、802.11设备、WiFi设备、WiMax设备、蜂窝通信设施等)。
本公开可以被实现为装置、系统、集成电路和非瞬时性计算机可读介质上的计算机程序的任何组合。可以将一个或多个处理器实现为执行本公开中描述的部分或全部功能的集成电路(IC)、专用集成电路(ASIC)或大规模集成电路(LSI)、系统LSI,超级LSI或超LSI组件。
本公开包括软件、应用程序、计算机程序或算法的使用。可以将软件、应用程序、计算机程序或算法存储在非瞬时性计算机可读介质上,以使诸如一个或多个处理器的计算机执行上述步骤和附图中描述的步骤。例如,一个或多个存储器以可执行指令存 储软件或算法,并且一个或多个处理器可以关联执行该软件或算法的一组指令,以根据本公开中描述的实施例提供各种功能。
软件和计算机程序(也可以称为程序、软件应用程序、应用程序、组件或代码)包括用于可编程处理器的机器指令,并且可以以高级过程性语言、面向对象编程语言、功能性编程语言、逻辑编程语言或汇编语言或机器语言来实现。术语“计算机可读介质”是指用于向可编程数据处理器提供机器指令或数据的任何计算机程序产品、装置或设备,例如磁盘、光盘、固态存储设备、存储器和可编程逻辑设备(PLD),包括将机器指令作为计算机可读信号来接收的计算机可读介质。
举例来说,计算机可读介质可以包括动态随机存取存储器(DRAM)、随机存取存储器(RAM)、只读存储器(ROM)、电可擦只读存储器(EEPROM)、紧凑盘只读存储器(CD-ROM)或其他光盘存储设备、磁盘存储设备或其他磁性存储设备,或可以用于以指令或数据结构的形式携带或存储所需的计算机可读程序代码以及能够被通用或专用计算机或通用或专用处理器访问的任何其它介质。如本文中所使用的,磁盘或盘包括紧凑盘(CD)、激光盘、光盘、数字多功能盘(DVD)、软盘和蓝光盘,其中磁盘通常以磁性方式复制数据,而盘则通过激光以光学方式复制数据。上述的组合也包括在计算机可读介质的范围内。
提供本公开的主题作为用于执行本公开中描述的特征的装置、系统、方法和程序的示例。但是,除了上述特征之外,还可以预期其他特征或变型。可以预期的是,可以用可能代替任何上述实现的技术的任何新出现的技术来完成本公开的部件和功能的实现。
另外,以上描述提供了示例,而不限制权利要求中阐述的范围、适用性或配置。在不脱离本公开的精神和范围的情况下,可以对所讨论的元件的功能和布置进行改变。各种实施例可以适当地省略、替代或添加各种过程或部件。例如,关于某些实施例描述的特征可以在其他实施例中被结合。
类似地,虽然在附图中以特定次序描绘了操作,但是这不应该被理解为要求以所示的特定次序或者以顺序次序执行这样的操作,或者要求执行所有图示的操作以实现所希望的结果。在某些情况下,多任务处理和并行处理可以是有利的。

Claims (15)

  1. 一种纵向联邦学习的数据检验方法,包含步骤:
    存储基于一批有效的训练数据得到的训练数据各特征的概率分布特征;
    根据训练数据各特征的概率分布特征设置训练数据的正常出现概率阈值ε;
    计算一条数据x出现在与所述一批有效的训练数据相同的分布中的出现概率P(x);以及
    将所述出现概率P(x)与正常出现概率阈值ε进行比较,若所述出现概率P(x)小于正常出现概率阈值ε,则判定该条数据为异常数据,否则判定该条数据为正常数据。
  2. 根据权利要求1所述的方法,其中,
    纵向联邦学习的多个参与方的各特征组成一条数据x=(x 1,x 2,...,x d),d为大于或等于2的自然数,按照如下公式计算出现概率:P(x)=P(x 1)·P(x 2|x 1)·P(x 3|x 1,x 2)·...·P(x d- 1|x 1,x 2,x 3,...x d-2)·P(x d|x 1,x 2,x 3,...x d-1)。
  3. 根据权利要求2所述的方法,其中,纵向联邦学习的参与方包含A方和B方,A方具有特征x 1,x 2,...,x d1,B方具有特征x d1+1,x d1+2,...,x d,d1为大于或等于1的自然数。
  4. 根据权利要求2所述的方法,其中,
    按照如下公式计算出现概率:P(x)=P(x 1)·P(x 2)·P(x 3)·...·P(x d-1)·P(x d)。
  5. 根据权利要求3所述的方法,进一步包含步骤:
    A方计算P(x A)=P(x 1)·P(x 2)·P(x 3)·...·P(x d1),B方计算P(x B)=P(x d1+1)·P(x d1+2)·P(x d1+3)·...·P(x d)。
  6. 根据权利要求5所述的方法,进一步包含步骤:
    数据出现概率最终计算方根据P(x A)和P(x B)计算P(x)=P(x A)·P(x B)。
  7. 根据权利要求6所述的方法,其中,
    由B方或者另外的第三方充当数据出现概率最终计算方,
    A方将P(x A)传给数据出现概率最终计算方。
  8. 根据权利要求7所述的方法,其中,
    A方以加密方式将P(x A)传给数据出现概率最终计算方。
  9. 根据权利要求1所述的方法,其中,
    如果判定该条数据为异常数据,则拒绝该条数据用于纵向联邦学习,
    如果判定该条数据为正常数据,则允许该条数据用于纵向联邦学习。
  10. 根据权利要求1所述的方法,进一步包含步骤:
    判断A方提供的包含多条数据的一批数据是否异常。
  11. 根据权利要求10所述的方法,其中,
    在该批数据包含的数据条数小于阈值的情况下,
    计算该批数据中每一条数据的出现概率,并且计算该批数据被污染的概率P(P(x)<ε),
    如果该批数据被污染的概率P(P(x)<ε)小于可信容忍度阈值α,则接受该批数据用于纵向联邦学习,
    如果该批数据被污染的概率P(P(x)<ε)不小于可信容忍度阈值α,则拒绝该批数据用于纵向联邦学习,或者根据每条数据的出现概率删除污染数据后接受该批数据中没有污染的部分,
    如果该批数据被污染的概率P(P(x)<ε)大于威胁容忍度阈值β,则拒绝该批数据用于纵向联邦学习,并且进一步拒绝接受A方提供的后续所有数据,
    其中,可信容忍度阈值α小于威胁容忍度阈值β。
  12. 根据权利要求10所述的方法,其中,
    在该批数据包含的数据条数不小于阈值的情况下,
    抽样计算该批数据中部分数据的出现概率,并且计算该批数据被污染的概率P(P(x)<ε),
    如果该批数据被污染的概率P(P(x)<ε)小于可信容忍度阈值α,则接受该批数据 用于纵向联邦学习,
    如果该批数据被污染的概率P(P(x)<ε)不小于可信容忍度阈值α,则拒绝该批数据用于纵向联邦学习,
    如果该批数据被污染的概率P(P(x)<ε)大于威胁容忍度阈值β,则拒绝该批数据用于纵向联邦学习,并且进一步拒绝接受A方提供的后续所有数据,
    其中,可信容忍度阈值α小于威胁容忍度阈值β。
  13. 根据权利要求1所述的方法,进一步包含步骤:
    利用纵向联邦学习的结果来进行联邦模型更新。
  14. 一种纵向联邦学习的数据检验装置,包括:
    存储器,其上存储有指令;以及
    处理器,被配置为执行存储在所述存储器上的指令,以执行根据权利要求1至13中的任一项所述的方法。
  15. 一种计算机可读存储介质,包括计算机可执行指令,所述计算机可执行指令在由一个或多个处理器执行时,使得所述一个或多个处理器执行根据权利要求1至13中的任一项所述的方法。
PCT/CN2022/115465 2021-09-16 2022-08-29 一种纵向联邦学习的数据检验方法 WO2023040640A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111085459.2A CN115829048A (zh) 2021-09-16 2021-09-16 一种纵向联邦学习的数据检验方法
CN202111085459.2 2021-09-16

Publications (1)

Publication Number Publication Date
WO2023040640A1 true WO2023040640A1 (zh) 2023-03-23

Family

ID=85514998

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/115465 WO2023040640A1 (zh) 2021-09-16 2022-08-29 一种纵向联邦学习的数据检验方法

Country Status (2)

Country Link
CN (1) CN115829048A (zh)
WO (1) WO2023040640A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112231570A (zh) * 2020-10-26 2021-01-15 腾讯科技(深圳)有限公司 推荐系统托攻击检测方法、装置、设备及存储介质
CN112420187A (zh) * 2020-10-15 2021-02-26 南京邮电大学 一种基于迁移联邦学习的医疗疾病分析方法
US20210073677A1 (en) * 2019-09-06 2021-03-11 Oracle International Corporation Privacy preserving collaborative learning with domain adaptation
CN113283185A (zh) * 2021-07-23 2021-08-20 平安科技(深圳)有限公司 联邦模型训练、客户画像方法、装置、设备及介质
CN113360896A (zh) * 2021-06-03 2021-09-07 哈尔滨工业大学 一种横向联邦学习架构下的Free Rider攻击检测方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210073677A1 (en) * 2019-09-06 2021-03-11 Oracle International Corporation Privacy preserving collaborative learning with domain adaptation
CN112420187A (zh) * 2020-10-15 2021-02-26 南京邮电大学 一种基于迁移联邦学习的医疗疾病分析方法
CN112231570A (zh) * 2020-10-26 2021-01-15 腾讯科技(深圳)有限公司 推荐系统托攻击检测方法、装置、设备及存储介质
CN113360896A (zh) * 2021-06-03 2021-09-07 哈尔滨工业大学 一种横向联邦学习架构下的Free Rider攻击检测方法
CN113283185A (zh) * 2021-07-23 2021-08-20 平安科技(深圳)有限公司 联邦模型训练、客户画像方法、装置、设备及介质

Also Published As

Publication number Publication date
CN115829048A (zh) 2023-03-21

Similar Documents

Publication Publication Date Title
TWI764037B (zh) 跨區塊鏈的交互方法及系統、電腦設備及儲存媒體
TWI689841B (zh) 資料加密、機器學習模型訓練方法、裝置及電子設備
CN110457912B (zh) 数据处理方法、装置和电子设备
WO2021114822A1 (zh) 基于私有数据保护的风险决策方法、装置、系统及设备
US20210234687A1 (en) Multi-model training based on feature extraction
CN111008709A (zh) 联邦学习、资料风险评估方法、装置和系统
US20220245472A1 (en) Data processing method and apparatus, and non-transitory computer readable storage medium
CN111967609B (zh) 模型参数验证方法、设备及可读存储介质
US20210234848A1 (en) Offline authorization of interactions and controlled tasks
CN112132198A (zh) 数据处理方法、装置、系统和服务器
TW201944306A (zh) 確定高風險用戶的方法及裝置
CN113240524A (zh) 联邦学习系统中账户的异常检测方法、装置及电子设备
WO2022247620A1 (zh) 保护隐私的确定业务数据特征有效值的方法及装置
WO2020052168A1 (zh) 反欺诈模型的生成及应用方法、装置、设备及存储介质
US20220092185A1 (en) Trusted execution environment-based model training methods and apparatuses
EP4198783A1 (en) Federated model training method and apparatus, electronic device, computer program product, and computer-readable storage medium
CN112308238A (zh) 解析模型的训练方法、装置、电子设备及存储介质
US20230093540A1 (en) System and Method for Detecting Anomalous Activity Based on a Data Distribution
CN111383113A (zh) 可疑客户预测方法、装置、设备及可读存储介质
WO2022237175A1 (zh) 图数据的处理方法、装置、设备、存储介质及程序产品
CN115545216A (zh) 一种业务指标预测方法、装置、设备和存储介质
CN114883005A (zh) 一种数据分类分级方法、装置、电子设备和存储介质
Li et al. Instance-wise or class-wise? a tale of neighbor shapley for concept-based explanation
CN107038377B (zh) 一种网站认证方法及装置、网站授信方法及装置
WO2023040640A1 (zh) 一种纵向联邦学习的数据检验方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22869012

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE