CN109992977A - A data anomaly point cleaning method based on secure multi-party computing technology - Google Patents

A data anomaly point cleaning method based on secure multi-party computing technology Download PDF

Info

Publication number
CN109992977A
CN109992977A CN201910156492.6A CN201910156492A CN109992977A CN 109992977 A CN109992977 A CN 109992977A CN 201910156492 A CN201910156492 A CN 201910156492A CN 109992977 A CN109992977 A CN 109992977A
Authority
CN
China
Prior art keywords
data
server
participant
party
thre
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910156492.6A
Other languages
Chinese (zh)
Other versions
CN109992977B (en
Inventor
刘雪峰
杨烨
裴庆祺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201910156492.6A priority Critical patent/CN109992977B/en
Publication of CN109992977A publication Critical patent/CN109992977A/en
Application granted granted Critical
Publication of CN109992977B publication Critical patent/CN109992977B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本发明属于信息安全技术领域,公开了一种基于安全多方计算技术的数据异常点清洗方法,所述基于安全多方计算技术的数据异常点清洗方法包括:将A与B两个参与方的数据统一为矩阵格式,拥有相同维度,并且最后一维为该条数据的AVF值;参与方A与参与方B利用安全多方计算算法ABY中的Yao’s加密算法对数据矩阵进行加密;服务器A与服务器B对各参与方上传的加密数据集进行数据异常点清洗。本发明结合安全多方计算技术和AVF异常值检测算法,利用现有的安全多方计算工具ABY算法,实现了对高维数据的高效检测,并且在保证一定效率的前提下利用安全多方计算技术中的Yao’s加密算法保证了各方数据隐私相当的安全性。

The invention belongs to the technical field of information security, and discloses a method for cleaning abnormal data points based on a secure multi-party computing technology. The method for cleaning abnormal data points based on the secure multi-party computing technology includes: unifying the data of two parties A and B. It is a matrix format, with the same dimensions, and the last dimension is the AVF value of the piece of data; participant A and participant B use Yao's encryption algorithm in the secure multi-party computation algorithm ABY to encrypt the data matrix; server A and server B pair The encrypted data set uploaded by each participant is cleaned for abnormal data points. The invention combines the secure multi-party computing technology and the AVF abnormal value detection algorithm, uses the existing secure multi-party computing tool ABY algorithm, realizes the efficient detection of high-dimensional data, and uses the secure multi-party computing technology on the premise of ensuring a certain efficiency. Yao's encryption algorithm ensures the security of the data privacy of all parties.

Description

一种基于安全多方计算技术的数据异常点清洗方法A data anomaly point cleaning method based on secure multi-party computing technology

技术领域technical field

本发明属于信息安全技术领域,尤其涉及一种基于安全多方计算技术的数据异常点清洗方法。The invention belongs to the technical field of information security, and in particular relates to a method for cleaning abnormal data points based on a secure multi-party computing technology.

背景技术Background technique

目前,最接近的现有技术:联合数据源是指机器学习训练过程中,多个参与方拥有同一类型的数据,将这些数据融合起来,可以扩大训练数据集规模,提升模型训练结果的准确度。机器学习发展至今,模型的优劣在很大程度上取决于数据集的规模与质量,因此联合数据源学习成为机器学习发展的一大趋势。但是随着联合数据源训练优势而来的,就是多数据源数据隐私安全保护的新问题,由于在一些场景下,各参与方拥有的数据也许是隐私敏感的,比如一些商业数据或者一些客户的隐私信息,如医疗信息或财产信息等,这样的数据对隐私保护的要求极高,自然也很难做到随意共享。At present, the closest existing technology: joint data source refers to the fact that in the process of machine learning training, multiple participants have the same type of data. Fusion of these data can expand the scale of the training data set and improve the accuracy of the model training results. . Since the development of machine learning, the quality of the model depends to a large extent on the scale and quality of the data set. Therefore, joint data source learning has become a major trend in the development of machine learning. However, with the advantages of joint data source training, there is a new problem of data privacy and security protection of multiple data sources. In some scenarios, the data owned by each participant may be sensitive to privacy, such as some commercial data or some customers' data. Private information, such as medical information or property information, has extremely high requirements for privacy protection, and it is naturally difficult to share it at will.

随着大家对于数据融合的需求逐渐增加,针对保护数据隐私的算法也陆续出现。如增加可信第三方的方法,多个参与方共同认证一个可信的第三方,将各自的明文数据上传给第三方,由第三方进行数据清洗、训练等任务,可信第三方往往是一些具有公信力的组织,或者一些提供收费服务的云计算提供者。这样带来的好处是实现了数据的隐私保护,同时也达到了融合数据的目的。但是这种算法存在一定的安全风险,可信第三方往往是诚实但好奇的,如果在收集到数据进行处理的过程中有不可预料的数据泄露,或者遇到恶意的第三方窃取数据信息,往往会造成严重的后果。With the increasing demand for data fusion, algorithms for protecting data privacy have emerged one after another. For example, in the method of adding a trusted third party, multiple participants jointly authenticate a trusted third party, upload their own plaintext data to the third party, and the third party performs tasks such as data cleaning and training. The trusted third party is often some Credible organizations, or some cloud computing providers that offer paid services. The advantage of this is to realize the privacy protection of data, and also achieve the purpose of data fusion. However, this algorithm has certain security risks. Trusted third parties are often honest but curious. If there is unforeseen data leakage during the process of collecting data for processing, or encountering malicious third parties stealing data information, it is often will cause serious consequences.

随着各领域技术的融会贯通,密码学的思维被应用在了联合数据源训练的领域,即使用成熟的加密算法,将各参与方的数据进行加密,再将加密数据集合起来送给可信第三方,可信第三方并不拥有敏感的明文数据,只拥有加密后看上去毫无现实意义的密文数据,加密算法往往采用同态加密,即明文加密后,对密文进行怎样的运算,等同于对明文进行同样的运算,这种加密方法保证了密文训练的可行性,这样就极大程度的保证了数据的隐私性。但是同样,这样的算法也存在现实问题,最大的问题就是安全与效率之间的博弈,目前已有的同态加密算法,得到结果往往需要耗费大量的时间和计算资源,在对隐私要求没有那么高的场景下,这种算法只有极低的使用效率,并不适合大量推广。With the integration of technologies in various fields, the idea of cryptography has been applied to the field of joint data source training, that is, using mature encryption algorithms to encrypt the data of each participant, and then collect the encrypted data and send it to trusted third parties. The three parties, the trusted third party does not own sensitive plaintext data, but only possesses ciphertext data that seems meaningless after encryption. Equivalent to performing the same operation on plaintext, this encryption method ensures the feasibility of ciphertext training, which greatly ensures the privacy of data. However, there are also practical problems with such algorithms. The biggest problem is the game between security and efficiency. The existing homomorphic encryption algorithms often require a lot of time and computing resources to obtain the results, and the privacy requirements are not so high. In high scenarios, this algorithm has very low efficiency and is not suitable for mass promotion.

现有技术一提出了一种利用同态加密算法解决多数据源联合数据异常点清洗的算法,利用同态加密算法对各方数据进行加密,然后采用AVF异常点检测算法对数据集中的异常点进行筛选和清洗,但是由于同态加密本身的效率限制,其加解密所需的时间和计算资源较多,导致该算法相对计算效率较低,不能满足大量的数据处理需求;现有技术二提出了基于LOF异常点检测算法的隐私保护数据清洗方案,但是由于其基于数据分布密度而决策数据是否为异常点的性质,如果数据的维度较高,则无法有效的根据分布密度的区别来分辨异常点的存在,因此该技术存在一定的面对高维数据集时处理效率较低的问题。Prior art 1 proposes an algorithm that uses a homomorphic encryption algorithm to solve the multi-data source joint data outlier cleaning algorithm. The homomorphic encryption algorithm is used to encrypt the data of each party, and then the AVF outlier detection algorithm is used to detect the outliers in the data set. Screening and cleaning are performed, but due to the efficiency limitation of homomorphic encryption itself, its encryption and decryption requires more time and computing resources, resulting in a relatively low computational efficiency of the algorithm, which cannot meet a large number of data processing needs; the second prior art proposes A privacy-preserving data cleaning scheme based on the LOF outlier detection algorithm is proposed, but due to the nature of whether the data is an outlier or not based on the data distribution density, if the dimension of the data is high, it cannot effectively distinguish the abnormality according to the difference in distribution density. Therefore, this technology has a certain problem of low processing efficiency in the face of high-dimensional data sets.

综上所述,现有技术存在的问题是:To sum up, the problems existing in the prior art are:

(1)现有利用同态加密算法解决多数据源联合数据异常点清洗的算法,计算效率较低,不能满足大量的数据处理需求。(1) Existing algorithms that use homomorphic encryption algorithms to solve joint data outlier cleaning of multiple data sources have low computational efficiency and cannot meet a large number of data processing needs.

(2)现有基于LOF异常点检测算法的隐私保护数据清洗方案存在面对高维数据集时处理效率较低的问题。(2) The existing privacy-preserving data cleaning schemes based on the LOF outlier detection algorithm have the problem of low processing efficiency when faced with high-dimensional data sets.

针对以上技术存在问题,需要一种能够平衡计算效率与安全性的新的技术,能够改进传统同态加密算法的低效率和高能耗,还能够保证必要的数据隐私安全需求,同时为了更好地适应实际实施实例,还需要能够支持高维数据的处理。In view of the problems of the above technologies, a new technology that can balance computing efficiency and security is required, which can improve the low efficiency and high energy consumption of traditional homomorphic encryption algorithms, and can also ensure the necessary data privacy and security requirements. To adapt to actual implementation examples, it is also necessary to support the processing of high-dimensional data.

解决上述技术问题的意义:The significance of solving the above technical problems:

针对以上技术存在的问题进行改进之后,可以使算法更加适应实际使用环境,提升了实际使用效率,增加了算法的可实施度,能够更好地保护敏感数据的隐私安全。After improving the problems existing in the above technologies, the algorithm can be more adapted to the actual use environment, improve the actual use efficiency, increase the implementability of the algorithm, and better protect the privacy and security of sensitive data.

发明内容SUMMARY OF THE INVENTION

针对现有技术存在的问题,本发明提供了一种基于安全多方计算技术的数据异常点清洗方法。Aiming at the problems existing in the prior art, the present invention provides a method for cleaning abnormal data points based on a secure multi-party computing technology.

本发明是这样实现的,一种基于安全多方计算技术的数据异常点清洗方法,所述基于安全多方计算技术的数据异常点清洗方法包括:The present invention is implemented in the following way: a method for cleaning abnormal data points based on secure multi-party computing technology, and the method for cleaning abnormal data points based on secure multi-party computing technology includes:

第一步,将A与B两个参与方的数据统一为矩阵格式,拥有相同维度,并且最后一维为该条数据的AVF值;The first step is to unify the data of the two parties A and B into a matrix format, with the same dimensions, and the last dimension is the AVF value of the data;

第二步,参与方A与参与方B利用安全多方计算算法ABY中的Yao’s加密算法对数据矩阵进行加密;In the second step, participant A and participant B use Yao's encryption algorithm in the secure multi-party computation algorithm ABY to encrypt the data matrix;

第三步,服务器A与服务器B对各参与方上传的加密数据集进行数据异常点清洗。In the third step, server A and server B perform data anomaly cleaning on the encrypted data set uploaded by each participant.

进一步,所述第一步参与方A与参与方B按照规定统一自有数据集格式:Further, in the first step, participant A and participant B unify their own data set formats according to regulations:

其中,D1表示参与方A的N×(M+1)的数据集矩阵,aij表示参与方A数据集中的任意数据,avfai表示参与方A第i条数据的AVF值,i∈[1,N],j∈[1,M],M,N∈N+;D2表示参与方B的P×(M+1)的数据集矩阵,bkj表示参与方B数据集中的任意数据,avfbk表示参与方A第k条数据的AVF值,k∈[1,P],j∈[1,M],M,P∈N+。其中两个参与方的数据维度相同。Among them, D 1 represents the N×(M+1) dataset matrix of participant A, a ij represents any data in participant A’s dataset, avf ai represents the AVF value of participant A’s ith data, i∈[ 1, N], j∈[1, M], M, N∈N + ; D 2 represents the P×(M+1) dataset matrix of participant B, and b kj represents any data in the dataset of participant B , avf bk represents the AVF value of the kth data of participant A, k∈[1, P], j∈[1, M], M, P∈N + . Two of the parties have the same data dimension.

进一步,所述第二步参与方A与参与方B按照规定加密自有数据集具体包括:Further, in the second step, participant A and participant B encrypt their own data sets according to the regulations, including:

1)利用安全多方计算加密ABY算法中的Yao’s加密算法对参与方A的数据集D1进行加密:1) Use Yao's encryption algorithm in the secure multi-party computation encryption ABY algorithm to encrypt the data set D 1 of the participant A:

其中,表示加密后的数据集交给服务器A的部分,表示加密后的数据集交给服务器B的部分,Enc表示Yao’s加密算法,D1表示参与方A的数据集;in, Indicates the part where the encrypted data set is handed over to server A, Indicates that the encrypted data set is handed over to server B, Enc represents Yao's encryption algorithm, and D 1 represents the data set of participant A;

具体按照下式加密每一个元素:Specifically, encrypt each element as follows:

其中,表示加密后的数据交给服务器A的部分,表示加密后的数据交给服务器B的部分,aij表示参与方A的任意数据;表示加密后的参与方A的第i条数据的AVF值交给服务器A的部分,表示加密后的参与方A的第i条数据的AVF值交给服务器B的部分,avfai表示参与方A的第i条数据的AVF值;in, Indicates the part where the encrypted data is handed over to server A, Indicates that the encrypted data is handed over to server B, and a ij represents any data of participant A; The part indicating that the encrypted AVF value of the i-th data of the participant A is handed over to the server A, Represents the part where the encrypted AVF value of the i-th data of the participant A is handed over to the server B, and avf ai represents the AVF value of the i-th data of the participant A;

2)利用下式表示加密后的参与方A的数据集:2) Use the following formula to represent the encrypted data set of participant A:

其中,X10表示服务器A持有的参与方A的加密数据集,X11表示服务器B持有的参与方A的加密数据集,i∈[1,N],j∈[1,M],M,N∈N+Among them, X 10 represents the encrypted data set of participant A held by server A, X 11 represents the encrypted data set of participant A held by server B, i ∈ [1, N], j ∈ [1, M], M, N∈N + ;

3)利用安全多方计算加密ABY算法中的Yao’s加密算法对参与方B的数据集D2进行加密:3) Use Yao's encryption algorithm in the secure multi-party computation encryption ABY algorithm to encrypt the data set D 2 of the participant B:

其中,表示加密后的数据集交给服务器A的部分,表示加密后的数据集交给服务器B的部分,Enc表示Yao’s加密算法,D2表示参与方B的数据集;in, Indicates the part where the encrypted data set is handed over to server A, Indicates that the encrypted data set is handed over to server B, Enc represents Yao's encryption algorithm, and D 2 represents the data set of participant B;

具体按照下式加密每一个元素:Specifically, encrypt each element as follows:

其中,表示加密后的数据交给服务器A的部分,表示加密后的数据交给服务器B的部分,bkj表示参与方A的任意数据;表示加密后的参与方B的第k条数据的AVF值交给服务器A的部分,表示加密后的参与方B的第k条数据的AVF值交给服务器B的部分,avfbk表示参与方B的第k条数据的AVF值;in, Indicates the part where the encrypted data is handed over to server A, Indicates that the encrypted data is handed over to server B, and b kj represents any data of participant A; The part indicating that the AVF value of the k-th data of the encrypted participant B is handed over to the server A, Represents the part where the encrypted AVF value of the k-th data of the participant B is handed over to the server B, and avf bk represents the AVF value of the k-th data of the participant B;

4)利用下式表示加密后的参与方B的数据集:4) Use the following formula to represent the encrypted data set of participant B:

其中,X20表示服务器A持有的参与方B的加密数据集,X21表示服务器B持有的参与方B的加密数据集,k∈[1,P],j∈[1,M],M,P∈N+Among them, X 20 represents the encrypted data set of participant B held by server A, X 21 represents the encrypted data set of participant B held by server B, k ∈ [1, P], j ∈ [1, M], M, P∈N + ;

5)参与方A与参与方B分别将加密后的数据上传至对应服务器。5) Participant A and Participant B respectively upload the encrypted data to the corresponding server.

进一步,所述第三步服务器A与服务器B对各参与方上传的加密数据集进行数据异常点清洗具体包括:Further, in the third step, server A and server B perform data anomaly cleaning on the encrypted data set uploaded by each participant, which specifically includes:

1)服务器A提取自己拿到的参与方A的加密数据集中的最后一维数据:1) Server A extracts the last one-dimensional data in the encrypted data set of participant A obtained by itself:

服务器A使用安全加密算法ABY中的Yao’s加密算法中的排序算法对A10进行排序:Server A sorts A 10 using the sorting algorithm in Yao's encryption algorithm in secure encryption algorithm ABY:

A′10=Sort(A10);A' 10 =Sort(A 10 );

其中,A10表示服务器A拥有的参与方A的加密数据集中最后一维数据,A′10表示A10按照降序排序完成后的数据,Sort()表示Yao’s加密算法中的排序算法;Among them, A 10 represents the last one-dimensional data in the encrypted data set of participant A owned by server A, A' 10 represents the data after A 10 is sorted in descending order, and Sort() represents the sorting algorithm in Yao's encryption algorithm;

以A10为基准将X10也同时排序,即按照A10降序排列X10,排序完成后:Sort X 10 at the same time based on A 10 , that is, arrange X 10 in descending order of A 10. After sorting is completed:

其中,X′10为以X10最后一维数据,即A10为基准降序排序完成后的参与方A提交给服务器A的数据集,i∈[1,N],j∈[1,M],M,N∈N+Among them, X′ 10 is the data set submitted by participant A to server A after the last one-dimensional data of X 10 , that is, A 10 is the benchmark after the descending sorting is completed, i∈[1, N], j∈[1, M] , M, N∈N + ;

规定一个固定值Thre,表示AVF值在正常范围内的阈值,将A′10中的数据,按顺序与Thre比较大小:A fixed value Thre is specified, indicating the threshold value of the AVF value within the normal range, and the data in A' 10 is compared with Thre in order:

Resi=Comp(A′10i,Thre);Res i =Comp(A' 10i , Thre);

其中,A′10i表示A′10中的元素,i∈[1,N],N∈N+,Comp()表示Yao’s加密算法中的比较大小算法,Resi表示A′10i与Thre比较的结果,若Resi值为1,表示A′10i≥Thre;若Resi值为0,表示A′10i<Thre,将A′10中的数据,按顺序与Thre比较大小,直到Resi=0,停止比较,将X′10中的前i行数据保留:Among them, A' 10i represents the element in A' 10 , i∈[1, N], N∈N + , Comp() represents the comparison size algorithm in Yao's encryption algorithm, Res i represents the result of comparing A' 10i with Thre , if the value of Res i is 1, it means that A' 10i ≥ Thre; if the value of Res i is 0, it means that A' 10i <Thre, compare the data in A' 10 with Thre in order, until Res i =0, Stop the comparison and keep the first i rows of data in X' 10 :

其中,I=i,为排序之后保留的前i行数据,j∈[1,M],M∈N+,X″10为最终数据清洗完成后服务器A所拥有的参与方A的数据集;Among them, I=i, is the first i row data retained after sorting, j∈[1, M], M∈N + , X″ 10 is the data set of participant A owned by server A after the final data cleaning is completed;

2)服务器A提取自己拿到的参与方B的加密数据集中的最后一维数据:2) Server A extracts the last one-dimensional data in the encrypted data set of participant B obtained by itself:

服务器A使用安全加密算法ABY中的Yao’s加密算法中的排序算法对A20进行排序:Server A sorts A 20 using the sorting algorithm in Yao's encryption algorithm in secure encryption algorithm ABY:

A′20=Sort(A20);A' 20 =Sort(A 20 );

其中,A20表示服务器A拥有的参与方B的加密数据集中最后一维数据,A′20表示A20按照降序排序完成后的数据,Sort()表示Yao’s加密算法中的排序算法;Among them, A 20 represents the last one-dimensional data in the encrypted data set of participant B owned by server A, A' 20 represents the data after A 20 is sorted in descending order, and Sort() represents the sorting algorithm in Yao's encryption algorithm;

以A20为基准将X20也同时排序,即按照A20降序排列X20,排序完成后:Sort X 20 at the same time based on A 20 , that is, sort X 20 in descending order of A 20. After sorting is completed:

其中,X′20为以X20最后一维数据,即A20为基准降序排序完成后的参与方B提交给服务器A的数据集,k∈[1,P],j∈[1,M],M,P∈N+Among them, X′ 20 is the data set submitted by participant B to server A after the last one-dimensional data of X 20 , that is, A 20 is the benchmark after the descending sorting is completed, k∈[1, P], j∈[1, M] , M, P∈N + ;

规定一个固定值Thre,表示AVF值在正常范围内的阈值,将A′20中的数据,按顺序与Thre比较大小:A fixed value Thre is specified, indicating the threshold value of the AVF value within the normal range, and the data in A' 20 is compared with Thre in order:

Resk=Comp(A′20k,Thre);Res k =Comp(A' 20k , Thre);

其中,A′20k表示A′20中的元素,k∈[1,P],P∈N+,Comp()表示Yao’s加密算法中的比较大小算法,Resk表示A′20k与Thre比较的结果,若Resk值为1,表示A′20k≥Thre;若Resk值为0,表示A′20k<Thre,将A′20中的数据,按顺序与Thre比较大小,直到Resk=0,停止比较,将X′20中的前k行数据保留:Among them, A' 20k represents the element in A' 20 , k∈[1, P], P∈N + , Comp() represents the comparison size algorithm in Yao's encryption algorithm, Res k represents the result of comparing A' 20k with Thre , if the value of Res k is 1, it means that A' 20k ≥ Thre; if the value of Res k is 0, it means that A' 20k <Thre, compare the data in A' 20 with Thre in order until Res k =0, Stop the comparison and keep the first k rows of data in X' 20 :

其中,K=k,为排序之后保留的前k行数据,j∈[1,M],M∈N+,X″20为最终数据清洗完成后服务器A所拥有的参与方B的数据集;Among them, K=k, is the first k rows of data retained after sorting, j∈[1, M], M∈N + , X″ 20 is the data set of participant B owned by server A after the final data cleaning is completed;

3)服务器B提取自己拿到的参与方A的加密数据集中的最后一维数据:3) Server B extracts the last one-dimensional data in the encrypted data set of participant A obtained by itself:

服务器B使用安全加密算法ABY中的Yao’s加密算法中的排序算法对A11进行排序:Server B sorts A 11 using the sorting algorithm in Yao's encryption algorithm in secure encryption algorithm ABY:

A′11=Sort(A11);A' 11 =Sort(A 11 );

其中,A11表示服务器B拥有的参与方A的加密数据集中最后一维数据,A′11表示A11按照降序排序完成后的数据,Sort()表示Yao’s加密算法中的排序算法;Among them, A 11 represents the last one-dimensional data in the encrypted data set of participant A owned by server B, A' 11 represents the data after A 11 is sorted in descending order, and Sort() represents the sorting algorithm in Yao's encryption algorithm;

以A11为基准将X11也同时排序,即按照A11降序排列X11,排序完成后:Sort X 11 at the same time based on A 11 , that is, sort X 11 in descending order of A 11. After sorting is completed:

其中,X′11为以X11最后一维数据,即A11为基准降序排序完成后的参与方A提交给服务器B的数据集,i∈[1,N],j∈[1,M],M,N∈N+Among them, X′ 11 is the data set submitted by participant A to server B after the last one-dimensional data of X 11 , that is, A 11 is the benchmark after the descending sorting is completed, i∈[1, N], j∈[1, M] , M, N∈N + ;

规定一个固定值Thre,表示AVF值在正常范围内的阈值,将A′11中的数据,按顺序与Thre比较大小:A fixed value Thre is specified, indicating the threshold value of the AVF value within the normal range, and the data in A' 11 is compared with Thre in order:

Resi=Comp(A′11i,Thre);Res i =Comp(A' 11i , Thre);

其中,A′11i表示A′11中的元素,i∈[1,N],N∈N+,Comp()表示Yao’s加密算法中的比较大小算法,Resi表示A′11i与Thre比较的结果,若Resi值为1,表示A′11i≥Thre;若Resi值为0,表示A′11i<Thre,将A′11中的数据,按顺序与Thre比较大小,直到Resi=0,停止比较,将X′11中的前i行数据保留:Among them, A' 11i represents the element in A' 11 , i∈[1, N], N∈N + , Comp() represents the comparison size algorithm in Yao's encryption algorithm, Res i represents the result of comparing A' 11i with Thre , if the value of Res i is 1, it means that A' 11i ≥ Thre; if the value of Res i is 0, it means that A' 11i <Thre, compare the data in A' 11 with Thre in order, until Res i =0, Stop the comparison and keep the first i rows of data in X'11 :

其中,I=i,为排序之后保留的前i行数据,j∈[1,M],M∈N+,X″11为最终数据清洗完成后服务器B所拥有的参与方A的数据集;Among them, I=i, is the first i row data retained after sorting, j∈[1, M], M∈N + , X″ 11 is the data set of participant A owned by server B after the final data cleaning is completed;

4)服务器B提取自己拿到的参与方B的加密数据集中的最后一维数据:4) Server B extracts the last one-dimensional data in the encrypted data set of participant B obtained by itself:

服务器B使用安全加密算法ABY中的Yao’s加密算法中的排序算法对A21进行排序:Server B sorts A 21 using the sorting algorithm in Yao's encryption algorithm in secure encryption algorithm ABY:

A′21=Sort(A21);A' 21 =Sort(A 21 );

其中,A21表示服务器B拥有的参与方B的加密数据集中最后一维数据,A′21表示A21按照降序排序完成后的数据,Sort()表示Yao’s加密算法中的排序算法。Among them, A 21 represents the last one-dimensional data in the encrypted data set of participant B owned by server B, A' 21 represents the data sorted by A 21 in descending order, and Sort() represents the sorting algorithm in Yao's encryption algorithm.

以A21为基准将X21也同时排序,即按照A21降序排列X21,排序完成后:Sort X 21 at the same time based on A 21 , that is, sort X 21 in descending order of A 21 , after sorting is completed:

其中,X′21为以X21最后一维数据,即A21为基准降序排序完成后的参与方B提交给服务器B的数据集,k∈[1,P],j∈[1,M],M,P∈N+Among them, X′ 21 is the data set submitted by participant B to server B after the last one-dimensional data of X 21 , that is, A 21 is the benchmark after the descending sorting is completed, k∈[1, P], j∈[1, M] , M, P ∈ N + .

规定一个固定值Thre,表示AVF值在正常范围内的阈值,将A′21中的数据,按顺序与Thre比较大小:A fixed value Thre is specified, indicating the threshold value of the AVF value within the normal range, and the data in A' 21 is compared with Thre in order:

Resk=Comp(A′21k,Thre);Res k =Comp(A' 21k , Thre);

其中,A′21k表示A′21中的元素,k∈[1,P],P∈N+,Comp()表示Yao’s加密算法中的比较大小算法,Resk表示A′21k与Thre比较的结果,若Resk值为1,表示A′21k≥Thre;若Resk值为0,表示A′21k<Thre,将A′21中的数据,按顺序与Thre比较大小,直到Resk=0,停止比较,将X′21中的前k行数据保留:Among them, A' 21k represents the element in A' 21 , k∈[1, P], P∈N + , Comp() represents the comparison size algorithm in Yao's encryption algorithm, Res k represents the result of comparing A' 21k with Thre , if the value of Res k is 1, it means that A' 21k ≥ Thre; if the value of Res k is 0, it means that A' 21k <Thre, compare the data in A' 21 with Thre in order, until Res k =0, Stop the comparison and keep the first k rows of data in X'21 :

其中,K=k,为排序之后保留的前k行数据,j∈[1,M],M∈N+,X″21为最终数据清洗完成后服务器B所拥有的参与方B的数据集;Among them, K=k, is the first k rows of data retained after sorting, j∈[1, M], M∈N + , X″ 21 is the data set of participant B owned by server B after the final data cleaning is completed;

5)最终得到的X″10,X″11,X″20,X″21为最终数据清洗完成后的数据集。5) The finally obtained X″ 10 , X″ 11 , X″ 20 , and X″ 21 are the data set after the final data cleaning is completed.

本发明的另一目的在于提供一种应用所述基于安全多方计算技术的数据异常点清洗方法的机器学习系统。Another object of the present invention is to provide a machine learning system applying the method for cleaning abnormal data points based on the secure multi-party computing technology.

综上所述,本发明的优点及积极效果为:本发明结合安全多方计算技术和AVF异常值检测算法,利用现有的安全多方计算工具ABY算法,实现了对高维数据的高效检测,并且在保证一定效率的前提下利用安全多方计算技术中的Yao’s加密算法保证了各方数据隐私相当的安全性。To sum up, the advantages and positive effects of the present invention are as follows: the present invention combines the secure multi-party computing technology and the AVF outlier detection algorithm, and utilizes the existing secure multi-party computing tool ABY algorithm to achieve efficient detection of high-dimensional data, and Under the premise of ensuring a certain efficiency, the Yao's encryption algorithm in the secure multi-party computing technology is used to ensure the security of the data privacy of all parties.

表1技术性能对比Table 1 Technical performance comparison

附图说明Description of drawings

图1是本发明实施例提供的基于安全多方计算技术的数据异常点清洗方法流程图。FIG. 1 is a flowchart of a method for cleaning abnormal data points based on a secure multi-party computing technology provided by an embodiment of the present invention.

图2是本发明实施例提供的实施例的场景示意图。FIG. 2 is a schematic diagram of a scenario of an embodiment provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白,以下结合实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

针对现有利用同态加密算法解决多数据源联合数据异常点清洗的算法,计算效率较低,不能满足大量的数据处理需求;现有基于LOF异常点检测算法的隐私保护数据清洗方案存在面对高维数据集时处理效率较低的问题。本发明主要用于实现联合数据源环境下的安全数据异常点清洗算法;基于安全多方计算技术,实现了隐私保护前提下的多个数据源联合机器学习场景下的数据异常点清洗工作。In view of the existing algorithms that use homomorphic encryption algorithm to solve the multi-data source joint data outlier cleaning, the computational efficiency is low and cannot meet the needs of a large number of data processing; the existing privacy protection data cleaning solutions based on the LOF outlier detection algorithm face the Inefficient processing of high-dimensional datasets. The invention is mainly used to realize the safe data abnormal point cleaning algorithm in the joint data source environment; based on the secure multi-party computing technology, the data abnormal point cleaning work under the joint machine learning scenario of multiple data sources under the premise of privacy protection is realized.

下面结合附图对本发明的应用原理作详细的描述。The application principle of the present invention will be described in detail below with reference to the accompanying drawings.

如图1所示,本发明实施例提供的基于安全多方计算技术的数据异常点清洗方法包括以下步骤:As shown in FIG. 1 , the method for cleaning abnormal data points based on the secure multi-party computing technology provided by the embodiment of the present invention includes the following steps:

S101:参与方A与参与方B按照规定统一自有数据集格式:将A与B两个参与方的数据统一为矩阵格式,拥有相同维度,并且最后一维为该条数据的Attribute ValueFrequency(AVF)值;S101: Participant A and Participant B unify their own data set formats according to the regulations: unify the data of the two parties A and B into a matrix format with the same dimensions, and the last dimension is the Attribute Value Frequency (AVF )value;

S102:参与方A与参与方B按照规定加密自有数据集:参与方A与参与方B利用安全多方计算算法ABY中的Yao’s加密算法对数据矩阵进行加密;S102: Participant A and Participant B encrypt their own data sets according to regulations: Participant A and Participant B encrypt the data matrix using Yao's encryption algorithm in the secure multi-party computing algorithm ABY;

S103:服务器A与服务器B对各参与方上传的加密数据集进行数据异常点清洗。S103: Server A and server B perform data anomaly point cleaning on the encrypted data set uploaded by each participant.

下面结合具体实施例对本发明的应用原理作进一步的描述。The application principle of the present invention will be further described below with reference to specific embodiments.

本发明实施例提供的基于安全多方计算技术的数据异常点清洗方法具体包括以下步骤:The method for cleaning abnormal data points based on the secure multi-party computing technology provided by the embodiment of the present invention specifically includes the following steps:

步骤一,参与方A与参与方B按照规定统一自有数据集格式:Step 1, Participant A and Participant B unify their own data set format according to the regulations:

其中,D1表示参与方A的N×(M+1)的数据集矩阵,aij表示参与方A数据集中的任意数据,avfai表示参与方A第i条数据的Attribute Value Frequency(AVF)值,i∈[1,N],j∈[1,M],M,N∈N+;D2表示参与方B的P×(M+1)的数据集矩阵,bkj表示参与方B数据集中的任意数据,avfbk表示参与方A第k条数据的Attribute Value Frequency(AVF)值,k∈[1,P],j∈[1,M],M,P∈N+。其中两个参与方的数据维度(即M的值)相同。Among them, D 1 represents the N×(M+1) data set matrix of the participant A, a ij represents any data in the data set of the participant A, and avf ai represents the Attribute Value Frequency (AVF) of the i-th data of the participant A Value, i∈[1, N], j∈[1, M], M, N∈N + ; D 2 represents the P×(M+1) dataset matrix of participant B, and b kj represents participant B Arbitrary data in the dataset, avf bk represents the Attribute Value Frequency (AVF) value of the kth data of participant A, k∈[1, P], j∈[1, M], M, P∈N + . The data dimension (ie, the value of M) of the two parties is the same.

步骤二,参与方A与参与方B按照规定加密自有数据集:Step 2, Participant A and Participant B encrypt their own data sets according to the regulations:

2a)利用安全多方计算加密ABY算法中的Yao’s加密算法对参与方A的数据集D1进行加密:2a) Use Yao's encryption algorithm in the secure multi-party computation encryption ABY algorithm to encrypt the data set D 1 of the participant A:

其中,表示加密后的数据集交给服务器A的部分,表示加密后的数据集交给服务器B的部分,Enc表示Yao’s加密算法,D1表示参与方A的数据集。in, Indicates the part where the encrypted data set is handed over to server A, Indicates that the encrypted data set is handed over to server B, Enc represents Yao's encryption algorithm, and D 1 represents the data set of participant A.

具体按照下式加密每一个元素:Specifically, encrypt each element as follows:

其中,表示加密后的数据交给服务器A的部分,表示加密后的数据交给服务器B的部分,aij表示参与方A的任意数据;表示加密后的参与方A的第i条数据的AVF值交给服务器A的部分,表示加密后的参与方A的第i条数据的AVF值交给服务器B的部分,avfai表示参与方A的第i条数据的AVF值。in, Indicates the part where the encrypted data is handed over to server A, Indicates that the encrypted data is handed over to server B, and a ij represents any data of participant A; The part indicating that the encrypted AVF value of the i-th data of the participant A is handed over to the server A, The AVF value representing the ith piece of data of the encrypted participant A is handed over to the server B, and avf ai represents the AVF value of the ith piece of data of the participant A.

2b)利用下式表示加密后的参与方A的数据集:2b) Use the following formula to represent the encrypted data set of participant A:

其中,X10表示服务器A持有的参与方A的加密数据集,X11表示服务器B持有的参与方A的加密数据集,i∈[1,N],j∈[1,M],M,N∈N+Among them, X 10 represents the encrypted data set of participant A held by server A, X 11 represents the encrypted data set of participant A held by server B, i ∈ [1, N], j ∈ [1, M], M, N∈N + .

2c)利用安全多方计算加密ABY算法中的Yao’s加密算法对参与方B的数据集D2进行加密:2c) Use Yao's encryption algorithm in the secure multi-party computation encryption ABY algorithm to encrypt the data set D 2 of the participant B:

其中,表示加密后的数据集交给服务器A的部分,表示加密后的数据集交给服务器B的部分,Enc表示Yao’s加密算法,D2表示参与方B的数据集。in, Indicates the part where the encrypted data set is handed over to server A, Indicates that the encrypted data set is handed over to server B, Enc represents Yao's encryption algorithm, and D 2 represents the data set of participant B.

具体按照下式加密每一个元素:Specifically, encrypt each element as follows:

其中,表示加密后的数据交给服务器A的部分,表示加密后的数据交给服务器B的部分,bkj表示参与方A的任意数据;表示加密后的参与方B的第k条数据的AVF值交给服务器A的部分,表示加密后的参与方B的第k条数据的AVF值交给服务器B的部分,avfbk表示参与方B的第k条数据的AVF值。in, Indicates the part where the encrypted data is handed over to server A, Indicates that the encrypted data is handed over to server B, and b kj represents any data of participant A; The part indicating that the AVF value of the k-th data of the encrypted participant B is handed over to the server A, The AVF value representing the k-th piece of data of the encrypted participant B is handed over to the server B, and avf bk represents the AVF value of the k-th piece of data of the participant B.

2d)利用下式表示加密后的参与方B的数据集:2d) Use the following formula to represent the encrypted data set of Party B:

其中,X20表示服务器A持有的参与方B的加密数据集,X21表示服务器B持有的参与方B的加密数据集,k∈[1,P],j∈[1,M],M,P∈N+Among them, X 20 represents the encrypted data set of participant B held by server A, X 21 represents the encrypted data set of participant B held by server B, k ∈ [1, P], j ∈ [1, M], M, P∈N + .

2e)参与方A与参与方B分别将加密后的数据上传至对应服务器。2e) Participant A and Participant B respectively upload the encrypted data to the corresponding server.

步骤三,服务器A与服务器B对各参与方上传的加密数据集进行数据异常点清洗:Step 3, server A and server B clean the encrypted data set uploaded by each participant for abnormal data points:

3a)服务器A提取自己拿到的参与方A的加密数据集中的最后一维数据,即:3a) Server A extracts the last one-dimensional data in the encrypted data set of Participant A obtained by itself, namely:

服务器A使用安全加密算法ABY中的Yao’s加密算法中的排序算法对A10进行排序:Server A sorts A 10 using the sorting algorithm in Yao's encryption algorithm in secure encryption algorithm ABY:

A′10=Sort(A10);A' 10 =Sort(A 10 );

其中,A10表示服务器A拥有的参与方A的加密数据集中最后一维数据,A′10表示A10按照降序排序完成后的数据,Sort()表示Yao’s加密算法中的排序算法。Among them, A 10 represents the last one-dimensional data in the encrypted data set of participant A owned by server A, A' 10 represents the data sorted by A 10 in descending order, and Sort() represents the sorting algorithm in Yao's encryption algorithm.

以A10为基准将X10也同时排序,即按照A10降序排列X10,排序完成后:Sort X 10 at the same time based on A 10 , that is, arrange X 10 in descending order of A 10. After sorting is completed:

其中,X′10为以X10最后一维数据,即A10为基准降序排序完成后的参与方A提交给服务器A的数据集,i∈[1,N],j∈[1,M],M,N∈N+Among them, X′ 10 is the data set submitted by participant A to server A after the last one-dimensional data of X 10 , that is, A 10 is the benchmark after the descending sorting is completed, i∈[1, N], j∈[1, M] , M, N ∈ N + .

规定一个固定值Thre,表示AVF值在正常范围内的阈值,将A′10中的数据,按顺序与Thre比较大小:A fixed value Thre is specified, indicating the threshold value of the AVF value within the normal range, and the data in A' 10 is compared with Thre in order:

Resi=Comp(A′10i,Thre);Res i =Comp(A' 10i , Thre);

其中,A′10i表示A′10中的元素,i∈[1,N],N∈N+,Comp()表示Yao’s加密算法中的比较大小算法,Resi表示A′10i与Thre比较的结果,若Resi值为1,表示A′10i≥Thre;若Resi值为0,表示A′10i<Thre,将A′10中的数据,按顺序与Thre比较大小,直到Resi=0,停止比较,将X′10中的前i行数据保留:Among them, A' 10i represents the element in A' 10 , i∈[1, N], N∈N + , Comp() represents the comparison size algorithm in Yao's encryption algorithm, Res i represents the result of comparing A' 10i with Thre , if the value of Res i is 1, it means that A' 10i ≥ Thre; if the value of Res i is 0, it means that A' 10i <Thre, compare the data in A' 10 with Thre in order, until Res i =0, Stop the comparison and keep the first i rows of data in X' 10 :

其中,I=i,为排序之后保留的前i行数据,j∈[1,M},M∈N+,X″10为最终数据清洗完成后服务器A所拥有的参与方A的数据集。Among them, I=i, is the first i row of data retained after sorting, j∈[1, M}, M∈N + , X″ 10 is the data set of participant A owned by server A after the final data cleaning is completed.

3b)服务器A提取自己拿到的参与方B的加密数据集中的最后一维数据,即:3b) Server A extracts the last one-dimensional data in the encrypted data set of participant B obtained by itself, namely:

服务器A使用安全加密算法ABY中的Yao’s加密算法中的排序算法对A20进行排序:Server A sorts A 20 using the sorting algorithm in Yao's encryption algorithm in secure encryption algorithm ABY:

A′20=Sort(A20);A' 20 =Sort(A 20 );

其中,A20表示服务器A拥有的参与方B的加密数据集中最后一维数据,A′20表示A20按照降序排序完成后的数据,Sort()表示Yao’s加密算法中的排序算法。Among them, A 20 represents the last one-dimensional data in the encrypted data set of participant B owned by server A, A' 20 represents the data sorted by A 20 in descending order, and Sort() represents the sorting algorithm in Yao's encryption algorithm.

以A20为基准将X20也同时排序,即按照A20降序排列X20,排序完成后:Sort X 20 at the same time based on A 20 , that is, sort X 20 in descending order of A 20. After sorting is completed:

其中,X′20为以X20最后一维数据,即A20为基准降序排序完成后的参与方B提交给服务器A的数据集,k∈[1,P],j∈[1,M],M,P∈N+Among them, X′ 20 is the data set submitted by participant B to server A after the last one-dimensional data of X 20 , that is, A 20 is the benchmark after the descending sorting is completed, k∈[1, P], j∈[1, M] , M, P ∈ N + .

规定一个固定值Thre(同上文Thre),表示AVF值在正常范围内的阈值,将A′20 A fixed value Thre (same as Thre above) is specified, indicating the threshold value of the AVF value within the normal range, and A′ 20

中的数据,按顺序与Thre比较大小:The data in , compare the size with Thre in order:

Resk=Comp(A′20k,Thre);Res k =Comp(A' 20k , Thre);

其中,A′20k表示A′20中的元素,k∈[1,P],P∈N+,Comp()表示Yao’s加密算法中的比较大小算法,Resk表示A′20k与Thre比较的结果,若Resk值为1,表示A′20k≥Thre;若Resk值为0,表示A′20k<Thre,将A′20中的数据,按顺序与Thre比较大小,直到Resk=0,停止比较,将X′20中的前k行数据保留:Among them, A' 20k represents the element in A' 20 , k∈[1, P], P∈N + , Comp() represents the comparison size algorithm in Yao's encryption algorithm, Res k represents the result of comparing A' 20k with Thre , if the value of Res k is 1, it means that A' 20k ≥ Thre; if the value of Res k is 0, it means that A' 20k <Thre, compare the data in A' 20 with Thre in order until Res k =0, Stop the comparison and keep the first k rows of data in X' 20 :

其中,K=k,为排序之后保留的前k行数据,j∈[1,M],M∈N+,X″20为最终数据清洗完成后服务器A所拥有的参与方B的数据集。Among them, K=k, is the first k rows of data retained after sorting, j∈[1, M], M∈N + , X″ 20 is the data set of participant B owned by server A after the final data cleaning is completed.

3c)服务器B提取自己拿到的参与方A的加密数据集中的最后一维数据,即:3c) Server B extracts the last one-dimensional data in the encrypted data set of Participant A obtained by itself, namely:

服务器B使用安全加密算法ABY中的Yao’s加密算法中的排序算法对A11进行排序:Server B sorts A 11 using the sorting algorithm in Yao's encryption algorithm in secure encryption algorithm ABY:

A′11=Sort(A11);A' 11 =Sort(A 11 );

其中,A11表示服务器B拥有的参与方A的加密数据集中最后一维数据,A′11表示A11按照降序排序完成后的数据,Sort()表示Yao’s加密算法中的排序算法。Among them, A 11 represents the last one-dimensional data in the encrypted data set of participant A owned by server B, A' 11 represents the data sorted by A 11 in descending order, and Sort() represents the sorting algorithm in Yao's encryption algorithm.

以A11为基准将X11也同时排序,即按照A11降序排列X11,排序完成后:Sort X 11 at the same time based on A 11 , that is, sort X 11 in descending order of A 11. After sorting is completed:

其中,X′11为以X11最后一维数据,即A11为基准降序排序完成后的参与方A提交给服务器B的数据集,i∈[1,N],j∈[1,M],M,N∈N+Among them, X′ 11 is the data set submitted by participant A to server B after the last one-dimensional data of X 11 , that is, A 11 is the benchmark after the descending sorting is completed, i∈[1, N], j∈[1, M] , M, N ∈ N + .

规定一个固定值Thre(同上文Thre),表示AVF值在正常范围内的阈值,将A′11中的数据,按顺序与Thre比较大小:A fixed value Thre (same as Thre above) is specified, indicating the threshold value of the AVF value within the normal range, and the data in A' 11 is compared with Thre in order:

Resi=Comp(A′11i,Thre);Res i =Comp(A' 11i , Thre);

其中,A′11i表示A′11中的元素,k∈[1,N],P∈N+,Comp()表示Yao’s加密算法中的比较大小算法,Resi表示A′11i与Thre比较的结果,若Resi值为1,表示A′11i≥Thre;若Resi值为0,表示A′11i<Thre,将A′11中的数据,按顺序与Thre比较大小,直到Resi=0,停止比较,将X′11中的前i行数据保留:Among them, A' 11i represents the element in A' 11 , k∈[1, N], P∈N + , Comp() represents the comparison size algorithm in Yao's encryption algorithm, Res i represents the result of comparing A' 11i with Thre , if the value of Res i is 1, it means that A' 11i ≥ Thre; if the value of Res i is 0, it means that A' 11i <Thre, compare the data in A' 11 with Thre in order, until Res i =0, Stop the comparison and keep the first i rows of data in X'11 :

其中,I=i,为排序之后保留的前i行数据,j∈[1,M],M∈N+,X″11为最终数据清洗完成后服务器B所拥有的参与方A的数据集。Among them, I=i, is the first i row data retained after sorting, j∈[1, M], M∈N + , X″ 11 is the data set of participant A owned by server B after the final data cleaning is completed.

3d)服务器B提取自己拿到的参与方B的加密数据集中的最后一维数据,即:3d) Server B extracts the last one-dimensional data in the encrypted data set of participant B obtained by itself, namely:

服务器B使用安全加密算法ABY中的Yao’s加密算法中的排序算法对A21进行排序:Server B sorts A 21 using the sorting algorithm in Yao's encryption algorithm in secure encryption algorithm ABY:

A′21=Sort(A21);A' 21 =Sort(A 21 );

其中,A21表示服务器B拥有的参与方B的加密数据集中最后一维数据,A′21表示A21按照降序排序完成后的数据,Sort()表示Yao’s加密算法中的排序算法。Among them, A 21 represents the last one-dimensional data in the encrypted data set of participant B owned by server B, A' 21 represents the data sorted by A 21 in descending order, and Sort() represents the sorting algorithm in Yao's encryption algorithm.

以A21为基准将X21也同时排序,即按照A21降序排列X21,排序完成后:Sort X 21 at the same time based on A 21 , that is, sort X 21 in descending order of A 21 , after sorting is completed:

其中,X′21为以X21最后一维数据,即A21为基准降序排序完成后的参与方B提交给服务器B的数据集,k∈[1,P],j∈[1,M],M,P∈N+Among them, X′ 21 is the data set submitted by participant B to server B after the last one-dimensional data of X 21 , that is, A 21 is the benchmark after the descending sorting is completed, k∈[1, P], j∈[1, M] , M, P ∈ N + .

规定一个固定值Thre(同上文Thre),表示AVF值在正常范围内的阈值,将A′21中的数据,按顺序与Thre比较大小:A fixed value Thre (same as Thre above) is specified, indicating the threshold value of the AVF value within the normal range, and the data in A' 21 is compared with Thre in order:

Resk=Comp(A′21k,Thre);Res k =Comp(A' 21k , Thre);

其中,A′21k表示A′21中的元素,k∈[1,P],P∈N+,Comp()表示Yao’s加密算法中的比较大小算法,Resk表示A′21k与Thre比较的结果,若Resk值为1,表示A′21k≥Thre;若Resk值为0,表示A′21k<Thre,将A′21k中的数据,按顺序与Thre比较大小,直到Resk=0,停止比较,将X′21中的前k行数据保留:Among them, A' 21k represents the element in A' 21 , k∈[1, P], P∈N + , Comp() represents the comparison size algorithm in Yao's encryption algorithm, Res k represents the result of comparing A' 21k with Thre , if the value of Res k is 1, it means that A' 21k ≥ Thre; if the value of Res k is 0, it means that A' 21k <Thre, compare the data in A' 21k with Thre in order, until Res k =0, Stop the comparison and keep the first k rows of data in X'21 :

其中,K=k,为排序之后保留的前k行数据,j∈[1,M],M∈N+,X″21为最终数据清洗完成后服务器B所拥有的参与方B的数据集。Among them, K=k, is the first k rows of data retained after sorting, j∈[1, M], M∈N + , X″ 21 is the data set of participant B owned by server B after the final data cleaning is completed.

3e)最终得到的X″10,X″11,X″20,X″21为最终数据清洗完成后的数据集。3e) The finally obtained X″ 10 , X″ 11 , X″ 20 , and X″ 21 are the data set after the final data cleaning is completed.

以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention shall be included in the protection of the present invention. within the range.

Claims (5)

1. A data abnormal point cleaning method based on a secure multi-party computing technology is characterized by comprising the following steps:
step one, unifying data of two parties A and B into a matrix format, wherein the data have the same dimensionality, and the last dimension is the AVF value of the data;
secondly, encrypting the data matrix by the participant A and the participant B by using a Yao's encryption algorithm in a secure multi-party computing algorithm ABY;
and thirdly, the server A and the server B perform data anomaly point cleaning on the encrypted data sets uploaded by the participants.
2. The secure multi-party computing technology based data anomaly cleansing method according to claim 1, wherein said first step participant a and participant B unify their own data set formats as specified:
wherein D is1A data set matrix of N (M +1) representing party A, aijRepresenting arbitrary data in the participant a dataset, avfaiAVF value representing participant A ith data, i ∈ [1, N],j∈[1,M],M,N∈N+;D2A dataset matrix representing P x (M +1) of party B, BkjRepresenting arbitrary data in the participant B dataset, avfbkAVF value representing data of k-th item of participant A, k ∈ [1, P],j∈[1,M],M,P∈N+(ii) a Where the data dimensions of both participants are the same.
3. The method for cleansing data anomaly points based on secure multiparty computing technology according to claim 1, wherein said second step of encrypting the owned data set by party a and party B specifically comprises:
1) encrypting data set D of participant A by Yao's encryption algorithm in ABY algorithm by utilizing secure multi-party computation1And (3) encryption:
wherein,a section representing that the encrypted data set is handed to the server a,representing the part of the encrypted data set handed to server B, Enc representing the Yao's encryption algorithm, D1A data set representing party a;
specifically, each element is encrypted according to the following formula:
wherein,a section showing that the encrypted data is handed to the server a,a section showing that the encrypted data is handed to the server B, aijAny data representing party a;the AVF value representing the encrypted ith piece of data of party a is handed to part of server a,part of server B handed over the AVF value representing encrypted participant a's ith piece of data, AVFaiAVF value representing participant a's ith data;
2) the encrypted data set of party a is represented by:
wherein, X10An encrypted data set, X, representing a party A held by a server A11An encrypted data set representing party A held by Server B, i ∈ [1, N],j∈[1,M],M,N∈N+
3) Encrypting data set D of participant B using Yao's encryption algorithm in ABY algorithm using secure multi-party computation2And (3) encryption:
wherein,a section representing that the encrypted data set is handed to the server a,representing the part of the encrypted data set handed to server B, Enc representing the Yao's encryption algorithm, D2A data set representing party B;
specifically, each element is encrypted according to the following formula:
wherein,a section showing that the encrypted data is handed to the server a,a section showing that the encrypted data is handed to the server B, BkjAny data representing party a;the AVF value representing the encrypted kth piece of data of party B is handed over to part of server a,part of server B handed over the AVF value representing the encrypted kth piece of data of party B, AVFbkAVF value representing the kth piece of data for party B;
4) the encrypted data set of party B is represented by:
wherein, X20An encrypted data set, X, representing a party B held by a server A21An encrypted data set representing party B held by Server B, k ∈ [1, P ∈],j∈[1,M],M,P∈N+
5) And the participant A and the participant B respectively upload the encrypted data to corresponding servers.
4. The method for cleaning data anomaly points based on the secure multi-party computing technology as claimed in claim 1, wherein the third step of cleaning the data anomaly points of the encrypted data sets uploaded by the participants by the server a and the server B specifically comprises:
1) the server A extracts the last one-dimensional data in the encrypted data set of the party A taken by the server A:
server A uses the sorting algorithm in Yao's encryption algorithm in security encryption algorithm ABY to pair A10And (3) sequencing:
A′10=Sort(A10);
wherein A is10Last-dimensional data, A ', in the encrypted data set representing party A owned by Server A'10Represents A'10Sorting the finished data in a descending order, wherein the Sort () represents a sorting algorithm in the Yao's encryption algorithm;
with A10Taking X as a reference10Also sorted simultaneously, i.e. according to A10Descending order of X10And after the sorting is finished:
wherein, X'10Is represented by X10Last one-dimensional data, i.e. A10Submitting the data set of the server A for the participant A after the reference descending sorting is completed, wherein i belongs to [1, N ∈],j∈[1,M],M,N∈N+
A 'is defined as a fixed value Thre representing a threshold value of AVF value within a normal range'10The data in (1) is compared with Thre in order:
Resi=Comp(A′10i,Thre);
wherein, A'10iRepresents A'10The element in (1) is i' [1, N ]],N∈N+Comp () denotes the comparative size algorithm, Res, in Yao's encryption algorithmiRepresents A'10iIf Res is the result of comparison with ThreiValue is 1 and represents A'10iMore than or equal to Thre; if ResiValue is 0 and represents A'10i< Thre, A'10The data in (1) is compared with Thre in sequence until Resi0, stop comparison, and mix X'10First i row data retention in (1):
wherein, I is I, the first I row of data reserved after sorting, j belongs to [1, M ∈],M∈N+,X″10Of party A owned by Server A after completion of cleaning for final dataA data set;
2) the server A extracts the last one-dimensional data in the encrypted data set of the party B taken by the server A:
server A uses the sorting Algorithm of the Yao's encryption Algorithm of the secure encryption Algorithm ABY to pair A'20And (3) sequencing:
A′20=Sort(A20);
wherein A is20Last-dimensional data, A 'in encrypted data set representing party B owned by Server A'20Is represented by A20Sorting the finished data in a descending order, wherein the Sore () represents a sorting algorithm in the Yao's encryption algorithm;
with A20Taking X as a reference20Also sorted simultaneously, i.e. according to A20Descending order of X20And after the sorting is finished:
wherein, X'20Is represented by X20Last one-dimensional data, i.e. A20Submitting the data set of the server A for the participant B after the reference descending sorting is completed, wherein k belongs to [1, P ]],j∈[1,M],M,P∈N+
A 'is defined as a fixed value Thre representing a threshold value of AVF value within a normal range'20The data in (1) is compared with Thre in order:
Resk=Comp(A′20k,Thre);
wherein, A'20kRepresents A'20Element (b) k ∈ [1, P ]],P∈N+Comp () denotes the comparative size algorithm, Res, in Yao's encryption algorithmkRepresents A'20kIf Res is the result of comparison with ThrekValue is 1 and represents A'20kMore than or equal to Thre; if ReskValue is 0 and represents A'20k< Thre, A'20In order of the data inSize comparison with Thre until Resk0, stop comparison, and mix X'20First k rows of data retention:
where K is K, the first K rows of data retained after sorting, j e [1, M],M∈N+,X″20The data set of the participant B owned by the server A after the final data cleaning is finished;
3) the server B extracts the last one-dimensional data in the encrypted data set of the party A taken by the server B:
server B uses the sorting algorithm in Yao's encryption algorithm in secure encryption algorithm ABY to pair a11And (3) sequencing:
A′11=Sort(A11);
wherein A is11Last-dimensional data, A 'in encrypted data set representing party A owned by server B'11Is represented by A11Sorting the finished data in a descending order, wherein the Sort () represents a sorting algorithm in the Yao's encryption algorithm;
with A11Taking X as a reference11Also sorted simultaneously, i.e. according to A11Descending order of X11And after the sorting is finished:
wherein, A'11Is represented by X11Last one-dimensional data, i.e. A11Submitting the data set of the server B for the participant A after the reference descending sorting is completed, wherein i belongs to [1, N ∈],j∈[1,M],M,N∈N+
A 'is defined as a fixed value Thre representing a threshold value of AVF value within a normal range'11The data in (1) is compared with Thre in order:
Resi=Comp(A′11i,Thre);
wherein, A'11iRepresents A'11Element in (1, N) is i ∈ [, N ∈ [ ]],N∈N+Comp () denotes the comparative size algorithm, Res, in Yao's encryption algorithmiRepresents A'11iIf Res is the result of comparison with ThreiValue is 1 and represents A'11iMore than or equal to Thre; if ResiValue is 0 and represents A'11i< Thre, A'11The data in (1) is compared with Thre in sequence until Resi0, stop comparison, and mix X'11First i row data retention in (1):
wherein, I is I, the first I row of data reserved after sorting, j belongs to [1, M ∈],M∈N+,X″11The data set of the participant A owned by the server B after the final data cleaning is finished;
4) the server B extracts the last one-dimensional data in the encrypted data set of the party B taken by the server B:
server B uses the sorting algorithm in Yao's encryption algorithm in secure encryption algorithm ABY to pair a21And (3) sequencing:
A′21=Sort(A21);
wherein, A'21Last-dimensional data, A 'in encrypted data set representing party B owned by server B'21Is represented by A21Sorting the finished data in a descending order, wherein the Sort () represents a sorting algorithm in the Yao's encryption algorithm;
with A21Taking X as a reference21Also sorted simultaneously, i.e. according to A21Descending order of X21And after the sorting is finished:
wherein, X'21Is represented by X21Last one-dimensional data, i.e. A21Submitting the data set of the server B for the participant B after the reference descending sorting is completed, wherein k belongs to [1, P ∈],j∈[1,M],M,P∈N+
A 'is defined as a fixed value Thre representing a threshold value of AVF value within a normal range'21The data in (1) is compared with Thre in order:
Resk=Comp(A′21k,Thre);
wherein, A'21kRepresents A'21Element (b) k ∈ [1, P ]],P∈N+Comp () denotes the comparative size algorithm, Res, in Yao's encryption algorithmkRepresents A'21kIf Res is the result of comparison with ThrekValue is 1 and represents A'21kMore than or equal to Thre; if ReskValue is 0 and represents A'21k< Thre, A'21The data in (1) is compared with Thre in sequence until Resk0, stop comparison, and mix X'21First k rows of data retention:
where K is K, the first K rows of data retained after sorting, j e [1, M],M∈M+,X″21The data set of the participant B owned by the server B after the final data cleaning is finished;
5) the final X ″)10,X″11,X″20,X″21And cleaning the finished data set for the final data.
5. A machine learning system applying the data anomaly point cleaning method based on the secure multi-party computing technology according to any one of claims 1 to 4.
CN201910156492.6A 2019-03-01 2019-03-01 A data anomaly point cleaning method based on secure multi-party computing technology Active CN109992977B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910156492.6A CN109992977B (en) 2019-03-01 2019-03-01 A data anomaly point cleaning method based on secure multi-party computing technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910156492.6A CN109992977B (en) 2019-03-01 2019-03-01 A data anomaly point cleaning method based on secure multi-party computing technology

Publications (2)

Publication Number Publication Date
CN109992977A true CN109992977A (en) 2019-07-09
CN109992977B CN109992977B (en) 2022-12-16

Family

ID=67130167

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910156492.6A Active CN109992977B (en) 2019-03-01 2019-03-01 A data anomaly point cleaning method based on secure multi-party computing technology

Country Status (1)

Country Link
CN (1) CN109992977B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046409A (en) * 2019-12-16 2020-04-21 支付宝(杭州)信息技术有限公司 Private data multi-party security calculation method and system
CN111125735A (en) * 2019-12-20 2020-05-08 支付宝(杭州)信息技术有限公司 Method and system for model training based on private data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013102488A1 (en) * 2012-01-02 2013-07-11 Telecom Italia S.P.A. Method and system for comparing images
CN108712260A (en) * 2018-05-09 2018-10-26 曲阜师范大学 The multi-party deep learning of privacy is protected to calculate Proxy Method under cloud environment
CN108809628A (en) * 2018-06-13 2018-11-13 哈尔滨工业大学深圳研究生院 Based on the time series method for detecting abnormality and system under Secure

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013102488A1 (en) * 2012-01-02 2013-07-11 Telecom Italia S.P.A. Method and system for comparing images
CN108712260A (en) * 2018-05-09 2018-10-26 曲阜师范大学 The multi-party deep learning of privacy is protected to calculate Proxy Method under cloud environment
CN108809628A (en) * 2018-06-13 2018-11-13 哈尔滨工业大学深圳研究生院 Based on the time series method for detecting abnormality and system under Secure

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046409A (en) * 2019-12-16 2020-04-21 支付宝(杭州)信息技术有限公司 Private data multi-party security calculation method and system
CN111125735A (en) * 2019-12-20 2020-05-08 支付宝(杭州)信息技术有限公司 Method and system for model training based on private data

Also Published As

Publication number Publication date
CN109992977B (en) 2022-12-16

Similar Documents

Publication Publication Date Title
Mehmood et al. Protection of big data privacy
CN106649587B (en) High-security desensitization method based on big data information system
Wang et al. Search in my way: Practical outsourced image retrieval framework supporting unshared key
CN107147720B (en) Traceable effective public auditing method and traceable effective public auditing system in cloud storage data sharing
US10635824B1 (en) Methods and apparatus for private set membership using aggregation for reduced communications
Wang et al. A system framework of security management in enterprise systems
He et al. Secure logistic regression for vertical federated learning
CN110011810A (en) Blockchain Anonymous Signature Method Based on Linkable Ring Signature and Multi-signature
CN107196967B (en) A kind of logistics big data information security access control system
CN113434898B (en) A non-interactive privacy-preserving logistic regression federated training method and system
CN114548418A (en) A Horizontal Federation IV Algorithm Based on Secret Sharing
CN111259440A (en) Privacy protection decision tree classification method for cloud outsourcing data
CN108197491A (en) A kind of subgraph search method based on ciphertext
CN114218322A (en) Data display method, device, equipment and medium based on ciphertext transmission
CN109344637B (en) A searchable and privacy-preserving data-sharing cloud-assisted e-health system
CN114528331A (en) Data query method, device, medium and equipment based on block chain
CN107592298A (en) A kind of sequence comparison algorithm based on single server model safely outsourced method, user terminal and server
CN109992977A (en) A data anomaly point cleaning method based on secure multi-party computing technology
CN111159727B (en) Multi-party cooperation oriented Bayes classifier safety generation system and method
CN110555783A (en) block chain-based power marketing data protection method and system
CN104283930B (en) Keyword search system for security index and method for establishing the system
CN114510734B (en) Data access control method, device and computer readable storage medium
CN114793237B (en) Smart city data sharing method, equipment and medium based on block chain technology
CN116894051A (en) Efficient hidden trace query method based on hash PSI in federal learning
CN116248289A (en) Access Control Method for Industrial Internet Identity Resolution Based on Ciphertext Attribute Encryption

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant