下面結合圖式,對本說明書提供的方案進行描述。
圖1A示出了本說明書實施例披露的資料方A擁有的資料。圖1B示出了本說明書實施例披露的資料方B擁有的資料。圖1A和圖1B中的每一個ID(Identity Document,身份標識號)可以為唯一標識一個使用者的數位編碼,例如手機號等。如圖1A和圖1B所示,ID1、ID2、ID3為資料方A和資料方B共有的ID。圖1A中的每一個ID具有標籤和特徵Fa的特徵值。示例性的,如圖1A所示,標籤可分為正標籤和負標籤兩種。圖1B中的每一個ID具有特徵Fb的特徵值。
在一個示例性場景中,資料方A可以為電子支付平台(例如支付寶),標籤可以為欺詐商家的標記或非欺詐商家的標記。特徵Fa可以為交易流水資料。資料方B可以為銀行機構,特徵Fb可以為借貸資料。每個ID對應的交易流水資料的特徵值或者借貸資料的特徵值,可以透過特徵工程計算得到,具體可以參考現有技術介紹,此處不再贅述。
在一個示例性場景中,資料方A可以為電子商務平台(例如淘寶),標籤可以為正常買家的標記或非正常買家的標記,特徵Fa可以為銷售資料。資料方B可以為銀行機構,特徵Fb可以為借貸資料。
多方聯合訓練機器學習模型,需要使用資料方A和資料方B共有使用者的特徵。為了有效訓練機器學習模型,需要評估特徵和標籤的相關性。
可以透過圖2所示方案進行特徵篩選。其中,資料方A中的多個ID(ID集合),可以稱為set_A。B中的多個ID(ID集合),可以稱為set_B。在進行聯合計算時,資料方A可以將set_A和set_A的標籤發送給資料方B。由此,資料方B可以判定set_A和set_B的共有ID,然後,計算共有ID的特徵Fb的資訊價值,以評估特徵Fb和標籤的相關性。資料方B可以將set_B發送給資料方A。由此,資料方A可以判定set_A和set_B的共有ID,然後,計算共有ID的特徵Fa的資訊價值,以評估特徵Fa和標籤的相關性。在該方案中,資料雙方需要交換明文ID。
用於評估特徵和標籤的相關性的另一種方案為,構建可信執行環境(例如利用intel的sgx技術構建一個可信執行環境),資料方A的資料(set_A、set_A的標籤、set_A的特徵Fa)以及資料方B的資料(set_B、set_B的特徵Fb)可以各自經公開金鑰加密後,傳入可信執行環境。在可信執行環境內進行私密金鑰解密,並完成特徵的資訊價值計算,以及將特徵的資訊價值計算結果傳出可信環境。
用於評估特徵和標籤的相關性的又一種方案為,資料方A的資料(set_A、set_A的標籤、set_A的特徵Fa)以及資料方B的資料(set_B、set_B的特徵Fb)發送給第三方機構,由第三方完成特徵的資訊價值計算。
為進一步增強隱私資料安全,本說明書實施例提供了一種多方聯合進行特徵評估的方法,可以在雙方未知對方使用者以及在標籤和特徵資料隔離的情況下,計算雙方共有使用者的特徵的資訊價值。在一個實施例中,該方法可以包括如圖3所示的步驟。需要說明的是,圖3雖然按照序列順序示出步驟301a-步驟310a以及步驟301b-步驟310b,並不限定這些步驟300-步驟310的執行順序。在一些示例中,可以按照圖3所示循序執行步驟301a-步驟310a以及步驟301b-步驟310b。在一些示例中,可以按照與圖3所示順序不同的循序執行步驟301a-步驟310a以及步驟301b-步驟310b。在一些示例中,可以並存執行步驟301a-步驟310a以及步驟301b-步驟310b中的兩個或更多個步驟。
接下來,結合圖3對本說明書提供的保護隱私安全的多方聯合進行特徵評估的方法進行示例說明。
資料方A和資料方B可以為具有計算、處理能力的裝置、設備、平台、設備集群,可相互配合以執行圖3所示的方法。
在步驟300a和步驟300b,資料方A和資料方B可以相互配合以執行初始化操作。具體的,資料方A和資料方B可以判定其擁有的ID的取值上限。以ID為手機號為例,其為11位數字構成的整數,即每一個ID為一整數。任一方的ID的取值上限為該方擁有的ID中數值最大ID。
在一個示例中,資料方A可以判定大於或等於資料方A的數值最大ID的整數C1。示例性的,以ID為11位數字組成手機號為例,整數C1可以為12位數字構成的整數。資料方A可以向資料方B發送資料方A的整數C1。資料方B可以判定大於資料方B的數值最大ID,且大於整數C1的質數P,並將質數P發送給資料方A。
在一個示例中,資料方B可以判定大於或等於資料方B的數值最大ID的整數C2。示例性的,以ID為11位數字組成手機號為例,整數C2可以為12位數字構成的整數。資料方B可以向資料方A發送資料方A的整數C2。資料方A可以判定大於資料方A的數值最大ID,且大於整數C2的質數P,並將質數P發送給資料方B。
資料方A可以隨機產生與質數P互質的正整數keyA。keyA也可以稱為第一金鑰。資料方B可以隨機產生與質數P互質的正整數keyB。keyB也可以稱為第二金鑰。
透過上述方式資料方A和資料方B完成初始化,得到各自的金鑰。接下來,資料方A和資料方B分別透過各自的金鑰對各自的ID進行第一次加密,得到各自的第一次加密ID。然後分別將各自的第一次加密ID發送給對方,由對方使用其金鑰再進行第二次加密。就數值相同的ID而言,經過兩次加密後,數值仍然相同,由此,可以使得資料方A和資料方B可以在無需向對方透漏未加密ID(也可以稱為初始ID)情況下,分別得到雙方共有的ID。具體如下。
為表述方便,可以將資料方A擁有的ID集合,即資料方A的樣本集中各樣本的ID的集合,稱為set_A。可以將資料方B擁有的ID集合,即資料方B的樣本集中各樣本的ID的集合,稱為set_B。可理解的,樣本和ID具有一一對應關係。在進行下文所述的加密之前,set_A和set_B中的各ID可以稱為樣本的初始ID。
在步驟302a中,資料方A使用keyA對set_A的每一個ID(初始ID),進行第一次加密,得到第一次加密ID。示例性的,就set_A的每一個ID而言,其第一次加密方式為,計算該ID和keyA的乘積,並將乘積除以質數P得到的餘數用作對應於該ID對應的第一次加密ID。第一次加密ID可以記為Encry(ID,keyA)。
具體可以如圖4所示,待加密的ID可以為set_A中的每一個ID。初始化p即為上述質數p。max(ID)為資料方A中數值最大ID。可以將待加密ID乘以待加密ID,得到TMP。然後,將TMP模質數p的餘數(即TMP除以質數p得到的餘數)E,作為待加密ID的加密結果。
資料方A可以根據特徵Fa的特徵值對set_A進行特徵分箱,以將set_A中第一次加密ID分到多個分箱中。參閱圖3,特徵Fa可以為包括了特徵Fa1、特徵Fa2等多種特徵的特徵集合,特徵Fa1、特徵Fa2可以統稱為Fai,即Fai中i可以為1,也可以為2,等等。其中,每個樣本具有特徵Fai的特徵值(特徵Fai的特徵值也可以稱為特徵Fai的取值)。就特徵Fai而言,資料方A可以根據set_A中各ID對應的特徵Fai的特徵值,進行特徵分箱,以將set_A中ID的第一次加密ID分到特徵Fai對應的多個分箱中。每一個分箱均具有分箱標識,以特徵Fa1為例,其分箱標識可以記為Fa1_bin。以特徵Fa2為例,其分箱標識可以記為Fa2_bin。可以將每一個第一次加密ID、Fa1_bin、Fa2_bin等進行關聯,可以記為(Encry(ID,keyA),Fa1_bin,Fa2_bin,…)。其中,Fa1_bin、Fa2_bin等可以統稱為Fai_bin,其表示ID根據特徵Fai的特徵值被分到了第Fai_bin分箱中。
在一個例子中,可以採用等頻分箱演算法進行特徵分箱。在另一個例子中,可以採用等距分箱演算法進行特徵分箱。在又一個例子中,可以採用卡方分箱演算法進行特徵分箱。
可以將set_A每一個樣本的第一次加密ID、標籤以及按照特徵Fai的特徵值進行分箱後所在分箱的標識進行關聯,得到set_A每一個樣本的第一次加密ID的關聯資訊,可以記為(Encry(ID,keyA),標籤,Fa1_bin,Fa2_bin,…)。set_A所有第一次加密ID的關聯資訊構成了第一交換資訊。資料方A可以將第一交換資訊發送給資料方B。
可理解的,每一個分箱中可以包括多個ID,例如K個ID。這相當於B得到的A的特徵分箱資訊是K匿名化的,即對應任意一個ID,都至少有K各ID與其特徵分箱資訊是相同的,因此,資料方B難以根據ID對應的特徵資訊,來推測ID和特徵資訊的對應關係。
在步驟302b中,資料方B使用keyB對set_B的每一個ID(初始ID),進行第一次加密,得到第一次加密ID。示例性的,就set_B的每一個ID而言,其第一次加密方式為,計算該ID和keyB的乘積,並將乘積除以質數P得到的餘數用作對應於該ID的第一次加密ID。第一次加密ID可以記為Encry(ID,keyB)。
資料方B可以根據特徵Fb的特徵值對set_B進行特徵分箱,以將set_B中第一次加密ID分到多個分箱中。參閱圖3,特徵Fb可以為包括了特徵Fb1、特徵Fb2等多種特徵的特徵集合。特徵Fb1、特徵Fb2可以統稱為Fbi,即Fai中i可以為1,也可以為2,等等。其中,每個樣本具有特徵Fbi的特徵值。可以根據特徵Fbi的特徵值,對set_B進行特徵分箱。具體可以參考上文關於步驟302a所示實施例的介紹,在此不再贅述。
可以將set_B中每一個樣本的第一次加密ID、按照Fbi的特徵值進行分箱後所在分箱的標識進行關聯,得到set_B每一個樣本的第一次加密ID的關聯資訊,可以記為(Encry(ID,keyB),Fb1_bin,Fb2_bin,…)。set_B所有第一次加密ID的關聯資訊構成了第三交換資訊。資料方B可以將第三交換資訊發送給資料方A。
在步驟304a中,資料方A在接收到第三交換資訊後,可以使用keyA對第三交換資訊中set_B的各個第一次加密ID分別進行二次加密,分別得到set_B的各個第一次加密ID的第二次加密ID。具體為,計算第一次加密ID和keyA的乘積,並將乘積除以質數P得到的餘數用作對應於該第一次加密ID的第二次加密ID,可以記為Encry(Encry(ID,keyB),keyA)。連同所在分箱標識,可以記為(Encry (Encry(ID,keyB),keyA),Fb1_bin,Fb2_bin,…),該資訊構成第一加密集合。
在步驟306a中,打亂(擾亂)set_B的各個第二次加密ID之間的相對順序,並將擾亂後的set_B的各個第二次加密ID,作為第四交換資訊發送給資料方B。
需要理解,第三交換資訊中的set_B的各個第一次加密ID之間具有相對順序,在使用第一金鑰對set_B的各個第一次加密ID進行二次加密,得到的set_B的各個第二次加密ID之間的相對順序與set_B的各個第一次加密ID之間具有相對順序相同。如不打亂set_B各個第二次加密ID之間的相對順序,就將set_B各個第二次加密ID發送給資料方B,則資料方B可以根據set_B各個第二次加密ID之間的相對順序,判定set_B各個第二次加密ID和set_B各個第一次加密ID的一一對應關係,由此可以得到第一金鑰,進而可以判定定set_A中的ID,導致資料方A的ID以及黑白名單洩露。
並且,在第三交換資訊中並不攜帶set_B的各個ID的所在分箱的標識,以避免資料方B根據set_B的各個第二次加密ID的所在分箱的標識,推測出各樣本第二次加密ID和各樣本的初始ID(或第一次加密ID)的對應關係,由此,得到第一金鑰,進而可以判定set_A中的ID,導致資料方A的ID以及黑白名單洩露。
在步驟304b中,資料方B在接收到第一交換資訊後,可以使用keyB對第一交換資訊中set_A的各個第一次加密ID分別進行二次加密,分別得到set_A的各個第一次加密ID對應的第二次加密ID。具體為,計算第一次加密ID和keyB的乘積,並將乘積除以質數P得到的餘數用作對應於該第一次加密ID的第二次加密ID,可以記為Encry (Encry(ID,keyA),keyB)。連同所在分箱標識,可以記為(Encry(Encry(ID,keyA),keyB),標籤,Fa1_bin,Fa2_bin,…),該資訊構成第二加密集合。
在步驟306b中,打亂(擾亂)set_A的各個第二次加密ID之間的相對順序,並將擾亂後的set_A的各個第二次加密ID連同各自的標籤,作為第二交換資訊發送給資料方A。在步驟306b中,擾亂set_A的各個第二次加密ID之間的相對順序,以及不向資料方發送set_A中ID的所在分箱的標識,以避免資料方A推測出第二金鑰。
透過上述步驟,set_A和set_B中各初始ID的均進行了兩次加密。其中,set_A中的初始ID,先在資料方A使用第一金鑰進行第一次加密,然後在資料方B使用第二金鑰進行第二次加密。set_B中的初始ID,先在資料方B使用第一金鑰進行第一次加密,然後在資料方A使用第二金鑰進行第二次加密。資料方A和B彼此交換各自二次加密的結果,使得資料方A和資料方B都擁有set_A和set_B中各初始ID對應的第二次加密ID。第一金鑰和第二金鑰均與質數p的互質,並且第一次和第二次的加密方式均為將金鑰和ID乘積除以質數p的餘數作為加密ID。由餘數系統的性質,使得上述加密方式具有如下性質:
可疊加性,ID加密前後具有相同的取值範圍,可進行多次加密運算;
可交換性,加密符合交換律,同一個ID透過兩個不同的金鑰進行二次加密,交換加密次序,得到的密文一致,即Encry(Encry(ID,keyA),keyB)=Encry(Encry(ID,keyB),keyA)。
難解密性,加密的金鑰未知時,解密是極難的。
唯一性,當且僅當ID(整數)相等時,ID的加密結果才相同。
接下來,結合餘數系統的性質對本說明書實施例所述的加密方式的性質進行證明。
在本說明書實施例中,x mod(y),可以稱為x模y,表示x除以y所得的餘數。餘數系統具有如下性質。
模n的完整餘數系統的任意兩個數模n的餘數不同,且正整數中任意正整數模n必定與模n的完整餘數系統中的某個數模n的餘數相同。模n完整餘數系統中,與模n互質的代表數所構成的集合,稱為模n的簡約餘數系統。
對於質數p和任意與p互質的正整數a,模p的最小簡約餘數系統集合S={1,2,3,…,(p-1)}的元素都乘以a,得到新的集合a*S={a,2a,3a,…,(p-1)a},滿足a*S mod(p)=S。證明如下。
若x屬於S,由餘數性質可知a*x mod(p)屬於集合S或0。假設a*x mod(p)=0,則a*x是p的整數倍。因p是質數,x不能被p整除,推出a能被p整除,與“a與p互質的條件”矛盾,因而假設不成立,a*x mod(p)不等於0,即知a*x mod(p)屬於集合S。
若x1、x2都屬於S且x1>x2,假設a*x1和a*x2模p同餘,即a*x1 mod(p)= a*x2 mod(p),則a*x1-k1*p=a*x2-k2*p,推出a*(x1-x2)=(k1-k2)*p。因-p<x1-x2<p,p是質數。若前式 a*(x1-x2)=(k1-k2)*p成立,即a是p的整數倍,這與“a與p互質的條件”矛盾,因而a*x1和a*x2模p同余不成立,a*x1和a*x2模p不同餘。由上可知,集合a*S中的p-1個元素,模p後的餘數是集合S中的元素,且互不相等,那麼顯而易見,集合S中的任何一個元素,都一定是a*S中某個元素模p的餘數。即,集合a*S mod(p)與集合S相同。
在本說明書實施例中,max(ID)<p,所以ID屬於集合S={1,2,3,...(p-1)},由此,可疊加性得證。即集合S的元素,經本說明書實施例提供的加密方式加密後,仍屬於集合S,因而可以繼續進行下一次加密。
對於質數p,對任意與p互質的正整數a和b,滿足交換律b*(a*x mod(p)) mod(p) = a*(b*x mod(p)) mod(p)。證明如下。
易證明x*y mod(z)=(x mod(z)) * (y mod(z)),於是,b*(a*x mod(p)) mod(p)=[b mod(p)] * [(a*x mod(p)) mod(p)] =[b mod(p)] * [a*x mod(p)]=[b mod(p)] * [a mod(p)] * [x mod(p)],同理可得a*(b*x mod(p)) mod(p)=[a mod(p)] * [b mod(p)] * [x mod(p)],由上,b*(a*x mod(p)) mod(p)=a*(b*x mod(p)) mod(p)得證。
在本說明書實施例中,同一個ID透過兩個不同的key進行二次加密,交換加密次序,得到的密文一致,即Encry(Encry(ID,keyA),keyB)= Encry(Encry(ID,keyB),keyA)。由此,可交換性得證。
已知質數p,和a*x mod(p)的值v,已知x屬於集合{1,2,3,…,(p-1)},a是一個與p互質的正整數,求x是一件很難的事。證明:這裡有兩個未知數a和x,a的取值範圍是1至正無窮,x的取值範圍是1~(p-1),有無窮組可能解,因而不可能解出x的值。即加密key未知時,解密是極難。由此,難解密性得證。
對於質數p和任意與p互質的正整數a,m和n是集合S={1,2,3,…,(p-1)}的兩個不同的元素,那麼a*m mod(p)一定不等於a*n mod(p)。證明如下。
假設a*m mod(p)=a*n mod(p),那麼a*m-k1*p=a*n-k2*p,k1和k2是整數。可推出a*(m-n)=(k1-k2)*p。因a與p互質,那麼必然有m-n可被p整除。因為m和n都屬於集合S,因而只可能有m-n=0,m和n相等,不符合條件,推出矛盾。因而a*m mod(p)不等於a*n mod(p)得證。
因此,透過本說明書提供的加密方式,當且僅當ID相等時,ID的加密結果才相同;當ID不相等時,ID的加密結果一定不同。
透過上述論證可知,set_A和set_B中具有相同ID時,set_A中的該ID經過上文所述加密方式加密後的加密結果,等於set_B中該ID經過上述所述加密方式加密後的加密介面。
由此,在步驟308a中,資料方A可以判定出set_A和set_B共有ID。並且第二交換資訊中攜帶了各ID的標籤,透過第三次交換資訊可以得到共有ID透過特徵Fbi(Fb1、特徵Fb2等)的特徵值進行分箱得到的所在分箱的標識。
在步驟310a中,可以根據步驟308a得到的資訊,利用圖3所示的公式,計算各特徵Fbi的資訊價值。其中,label=1表示標籤為正,label=0表示標籤為負。對於任一特徵Fbi而言,Precallk
表示分箱k中標籤為正的ID的數量相對於共有樣本中標籤為正的樣本總個數的比例,Nrecallk
表示分箱k中標籤為負的ID的數量相對於共有樣本中標籤為負的樣本總個數的比例,IV表示資訊價值。
在步驟308b中,資料方B可以判定出set_A和set_B共有ID。並且第一交換資訊中攜帶了各ID的標籤以及所在分箱的標識,由此,可以在步驟310b中,計算各特徵Fai的資訊價值。
本說明書實施例提供的方法,能夠實現各方資料隔離的情況下,完成特徵的資訊價值的安全計算,不洩露各方資料。具體如下。
在資訊價值計算過程中,資料方A拿到了資料方B的ID是由keyB加密的結果和對應的Fb特徵分箱,但這個資料對資料方A來說是足夠隱密的,因為:1)資料方A拿到的ID是經過keyB加密的,資料方A無法知道其背後對應的原ID,因而也無法把Fb分箱結果與真實ID對應起來;2)計算資訊價值時用的分箱資訊無關分箱的順序,因而資料方B傳給資料方A的所在分箱的標識可以是打亂順序的(可以在打亂第二次加密ID順序時實現),或者所在分箱的標識只是一個代號,這樣資料方A無法知道分箱對應的特徵大小順序;3)特徵的每個分箱裡包含K個ID,相當於資料方A得到的關於資料方B特徵的資訊是經過K匿名化的,任何一個ID的資訊,都有至少K個ID與之是一樣。資料方A還拿到了資料方A ID經過二次加密後的結果,這個加密ID因為已經被B打亂順序,且沒有攜帶任何其它可供辨識的額外資訊,因而資料方A只知道,這些ID都是自身ID被加密後得到的結果,且一一對應,但是並不清楚其中的對應關係。資料方A在拿到兩份資料後進行匹配、取交集、運算,這些操作相當於在一個ID加密後的空間內進行,且這個加密空間與原空間的對應關係未知(這個映射關係必須擁有兩方的keyA和keyB兩個金鑰才可知),因此,計算是安全的。類似可知,資料方B可獲得的資料,也不足以讓資料方B推導出資料方A的資料資訊。
參閱圖5,本說明書實施例提供了一種保護隱私安全的多方聯合進行特徵評估的方法,所述多方至少包括第一設備和第二設備,第一設備儲存有第一樣本集和其中各樣本的標籤,第二設備儲存有第二樣本集,所述方法應用於第一設備。參閱圖5,所述方法包括如下步驟。
步驟501,使用第一金鑰對第一樣本集中各樣本的初始ID進行加密,得到第一樣本集中各樣本的第一次加密ID。具體可以參閱上文對圖3中步驟302a的介紹,在此不再贅述。
需要理解,在描述302a時結合餘數加密的演算法進行描述。餘數加密演算法計算量少,並且安全性高,為一種較佳的加密演算法。應該理解,餘數加密演算法並非唯一的加密演算法,只要加密演算法滿足可疊加性、交換性、唯一性,都可以用於在步驟302a以及步驟302b中對樣本ID進行加密。在本說明書實施例中,資料方A與資料方B可以預先協商其他加密演算法。這裡的加密演算法可以為任一基於同一組金鑰對目標資料進行加密時,金鑰的使用順序不影響加密結果的演算法。這裡的加密演算法除圖3所示實施例中描述的餘數加密演算法外,還可以為異或(XOR)演算法、DH演算法、ECC-DH演算法等中任一種。
步驟503,向所述第二設備發送第一交換資訊,其中至少包括,第一樣本集中每個樣本的第一次加密ID和標籤。具體可以參閱上文對圖3中步驟302a的介紹,在此步驟贅述。
步驟505,從所述第二設備分別接收第二交換資訊和第三交換資訊,其中,所述第二交換資訊包括,由所述第二設備使用第二金鑰對第一樣本集中每個樣本的第一次加密ID進行二次加密後得到的第二次加密ID和對應的標籤,且所述第二交換資訊中各樣本的相對順序已由所述第二設備擾亂;所述第三交換資訊包括,針對第二樣本集中每一個樣本,由所述第二設備基於所述第二金鑰對其初始ID進行加密得到的第一次加密ID和該樣本所在第一分箱的標識,所述第一分箱的標識由所述第二設備基於第二樣本集中各樣本的第一特徵的特徵值進行分箱得到。
具體可以參閱上文對圖3中步驟302b、304b、306b的介紹,在此不再贅述。
步驟507,使用所述第一金鑰,對所述第三交換資訊中各樣本的第一次加密ID進行二次加密,得到第一加密集合。具體可以參閱上文對圖3中步驟304a的介紹,在此不再贅述。
步驟509,基於第二交換資訊中的第二次加密ID和第一加密集合中的第二次加密ID,判定第一樣本集和第二樣本集的共有樣本。具體可以參閱上文對圖3步驟308a的介紹,在此不再贅述。
步驟511,基於共有樣本中各樣本的標籤、所在第一分箱的標識,判定所述第一特徵的資訊價值,用以針對機器學習模型進行特徵選擇。具體可以參閱上文對圖3中步驟310a的介紹,在此不再贅述。
在一些實施例中,所述方法還包括:在向第二設備發送第一交換資訊之前,基於第一樣本集中各樣本的第二特徵的特徵值,將第一樣本集分成多個第二分箱,並將第一樣本集中每一個樣本所在第二分箱的標識包括在所述第一交換資訊中;在得到所述第一加密集合之後,擾亂第二樣本集中各樣本的相對順序,得到第四交換資訊;向所述第二設備發送所述第四交換資訊,以便所述第二設備基於所述第四交換資訊中的第二次加密ID和第二加密集合中的第二次加密ID判定共有樣本,並基於共有樣本中各樣本的標籤、所在第二分箱的標識,判定所述第二特徵的資訊價值,其中第二加密集合是使用所述第二金鑰對所述第一交換資訊中的第一次加密ID進行二次加密得到的。具體可以參閱上文對圖3中步驟302a、306a、308b、310b的介紹,在此不再贅述。
在該實施例的一個示例中,所述基於第一樣本集中各樣本的第二特徵的特徵值,將第一樣本集分成多個第二分箱包括:根據等頻分箱、等距分箱、卡方分箱中任一項,將第一樣本集分成所述多個第二分箱。
在一些實施例中,第一樣本集中各樣本的初始ID和第二樣本集中各樣本的初始ID均為正整數;在使用第一金鑰對第一樣本集中各樣本的初始ID進行加密之前,所述方法還包括:判定大於第一樣本集中各樣本的初始ID中最大初始ID,且大於第二樣本集中各樣本的初始ID中最大初始ID的第一質數;判定與第一質數互質的第一正整數為所述第一金鑰。具體可以參閱上文對圖3中步驟300a和步驟300b的介紹,在此不再贅述。
在一些實施例中,所述使用第一金鑰對第一樣本集中各樣本的初始ID進行加密,得到第一樣本集中各樣本的第一次加密ID包括:對於第一樣本集中每一個樣本,判定該樣本初始ID和所述第一金鑰的乘積除以所述第一質數的餘數為該樣本的第一次加密ID。具體可以參閱上文對圖3中步驟302的介紹,在此不再贅述。
在一些實施例中,第一樣本集包括標籤為正的多個樣本和標籤為負的多個樣本;所述基於共有樣本中各樣本的標籤、所在第一分箱的標識,判定所述第一特徵的資訊價值包括:判定共有樣本中落入具有第一標識的第一分箱中且標籤為正的樣本個數,相對於共有樣本中標籤為正的樣本總個數的第一比例;判定共有樣本中落入所述具有第一標識的第一分箱中且標籤為負的樣本個數,相對於共有樣本中標籤為負的樣本總個數的第二比例;基於各個標識的第一分箱分別對應的所述第一比例,和所述第二比例,判定共有樣本的第一特徵的資訊價值。具體可以參閱上文對圖3中步驟310a的介紹,在此不再贅述。
在一些實施例中,所述第一樣本集中的樣本包括使用者樣本,所述機器學習模型為使用者分類模型;或者,所述第一樣本集中的樣本包括業務樣本,所述機器學習模型為業務處理模型。
本說明書實施例提供的方法,可以在雙方未知對方使用者以及在標籤和特徵資料隔離的情況下,計算雙方共有使用者的特徵的資訊價值,安全性高。
參閱圖6,本說明書實施例提供了一種保護隱私安全的多方聯合進行特徵評估的方法,所述多方至少包括第一設備和第二設備,所述第一設備儲存有第一樣本集和其中各樣本的標籤,所述第二設備儲存有第二樣本集,所述方法應用於第二設備。如圖6所示,該方法包括如下步驟。
步驟601,從第一設備接收第一交換資訊,其中至少包括,由所述第一設備使用第一金鑰對第一樣本集中每個樣本的初始ID進行加密後得到的第一次加密ID和對應的標籤。具體可以參閱上文對圖3中步驟302a的介紹,在此不再贅述。
步驟603,使用第二金鑰,對所述第一交換資訊中各樣本的第一次加密ID進行二次加密,得到第二加密集合,然後擾亂所述第二加密集合中各樣本的相對順序。具體可以參閱上文對圖3中步驟304b、306b的介紹,在此不再贅述。
步驟605,向所述第一設備發送第二交換資訊,所述第二交換資訊包括已擾亂相對順序的第一樣本集中各樣本的第二次加密ID和標籤。具體可以參閱上文對圖3中步驟306b的介紹,在此不再贅述。
步驟607,使用第二金鑰對第二樣本集中各個樣本的初始ID進行加密,得到第二樣本集中第一次加密ID。具體可以參閱上文對圖3中步驟302b的介紹,在此不再贅述。
步驟609,基於第二樣本集中各樣本的第一特徵的特徵值,將第二樣本集分成多個第一分箱。具體可以參閱上文對圖3中步驟302b的介紹,在此不再贅述。
步驟611,向所述第一設備發送第三交換資訊,所述第三交換資訊包括第二樣本集中各樣本的第一次加密ID和所在第一分箱的標識,以便所述第一設備使用第一金鑰對第三交換資訊中的第一次加密ID進行加密,得到第一加密集合,並基於第一加密集合中的第二次加密ID和所述第二交換資訊中的第二次加密ID,判定第一樣本集和第二樣本集的共有樣本,以及基於共有樣本中各樣本的標籤、所在第一分箱的標識,判定所述第一特徵的資訊價值,用於針對機器學習模型進行特徵選擇。
具體可以參閱上文對圖3中步驟302b的介紹,在此不再贅述。
在一些實施例中,所述第一交換資訊還包括第一樣本集中每一個樣本所在第二分箱的標識,所述第二分箱的標識由所述第一設備基於第一樣本集中各樣本的第二特徵的特徵值進行分箱得到;所述方法還包括:從所述第一設備接收第四交換資訊,所述第四交換資訊包括第二樣本集中各樣本的第二次加密ID,且所述第四交換資訊中各樣本的相對順序已由所述第一設備擾亂;基於所述第二加密集合的第二次加密ID和所述第四交換資訊中的第二次加密ID,判定第一樣本集和第二樣本集的共有樣本;基於共有樣本中各樣本的標籤、所在第二分箱的標識,判定所述第二特徵的資訊價值,用於針對機器學習模型進行特徵選擇。具體可以參閱上文對圖3中步驟302a、304a、306a、308b、310b的介紹,在此不再贅述。
本說明書實施例提供的方法,可以在雙方未知對方使用者以及在標籤和特徵資料隔離的情況下,計算雙方共有使用者的特徵的資訊價值,安全性高。
參閱圖7,本說明書實施例提供了一種保護隱私安全的多方聯合進行特徵評估的裝置700,所述多方至少包括第一設備和第二設備,第一設備儲存有第一樣本集和其中各樣本的標籤,第二設備儲存有第二樣本集,所述裝置配置於第一設備。如圖7所示,所述裝置700包括:
第一加密單元710,用於使用第一金鑰對第一樣本集中各樣本的初始ID進行加密,得到第一樣本集中各樣本的第一次加密ID;
第一發送單元720,用於向所述第二設備發送第一交換資訊,其中至少包括,第一樣本集中每個樣本的第一次加密ID和標籤;
第一接收單元730,用於從所述第二設備分別接收第二交換資訊和第三交換資訊,其中,所述第二交換資訊包括,由所述第二設備使用第二金鑰對第一樣本集中每個樣本的第一次加密ID進行二次加密後得到的第二次加密ID和對應的標籤,且所述第二交換資訊中各樣本的相對順序已由所述第二設備擾亂;所述第三交換資訊包括,針對第二樣本集中每一個樣本,由所述第二設備基於所述第二金鑰對其初始ID進行加密得到的第一次加密ID和該樣本所在第一分箱的標識,所述第一分箱的標識由所述第二設備基於第二樣本集中各樣本的第一特徵的特徵值進行分箱得到;
第二加密單元740,用於基於所述第一金鑰,對所述第三交換資訊中各樣本的第一次加密ID進行二次加密,得到第二樣本集中各樣本的第二次加密ID;
第一判定單元750,用於基於第一樣本集中各樣本的第二次加密ID和第二樣本集中各樣本的第二次加密ID,判定第一樣本集和第二樣本集的共有樣本;
第二判定單元760,用於基於共有樣本中各樣本的標籤、所在第一分箱的標識,判定所述第一特徵的資訊價值,用以針對機器學習模型進行特徵選擇。
裝置700的各功能單元的功能可以參考圖5所示方法實施例實現,在此不再贅述。
本說明書實施例提供的裝置,可以在雙方未知對方使用者以及在標籤和特徵資料隔離的情況下,計算雙方共有使用者的特徵的資訊價值,安全性高。
參閱圖8,本說明書實施例提供了一種保護隱私安全的多方聯合進行特徵評估的裝置,所述多方至少包括第一設備和第二設備,所述第一設備儲存有第一樣本集和其中各樣本的標籤,所述第二設備儲存有第二樣本集,所述裝置配置於第二設備;所述裝置包括:
第二接收單元810,用於從第一設備接收第一交換資訊,其中至少包括,由所述第一設備使用第一金鑰對第一樣本集中每個樣本的初始ID進行加密後得到的第一次加密ID和對應的標籤;
第三加密單元820,用於使用第二金鑰,對所述第一交換資訊中各樣本的第一次加密ID進行二次加密,得到第二加密集合,然後擾亂所述第二加密集合中各樣本的相對順序;
第二發送單元830,用於向所述第一設備發送第二交換資訊,所述第二交換資訊包括已擾亂相對順序的第一樣本集中各樣本的第二次加密ID和標籤;
第四加密單元840,用於使用第二金鑰對第二樣本集中各個樣本的初始ID進行加密,得到第二樣本集中第一次加密ID;
第二分箱單元850,用於基於第二樣本集中各樣本的第一特徵的特徵值,將第二樣本集分成多個第一分箱;
第二發送單元830還用於向所述第一設備發送第三交換資訊,所述第三交換資訊包括第二樣本集中各樣本的第一次加密ID和所在第一分箱的標識,以便所述第一設備使用第一金鑰對第三交換資訊中的第一次加密ID進行二次加密,得到第一加密集合,並基於第一加密集合中的第二次加密ID和所述第二交換資訊中的各樣本的第二次加密ID,判定第一樣本集和第二樣本集的共有樣本,以及基於共有樣本中各樣本的標籤、所在第一分箱的標識,判定所述第一特徵的資訊價值,用於針對機器學習模型進行特徵選擇。
裝置800的各功能單元的功能可以參考圖6所示方法實施例實現,在此不再贅述。
本說明書實施例提供的裝置,可以在雙方未知對方使用者以及在標籤和特徵資料隔離的情況下,計算雙方共有使用者的特徵的資訊價值,安全性高。
另一方面,本說明書的實施例提供了一種電腦可讀儲存媒體,其上儲存有電腦程式,當所述電腦程式在電腦中執行時,令電腦執行圖5所示的方法或圖6所示的方法。
另一方面,本說明書的實施例提供了一種計算終端,包括記憶體和處理器,所述記憶體中儲存有可執行代碼,所述處理器執行所述可執行代碼時,實現圖5所示的方法或圖6所示的方法。
本領域技術人員應該可以意識到,在上述一個或多個示例中,本說明書所描述的功能可以用硬體、軟體、韌體或它們的任意組合來實現。當使用軟體實現時,可以將這些功能儲存在電腦可讀媒體中或者作為電腦可讀媒體上的一個或多個指令或代碼進行傳輸。
以上所述的具體實施方式,對本發明的目的、技術方案和有益效果進行了進一步詳細說明,所應理解的是,以上所述僅為本發明的具體實施方式而已,並不用於限定本發明的保護範圍,凡在本發明的技術方案的基礎之上,所做的任何修改、等同替換、改進等,均應包括在本發明的保護範圍之內。The following describes the solutions provided in this specification in conjunction with the drawings. Fig. 1A shows the data owned by the data party A disclosed in the embodiment of this specification. Figure 1B shows the data owned by the data party B disclosed in the embodiment of this specification. Each ID (Identity Document, identity identification number) in FIG. 1A and FIG. 1B may be a digital code that uniquely identifies a user, such as a mobile phone number. As shown in Figure 1A and Figure 1B, ID1, ID2, and ID3 are the IDs shared by the data party A and the data party B. Each ID in FIG. 1A has a tag and a characteristic value of the characteristic Fa. Exemplarily, as shown in FIG. 1A, tags can be classified into two types: positive tags and negative tags. Each ID in FIG. 1B has the characteristic value of the characteristic Fb. In an exemplary scenario, the data party A may be an electronic payment platform (for example, Alipay), and the label may be a mark of a fraudulent merchant or a mark of a non-fraudulent merchant. The feature Fa may be transaction flow data. The data party B can be a banking institution, and the feature Fb can be loan data. The feature value of the transaction flow data or the feature value of the loan data corresponding to each ID can be calculated through feature engineering. For details, please refer to the introduction of the prior art, which will not be repeated here. In an exemplary scenario, the data party A may be an e-commerce platform (such as Taobao), the label may be a mark of a normal buyer or a mark of an abnormal buyer, and the feature Fa may be a sales material. The data party B can be a banking institution, and the feature Fb can be loan data. Multi-party joint training of machine learning models requires the use of user characteristics shared by data party A and data party B. In order to effectively train a machine learning model, it is necessary to evaluate the correlation between features and labels. The feature screening can be performed through the scheme shown in Figure 2. Among them, multiple IDs (ID set) in data party A can be called set_A. Multiple IDs (ID set) in B can be called set_B. When performing joint calculations, data party A can send the tags of set_A and set_A to data party B. From this, the data party B can determine the shared ID of set_A and set_B, and then calculate the information value of the feature Fb of the shared ID to evaluate the correlation between the feature Fb and the label. Data party B can send set_B to data party A. From this, the data party A can determine the shared ID of set_A and set_B, and then calculate the information value of the feature Fa of the shared ID to evaluate the correlation between the feature Fa and the label. In this solution, both parties need to exchange plaintext IDs. Another solution for evaluating the correlation between features and tags is to build a trusted execution environment (for example, using Intel’s sgx technology to build a trusted execution environment), and the data of data party A (set_A, set_A tags, set_A features) Fa) and the data of data party B (characteristic Fb of set_B and set_B) can be transmitted to the trusted execution environment after being encrypted with a public key. The private key is decrypted in the trusted execution environment, and the information value calculation of the feature is completed, and the information value calculation result of the feature is transmitted to the trusted environment. Another solution for evaluating the correlation between features and tags is to send data from data party A (set_A, set_A tags, set_A feature Fa) and data party B (set_B, set_B feature Fb) to a third party For institutions, a third party completes the calculation of the information value of the characteristics. In order to further enhance the security of private data, the embodiment of this specification provides a method for multi-party joint feature evaluation, which can calculate the information value of the features of the users shared by both parties when the other party is unknown to the user and the tag and feature data are isolated. . In one embodiment, the method may include the steps shown in FIG. 3. It should be noted that although FIG. 3 shows step 301a-step 310a and step 301b-step 310b in sequence, it does not limit the execution order of these steps 300-310. In some examples, step 301a to step 310a and step 301b to step 310b can be performed in sequence as shown in FIG. 3. In some examples, step 301a to step 310a and step 301b to step 310b may be performed in a different order from that shown in FIG. 3. In some examples, two or more steps of step 301a to step 310a and step 301b to step 310b may be performed concurrently. Next, in conjunction with Fig. 3, the method of multi-party joint feature evaluation for protecting privacy and security provided in this specification will be illustrated as an example. The data party A and the data party B can be devices, equipment, platforms, and equipment clusters with computing and processing capabilities, and can cooperate with each other to execute the method shown in FIG. 3. In step 300a and step 300b, the data party A and the data party B can cooperate with each other to perform the initialization operation. Specifically, the data party A and the data party B can determine the upper limit of the value of the ID they own. Taking the ID as a mobile phone number as an example, it is an integer composed of 11 digits, that is, each ID is an integer. The upper limit of the ID of either party is the ID with the largest value among the IDs owned by that party. In one example, the data party A may determine that the data party A is greater than or equal to the integer C1 of the largest ID of the data party A. Exemplarily, taking the ID of 11 digits forming a mobile phone number as an example, the integer C1 may be an integer consisting of 12 digits. Data party A can send data party A's integer C1 to data party B. The data party B can determine the prime number P which is greater than the data party B's numerical maximum ID and is greater than the integer C1, and send the prime number P to the data party A. In an example, the data party B can determine that the data party B is greater than or equal to the integer C2 of the largest ID of the data party B. Exemplarily, taking the ID of 11 digits forming a mobile phone number as an example, the integer C2 may be an integer consisting of 12 digits. Data party B can send data party A's integer C2 to data party A. The data party A can determine the prime number P which is greater than the data party A's numerical maximum ID and is greater than the integer C2, and send the prime number P to the data party B. The data party A can randomly generate a positive integer keyA that is relatively prime to the prime number P. keyA can also be called the first key. The data party B can randomly generate a positive integer keyB that is relatively prime to the prime number P. keyB can also be called the second key. The data party A and data party B complete the initialization through the above-mentioned method, and obtain their respective keys. Next, the data party A and the data party B respectively use their own keys to encrypt their IDs for the first time to obtain their first encrypted IDs. Then respectively send their first encrypted ID to the other party, and the other party uses its key to perform the second encryption. For IDs with the same value, the value is still the same after two encryptions. This allows the data party A and the data party B to disclose the unencrypted ID (also known as the initial ID) to the other party. Obtain the IDs shared by both parties. details as follows. For the convenience of presentation, the ID set owned by data party A, that is, the set of IDs of each sample in the sample set of data party A, can be called set_A. The ID set owned by data party B, that is, the set of IDs of each sample in the sample set of data party B, can be called set_B. Understandably, there is a one-to-one correspondence between samples and IDs. Before the encryption described below, each ID in set_A and set_B can be referred to as the initial ID of the sample. In step 302a, the data party A uses keyA to encrypt each ID (initial ID) of set_A for the first time to obtain the first encrypted ID. Exemplarily, for each ID of set_A, the first encryption method is to calculate the product of the ID and keyA, and divide the product by the prime number P to obtain the remainder as the first corresponding to the ID Encrypted ID. The first encryption ID can be recorded as Encry(ID, keyA). Specifically, as shown in Figure 4, the ID to be encrypted can be each ID in set_A. Initialization p is the above prime number p. max(ID) is the ID with the largest value in data party A. You can multiply the ID to be encrypted by the ID to be encrypted to get the TMP. Then, the remainder E of the TMP modulus prime number p (that is, the remainder obtained by dividing TMP by the prime number p) E is used as the encryption result of the ID to be encrypted. Data party A can perform feature binning on set_A according to the feature value of feature Fa, so as to split the first encrypted ID in set_A into multiple bins. Referring to FIG. 3, the feature Fa can be a feature set including multiple features such as feature Fa1, feature Fa2, etc. Feature Fa1, feature Fa2 can be collectively referred to as Fai, that is, in Fai, i can be 1, or 2, and so on. Among them, each sample has the feature value of the feature Fai (the feature value of the feature Fai may also be referred to as the value of the feature Fai). In terms of feature Fai, data party A can perform feature binning according to the feature value of feature Fai corresponding to each ID in set_A, so as to divide the first encrypted ID of ID in set_A into multiple bins corresponding to feature Fai . Each bin has a bin identification. Taking feature Fa1 as an example, its bin identification can be recorded as Fa1_bin. Taking feature Fa2 as an example, its bin identification can be recorded as Fa2_bin. You can associate each first encrypted ID, Fa1_bin, Fa2_bin, etc., which can be recorded as (Encry(ID, keyA), Fa1_bin, Fa2_bin,...). Among them, Fa1_bin, Fa2_bin, etc. can be collectively referred to as Fai_bin, which means that the ID is sorted into the Fai_bin bin according to the feature value of the feature Fai. In one example, an equal frequency binning algorithm can be used to perform feature binning. In another example, the equidistant binning algorithm can be used for feature binning. In another example, the chi-square binning algorithm can be used to perform feature binning. The first encrypted ID and label of each sample of set_A can be associated with the identification of the bin after being binned according to the feature value of the feature Fai, and the associated information of the first encrypted ID of each sample of set_A can be obtained. It is (Encry(ID, keyA), label, Fa1_bin, Fa2_bin,...). All related information of the first encrypted ID of set_A constitutes the first exchange information. Data party A can send the first exchange information to data party B. It is understandable that each sub-box may include multiple IDs, for example, K IDs. This is equivalent to that the feature binning information of A obtained by B is anonymized by K, that is, corresponding to any ID, at least K each ID and its feature binning information are the same. Therefore, it is difficult for data party B to correspond to the characteristics of the ID. Information to infer the correspondence between ID and feature information. In step 302b, the data party B uses keyB to encrypt each ID (initial ID) of set_B for the first time to obtain the first encrypted ID. Exemplarily, for each ID of set_B, the first encryption method is to calculate the product of the ID and keyB, and divide the product by the prime number P to obtain the remainder as the first encryption corresponding to the ID ID. The first encryption ID can be recorded as Encry(ID, keyB). Data party B can perform feature binning on set_B according to the feature value of feature Fb, so as to split the first encrypted ID in set_B into multiple bins. Referring to FIG. 3, the feature Fb may be a feature set including multiple features such as feature Fb1 and feature Fb2. Feature Fb1 and Feature Fb2 can be collectively referred to as Fbi, that is, i in Fai can be 1, or 2, and so on. Among them, each sample has the characteristic value of the characteristic Fbi. The set_B can be binned according to the feature value of the feature Fbi. For details, reference may be made to the above description of the embodiment shown in step 302a, which will not be repeated here. The first encrypted ID of each sample in set_B can be correlated with the identification of the bin after binning according to the characteristic value of Fbi, and the associated information of the first encrypted ID of each sample in set_B can be obtained, which can be recorded as ( Encry(ID, keyB), Fb1_bin, Fb2_bin,...). All related information of the first encrypted ID of set_B constitutes the third exchange information. Data party B can send the third exchange information to data party A. In step 304a, after the data party A receives the third exchange information, it can use keyA to encrypt each first encrypted ID of set_B in the third exchange information respectively to obtain each first encrypted ID of set_B. The second encrypted ID. Specifically, the product of the first encrypted ID and keyA is calculated, and the remainder obtained by dividing the product by the prime number P is used as the second encrypted ID corresponding to the first encrypted ID, which can be recorded as Encry(Encry(ID, keyB), keyA). Together with the bin identification, it can be recorded as (Encry (Encry (ID, keyB), keyA), Fb1_bin, Fb2_bin,...), and this information constitutes the first encrypted set. In step 306a, the relative order between the second encrypted IDs of set_B is disturbed (disturbed), and the disturbed second encrypted IDs of set_B are sent to the data party B as the fourth exchange information. It should be understood that the first encrypted IDs of set_B in the third exchange information have a relative order. After the first key is used to encrypt the first encrypted IDs of set_B, each second encrypted ID of set_B is obtained. The relative order between the secondary encryption IDs is the same as the relative order between the first encryption IDs of set_B. If the relative order between the second encrypted IDs of set_B is not disturbed, then the second encrypted IDs of set_B will be sent to the data party B, and then the data party B can follow the relative order between the second encrypted IDs of set_B , Determine the one-to-one correspondence between each second encrypted ID of set_B and each first encrypted ID of set_B, from which the first key can be obtained, and then the ID in set_A can be determined, resulting in the ID of data party A and the black and white list Give way. In addition, the third exchange information does not carry the identification of the bin where each ID of set_B is located, so as to prevent the data party B from inferring the second time of each sample based on the identification of the bin where each second encryption ID of set_B is located The corresponding relationship between the encrypted ID and the initial ID (or the first encrypted ID) of each sample, thereby obtaining the first key, and then the ID in set_A can be determined, which leads to the disclosure of the ID of the data party A and the black and white list. In step 304b, after receiving the first exchange information, the data party B can use keyB to perform secondary encryption on each first encrypted ID of set_A in the first exchange information to obtain each first encrypted ID of set_A. The corresponding ID for the second encryption. Specifically, the product of the first encrypted ID and keyB is calculated, and the remainder obtained by dividing the product by the prime number P is used as the second encrypted ID corresponding to the first encrypted ID, which can be recorded as Encry (Encry(ID, keyA), keyB). Together with the bin identification, it can be recorded as (Encry(Encry(ID, keyA), keyB), label, Fa1_bin, Fa2_bin,...), and this information constitutes the second encrypted set. In step 306b, the relative order between the second encrypted IDs of set_A is disrupted (disrupted), and the respective second encrypted IDs of set_A after the scrambled are sent to the data as the second exchange information along with their respective tags. Party A. In step 306b, the relative sequence between the second encrypted IDs of set_A is disturbed, and the identification of the bin where the ID in set_A is located is not sent to the data party, so as to prevent the data party A from inferring the second key. Through the above steps, each initial ID in set_A and set_B has been encrypted twice. Among them, the initial ID in set_A is first encrypted by the data party A using the first key, and then encrypted by the data party B using the second key for the second time. The initial ID in set_B is first encrypted by the data party B using the first key, and then encrypted by the data party A using the second key for the second time. The data parties A and B exchange the results of their respective secondary encryptions with each other, so that both the data party A and the data party B have the second encryption IDs corresponding to the initial IDs in set_A and set_B. Both the first key and the second key are relatively prime to the prime number p, and the first and second encryption methods both use the remainder of the product of the key and ID divided by the prime number p as the encryption ID. Due to the nature of the remainder system, the above encryption method has the following properties: Superimposability, ID encryption has the same value range before and after encryption, and multiple encryption operations can be performed; Exchangeability, encryption conforms to the commutative law, and the same ID passes through two Different keys are encrypted twice, and the encryption sequence is exchanged, and the obtained ciphertext is the same, that is, Encry(Encry(ID, keyA), keyB)=Encry(Encry(ID, keyB), keyA). It is difficult to decrypt. When the encryption key is unknown, decryption is extremely difficult. Uniqueness, if and only if the ID (integer) is equal, the encryption result of the ID is the same. Next, the nature of the encryption method described in the embodiment of this specification is proved in conjunction with the nature of the remainder system. In the embodiment of this specification, x mod(y) can be called x mod y, which represents the remainder obtained by dividing x by y. The remainder system has the following properties. The remainder of any two modulo n of the complete remainder system modulo n is different, and any positive integer modulo n in the positive integer must be the same as the remainder of a certain number modulo n in the complete remainder system of modulo n. In the complete remainder system modulo n, the set of representative numbers that are relatively prime to modulo n is called the reduced remainder system modulo n. For a prime number p and any positive integer a that is relatively prime to p, the elements of the minimum reduced remainder system set S={1,2,3,...,(p-1)} modulo p are multiplied by a to obtain a new set a*S={a,2a,3a,...,(p-1)a}, satisfying a*S mod(p)=S. The proof is as follows. If x belongs to S, we know that a*x mod(p) belongs to the set S or 0 from the property of the remainder. Assuming a*x mod(p)=0, then a*x is an integer multiple of p. Since p is a prime number and x cannot be divisible by p, it is concluded that a can be divisible by p, which contradicts the "condition that a and p are mutually prime". Therefore, the hypothesis does not hold and a*x mod(p) is not equal to 0, that is, a*x mod(p) belongs to the set S. If x1 and x2 belong to S and x1>x2, assuming that a*x1 and a*x2 modulo p are congruent, that is, a*x1 mod(p) = a*x2 mod(p), then a*x1-k1*p =a*x2-k2*p, infer a*(x1-x2)=(k1-k2)*p. Because -p<x1-x2<p, p is a prime number. If the previous formula a*(x1-x2)=(k1-k2)*p holds, that is, a is an integer multiple of p, which contradicts the “condition of a and p being mutually prime”, so a*x1 and a*x2 modulo p congruence does not hold, a*x1 and a*x2 modulo p are not identical. It can be seen from the above that for the p-1 elements in the set a*S, the remainder after modulo p is the elements in the set S, and they are not equal to each other, then it is obvious that any element in the set S must be a*S The remainder of an element modulo p in. That is, the set a*S mod(p) is the same as the set S. In the embodiment of this specification, max(ID)<p, so ID belongs to the set S={1,2,3,...(p-1)}, thus, the superimposability is proved. That is, the elements of the set S still belong to the set S after being encrypted by the encryption method provided in the embodiment of this specification, so the next encryption can be continued. For a prime number p, for any positive integers a and b that are relatively prime to p, the commutative law b*(a*x mod(p)) mod(p) = a*(b*x mod(p)) mod(p ). The proof is as follows. It is easy to prove that x*y mod(z)=(x mod(z)) * (y mod(z)), so b*(a*x mod(p)) mod(p)=[b mod(p) ] * [(a*x mod(p)) mod(p)] =[b mod(p)] * [a*x mod(p)]=[b mod(p)] * [a mod(p) ] * [x mod(p)], in the same way, a*(b*x mod(p)) mod(p)=[a mod(p)] * [b mod(p)] * [x mod( p)], from the above, b*(a*x mod(p)) mod(p)=a*(b*x mod(p)) mod(p) is proved. In the embodiment of this specification, the same ID is encrypted twice through two different keys, the encryption order is exchanged, and the obtained ciphertext is consistent, that is, Encry(Encry(ID, keyA), keyB) = Encry(Encry(ID, keyB), keyA). Thus, the exchangeability is proved. Given the prime number p, and the value v of a*x mod(p), it is known that x belongs to the set {1,2,3,...,(p-1)}, and a is a positive integer that is relatively prime to p. x is a difficult thing. Proof: There are two unknowns a and x. The value range of a is 1 to positive infinity, and the value range of x is 1~(p-1). There are infinite groups of possible solutions, so it is impossible to solve the value of x . That is, when the encryption key is unknown, decryption is extremely difficult. As a result, it is difficult to decipher. For a prime number p and any positive integer a that is relatively prime to p, m and n are two different elements of the set S={1,2,3,...,(p-1)}, then a*m mod(p ) Must not be equal to a*n mod(p). The proof is as follows. Assuming a*m mod(p)=a*n mod(p), then a*m-k1*p=a*n-k2*p, k1 and k2 are integers. It can be deduced that a*(mn)=(k1-k2)*p. Since a and p are relatively prime, then mn must be divisible by p. Because both m and n belong to the set S, it is only possible that mn=0, m and n are equal, and the conditions are not met, and a contradiction is derived. Therefore, it is proved that a*m mod(p) is not equal to a*n mod(p). Therefore, through the encryption method provided in this manual, if and only when the IDs are equal, the encryption result of the ID is the same; when the IDs are not equal, the encryption result of the ID must be different. Through the above argumentation, when set_A and set_B have the same ID, the encryption result of the ID in set_A after being encrypted by the above encryption method is equal to the encryption interface of the ID in set_B after being encrypted by the above encryption method. Therefore, in step 308a, the data party A can determine that set_A and set_B share IDs. In addition, the second exchange information carries the tags of each ID, and through the third exchange information, the identification of the bin where the shared ID is binned by the feature value of the feature Fbi (Fb1, feature Fb2, etc.) can be obtained. In step 310a, based on the information obtained in step 308a, the information value of each feature Fbi can be calculated using the formula shown in FIG. 3. Among them, label=1 indicates that the label is positive, and label=0 indicates that the label is negative. For any feature Fbi, Precall k represents the ratio of the number of positively labeled IDs in bin k to the total number of positively labeled samples in the common sample, and Nrecall k represents the negatively labeled IDs in bin k The ratio of the number of to the total number of samples with negative labels in the total sample. IV represents the value of information. In step 308b, the data party B can determine that set_A and set_B share IDs. In addition, the first exchange information carries the label of each ID and the identification of the sub-box where it is located. Therefore, in step 310b, the information value of each feature Fai can be calculated. The method provided by the embodiment of this specification can complete the secure calculation of the information value of the feature without divulging the data of the parties under the condition that the data of the parties are isolated. details as follows. In the process of information value calculation, data party A got data party B’s ID which was encrypted by keyB and the corresponding Fb feature box, but this data is sufficiently secret for data party A because: 1) The ID obtained by data party A is encrypted by keyB, and data party A cannot know the corresponding original ID behind it, and therefore cannot match the Fb binning result with the real ID; 2) binning information used when calculating the value of the information It is irrelevant to the order of binning, so the identification of the bin where the data party B transmits to the party A can be in disorder (it can be implemented when the order of the second encryption ID is disrupted), or the identification of the bin is just one Code, so that data party A cannot know the order of feature size corresponding to the bins; 3) Each sub-box of the feature contains K IDs, which is equivalent to that the information obtained by data party A about the features of data party B is anonymized by K , The information of any ID has at least K IDs that are the same. Data party A also got the result of the second encryption of data party A’s ID. This encrypted ID has been shuffled by B and does not carry any additional information that can be identified. Therefore, data party A only knows these IDs. They are all the results obtained after their own ID is encrypted, and there is a one-to-one correspondence, but the correspondence relationship is not clear. Data party A performs matching, intersections, and calculations after obtaining the two pieces of information. These operations are equivalent to being performed in an ID-encrypted space, and the corresponding relationship between this encrypted space and the original space is unknown (this mapping relationship must have two The two keys of the party's keyA and keyB are known), so the calculation is safe. Similarly, it can be seen that the data available to data party B is not sufficient for data party B to derive data information of data party A. Referring to FIG. 5, this embodiment of the specification provides a method for protecting privacy and security by multiple parties jointly performing feature evaluation. The multiple parties include at least a first device and a second device. The first device stores a first sample set and each sample therein. The second device stores a second sample set, and the method is applied to the first device. Referring to Figure 5, the method includes the following steps. Step 501: Use the first key to encrypt the initial ID of each sample in the first sample set to obtain the first encrypted ID of each sample in the first sample set. For details, please refer to the above description of step 302a in FIG. 3, which will not be repeated here. It should be understood that when describing 302a, it is described in conjunction with the remainder encryption algorithm. The remainder encryption algorithm has a small amount of calculation and high security, making it a better encryption algorithm. It should be understood that the remainder encryption algorithm is not the only encryption algorithm. As long as the encryption algorithm satisfies superimposability, interchangeability, and uniqueness, it can be used to encrypt the sample ID in step 302a and step 302b. In the embodiment of this specification, the data party A and the data party B may negotiate other encryption algorithms in advance. The encryption algorithm here can be any algorithm that encrypts the target data based on the same set of keys, and the sequence of using the keys does not affect the encryption result. In addition to the remainder encryption algorithm described in the embodiment shown in FIG. 3, the encryption algorithm here can also be any one of an exclusive OR (XOR) algorithm, a DH algorithm, an ECC-DH algorithm, and the like. Step 503: Send the first exchange information to the second device, which includes at least the first encrypted ID and tag of each sample in the first sample set. For details, please refer to the above description of step 302a in FIG. 3, and this step is repeated here. Step 505: Receive the second exchange information and the third exchange information from the second device, where the second exchange information includes: the second device uses a second key to pair each item in the first sample set The second encrypted ID and the corresponding label obtained after the first encrypted ID of the sample is encrypted twice, and the relative order of each sample in the second exchange information has been disturbed by the second device; the third The exchange information includes, for each sample in the second sample set, the first encrypted ID obtained by the second device encrypting its initial ID based on the second key and the identification of the first bin where the sample is located, The identification of the first binning is obtained by the second device performing binning based on the feature value of the first feature of each sample in the second sample set. For details, please refer to the above description of steps 302b, 304b, and 306b in FIG. 3, which will not be repeated here. Step 507: Use the first key to perform secondary encryption on the first encrypted ID of each sample in the third exchange information to obtain a first encrypted set. For details, please refer to the above description of step 304a in FIG. 3, which will not be repeated here. Step 509, based on the second encryption ID in the second exchange information and the second encryption ID in the first encryption set, determine the common samples of the first sample set and the second sample set. For details, please refer to the above description of step 308a in FIG. 3, which will not be repeated here. Step 511: Determine the information value of the first feature based on the label of each sample in the shared sample and the identification of the first bin in which it is located, so as to perform feature selection for the machine learning model. For details, please refer to the above description of step 310a in FIG. 3, which will not be repeated here. In some embodiments, the method further includes: before sending the first exchange information to the second device, dividing the first sample set into a plurality of second features based on the feature value of the second feature of each sample in the first sample set Two bins, and include the identification of the second bin where each sample in the first sample set is located in the first exchange information; after obtaining the first encrypted set, disturb the relative relationship of each sample in the second sample set Order to obtain the fourth exchange information; send the fourth exchange information to the second device so that the second device is based on the second encryption ID in the fourth exchange information and the first encryption set in the second encryption set The secondary encryption ID determines the shared sample, and determines the information value of the second feature based on the label of each sample in the shared sample and the identification of the second sub-box where the second encryption set uses the second key pair The first encrypted ID in the first exchange information is obtained by performing secondary encryption. For details, please refer to the above description of steps 302a, 306a, 308b, and 310b in FIG. 3, which will not be repeated here. In an example of this embodiment, the dividing the first sample set into a plurality of second bins based on the feature values of the second features of each sample in the first sample set includes: according to equal frequency bins, equal distances Any one of binning and chi-square binning, dividing the first sample set into the plurality of second bins. In some embodiments, the initial ID of each sample in the first sample set and the initial ID of each sample in the second sample set are both positive integers; the first key is used to encrypt the initial ID of each sample in the first sample set. Previously, the method further includes: determining a first prime number greater than the largest initial ID in the initial ID of each sample in the first sample set, and greater than the largest initial ID in the initial ID of each sample in the second sample set; determining and the first prime number The first positive integer that is relatively prime is the first key. For details, please refer to the above description of step 300a and step 300b in FIG. 3, which will not be repeated here. In some embodiments, using the first key to encrypt the initial ID of each sample in the first sample set to obtain the first encrypted ID of each sample in the first sample set includes: For a sample, it is determined that the remainder of the product of the initial ID of the sample and the first key divided by the first prime number is the first encrypted ID of the sample. For details, please refer to the above description of step 302 in FIG. 3, which will not be repeated here. In some embodiments, the first sample set includes a plurality of samples with positive labels and a plurality of samples with negative labels; the determination is made based on the label of each sample in the common sample and the identification of the first bin where it is located. The information value of the first feature includes: determining the number of samples in the common sample that fall into the first bin with the first identification and the label is positive, relative to the first ratio of the total number of samples in the common sample with positive labels ; Determine the number of samples in the shared sample that fall into the first bin with the first identification and the label is negative, relative to the second proportion of the total number of samples with negative labels in the shared sample; based on each identification The first ratio and the second ratio corresponding to the first bin respectively determine the information value of the first feature of the shared sample. For details, please refer to the above description of step 310a in FIG. 3, which will not be repeated here. In some embodiments, the samples in the first sample set include user samples, and the machine learning model is a user classification model; or, the samples in the first sample set include business samples, and the machine learning The model is a business processing model. The method provided by the embodiment of the present specification can calculate the information value of the characteristics of the users shared by both parties under the circumstances that the two parties do not know the other user and the tag and the feature data are isolated, and have high security. Referring to Fig. 6, an embodiment of this specification provides a method for protecting privacy and security by multiple parties jointly performing feature evaluation. The multiple parties at least include a first device and a second device. The first device stores a first sample set and The label of each sample, the second device stores a second sample set, and the method is applied to the second device. As shown in Figure 6, the method includes the following steps. Step 601: Receive first exchange information from a first device, which includes at least the first encrypted ID obtained by encrypting the initial ID of each sample in the first sample set by the first device using the first key And the corresponding label. For details, please refer to the above description of step 302a in FIG. 3, which will not be repeated here. Step 603: Use the second key to perform secondary encryption on the first encrypted ID of each sample in the first exchange information to obtain a second encrypted set, and then disturb the relative order of each sample in the second encrypted set . For details, please refer to the above description of steps 304b and 306b in FIG. 3, which will not be repeated here. Step 605: Send second exchange information to the first device, where the second exchange information includes the second encrypted ID and tag of each sample in the first sample set whose relative order has been disturbed. For details, please refer to the above description of step 306b in FIG. 3, which will not be repeated here. Step 607: Use the second key to encrypt the initial ID of each sample in the second sample set to obtain the first encrypted ID in the second sample set. For details, please refer to the above description of step 302b in FIG. 3, which will not be repeated here. Step 609: Based on the feature value of the first feature of each sample in the second sample set, divide the second sample set into a plurality of first bins. For details, please refer to the above description of step 302b in FIG. 3, which will not be repeated here. Step 611: Send third exchange information to the first device. The third exchange information includes the first encrypted ID of each sample in the second sample set and the identification of the first bin where it is located, so that the first device can use it. The first key encrypts the first encrypted ID in the third exchange information to obtain the first encrypted set, which is based on the second encrypted ID in the first encrypted set and the second encrypted ID in the second exchange information Encrypt the ID, determine the common sample of the first sample set and the second sample set, and determine the information value of the first feature based on the label of each sample in the common sample and the identification of the first bin where it is located, which is used to target the machine The learning model performs feature selection. For details, please refer to the above description of step 302b in FIG. 3, which will not be repeated here. In some embodiments, the first exchange information further includes the identification of the second bin of each sample in the first sample set, and the identification of the second bin is determined by the first device based on the first sample set. The feature value of the second feature of each sample is obtained by binning; the method further includes: receiving fourth exchange information from the first device, where the fourth exchange information includes the second encryption of each sample in the second sample set ID, and the relative order of each sample in the fourth exchange information has been disturbed by the first device; the second encryption ID based on the second encryption set and the second encryption in the fourth exchange information ID, to determine the common sample of the first sample set and the second sample set; based on the label of each sample in the common sample and the identification of the second bin where it is located, the information value of the second feature is determined, which is used for the machine learning model Perform feature selection. For details, please refer to the above description of steps 302a, 304a, 306a, 308b, and 310b in FIG. 3, which will not be repeated here. The method provided by the embodiment of the present specification can calculate the information value of the characteristics of the users shared by both parties under the circumstances that the two parties do not know the other user and the tag and the feature data are isolated, and have high security. Referring to FIG. 7, an embodiment of this specification provides a privacy protection device 700 for joint feature evaluation by multiple parties. The multiple parties at least include a first device and a second device. The first device stores a first sample set and each of them. The label of the sample, the second device stores a second sample set, and the device is configured in the first device. As shown in FIG. 7, the device 700 includes: a first encryption unit 710, configured to use a first key to encrypt the initial ID of each sample in the first sample set to obtain the first sample of each sample in the first sample set. Secondary encryption ID; a first sending unit 720, configured to send first exchange information to the second device, which includes at least the first encryption ID and tag of each sample in the first sample set; first receiving unit 730 , Used to receive the second exchange information and the third exchange information from the second device, wherein the second exchange information includes: the second device uses a second key to pair each of the first sample sets The second encrypted ID and the corresponding label obtained after the first encrypted ID of the sample is encrypted twice, and the relative order of each sample in the second exchange information has been disturbed by the second device; the third The exchange information includes, for each sample in the second sample set, the first encrypted ID obtained by the second device encrypting its initial ID based on the second key and the identification of the first bin where the sample is located, The identification of the first binning is obtained by the second device performing binning based on the characteristic value of the first feature of each sample in the second sample set; The first encrypted ID of each sample in the third exchange information is encrypted twice to obtain the second encrypted ID of each sample in the second sample set; the first determining unit 750 is configured to be based on each sample in the first sample set The second encryption ID of each sample in the second sample set and the second encryption ID of each sample in the second sample set are used to determine the common samples of the first sample set and the second sample set; the second determining unit 760 is used to determine the common samples of each sample in the common sample set. The label and the identification of the first bin in which it is located are used to determine the information value of the first feature for feature selection for the machine learning model. The functions of each functional unit of the device 700 can be implemented with reference to the method embodiment shown in FIG. 5, and details are not described herein again. The device provided by the embodiment of the present specification can calculate the information value of the characteristics of the users shared by both parties when the other users are unknown and the tag and the characteristic data are isolated, and the security is high. Referring to Fig. 8, an embodiment of this specification provides an apparatus for protecting privacy and security by multiple parties jointly performing feature evaluation. The multiple parties include at least a first device and a second device. The first device stores a first sample set and The label of each sample, the second device stores a second sample set, and the device is configured in the second device; the device includes: a second receiving unit 810, configured to receive the first exchange information from the first device, wherein It includes at least the first encrypted ID and the corresponding label obtained by the first device using the first key to encrypt the initial ID of each sample in the first sample set; the third encryption unit 820 is used for The second key performs secondary encryption on the first encrypted ID of each sample in the first exchange information to obtain a second encrypted set, and then disturbs the relative order of the samples in the second encrypted set; second sending The unit 830 is configured to send second exchange information to the first device, where the second exchange information includes the second encryption ID and tag of each sample in the first sample set whose relative order has been disturbed; a fourth encryption unit 840 , Used to encrypt the initial ID of each sample in the second sample set using the second key to obtain the first encrypted ID in the second sample set; the second binning unit 850 is used to encrypt the first ID of each sample in the second sample set A feature value of a feature divides the second sample set into a plurality of first bins; the second sending unit 830 is further configured to send third exchange information to the first device, and the third exchange information includes the second sample set The first encrypted ID of each sample and the identification of the first sub-box where the sample is located, so that the first device uses the first key to perform secondary encryption on the first encrypted ID in the third exchange information to obtain the first encrypted set , And based on the second encryption ID in the first encryption set and the second encryption ID of each sample in the second exchange information, determine the common samples of the first sample set and the second sample set, and based on the common The label of each sample in the sample and the identification of the first bin in the sample determine the information value of the first feature, which is used for feature selection for the machine learning model. The functions of each functional unit of the device 800 can be implemented with reference to the method embodiment shown in FIG. 6, and details are not described herein again. The device provided by the embodiment of the present specification can calculate the information value of the characteristics of the users shared by both parties when the other users are unknown and the tag and the characteristic data are isolated, and the security is high. On the other hand, the embodiments of this specification provide a computer-readable storage medium on which a computer program is stored. When the computer program is executed in the computer, the computer is caused to execute the method shown in FIG. 5 or the method shown in FIG. 6 Methods. On the other hand, the embodiment of this specification provides a computing terminal, including a memory and a processor, the memory stores executable code, and when the processor executes the executable code, the implementation shown in FIG. 5的 method or the method shown in Figure 6. Those skilled in the art should be aware that in one or more of the above examples, the functions described in this specification can be implemented by hardware, software, firmware, or any combination thereof. When implemented by software, these functions can be stored in a computer-readable medium or transmitted as one or more instructions or codes on the computer-readable medium. The specific embodiments described above further describe the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention. The protection scope, any modification, equivalent replacement, improvement, etc. made on the basis of the technical solution of the present invention shall be included in the protection scope of the present invention.