本說明書一個或多個實施例描述了一種多方聯合進行風險識別的方法和裝置,能夠防止洩露用戶的隱私資訊。
第一態樣,提供了一種多方聯合進行風險識別的方法,所述多方包括第一站點和第二站點,所述第一站點儲存用戶的第一特徵集中的特徵資訊,所述第二站點儲存用戶的第二特徵集中的特徵資訊,所述特徵資訊涉及用戶的隱私資訊,所述方法應用於所述第一站點,包括:
獲取與所述第二站點聯合訓練的安全樹模型的第一子模型;所述安全樹模型還具有部署於所述第二站點的第二子模型;
獲取根據預設風險識別策略對應的樹結構得到的第三子模型;所述樹結構還具有部署於所述第二站點的第四子模型;
當確定滿足預設風險識別條件時,獲取目標用戶的第一特徵集中各項特徵的第一特徵資料;
將所述第一特徵資料輸入所述第一子模型,得到第一預測分數,以及輸入所述第三子模型,得到第三預測分數;
透過多方安全計算(multi-party computing,MPC)的方式提供所述第一預測分數和所述第三預測分數,從而與第二預測分數和第四預測分數相結合,綜合確定所述目標用戶是否具有第一風險;其中,所述第二預測分數為所述第二站點利用所述目標用戶的第二特徵集中各項特徵的第二特徵資料和所述第二子模型得到,所述第四預測分數為所述第二站點利用所述第二特徵資料和所述第四子模型得到。
在一種可能的實施方式中,所述獲取與所述第二站點聯合訓練的安全樹模型的第一子模型,包括:
透過MPC方式,與所述第二站點聯合訓練所述安全樹模型,得到所述安全樹模型的第一子模型。
在一種可能的實施方式中,所述獲取與所述第二站點聯合訓練的安全樹模型的第一子模型,包括:
接收所述第一子模型對應的第一模型文件,所述第一模型文件是從透過聯合訓練方式得到的安全樹模型的總模型文件中分拆的文件。
在一種可能的實施方式中,所述確定滿足預設風險識別條件包括:
接收評估請求,所述評估請求中包括所述目標用戶的標識。
在一種可能的實施方式中,所述確定滿足預設風險識別條件包括:
接收批量處理請求,所述目標用戶是批量處理請求所限定的用戶集合中的任意用戶。
在一種可能的實施方式中,所述MPC包括:
同態加密、秘密分享之一。
在一種可能的實施方式中,所述獲取與所述第二站點聯合訓練的安全樹模型的第一子模型之前,所述方法還包括:
確定與所述第二站點之間的資料交互權限;及/或,
確定所述第一特徵集中的特徵資訊和所述第二特徵集中的特徵資訊;及/或,
確定與所述第二站點之間已達成算法共識。
在一種可能的實施方式中,所述方法還包括:
與所述第二站點聯合訓練時,記錄與所述第二站點之間交互的資料。
在一種可能的實施方式中,所述第一風險包括有監督風險,所述有監督風險為用戶實施第一行為後能夠獲得所述第一行為對應的是否具有所述第一風險的標籤;所述特徵資訊還涉及用戶的行為資訊。
在一種可能的實施方式中,所述第一風險包括無監督風險;所述無監督風險為用戶實施第二行為後不能夠獲得所述第二行為對應的是否具有所述第一風險的標籤;
與所述第二站點聯合訓練安全樹模型,包括:
針對所述第一風險獲取第一樣本集合,所述第一樣本集合中各樣本的標籤為人工定義的,或者基於各樣本的高危險特徵集合中的各特徵的特徵分佈確定的;
利用所述第一樣本集合,與所述第二站點初步聯合訓練所述安全樹模型,並重新確定所述高危險特徵集合中包含的各特徵;
利用重新確定的所述高危險特徵集合中的各特徵的特徵分佈,更新所述第一樣本集合中各樣本的標籤;
基於更新後的標籤,與所述第二站點再次聯合訓練所述安全樹模型。
第二態樣,提供了一種多方聯合進行風險識別的裝置,所述多方包括第一站點和第二站點,所述第一站點儲存用戶的第一特徵集中的特徵資訊,所述第二站點儲存用戶的第二特徵集中的特徵資訊,所述特徵資訊涉及用戶的隱私資訊,所述裝置應用於所述第一站點,包括:
第一獲取單元,用於獲取與所述第二站點聯合訓練的安全樹模型的第一子模型;所述安全樹模型還具有部署於所述第二站點的第二子模型;
第二獲取單元,用於獲取根據預設風險識別策略對應的樹結構得到的第三子模型;所述樹結構還具有部署於所述第二站點的第四子模型;
第三獲取單元,用於當確定滿足預設風險識別條件時,獲取目標用戶的第一特徵集中各項特徵的第一特徵資料;
預測單元,用於將所述第三獲取單元獲取的第一特徵資料輸入所述第一獲取單元獲取的第一子模型,得到第一預測分數,以及輸入所述第二獲取單元獲取的第三子模型,得到第三預測分數;
聯合單元,用於透過多方安全計算MPC的方式提供所述預測單元得到的所述第一預測分數和所述第三預測分數,從而與第二預測分數和第四預測分數相結合,綜合確定所述目標用戶是否具有第一風險;其中,所述第二預測分數為所述第二站點利用所述目標用戶的第二特徵集中各項特徵的第二特徵資料和所述第二子模型得到,所述第四預測分數為所述第二站點利用所述第二特徵資料和所述第四子模型得到。
第三態樣,提供了一種計算機可讀儲存媒體,其上儲存有計算機程式,當所述計算機程式在計算機中執行時,令計算機執行第一態樣的方法。
第四態樣,提供了一種計算設備,包括記憶體和處理器,所述記憶體中儲存有可執行碼,所述處理器執行所述可執行碼時,實現第一態樣的方法。
透過本說明書實施例提供的方法和裝置,透過多方聯合進行風險識別,對於多方中的第一站點首先獲取與第二站點聯合訓練的安全樹模型的第一子模型;所述安全樹模型還具有部署於所述第二站點的第二子模型;然後獲取根據預設風險識別策略對應的樹結構得到的第三子模型;所述樹結構還具有部署於所述第二站點的第四子模型;接著當確定滿足預設風險識別條件時,獲取目標用戶的第一特徵集中各項特徵的第一特徵資料;再將所述第一特徵資料輸入所述第一子模型,得到第一預測分數,以及輸入所述第三子模型,得到第三預測分數;最後透過MPC的方式提供所述第一預測分數和所述第三預測分數,從而與第二預測分數和第四預測分數相結合,綜合確定所述目標用戶是否具有第一風險;其中,所述第二預測分數為所述第二站點利用所述目標用戶的第二特徵集中各項特徵的第二特徵資料和所述第二子模型得到,所述第四預測分數為所述第二站點利用所述第二特徵資料和所述第四子模型得到。由上可見,本說明書實施例,透過將總的模型拆分為多個子模型,將各子模型分別部署在多方站點,從而可以結合各子模型的預測結果,綜合得到最終的風險識別結果,保證了各站點不必交互用戶的隱私資訊,能夠防止洩露用戶的隱私資訊;此外,不僅將透過訓練得到的模型拆分部署,同樣地,將預設風險識別策略也進行拆分部署,進一步防止洩露用戶的隱私資訊,並且增強了風險識別的準確性。
One or more embodiments of this specification describe a method and device for multi-party joint risk identification, which can prevent leakage of user's private information.
The first aspect provides a multi-party joint risk identification method, the multi-party includes a first site and a second site, the first site stores feature information in the user's first feature set, and the second site stores feature information in the user's first feature set. The second site stores feature information in the user's second feature set, the feature information relates to the user's private information, and the method is applied to the first site, including:
Obtaining a first sub-model of the security tree model jointly trained with the second site; the security tree model also has a second sub-model deployed at the second site;
Obtaining a third sub-model obtained according to the tree structure corresponding to the preset risk identification strategy; the tree structure also has a fourth sub-model deployed at the second site;
When it is determined that the preset risk identification condition is satisfied, the first characteristic information of each characteristic in the first characteristic set of the target user is obtained;
inputting the first characteristic data into the first sub-model to obtain a first prediction score, and inputting the third sub-model to obtain a third prediction score;
providing the first prediction score and the third prediction score by way of multi-party computing (MPC), so as to combine with the second prediction score and the fourth prediction score to comprehensively determine whether the target user has the first risk; wherein, the second prediction score is obtained by the second site using the second feature data of each feature in the second feature set of the target user and the second sub-model, and the second Four prediction scores are obtained for the second site using the second feature data and the fourth sub-model.
In a possible implementation manner, the acquiring the first sub-model of the security tree model jointly trained with the second site includes:
Through MPC, jointly train the security tree model with the second station to obtain a first sub-model of the security tree model.
In a possible implementation manner, the acquiring the first sub-model of the security tree model jointly trained with the second site includes:
A first model file corresponding to the first sub-model is received, and the first model file is a file split from the total model file of the safety tree model obtained through joint training.
In a possible implementation manner, the determining that the preset risk identification condition is satisfied includes:
An evaluation request is received, where the evaluation request includes an identifier of the target user.
In a possible implementation manner, the determining that the preset risk identification condition is satisfied includes:
A batch processing request is received, and the target user is any user in the user set defined by the batch processing request.
In a possible implementation manner, the MPC includes:
One of homomorphic encryption and secret sharing.
In a possible implementation manner, before acquiring the first sub-model of the security tree model jointly trained with the second site, the method further includes:
determining the authority to exchange data with the second site; and/or,
determining feature information in the first feature set and feature information in the second feature set; and/or,
It is determined that an algorithm consensus has been reached with the second site.
In a possible implementation manner, the method also includes:
When training jointly with the second site, record the interaction data with the second site.
In a possible implementation manner, the first risk includes a supervision risk, and the supervision risk is that the user can obtain a label corresponding to the first behavior whether he has the first risk after performing the first behavior; The feature information also involves user behavior information.
In a possible implementation manner, the first risk includes an unsupervised risk; the unsupervised risk is that the user cannot obtain a label corresponding to the second behavior whether he has the first risk after performing the second behavior;
Jointly training the security tree model with the second site includes:
Obtaining a first sample set for the first risk, the label of each sample in the first sample set is manually defined, or determined based on the feature distribution of each feature in the high-risk feature set of each sample;
Using the first sample set, initially jointly training the safety tree model with the second site, and re-determining each feature included in the high-risk feature set;
Using the re-determined feature distribution of each feature in the high-risk feature set, update the label of each sample in the first sample set;
Based on the updated labels, jointly train the security tree model with the second site again.
The second aspect provides a device for multi-party joint risk identification, the multi-party includes a first site and a second site, the first site stores feature information in the user's first feature set, and the second site stores feature information in the first feature set of users. The second site stores feature information in the user's second feature set, the feature information relates to the user's private information, and the device is applied to the first site, including:
A first acquisition unit, configured to acquire a first sub-model of the security tree model jointly trained with the second site; the security tree model also has a second sub-model deployed at the second site;
The second acquisition unit is configured to acquire a third sub-model obtained according to a tree structure corresponding to a preset risk identification strategy; the tree structure also has a fourth sub-model deployed at the second site;
The third acquisition unit is configured to acquire the first feature data of each feature in the first feature set of the target user when it is determined that the preset risk identification condition is satisfied;
A predicting unit, configured to input the first feature data obtained by the third obtaining unit into the first sub-model obtained by the first obtaining unit to obtain a first prediction score, and input the third characteristic data obtained by the second obtaining unit. sub-model to obtain a third prediction score;
A joint unit, configured to provide the first prediction score and the third prediction score obtained by the prediction unit through multi-party secure calculation MPC, so as to combine with the second prediction score and the fourth prediction score to comprehensively determine the Whether the target user has the first risk; wherein, the second prediction score is obtained by the second site using the second feature data of each feature in the second feature set of the target user and the second sub-model , the fourth prediction score is obtained by the second site using the second feature data and the fourth sub-model.
A third aspect provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed in a computer, the computer is instructed to execute the method of the first aspect.
A fourth aspect provides a computing device, including a memory and a processor, wherein executable codes are stored in the memory, and when the processor executes the executable codes, the method of the first aspect is implemented.
Through the method and device provided by the embodiment of this specification, the risk identification is carried out through the cooperation of multiple parties. For the first site among the multiple parties, first obtain the first sub-model of the security tree model jointly trained with the second site; the security tree model It also has a second sub-model deployed at the second site; then obtains a third sub-model obtained according to the tree structure corresponding to the preset risk identification strategy; the tree structure also has a second sub-model deployed at the second site The fourth sub-model; then when it is determined that the preset risk identification conditions are met, the first feature data of each feature in the first feature set of the target user is obtained; and then the first feature data is input into the first sub-model to obtain The first prediction score, and the third sub-model is input to obtain the third prediction score; finally, the first prediction score and the third prediction score are provided through MPC, so as to be compatible with the second prediction score and the fourth prediction scores are combined to comprehensively determine whether the target user has the first risk; wherein, the second prediction score is the second characteristic data and The second sub-model is obtained, and the fourth prediction score is obtained by the second site using the second feature data and the fourth sub-model. It can be seen from the above that in the embodiment of this specification, the overall model is divided into multiple sub-models, and each sub-model is deployed on multiple sites, so that the prediction results of each sub-model can be combined to obtain the final risk identification result. It ensures that each site does not need to interact with the user's private information, and can prevent the leakage of the user's private information; in addition, not only will the model obtained through training be split and deployed, but also the preset risk identification strategy will be split and deployed to further prevent The privacy information of users is leaked, and the accuracy of risk identification is enhanced.
下面結合附圖,對本說明書提供的方案進行描述。
圖1為本說明書披露的一個實施例的實施場景示意圖。該實施場景涉及多方聯合進行風險識別。參照圖1,所述多方包括第一站點11和第二站點12,所述第一站點11儲存用戶的第一特徵集中的特徵資訊,所述第二站點12儲存用戶的第二特徵集中的特徵資訊。可以理解的是,第一特徵集與第二特徵集包含的特徵資訊不同,例如,第一特徵集包含特徵1、特徵2,第二特徵集包含特徵3、特徵4和特徵5,其中,特徵資訊涉及用戶的隱私資訊,這其中尤其重要的是可以定位到用戶個人的資訊(personally identifiable information,PII)資訊,如地址、郵箱、姓名、身份ID等。
本說明書實施例,基於多方安全計算(multi-party computing,MPC)的方式,由多方聯合進行風險識別。其中,涉及到策略和模型的多方部署,依賴於基於樹結構的部署形式,能夠防止洩露用戶的隱私資訊。
需要說明的是,本說明書實施例,僅以兩方聯合進行風險識別為例進行說明,但實際上多方並限定為兩方,例如,可以為三方、四方或更多方聯合進行風險識別。
圖2示出根據一個實施例的多方聯合進行風險識別的方法流程圖,該方法可以基於圖1所示的實施場景,所述多方包括第一站點和第二站點,所述第一站點儲存用戶的第一特徵集中的特徵資訊,所述第二站點儲存用戶的第二特徵集中的特徵資訊,所述特徵資訊涉及用戶的隱私資訊,所述方法應用於所述第一站點,可以理解的是,在多方聯合進行風險識別時,第一站點為多方中的任一方,第二站點與第一站點的處理過程類似,在此不做贅述。如圖2所示,該實施例中多方聯合進行風險識別的方法包括以下步驟:步驟21,獲取與所述第二站點聯合訓練的安全樹模型的第一子模型;所述安全樹模型還具有部署於所述第二站點的第二子模型;步驟22,獲取根據預設風險識別策略對應的樹結構得到的第三子模型;所述樹結構還具有部署於所述第二站點的第四子模型;步驟23,當確定滿足預設風險識別條件時,獲取目標用戶的第一特徵集中各項特徵的第一特徵資料;步驟24,將所述第一特徵資料輸入所述第一子模型,得到第一預測分數,以及輸入所述第三子模型,得到第三預測分數;步驟25,透過多方安全計算MPC的方式提供所述第一預測分數和所述第三預測分數,從而與第二預測分數和第四預測分數相結合,綜合確定所述目標用戶是否具有第一風險;其中,所述第二預測分數為所述第二站點利用所述目標用戶的第二特徵集中各項特徵的第二特徵資料和所述第二子模型得到,所述第四預測分數為所述第二站點利用所述第二特徵資料和所述第四子模型得到。下面描述以上各個步驟的具體執行方式。
首先在步驟21,獲取與所述第二站點聯合訓練的安全樹模型的第一子模型;所述安全樹模型還具有部署於所述第二站點的第二子模型。可以理解的是,安全樹模型為一個總的模型,該模型可以拆分為第一子模型和第二子模型,第一子模型和第二子模型分別部署於第一站點和第二站點。
在一個示例中,透過MPC方式,與所述第二站點聯合訓練所述安全樹模型,得到所述安全樹模型的第一子模型。可以理解的是,MPC方式即在保護資料隱私安全、資料不出域的情況下透過交換過程參數和隨機數的方式完成聯合訓練的相關計算。
在另一個示例中,接收所述第一子模型對應的第一模型文件,所述第一模型文件是從透過聯合訓練方式得到的安全樹模型的總模型文件中分拆的文件。
在一個示例中,在步驟21之前,所述方法還包括:
確定與所述第二站點之間的資料交互權限;及/或,
確定所述第一特徵集中的特徵資訊和所述第二特徵集中的特徵資訊;及/或,
確定與所述第二站點之間已達成算法共識。
在一個示例中,所述方法還包括:
與所述第二站點聯合訓練時,記錄與所述第二站點之間交互的資料。
然後在步驟22,獲取根據預設風險識別策略對應的樹結構得到的第三子模型;所述樹結構還具有部署於所述第二站點的第四子模型。可以理解的是,預設風險識別策略可以為人工定義的,例如,該預設風險識別策略為(x1>a or x2>b)and y3>c,可以轉換成x1>a and y3>c和x2>b and y3>c兩棵樹,每棵樹對應一個子模型。
本說明書實施例中,預設風險識別策略也被拆分為多個子模型,分別部署在多個站點,能夠防止洩露用戶的隱私資訊。
接著在步驟23,當確定滿足預設風險識別條件時,獲取目標用戶的第一特徵集中各項特徵的第一特徵資料。可以理解的是,該預設風險識別條件即觸發條件,可以是接收到請求後觸發,也可以是定時觸發。
在一個示例中,所述確定滿足預設風險識別條件包括:
接收評估請求,所述評估請求中包括所述目標用戶的標識。
在另一個示例中,所述確定滿足預設風險識別條件包括:
接收批量處理請求,所述目標用戶是批量處理請求所限定的用戶集合中的任意用戶。
再在步驟24,將所述第一特徵資料輸入所述第一子模型,得到第一預測分數,以及輸入所述第三子模型,得到第三預測分數。可以理解的是,第一特徵資料儲存於第一站點,第一子模型和第三子模型也部署在第一站點,第一特徵資料無需外傳,能夠防止洩露用戶的隱私資訊。
最後在步驟25,透過多方安全計算MPC的方式提供所述第一預測分數和所述第三預測分數,從而與第二預測分數和第四預測分數相結合,綜合確定所述目標用戶是否具有第一風險;其中,所述第二預測分數為所述第二站點利用所述目標用戶的第二特徵集中各項特徵的第二特徵資料和所述第二子模型得到,所述第四預測分數為所述第二站點利用所述第二特徵資料和所述第四子模型得到。可以理解的是,各方分別利用自身儲存的目標用戶的特徵資料確定相應的預測分數,再綜合多方的預測分數確定目標用戶是否具有風險,能夠防止洩露用戶的隱私資訊。
在一個示例中,所述MPC包括:
同態加密、秘密分享之一。
在一個示例中,所述第一風險包括有監督風險,所述有監督風險為用戶實施第一行為後能夠獲得所述第一行為對應的是否具有所述第一風險的標籤;所述特徵資訊還涉及用戶的行為資訊。可以理解的是,第一行為可以為交易行為,第一風險可以為盜用風險,通常這類風險在交易行為發生後會有用戶報案,從而獲得標籤。
在另一個示例中,所述第一風險包括無監督風險;所述無監督風險為用戶實施第二行為後不能夠獲得所述第二行為對應的是否具有所述第一風險的標籤;
與所述第二站點聯合訓練安全樹模型,包括:
針對所述第一風險獲取第一樣本集合,所述第一樣本集合中各樣本的標籤為人工定義的,或者基於各樣本的高危險特徵集合中的各特徵的特徵分佈確定的;
利用所述第一樣本集合,與所述第二站點初步聯合訓練所述安全樹模型,並重新確定所述高危險特徵集合中包含的各特徵;
利用重新確定的所述高危險特徵集合中的各特徵的特徵分佈,更新所述第一樣本集合中各樣本的標籤;
基於更新後的標籤,與所述第二站點再次聯合訓練所述安全樹模型。
可以理解的是,第二行為可以為交易行為,第一風險可以為營銷作弊風險或虛假交易風險,通常這類風險在交易行為發生後不會有用戶報案,從而不能夠獲得標籤。可以透過人工標注或特徵識別確定對應的標籤。
本說明書實施例提供的方法,透過將總的模型拆分為多個子模型,將各子模型分別部署在多方站點,從而可以結合各子模型的預測結果,綜合得到最終的風險識別結果,保證了各站點不必交互用戶的隱私資訊,能夠防止洩露用戶的隱私資訊;此外,不僅將透過訓練得到的模型拆分部署,同樣地,將預設風險識別策略也進行拆分部署,進一步防止洩露用戶的隱私資訊,並且增強了風險識別的準確性。
本說明書實施例中的MPC也可以稱為聯邦學習,具體地,可以採用安全樹(secureboost)的聯邦學習方案。
圖3示出根據一個實施例的多方聯合進行風險識別的體系結構示意圖。參照圖3,該體系結構包括配置層、定義層和部署層。
配置層主要有三部分組成:租戶管理,用於提供資料提供方和使用方的管理功能,並對租戶對資料的操作進行記錄以及全網同步;變量管理,用於提供各基礎變量的來源(來源於哪個租戶)以及基礎定義,線上資料來說對接到端上的資料實時介面,線下部分對接到端上的資料庫;算法授權,用於提供聯邦學習的算法共識部分,基於聯邦學習方案的算法分為三個步驟,第一個是離線訓練,透過隨機數以及中間參數的交互完成模型訓練;第二步是將得到的模型文件進行拆分,部署到各個端節點;第三步是在端節點上進行實時或者是離線批量預測。該運行的算法方案(如secureboost)不僅需要達到安全性的要求,同時還需要得到各個端的共識(確定瞭解算法不會外傳內部資訊)。共識後的算法需要輸入簽名,端資料智能在簽名匹配下的算法組件上面運行。
定義層,用於產出算法文件,包括模型訓練得到的算法文件,以及策略定義的算法文件。
部署層,用於將算法文件部署在多方,以提供預測服務。包括在線部署和離線部署。對於策略來說,是用and和or連起來的一些邏輯算子。透過對and和or的拆分即可將策略轉化成集成樹的結構從而複用模型的在線和離線部署鏈路。例如:策略(x1>a or x2>b)and y3>c可以轉換成x1>a and y3>c和x2>b and y3>c兩棵樹。對每棵樹來說,邏輯成立向右走(如果還有and邏輯那麼繼續分裂否則記為葉子節點1),邏輯不成立向左走並記為葉子節點0。兩顆不同的樹來進行加和,最終結果如果大於0那麼就是策略稽核,否則就是策略未稽核。轉化成樹結構之後可以沿用模型的部署鏈路來進行多方打分和預測。
圖4示出根據一個實施例的在線部署鏈路示意圖。參照圖4,展示了多方模型的聯邦學習過程以及在線打分過程。透過隨機數和參數的交互,得到一個樹模型,經過拆分之後部署在資料域A和資料域B的預測節點上。在風控實時鏈路上,由實時打分預測請求兩邊預測節點,預測節點從實時特徵介面讀取相應特徵。預測節點在節點所有擁有的子模型上得到子結果,並匯總到預測節點得到最終打分。預測節點將最終打分回傳給諮詢方。
圖5示出根據一個實施例的離線部署鏈路示意圖。參照圖5,展示了訓練好的模型在端節點部署之後的離線跑批和定時調度的鏈路。該部分鏈路需要同端資料庫打通,對資料庫內部的定時跑出的資料進行批量打分。同時該部分功能也提供一次性打分服務,來對策略和模型的效能進行評估。
圖6示出根據一個實施例的策略轉換過程示意圖。參照圖6,策略轉換成樹之後,會透過拆分服務拆分成子模型,將子模型部署在各個端上來進行預測或者離線調度打分。
圖7示出根據一個實施例的多方模型進化閉環示意圖。參照圖7,在聯邦學習多方建模的基礎上進一步提出了模型進化閉環的功能。在此基礎上,多方模型體系不僅能識別有標籤的監督型風險目標,同時也可以對營銷作弊、虛假交易等無監督風險進行風險識別,從而一體化覆蓋有監督風險、無監督風險的識別。首先透過人工定義的一些高風險標籤以及人工定義的高危險特徵識別到的無監督風險作為標籤來訓練有監督模型,根據有監督模型進一步對高危險特徵進行優化,此處同時可以結合人工經驗輸入調整高危險特徵的特徵分佈。優化後的高危險特徵可以進一步促進無監督風險識別的精度。透過閉環結構,可以在離線訓練或者建模階段不停的迭代優化安全樹模型。
綜上,基於聯邦學習的風控體系既可以解決多方盜用風險、欺詐風險等帶標籤回傳的風險,同時也可以對無標籤回傳的比如營銷作弊、虛假交易等風險進行防控。不僅可以支持模型、同時也可以兼容策略的部署。同時提供實時預測和離線打分兩種功能。在模型端,有一套完整的模型優化流程。同時由於是去中心化體系,在中心只有管理功能,沒有任何資料儲存,該部分功能可以開放給所有接入資料共享的機構,管理機構變量以及各個機構可以使用的算法功能,對不同機構提供不同的風控服務。
根據另一態樣的實施例,還提供一種多方聯合進行風險識別的裝置,所述多方包括第一站點和第二站點,所述第一站點儲存用戶的第一特徵集中的特徵資訊,所述第二站點儲存用戶的第二特徵集中的特徵資訊,所述特徵資訊涉及用戶的隱私資訊,所述裝置應用於所述第一站點,用於執行本說明書實施例提供的多方聯合進行風險識別的方法。圖8示出根據一個實施例的多方聯合進行風險識別的裝置的示意性框圖。如圖8所示,該裝置800包括:
第一獲取單元81,用於獲取與所述第二站點聯合訓練的安全樹模型的第一子模型;所述安全樹模型還具有部署於所述第二站點的第二子模型;
第二獲取單元82,用於獲取根據預設風險識別策略對應的樹結構得到的第三子模型;所述樹結構還具有部署於所述第二站點的第四子模型;
第三獲取單元83,用於當確定滿足預設風險識別條件時,獲取目標用戶的第一特徵集中各項特徵的第一特徵資料;
預測單元84,用於將所述第三獲取單元83獲取的第一特徵資料輸入所述第一獲取單元81獲取的第一子模型,得到第一預測分數,以及輸入所述第二獲取單元82獲取的第三子模型,得到第三預測分數;
聯合單元85,用於透過多方安全計算MPC的方式提供所述預測單元84得到的所述第一預測分數和所述第三預測分數,從而與第二預測分數和第四預測分數相結合,綜合確定所述目標用戶是否具有第一風險;其中,所述第二預測分數為所述第二站點利用所述目標用戶的第二特徵集中各項特徵的第二特徵資料和所述第二子模型得到,所述第四預測分數為所述第二站點利用所述第二特徵資料和所述第四子模型得到。
可選地,作為一個實施例,所述第一獲取單元81,具體用於透過MPC方式,與所述第二站點聯合訓練所述安全樹模型,得到所述安全樹模型的第一子模型。
可選地,作為一個實施例,所述第一獲取單元81,具體用於接收所述第一子模型對應的第一模型文件,所述第一模型文件是從透過聯合訓練方式得到的安全樹模型的總模型文件中分拆的文件。
可選地,作為一個實施例,所述確定滿足預設風險識別條件包括:
接收評估請求,所述評估請求中包括所述目標用戶的標識。
可選地,作為一個實施例,所述確定滿足預設風險識別條件包括:
接收批量處理請求,所述目標用戶是批量處理請求所限定的用戶集合中的任意用戶。
可選地,作為一個實施例,所述MPC包括:
同態加密、秘密分享之一。
可選地,作為一個實施例,所述裝置還包括:
確定單元,用於在所述第一獲取單元81獲取與所述第二站點聯合訓練的安全樹模型的第一子模型之前,確定與所述第二站點之間的資料交互權限;及/或,確定所述第一特徵集中的特徵資訊和所述第二特徵集中的特徵資訊;及/或,確定與所述第二站點之間已達成算法共識。
可選地,作為一個實施例,所述裝置還包括:
記錄單元,用於與所述第二站點聯合訓練時,記錄與所述第二站點之間交互的資料。
可選地,作為一個實施例,所述第一風險包括有監督風險,所述有監督風險為用戶實施第一行為後能夠獲得所述第一行為對應的是否具有所述第一風險的標籤;所述特徵資訊還涉及用戶的行為資訊。
可選地,作為一個實施例,所述第一風險包括無監督風險;所述無監督風險為用戶實施第二行為後不能夠獲得所述第二行為對應的是否具有所述第一風險的標籤;
與所述第二站點聯合訓練安全樹模型,包括:
針對所述第一風險獲取第一樣本集合,所述第一樣本集合中各樣本的標籤為人工定義的,或者基於各樣本的高危險特徵集合中的各特徵的特徵分佈確定的;
利用所述第一樣本集合,與所述第二站點初步聯合訓練所述安全樹模型,並重新確定所述高危險特徵集合中包含的各特徵;
利用重新確定的所述高危險特徵集合中的各特徵的特徵分佈,更新所述第一樣本集合中各樣本的標籤;
基於更新後的標籤,與所述第二站點再次聯合訓練所述安全樹模型。
根據另一態樣的實施例,還提供一種計算機可讀儲存媒體,其上儲存有計算機程式,當所述計算機程式在計算機中執行時,令計算機執行結合圖2所描述的方法。
根據再一態樣的實施例,還提供一種計算設備,包括記憶體和處理器,所述記憶體中儲存有可執行碼,所述處理器執行所述可執行碼時,實現結合圖2所描述的方法。
本領域技術人員應該可以意識到,在上述一個或多個示例中,本發明所描述的功能可以用硬體、軟體、韌體或它們的任意組合來實現。當使用軟體實現時,可以將這些功能儲存在計算機可讀媒體中或者作為計算機可讀媒體上的一個或多個指令或碼進行傳輸。
以上所述的具體實施方式,對本發明的目的、技術方案和有益效果進行了進一步詳細說明,所應理解的是,以上所述僅為本發明的具體實施方式而已,並不用於限定本發明的保護範圍,凡在本發明的技術方案的基礎之上,所做的任何修改、等同替換、改進等,均應包括在本發明的保護範圍之內。 The solutions provided in this specification will be described below in conjunction with the accompanying drawings. Fig. 1 is a schematic diagram of an implementation scene of an embodiment disclosed in this specification. This implementation scenario involves joint risk identification by multiple parties. With reference to Fig. 1, described multi-party comprises first site 11 and second site 12, and described first site 11 stores the feature information of user's first feature set, and described second site 12 stores user's second Feature information in the feature set. It can be understood that the feature information contained in the first feature set is different from that contained in the second feature set. For example, the first feature set includes feature 1 and feature 2, and the second feature set includes feature 3, feature 4 and feature 5, wherein the feature The information involves the user's private information, and it is particularly important to locate the user's personally identifiable information (PII) information, such as address, email, name, ID, etc. In this embodiment of the specification, based on a multi-party computing (MPC) approach, multiple parties jointly perform risk identification. Among them, the multi-party deployment involving strategies and models relies on the tree-based deployment form, which can prevent the leakage of users' private information. It should be noted that the embodiment of this specification only takes two parties to jointly carry out risk identification as an example, but in fact, multiple parties are not limited to two parties. For example, three parties, four parties or more parties may jointly carry out risk identification. Fig. 2 shows a flow chart of a method for multi-party joint risk identification according to an embodiment, the method may be based on the implementation scenario shown in Fig. 1, the multi-party includes a first site and a second site, and the first site Store the feature information in the user's first feature set, the second site stores the feature information in the user's second feature set, the feature information relates to the user's privacy information, and the method is applied to the first site , it can be understood that when multiple parties jointly perform risk identification, the first site is any one of the multiple parties, and the processing process of the second site is similar to that of the first site, which will not be repeated here. As shown in Figure 2, in this embodiment, the method for multi-party joint risk identification includes the following steps: Step 21, obtaining the first sub-model of the safety tree model jointly trained with the second site; the safety tree model also Having a second sub-model deployed at the second site; step 22, obtaining a third sub-model obtained according to the tree structure corresponding to the preset risk identification strategy; the tree structure also has a second sub-model deployed at the second site the fourth sub-model; step 23, when it is determined that the preset risk identification condition is satisfied, obtain the first feature data of each feature in the first feature set of the target user; step 24, input the first feature data into the first A sub-model to obtain a first prediction score, and input the third sub-model to obtain a third prediction score; step 25, providing the first prediction score and the third prediction score through multi-party secure calculation MPC, Therefore, in combination with the second prediction score and the fourth prediction score, it is comprehensively determined whether the target user has the first risk; wherein, the second prediction score is the second feature of the target user used by the second site. The second feature data of each feature and the second sub-model are collected, and the fourth prediction score is obtained by the second site using the second feature data and the fourth sub-model. The specific execution manner of each of the above steps is described below. First at step 21, the first sub-model of the security tree model jointly trained with the second site is obtained; the security tree model also has a second sub-model deployed at the second site. It can be understood that the security tree model is a general model, which can be split into a first sub-model and a second sub-model, and the first sub-model and the second sub-model are deployed on the first site and the second site respectively point. In an example, the security tree model is jointly trained with the second site through MPC to obtain a first sub-model of the security tree model. It can be understood that the MPC method is to complete the relevant calculations of joint training by exchanging process parameters and random numbers while protecting data privacy and security and leaving the data out of the domain. In another example, a first model file corresponding to the first sub-model is received, and the first model file is a file split from a total model file of the security tree model obtained through joint training. In one example, before step 21, the method further includes: determining the data exchange authority with the second site; and/or, determining the feature information in the first feature set and the second site feature information in the feature set; and/or, determine that an algorithmic consensus has been reached with the second site. In an example, the method further includes: during joint training with the second site, recording interaction materials with the second site. Then in step 22, the third sub-model obtained according to the tree structure corresponding to the preset risk identification strategy is obtained; the tree structure also has a fourth sub-model deployed on the second site. It can be understood that the default risk identification strategy can be manually defined, for example, the default risk identification strategy is (x1>a or x2>b) and y3>c, which can be transformed into x1>a and y3>c and x2>b and y3>c two trees, each tree corresponds to a sub-model. In the embodiment of this specification, the preset risk identification strategy is also split into multiple sub-models, which are respectively deployed on multiple sites, which can prevent leakage of user's private information. Then in step 23, when it is determined that the preset risk identification condition is met, the first feature information of each feature in the first feature set of the target user is acquired. It can be understood that the preset risk identification condition, that is, the trigger condition, may be triggered after receiving a request, or may be triggered periodically. In an example, the determining that a preset risk identification condition is satisfied includes: receiving an evaluation request, where the evaluation request includes an identifier of the target user. In another example, the determining that a preset risk identification condition is met includes: receiving a batch processing request, and the target user is any user in a user set defined by the batch processing request. Then in step 24, input the first feature data into the first sub-model to obtain a first prediction score, and input it into the third sub-model to obtain a third prediction score. It can be understood that the first feature data is stored at the first site, and the first sub-model and the third sub-model are also deployed at the first site. The first feature data does not need to be shared, which can prevent leakage of user's private information. Finally, in step 25, the first prediction score and the third prediction score are provided through multi-party secure calculation MPC, so as to combine with the second prediction score and the fourth prediction score to comprehensively determine whether the target user has the first prediction score A risk; wherein, the second prediction score is obtained by the second site using the second feature data of each feature in the second feature set of the target user and the second sub-model, and the fourth prediction A score is obtained for the second site using the second feature data and the fourth sub-model. It is understandable that each party uses its own stored characteristic data of the target user to determine the corresponding prediction score, and then combines the prediction scores of multiple parties to determine whether the target user is at risk, which can prevent the leakage of the user's private information. In an example, the MPC includes: one of homomorphic encryption and secret sharing. In an example, the first risk includes a supervised risk, and the supervised risk is that after the user performs the first behavior, a label corresponding to the first behavior can be obtained whether the user has the first risk; the feature information It also involves user behavior information. It is understandable that the first behavior can be a transaction behavior, and the first risk can be a risk of misappropriation. Usually, such risks will be reported by users after the transaction behavior occurs, thereby obtaining a label. In another example, the first risk includes an unsupervised risk; the unsupervised risk is that the user cannot obtain the label corresponding to the second behavior whether he has the first risk after performing the second behavior; The joint training of the safety tree model at the second site includes: obtaining a first sample set for the first risk, the labels of each sample in the first sample set are manually defined, or are based on the high risk of each sample The feature distribution of each feature in the feature set is determined; using the first sample set, initially jointly training the safety tree model with the second station, and re-determining each feature included in the high-risk feature set feature; using the re-determined feature distribution of each feature in the high-risk feature set, update the label of each sample in the first sample set; based on the updated label, jointly train with the second station again The security tree model. It is understandable that the second behavior can be transaction behavior, and the first risk can be marketing cheating risk or false transaction risk. Usually, such risks will not be reported by users after the transaction behavior occurs, so that the label cannot be obtained. The corresponding label can be determined through manual labeling or feature recognition. The method provided in the embodiment of this specification divides the overall model into multiple sub-models, and deploys each sub-model on multiple sites, so that the final risk identification result can be obtained by combining the prediction results of each sub-model, ensuring It prevents sites from interacting with each other’s private information, which can prevent the leakage of user’s private information; in addition, not only will the model obtained through training be split and deployed, but also the default risk identification strategy will be split and deployed to further prevent leakage User's privacy information, and enhance the accuracy of risk identification. The MPC in the embodiment of this specification may also be called federated learning, specifically, a federated learning solution of a secure tree (secureboost) may be used. Fig. 3 shows a schematic diagram of an architecture of multi-party joint risk identification according to an embodiment. Referring to Figure 3, the architecture includes a configuration layer, a definition layer and a deployment layer. The configuration layer mainly consists of three parts: tenant management, which is used to provide the management function of the data provider and the user, and records the operation of the tenant on the data and synchronizes the whole network; variable management, which is used to provide the source of each basic variable (source Which tenant) and the basic definition, the online data is connected to the real-time interface of the data on the end, and the offline part is connected to the database on the end; the algorithm authorization is used to provide the algorithm consensus part of the federated learning, based on the federated learning program The algorithm is divided into three steps, the first is offline training, and the model training is completed through the interaction of random numbers and intermediate parameters; the second step is to split the obtained model files and deploy them to each end node; the third step is to Real-time or offline batch prediction on the end node. The running algorithm scheme (such as secureboost) not only needs to meet the security requirements, but also needs to obtain the consensus of all ends (make sure to understand the algorithm and not transmit internal information). The algorithm after the consensus needs to input a signature, and the terminal data intelligence runs on the algorithm components under the signature matching. The definition layer is used to generate algorithm files, including algorithm files obtained from model training and algorithm files defined by strategies. The deployment layer is used to deploy algorithm files in multiple parties to provide prediction services. Including online deployment and offline deployment. For strategies, it is some logical operators connected with and and or. By splitting and and or, the strategy can be transformed into an integrated tree structure to reuse the online and offline deployment links of the model. For example: strategy (x1>a or x2>b) and y3>c can be transformed into two trees of x1>a and y3>c and x2>b and y3>c. For each tree, if the logic is true, go to the right (if there is and logic, then continue to split, otherwise it will be recorded as leaf node 1), and if the logic is not true, go to the left and record it as leaf node 0. Two different trees are added together. If the final result is greater than 0, then the strategy is audited, otherwise, the strategy is not audited. After converting into a tree structure, the deployment link of the model can be used for multi-party scoring and prediction. Fig. 4 shows a schematic diagram of an online deployment link according to an embodiment. Referring to Figure 4, the federated learning process and online scoring process of the multi-party model are shown. Through the interaction of random numbers and parameters, a tree model is obtained, and after splitting, it is deployed on the prediction nodes of data domain A and data domain B. On the risk control real-time link, the prediction nodes on both sides are requested by the real-time scoring prediction, and the prediction nodes read the corresponding characteristics from the real-time characteristic interface. The prediction node obtains sub-results on all sub-models owned by the node, and aggregates them to the prediction node to obtain the final score. The prediction node will finally score and send it back to the consulting party. Fig. 5 shows a schematic diagram of offline deployment links according to an embodiment. Referring to Figure 5, it shows the link of offline batch running and timing scheduling after the trained model is deployed on the end node. This part of the link needs to be connected to the database at the same end, and the data that is regularly released from the database is scored in batches. At the same time, this part of the function also provides a one-time scoring service to evaluate the effectiveness of strategies and models. Fig. 6 shows a schematic diagram of a policy switching process according to an embodiment. Referring to Figure 6, after the strategy is converted into a tree, it will be split into sub-models through the split service, and the sub-models will be deployed on each end for prediction or offline scheduling and scoring. Fig. 7 shows a schematic diagram of a closed-loop evolution of a multi-party model according to an embodiment. Referring to Figure 7, on the basis of federated learning multi-party modeling, the function of model evolution closed-loop is further proposed. On this basis, the multi-party model system can not only identify labeled supervisory risk targets, but also carry out risk identification on unsupervised risks such as marketing cheating and false transactions, thereby integrating the identification of supervised risks and unsupervised risks. First, use some artificially defined high-risk labels and unsupervised risks identified by artificially defined high-risk features as labels to train the supervised model, and further optimize the high-risk features according to the supervised model. At the same time, it can be combined with manual experience input. Adjust the feature distribution for high-risk features. The optimized high-risk features can further promote the accuracy of unsupervised risk identification. Through the closed-loop structure, the safety tree model can be iteratively optimized during the offline training or modeling phase. To sum up, the risk control system based on federated learning can not only solve the risks of multi-party misappropriation and fraud risks, but also prevent and control the risks of unlabeled returns such as marketing cheating and false transactions. It can not only support the model, but also be compatible with the deployment of strategies. At the same time, it provides two functions of real-time prediction and offline scoring. On the model side, there is a complete model optimization process. At the same time, because it is a decentralized system, there is only management function in the center without any data storage. This part of the function can be opened to all institutions that access data sharing, management institution variables and algorithm functions that can be used by each institution, providing different institutions. wind control services. According to another aspect of the embodiment, there is also provided a multi-party joint risk identification device, the multi-party includes a first site and a second site, and the first site stores feature information in the user's first feature set , the second site stores the feature information in the user's second feature set, the feature information involves the user's privacy information, and the device is applied to the first site to implement the multi-party method provided by the embodiment of this specification A joint approach to risk identification. Fig. 8 shows a schematic block diagram of an apparatus for multi-party joint risk identification according to an embodiment. As shown in FIG. 8 , the device 800 includes: a first acquiring unit 81, configured to acquire a first sub-model of the security tree model jointly trained with the second site; the security tree model also has a The second sub-model of the second site; the second acquisition unit 82, configured to acquire the third sub-model obtained according to the tree structure corresponding to the preset risk identification strategy; the tree structure also has a function deployed on the second site The fourth sub-model; the third acquisition unit 83, used to obtain the first feature data of each feature in the first feature set of the target user when it is determined that the preset risk identification condition is met; the prediction unit 84 is used to use the The first feature data obtained by the third obtaining unit 83 is input into the first sub-model obtained by the first obtaining unit 81 to obtain the first prediction score, and input to the third sub-model obtained by the second obtaining unit 82 to obtain the first prediction score. Three predictive scores; the joint unit 85 is used to provide the first predictive score and the third predictive score obtained by the predictive unit 84 through multi-party secure calculation MPC, so as to be compatible with the second predictive score and the fourth predictive score In combination, it is comprehensively determined whether the target user has the first risk; wherein, the second prediction score is the second feature data of each feature in the second feature set of the target user used by the second site and the obtained The second sub-model is obtained, and the fourth prediction score is obtained by the second site using the second feature data and the fourth sub-model. Optionally, as an embodiment, the first acquisition unit 81 is specifically configured to jointly train the security tree model with the second site through MPC to obtain the first sub-model of the security tree model . Optionally, as an embodiment, the first obtaining unit 81 is specifically configured to receive the first model file corresponding to the first sub-model, and the first model file is obtained from the security tree through joint training. A split file from the model's overall model file. Optionally, as an embodiment, the determining that a preset risk identification condition is met includes: receiving an evaluation request, where the evaluation request includes an identifier of the target user. Optionally, as an embodiment, the determining that a preset risk identification condition is met includes: receiving a batch processing request, and the target user is any user in a user set defined by the batch processing request. Optionally, as an embodiment, the MPC includes: one of homomorphic encryption and secret sharing. Optionally, as an embodiment, the apparatus further includes: a determining unit, configured to determine, before the first acquiring unit 81 acquires the first sub-model of the security tree model jointly trained with the second site Data exchange authority with the second site; and/or, determine the feature information in the first feature set and the feature information in the second feature set; and/or, determine the feature information with the second site Algorithmic consensus has been reached among the points. Optionally, as an embodiment, the device further includes: a recording unit, configured to record interaction materials with the second site during joint training with the second site. Optionally, as an embodiment, the first risk includes a supervised risk, and the supervised risk is that the user can obtain a label corresponding to the first behavior whether he or she has the first risk after performing the first behavior; The feature information also involves user behavior information. Optionally, as an embodiment, the first risk includes an unsupervised risk; the unsupervised risk is that the user cannot obtain a label corresponding to the second behavior whether he or she has the first risk after performing the second behavior Jointly training the safety tree model with the second site, including: obtaining a first sample set for the first risk, the labels of each sample in the first sample set are manually defined, or based on each sample The feature distribution of each feature in the high-risk feature set is determined; using the first sample set, initially jointly train the safety tree model with the second site, and re-determine the high-risk feature set Each feature included; using the re-determined feature distribution of each feature in the high-risk feature set, update the label of each sample in the first sample set; based on the updated label, and the second site The security tree model is jointly trained again. According to another embodiment, there is also provided a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed in the computer, it causes the computer to execute the method described in conjunction with FIG. 2 . According to yet another aspect of the embodiment, a computing device is also provided, including a memory and a processor, wherein executable codes are stored in the memory, and when the processor executes the executable codes, the implementation described in conjunction with FIG. 2 is implemented. described method. Those skilled in the art should be aware that, in the above one or more examples, the functions described in the present invention may be implemented by hardware, software, firmware or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. The specific embodiments described above have further described the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention and are not intended to limit the scope of the present invention. Protection scope, any modification, equivalent replacement, improvement, etc. made on the basis of the technical solution of the present invention shall be included in the protection scope of the present invention.