TWI726341B

TWI726341B - Sample attribute evaluation model training method, device, server and storage medium

Info

Publication number: TWI726341B
Application number: TW108122547A
Authority: TW
Inventors: 王修坤; 趙婷婷; 劉斌
Original assignee: 開曼群島商創新先進技術有限公司
Priority date: 2018-08-31
Filing date: 2019-06-27
Publication date: 2021-05-01
Also published as: TW202011285A; WO2020042795A1; CN109325525A

Abstract

本說明書實施例提供了一種樣本屬性評估方法，首先確定訓練樣本，該訓練樣本中僅包括少量已確認屬性的黑樣本，還有大部分未確認屬性的未知樣本。基於訓練樣本對應的關係圖，確定每個社區的黑樣本濃度，結合社區黑樣本濃度以及半監督機器學習演算法，即使黑樣本數量較少，本實施例中的方法也可以從未知樣本中挖掘潛在黑樣本，進而確定模型訓練所需要的白樣本，達到模型訓練要求，使得訓練出的模型能夠準確地對樣本是否屬於黑樣本的屬性進行評估。 The embodiment of this specification provides a sample attribute evaluation method. First, a training sample is determined. The training sample includes only a small number of black samples with confirmed attributes and most unknown samples with unconfirmed attributes. Based on the relationship graph corresponding to the training samples, the black sample concentration of each community is determined, combined with the community black sample concentration and the semi-supervised machine learning algorithm, even if the number of black samples is small, the method in this embodiment can also mine unknown samples Potential black samples, and then determine the white samples required for model training, to meet the model training requirements, so that the trained model can accurately evaluate whether the samples belong to the attributes of the black samples.

Description

Sample attribute evaluation model training method, device, server and storage medium

本說明書實施例涉及網際網路技術領域，尤其涉及一種樣本屬性評估模型訓練方法、裝置及伺服器。 The embodiments of this specification relate to the field of Internet technology, and in particular to a training method, device, and server for a sample attribute evaluation model.

隨著網際網路的快速發展，越來越多的業務可以通過網路實現，如線上支付、線上購物、線上保險理賠等網際網路業務。網際網路在給人們生活提供便利的同時，也帶來了風險。不法人員可能會進行電子業務詐欺，給其它使用者造成損失。對於龐大的業務樣本集而言，明確屬性為黑的風險黑樣本數量較少，大部分是未知屬性的樣本，由於業務詐欺資料樣本具有隱藏性，所以，為了能夠提升整體風控能力，亟需設計一種能夠基於少量已知黑樣本訓練得到能夠準確對未知樣本進行屬性評估的方案。 With the rapid development of the Internet, more and more services can be realized through the Internet, such as online payment, online shopping, online insurance claims and other Internet services. While the Internet provides convenience to people's lives, it also brings risks. Unscrupulous persons may engage in electronic business fraud and cause losses to other users. For a large business sample set, the number of risky black samples with a clear attribute of black is small, and most of them are samples with unknown attributes. Due to the hidden nature of business fraud data samples, there is an urgent need to improve the overall risk control capability. Design a scheme that can be trained based on a small number of known black samples to accurately evaluate the attributes of unknown samples.

本說明書實施例提供及一種樣本屬性評估方法、裝置及伺服器。 The embodiments of this specification provide a sample attribute evaluation method, device and server.

第一態樣，本說明書實施例提供一種樣本屬性評估方法，包括：確定與訓練樣本對應的關係圖中每個社區的黑樣本濃度，其中，所述訓練樣本包括黑樣本和未知樣本；基於所述每個社區的黑樣本濃度，確定每個所述未知樣本的白樣本抽樣概率，以每個所述未知樣本的白樣本抽樣概率進行抽樣，獲得白樣本；基於半監督機器學習演算法對所述黑樣本與所述白樣本進行訓練，獲得目標樣本屬性評估模型。 In the first aspect, an embodiment of this specification provides a sample attribute evaluation method, including: determining the blackness of each community in the relationship graph corresponding to the training sample. Sample concentration, wherein the training sample includes a black sample and an unknown sample; based on the black sample concentration of each community, the white sample sampling probability of each unknown sample is determined, and the white sample of each unknown sample Sampling with sampling probability to obtain white samples; training the black samples and the white samples based on a semi-supervised machine learning algorithm to obtain a target sample attribute evaluation model.

第二態樣，本說明書實施例提供一種樣本屬性評估模型訓練裝置，包括：第一確定單元，用於確定與訓練樣本對應的關係圖中每個社區的黑樣本濃度，其中，所述訓練樣本包括黑樣本和未知樣本；第二確定單元，用於基於所述每個社區的黑樣本濃度，確定每個所述未知樣本的白樣本抽樣概率，以每個所述未知樣本的白樣本抽樣概率進行抽樣，獲得白樣本；訓練單元，用於基半監督機器學習演算法對所述黑樣本與所述白樣本進行訓練，獲得目標樣本屬性評估模型。 In the second aspect, an embodiment of this specification provides a sample attribute evaluation model training device, including: a first determining unit for determining the black sample concentration of each community in the relationship graph corresponding to the training sample, wherein the training sample Including black samples and unknown samples; the second determining unit is used to determine the white sample sampling probability of each unknown sample based on the black sample concentration of each community, and use the white sample sampling probability of each unknown sample Sampling is performed to obtain white samples; a training unit is used to train the black samples and the white samples based on a semi-supervised machine learning algorithm to obtain a target sample attribute evaluation model.

第三態樣，本說明書實施例提供一種伺服器，包括記憶體、處理器及儲存在記憶體上並可在處理器上運行的電腦程式，所述處理器執行所述程式時實現上述任一項所述樣本屬性評估方法的步驟。 In the third aspect, an embodiment of this specification provides a server, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor. When the processor executes the program, any one of the above is implemented. The steps of the sample attribute evaluation method described in item.

第四態樣，本說明書實施例提供一種電腦可讀儲存媒體，其上儲存有電腦程式，該程式被處理器執行時實現上述任一項所述樣本屬性評估方法的步驟。 In a fourth aspect, an embodiment of this specification provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the steps of any one of the above-mentioned sample attribute evaluation methods are implemented.

本說明書實施例有益效果如下：本說明書實施例中，通過確定訓練樣本，該訓練樣本中僅包括少量已確認屬性的黑樣本，還有大部分未確認屬性的未知樣本。基於訓練樣本對應的關係圖，確定每個社區的黑樣本濃度，結合社區黑樣本濃度以及半監督機器學習演算法，即使已知黑樣本數量較少，本實施例中的方法也可以從未知樣本中挖掘潛在黑樣本，進而確定模型訓練所需要的白樣本，達到模型訓練要求，使得訓練出的模型能夠準確地對樣本是否屬於黑樣本的屬性進行評估。The beneficial effects of the embodiments of this specification are as follows: In the embodiments of this specification, by determining the training samples, the training samples only include a small number of black samples with confirmed attributes, and most of the unknown samples with unconfirmed attributes. Based on the relationship graph corresponding to the training samples, the black sample concentration of each community is determined, combined with the community black sample concentration and the semi-supervised machine learning algorithm, even if the number of known black samples is small, the method in this embodiment can also be used from unknown samples. Mining potential black samples in, and then determine the white samples required for model training, to meet the model training requirements, so that the trained model can accurately evaluate whether the sample belongs to the attribute of the black sample.

為了更好的理解上述技術方案，下面通過圖式以及具體實施例對本說明書實施例的技術方案做詳細的說明，應當理解本說明書實施例以及實施例中的具體特徵是對本說明書實施例技術方案的詳細的說明，而不是對本說明書技術方案的限定，在不衝突的情況下，本說明書實施例以及實施例中的技術特徵可以相互組合。請參見圖1，為本說明書實施例的樣本屬性評估應用場景示意圖。終端100位於使用者側，與網路側的伺服器200通信。使用者可通過終端100中的APP或網站產生即時事件，一些業務資料。伺服器200收集各個終端產生的即時事件，即可挑選出訓練樣本。本說明書實施例可應用於風險樣本識別或保險理賠中騙保樣本識別等風控場景，也可以應用於二分類的分類場景。第一態樣，本說明書實施例提供一種樣本屬性評估方法，請參考圖2，包括步驟S201-S203。 S201：確定與訓練樣本對應的關係圖中每個社區的黑樣本濃度，其中，所述訓練樣本包括黑樣本和未知樣本； S202：基於所述每個社區的黑樣本濃度，確定每個所述未知樣本的白樣本抽樣概率，以每個所述未知樣本的白樣本抽樣概率進行抽樣，獲得白樣本； S203：基於半監督機器學習演算法對所述黑樣本與所述白樣本進行訓練，獲得目標樣本屬性評估模型。具體的，在本實施例中，首先通過步驟S201確定訓練樣本，訓練樣本如前述所示，可以是各個終端側產生的業務資料，訓練樣本中包括已經標記好屬性的黑樣本，還包括未知屬性的未知樣本。例如：在保險理賠場景中，訓練樣本為申請理賠的使用者的相關資料，其中，確定騙保使用者對應的樣本為黑樣本，保險理賠場景中已定騙保事實的黑樣本較少，缺乏大量黑樣本標記，從而導致樣本屬性評估模型精准度大大折扣，如何解決這種場景下的模型訓練問題是非常重要工作。本實施例中的方法，可以結合樣本的社區屬性與半監督機器學習演算法，來從大量未知樣本中挖掘潛在的黑樣本，達到模型訓練所需要的黑樣本數量，過濾得到信任度較高的白樣本，訓練時確保了黑樣本和白樣本的純度，從而完成模型訓練，得到精度較高的樣本屬性評估模型。進一步，再通過步驟S201確定與訓練樣本對應的關係圖中每個社區的黑樣本濃度。具體的，在本實施例中，需要預先構建包括訓練樣本的關係圖。具體的，每個訓練樣本對應一個節點，構建的關係圖中可以僅包括訓練樣本對應的節點，還可以是全網節點對應的關係圖。圖的構建過程可以是獲取各節點在預定時間段內的歷史事件，基於歷史事件，按預設構圖方法確定關係圖，採用預設社區發現演算法對關係圖中的節點進行社區劃分，其中，每個節點對應有該節點所屬的社區標籤。其中，預設時間段可以預先指定，預設構圖方法需要定義以下各個內容：節點的定義，邊的定義以及邊的權重值的定義。本實施例也不限制具體的構圖規則。不同的場景、不同實現中可以採用不同的構圖規則。舉例而言，在保險理賠場景中，預設構圖方法可以是：以使用者為點，若在半年內兩個使用者有過金融交易(如：轉帳)，則將兩個使用者連接起來，邊的權重可以是兩個使用者轉帳的次數。具體的，在本實施例中，針對上述構建的關係圖上給運行一個或多個預設社區發現演算法，這樣，每一個點得到一個該節點所屬社區的社區標籤。預設社區探索方法可以是標籤傳播演算法(LPA，Label Propagation Algorithm)，也可以是快速折疊演算法(FU，Fast Unfolding)等等，在此，本申請不做限制。其中，標籤傳播演算法流程簡述如下： Step1：圖上的每一個點都以自己點id作為自己的標籤； Step2：每一個點都從自己的鄰居那獲取各鄰居標籤； Step3：每一個點收到來自所有鄰居的標籤之後，將收到標籤中出現最多的作為自己的標籤(如果有權圖則是權重和最高的那個)。如果出現標籤數相同多的標籤，則在這些出現最多的標籤中任選一個作為自己的標籤； Step4：將每個點上的標籤作為自己的社區標籤輸出。 Step3：重複Step2直到所有點都不發生變化； Step4：將Step3得到的每一個社區當成點，重複Step2直到所有社區不發生變化； Step5：將每個點上的標籤作為自己的社區標籤輸出。在對關係圖劃分好社區後，即可計算得到每個社區的黑樣本濃度，每個社區的黑樣本濃度的確定方式包括但不限於以下三種：第一種：確定每個社區中所有黑樣本對應節點在該社區總節點中的第一占比，將所述第一占比作為該社區的黑樣本濃度。第二種：確定每個社區中所有黑樣本對應節點在所述關係圖中總節點中的第二占比，將所述第一占比作為該社區的黑樣本濃度。第三種：確定每個社區中所有黑樣本對應節點在該社區總節點中的第三占比，以及該社區總節點在所述關係圖中的總節點中的第四占比，獲得所述第三占比與所述第四占比的加權平均值，將所述加權平均值作為該社區的黑樣本濃度。具體的，在本實施例中，採用第一種方式，黑樣本濃度可以定義為社區內的黑樣本個數除以社區節點總數。例如：社區A內總共包括5個節點，其中，有一個節點是黑樣本對應的節點，這樣，可計算得到社區A的黑樣本濃度為1/5，該社區內所有節點的黑樣本濃度均為1/5。當然，還可以通過整個關係圖規模，採用上述第二種方式定義。例如：社區A內總共包括5個節點，其中，有一個節點是黑樣本對應的節點，關係圖中包括10個節點，這樣，可計算得到社區A的黑樣本濃度為1/10，該社區內所有節點的黑樣本濃度均為1/10。當然，可以結合社區規模以及黑樣本在社區中的占比兩個維度來設定，採用上述第三種方式定義。例如：社區A內總共包括5個節點，其中，有一個節點是黑樣本對應的節點，關係圖中包括10個節點，這樣，可計算得到社區A的黑樣本濃度為K1*1/5+K2*5/10，其中，K1與K2表示加權係數，可根據實際需要繼續進行設定，則該社區內所有節點的黑樣本濃度均為0.2K1+0.5k2。在具體實施過程中，黑樣本濃度的定義方式可根據實際需要進行設定，在此，本申請不做限制。在確定好各個樣本的黑樣本濃度後，通過步驟S202，基於每個社區的黑樣本濃度，確定每個未知樣本的白樣本抽樣概率，以每個所述未知樣本的白樣本抽樣概率進行抽樣，獲得白樣本。本實施例的方法，在從未知樣本中挖掘潛在黑樣本時，結合社區屬性，即使小部分未知樣本在當前時刻沒有體現出來黑樣本真實特徵，結合社區特性，進行深度挖掘，可以真實的擴大黑樣本比例，達到模型訓練要求。具體的，在本實施例中，針對訓練樣本中的每個未知樣本，可以根據該樣本的黑樣本濃度確定該樣本的白樣本抽樣概率。比如：如果未知樣本1位於社區A，該社區A的黑樣本濃度為1/5，所以，未知樣本1的黑樣本濃度為1/5，未知樣本1的白樣本抽樣概率P1=1-1/5=4/5。進一步，在初始的第一次訓練時，可以將每個未知樣本的白樣本抽樣概率設定為固定值。比如：有100個未知樣本，在第一次訓練時可將每個未知樣本的白樣本抽樣概率設定為1/100。在後續的多輪訓練中再結合未知樣本的黑樣本濃度確定該未知樣本的白樣本抽樣概率。這樣，在確定好各個未知樣本的白樣本抽樣概率後，可以對未知樣本按各自的白樣本抽樣概率進行白樣本抽樣，確定抽取到的白樣本，然後，通過步驟S203結合已經標記屬性的黑樣本，基於半監督機器學習演算法對黑樣本與白樣本進行訓練，獲得樣本屬性評估模型。具體實現可包括如下步驟：基於半監督機器學習演算法對黑樣本與白樣本進行訓練，獲得樣本屬性評估模型；判斷樣本屬性評估模型是否滿足預設收斂條件；如果否，更新每個社區的黑樣本濃度，基於更新後的每個社區的黑樣本濃度與半監督機器學習演算法繼續訓練，直至訓練得到的樣本屬性評估模型滿足預設收斂條件，將滿足預設收斂條件的樣本屬性評估模型作為目標樣本屬性評估模型。本實施例中，採用半監督機器學習演算法包括半監督(Positive and Unlabeled Learning，正樣本和無標記學習)機器學習演算法，它是一種半監督學習的機器學習演算法，是指用於訓練機器學習模型的訓練樣本中，僅部分訓練樣本是有標記樣本，而其餘的訓練樣本為無標記樣本，利用無標記樣本來輔助有標記樣本的學習過程。應用於建模一方收集到的訓練樣本中只有少量有標記的黑樣本，其餘的樣本均為無標記的未知樣本，針對有標記的正樣本和無標記樣本的機器學習過程。在構建好黑樣本和白樣本後，可以基於半監督機器學習演算法對這些訓練樣本進行訓練，來構建樣本屬性評估模型。對於半監督機器學習演算法而言，通常可以包含多種機器學習策略。例如：半監督機器學習演算法包含典型的機器學習策略，包括兩階段法(two-stage strategy)和代價敏感法(cost-sensitive strategy)兩類。所謂兩階段法，演算法首先基於已知的正樣本和無標記樣本，在無標記樣本中挖掘發現潛在的可靠負樣本，然後基於已知的正樣本和挖掘出來的可靠負樣本，將問題轉化為傳統的有監督的機器學習的過程，來訓練分類模型。而對於代價敏感的策略而言，演算法假設無標記樣本中正樣本的比例極低，通過直接將無標記樣本看作負樣本對待，為正樣本設置一個相對於負樣本更高的代價敏感權重。例如，通常會在基於代價敏感的半監督機器學習演算法的目標方程中，為與正樣本對應的損失函數，設置一個更高的代價敏感權重。通過給正樣本設置更高的代價敏感權重，使得最終訓練出的分類模型分錯一個正樣本的代價遠遠大於分錯一個負樣本的代價，如此一來，可以直接通過利用正樣本和無標記樣本(當作負樣本)學習一個代價敏感的分類器，來對未知的樣本進行分類。在本實施例中，既可以基於代價敏感的半監督機器學習演算法對上述訓練樣本進行訓練，也可以採用兩階段法對上述訓練樣本進行訓練。在具體實施過程中，可根據需要進行設定，在此，本申請不做限制。在本實施例中主要以兩階段的半監督機器學習演算法進行詳細介紹。以保險理賠場景為例，假設上述訓練樣本集中的黑樣本被標記為1，表示該樣本為已知的騙保的保險資料，白樣本標記為0，表示該訓練樣本對應保險資料是正常的。在對黑樣本和基於白樣本抽樣概率抽樣出的白樣本進行二分類模型訓練後，得到樣本屬性評估模型，然後再採用該樣本屬性評估模型對未知樣本進行評估，得到每個未知樣本被標記為黑樣本的黑樣本評分，該黑樣本評分為一個範圍在0~1的數值，表明未知樣本屬於黑樣本的概率。當然，還可以以其他方式定義黑樣本白樣本以及對應的黑樣本評分，在此，本申請不做限制。按照這樣的方式對訓練樣本進行多輪訓練，每輪訓練後得到對應的樣本屬性評估模型，需要判斷該樣本屬性評估模型是否滿足預設收斂條件，如果模型收斂，則將該輪訓練得到的樣本屬性評估模型作為最終的目標樣本屬性評估模型。如果模型還沒有收斂，則更新每個未知樣本的黑樣本濃度後繼續按照前述方式進行訓練，直至訓練得到的模型達到收斂條件。在本實施例中，判斷模型是否收斂可以通過如下步驟實現：基於樣本屬性評估模型對每個未知樣本進行評估，獲得每個未知樣本的本輪屬性評估結果，共計獲得M個本輪屬性評估結果，M為未知樣本的個數；基於M個本輪屬性評估結果與M個上一輪屬性評估結果，判斷樣本屬性評估模型是否滿足預設收斂條件。其中，基於樣本屬性評估模型對每個未知樣本進行評估，獲得每個未知樣本的本輪屬性評估結果，包括：基於樣本屬性評估模型對每個未知樣本進行評估，獲得每個未知樣本的黑樣本評分，如果黑樣本評分值大於預設分值，將該未知樣本的屬性資訊標記為黑樣本，其中，每個未知樣本的本輪屬性評估結果中包括該未知樣本的屬性資訊。具體的，在本實施例中，在每輪訓練得到該輪訓練對應的樣本屬性評估模型，利用該模型對每個未知樣本的黑樣本評分，可以根據評分對該未知樣本進行標記。具體的，如果黑樣本評分值大於預設分值，將該未知樣本的屬性資訊標記為黑樣本。舉例來說，預設分值設定為0.8，未知樣本1的黑樣本評分為0.9，將該未知樣本1的屬性資訊標記為黑樣本。未知樣本2的黑樣本評分為0.4，將該未知樣本2的屬性資訊保持不變，還是未知屬性樣本。通過這樣的方式，可以確定出每個未知樣本在該輪訓練中的屬性評估結果，該評估結果中可包括該未知樣本在本輪訓練中的黑樣本評分和屬性資訊。未知樣本個數為M，則得到M個本輪屬性評估結果。進而，還可獲得未知樣本在上一輪訓練對應的屬性評估結果，即M個上一輪屬性評估結果。通過M個本輪屬性評估結果與M個上一輪屬性評估結果，即可判斷樣本屬性評估模型是否滿足預設收斂條件，具體可通過如下步驟實現：判斷每個未知樣本的本輪屬性評估結果中的屬性資訊與該未知樣本的上一輪屬性評估結果中的屬性資訊是否一致，如果是，表明本輪樣本屬性評估模型滿足預設收斂條件。具體的，在本實施例中，通過M個上一輪屬性評估結果中每個評估結果中的屬性資訊，確定在上一輪訓練中被標記為黑樣本包括哪些未知樣本，以及通過M個本輪屬性評估結果中每個評估結果中的屬性資訊，確定在本輪訓練中被標記為黑樣本包括哪些未知樣本。如果上一輪被標記為黑樣本的未知樣本與本輪被標記為黑樣本的未知樣本一致，表明每個未知樣本的標記已經沒有變化，模型達到收斂。舉例而言，上一輪中訓練中未知樣本中的黑樣本包括未知樣本1、未知樣本2、未知樣本5、未知樣本10。本輪訓練中的黑樣本也包括未知樣本1、未知樣本2、未知樣本5、未知樣本10，表明未知樣本沒有變化，本輪訓練出的樣本屬性評估模型已達到收斂。將該輪訓練得到的樣本屬性評估模型作為目標樣本屬性評估模型。進一步，判定模型是否達到收斂的預設收斂條件可以根據實際需要進行設定，上述示例只是具體實現的一種示例，並不對本申請構成限定。例如：還可以設定為如果上一輪被標記為黑樣本的未知樣本與本輪被標記為黑樣本的未知樣本一致的未知樣本數量占比達到預設占比，表明每個未知樣本的標記已經沒有變化，模型達到收斂。進一步，如果確定本輪訓練得到的樣本屬性評估模型不滿足預設收斂條件，則表明模型還沒有收斂，需要進行下一輪訓練。在進行下一輪訓練之前，由於標記的黑樣本相對於上一輪訓練發生了變化，所以，需要根據標記的黑樣本對社區的黑樣本濃度進行更新，進而對每個未知樣本的黑樣本濃度進行更新，具體實現可包括如下步驟：基於M個本輪屬性評估結果與M個上一輪屬性評估結果，確定屬性資訊發生變化的未知樣本；重新計算與屬性資訊發生變化的未知樣本對應的社區的黑樣本濃度。具體的，在本實施例中，基於本輪訓練對應的每個未知樣本的屬性評估結果與上一輪訓練對應的每個未知樣本的屬性評估結果，可以定位出哪些未知樣本的屬性資訊發生變化，該變化可以是由黑樣本屬性變更為未知屬性，還可以是由未知屬性變更為黑樣本屬性。進而定位到產生屬性變化的未知樣本所在社區，重新計算該社區的黑樣本濃度，根據該社區更新後的黑樣本濃度，更新該社區對應節點的白樣本抽樣概率。舉例來說，社區A包括未知樣本1、未知樣本2、黑樣本1對應的節點。上一輪訓練中社區A中的所有節點屬性均為發生改變，社區A中每個節點對應的黑樣本濃度為1/3。此輪訓練中將未知樣本1標記為黑樣本，其餘節點屬性未發生變化，將社區A中每個節點對應的黑樣本濃度更新為2/3。這樣，未知樣本1、未知樣本2對應的白樣本抽樣概率均為1/3。按照這樣的方式，可以更新未知節點的白樣本抽樣概率，然後重複執行前述按各個節點的白樣本抽樣概率的白樣本抽樣得到白樣本，結合抽樣得到的白樣本與已知黑樣本，基於半監督機器學習演算法進行下一輪的樣本屬性評估模型的訓練，直至訓練得到的樣本屬性評估模型達到上述預設收斂條件。進而，通過前述方式訓練得到目標樣本屬性評估模型，可用該模型對新進樣本進行樣本屬性評估，確定新進樣本的評估結果，其中，評估結果中包括該新進樣的黑樣本評分和/或屬性資訊。具體的，本實施例中採用已知黑樣本和篩選出潛在黑樣本後的剩餘的信任度較高的白樣本進行模型訓練，得到的目標樣本屬性評估模型的評估精度較高，可以對新進樣本進行樣本屬性評估，評估結果可以包括前述的黑樣本評分，表明該新進樣本屬於黑樣本的概率。評估結果也可以包括該新進樣本的屬性資訊，例如：該新進樣本的黑樣本評分為0.9，大於預設分值0.8，確定該新進樣本的屬性資訊為黑樣本。通過該評估結果，相關人員即可及時獲知該新進樣本的屬性，及時進行風險調控。進一步，在本實施例中，由於節點間的關係會隨著時間發生變化，所以，可以按照預設時間間隔對前述實施例中的關係圖進行更新，對更新後的關係圖重新進行社區劃分，得到對應社區的黑樣本濃度，重新進行樣本屬性評估模型的訓練，以使得模型能夠按預設時間間隔更新。本實施例中的方法可以應用於保險理賠場景，訓練樣本為申請理賠人員的保險資料，黑樣本為已知的騙保人員的保險資料，通過前述方式獲得騙保評估模型，新進的申請理賠人員的相關保險資料輸入該騙保評估模型後即可得到該新進的申請理賠人員屬於騙保人員的評估得分或是否為騙保的屬性。這樣，保險公司相關人員就可以根據這樣的評估結果對疑似騙保人員進行後續相關審查，避免了不必要的財產損失。第二態樣，基於同一發明構思，本說明書實施例提供一種樣本屬性評估模型訓練裝置，請參考圖3，包括：第一確定單元301，用於確定與訓練樣本對應的關係圖中每個社區的黑樣本濃度，其中，所述訓練樣本包括黑樣本和未知樣本；第二確定單元302，用於基於所述每個社區的黑樣本濃度，確定每個所述未知樣本的白樣本抽樣概率，以每個所述未知樣本的白樣本抽樣概率進行抽樣，獲得白樣本；訓練單元303，用於基半監督機器學習演算法對所述黑樣本與所述白樣本進行訓練，獲得目標樣本屬性評估模型。在一種可選實現方式中，所述訓練單元303具體用於：基於半監督機器學習演算法對所述黑樣本與所述白樣本進行訓練，獲得樣本屬性評估模型；判斷所述樣本屬性評估模型是否滿足預設收斂條件；如果否，更新所述每個社區的黑樣本濃度，基於更新後的每個社區的黑樣本濃度與所述半監督機器學習演算法繼續訓練，直至訓練得到的樣本屬性評估模型滿足所述預設收斂條件，將滿足所述預設收斂條件的樣本屬性評估模型作為目標樣本屬性評估模型。在一種可選實現方式中，所述訓練單元303具體用於：基於所述樣本屬性評估模型對每個所述未知樣本進行評估，獲得每個所述未知樣本的本輪屬性評估結果，共計獲得M個本輪屬性評估結果，M為未知樣本的個數；基於所述M個本輪屬性評估結果與M個上一輪屬性評估結果，判斷所述樣本屬性評估模型是否滿足預設收斂條件。在一種可選實現方式中，所述訓練單元303具體用於：基於所述樣本屬性評估模型對每個所述未知樣本進行評估，獲得每個所述未知樣本的黑樣本評分，如果黑樣本評分值大於預設分值，將該未知樣本的屬性資訊標記為黑樣本，其中，每個所述未知樣本的本輪屬性評估結果中包括該未知樣本的屬性資訊。在一種可選實現方式中，所述訓練單元303具體用於：判斷每個未知樣本的本輪屬性評估結果中的屬性資訊與該未知樣本的上一輪屬性評估結果中的屬性資訊是否一致，如果是，表明所述本輪樣本屬性評估模型滿足所述預設收斂條件。在一種可選實現方式中，所述訓練單元303具體用於：基於所述M個本輪屬性評估結果與M個上一輪屬性評估結果，確定屬性資訊發生變化的未知樣本；重新計算與所述屬性資訊發生變化的未知樣本對應的社區的黑樣本濃度。在一種可選實現方式中，所述裝置還包括評估單元，所述評估單元具體用於：在所述將滿足所述預設收斂條件的樣本屬性評估模型作為目標樣本屬性評估模型之後，根據目標樣本屬性評估模型，對新進樣本進行評估，確定所述新進樣本的評估結果，其中，所述評估結果中包括該新進樣的黑樣本評分和/或屬性資訊。在一種可選實現方式中，所述訓練樣本為申請理賠人員對應的保險資料，所述黑樣本為騙保人員對應保險資料。在一種可選實現方式中，所述第一確定單元具體用於：確定每個社區中所有黑樣本對應節點在該社區總節點中的第一占比，將所述第一占比作為該社區的黑樣本濃度；或確定每個社區中所有黑樣本對應節點在所述關係圖中總節點中的第二占比，將所述第一占比作為該社區的黑樣本濃度；或確定每個社區中所有黑樣本對應節點在該社區總節點中的第三占比，以及該社區總節點在所述關係圖中的總節點中的第四占比，獲得所述第三占比與所述第四占比的加權平均值，將所述加權平均值作為該社區的黑樣本濃度。第三態樣，基於與前述實施例中樣本屬性評估方法同樣的發明構思，本發明還提供一種伺服器，如圖4所示，包括記憶體404、處理器402及儲存在記憶體404上並可在處理器402上運行的電腦程式，所述處理器402執行所述程式時實現前文所述樣本屬性評估方法的任一方法的步驟。其中，在圖4中，匯流排架構(用匯流排400來代表)，匯流排400可以包括任意數量的互聯的匯流排和橋，匯流排400將包括由處理器402代表的一個或多個處理器和記憶體404代表的記憶體的各種電路鏈接在一起。匯流排400還可以將諸如週邊設備、穩壓器和功率管理電路等之類的各種其他電路鏈接在一起，這些都是本領域所公知的，因此，本文不再對其進行進一步描述。匯流排界面406在匯流排400和接收器401和發送器403之間提供介面。接收器401和發送器403可以是同一個元件，即收發機，提供用於在傳輸媒體上與各種其他裝置通信的單元。處理器402負責管理匯流排400和通常的處理，而記憶體404可以被用於儲存處理器402在執行操作時所使用的資料。第四態樣，基於與前述實施例中樣本屬性評估的發明構思，本發明還提供一種電腦可讀儲存媒體，其上儲存有電腦程式，該程式被處理器執行時實現前文所述樣本屬性評估的方法的任一方法的步驟。本說明書是參照根據本說明書實施例的方法、設備(系統)、和電腦程式產品的流程圖和/或方塊圖來描述的。應理解可由電腦程式指令實現流程圖和/或方塊圖中的每一流程和/或方塊、以及流程圖和/或方塊圖中的流程和/或方塊的結合。可提供這些電腦程式指令到通用電腦、專用電腦、嵌入式處理機或其他可編程資料處理設備的處理器以產生一個機器，使得通過電腦或其他可編程資料處理設備的處理器執行的指令產生用於實現在流程圖一個流程或多個流程和/或方塊圖一個方塊或多個方塊中指定的功能的設備。這些電腦程式指令也可儲存在能引導電腦或其他可編程資料處理設備以特定方式工作的電腦可讀記憶體中，使得儲存在該電腦可讀記憶體中的指令產生包括指令設備的製造品，該指令設備實現在流程圖一個流程或多個流程和/或方塊圖一個方塊或多個方塊中指定的功能。這些電腦程式指令也可裝載到電腦或其他可編程資料處理設備上，使得在電腦或其他可編程設備上執行一系列操作步驟以產生電腦實現的處理，從而在電腦或其他可編程設備上執行的指令提供用於實現在流程圖一個流程或多個流程和/或方塊圖一個方塊或多個方塊中指定的功能的步驟。儘管已描述了本說明書的較佳實施例，但本領域內的技術人員一旦得知了基本進步性概念，則可對這些實施例作出另外的變更和修改。所以，所附申請專利範圍意欲解釋為包括較佳實施例以及落入本說明書範圍的所有變更和修改。顯然，本領域的技術人員可以對本說明書進行各種改動和變型而不脫離本說明書的精神和範圍。這樣，倘若本說明書的這些修改和變型屬於本說明書申請專利範圍及其等同技術的範圍之內，則本說明書也意圖包含這些改動和變型在內。In order to better understand the above technical solutions, the technical solutions of the embodiments of this specification are described in detail below through the drawings and specific embodiments. It should be understood that the embodiments of this specification and the specific features in the embodiments are of the technical solutions of the embodiments of this specification. The detailed description is not a limitation on the technical solution of this specification. The embodiments of this specification and the technical features in the embodiments can be combined with each other if there is no conflict. Please refer to FIG. 1, which is a schematic diagram of an application scenario of sample attribute evaluation according to an embodiment of this specification. The terminal 100 is located on the user side and communicates with the server 200 on the network side. The user can generate real-time events and some business data through the APP or website in the terminal 100. The server 200 collects real-time events generated by each terminal, and then selects training samples. The embodiments of this specification can be applied to risk control scenarios such as identifying risk samples or identifying fraud samples in insurance claims, and can also be applied to two-class classification scenarios. In the first aspect, the embodiment of this specification provides a sample attribute evaluation method. Please refer to FIG. 2, which includes steps S201-S203. S201: Determine the black sample concentration of each community in the relationship graph corresponding to the training sample, where the training sample includes the black sample and the unknown sample; S202: Determine the white sample sampling probability of each unknown sample based on the black sample concentration of each community, and perform sampling with the white sample sampling probability of each unknown sample to obtain a white sample; S203: Training the black sample and the white sample based on a semi-supervised machine learning algorithm to obtain an attribute evaluation model of the target sample. Specifically, in this embodiment, the training samples are first determined in step S201. The training samples may be business data generated by each terminal as shown above. The training samples include black samples with marked attributes and unknown attributes. Of unknown samples. For example: in the insurance claim scenario, the training samples are the relevant data of the user applying for the claim. Among them, the sample corresponding to the user who is determined to be a fraudulent user is a black sample. In the insurance claim scenario, there are fewer black samples that have deceived insurance facts. A large number of black samples are marked, and the accuracy of the sample attribute evaluation model is greatly reduced. How to solve the model training problem in this scenario is a very important work. The method in this embodiment can combine the community attributes of the samples with the semi-supervised machine learning algorithm to mine potential black samples from a large number of unknown samples to reach the number of black samples required for model training, and filter to obtain high trustworthiness For white samples, the purity of black and white samples is ensured during training, thereby completing model training and obtaining a sample attribute evaluation model with higher accuracy. Further, step S201 is used to determine the black sample concentration of each community in the relationship graph corresponding to the training sample. Specifically, in this embodiment, it is necessary to construct a relationship graph including training samples in advance. Specifically, each training sample corresponds to a node, and the constructed relationship graph may only include nodes corresponding to the training samples, or it may be a relationship graph corresponding to nodes in the entire network. The construction process of the graph can be to obtain the historical events of each node in a predetermined time period, based on the historical events, determine the relationship graph according to the preset composition method, and use the preset community discovery algorithm to divide the nodes in the relationship graph into communities, where, Each node corresponds to the community label to which the node belongs. Among them, the preset time period can be specified in advance, and the preset composition method needs to define the following contents: the definition of the node, the definition of the edge, and the definition of the weight value of the edge. This embodiment also does not limit specific composition rules. Different composition rules can be used in different scenarios and different implementations. For example, in an insurance claims scenario, the default composition method can be: take the user as the point, if two users have had financial transactions (such as transfers) within half a year, then connect the two users. The weight of the edge can be the number of transfers between two users. Specifically, in this embodiment, one or more preset community discovery algorithms are run on the relationship graph constructed above, so that each point gets a community label of the community to which the node belongs. The preset community exploration method can be Label Propagation Algorithm (LPA), Fast Unfolding Algorithm (FU), etc., and this application does not limit it here. Among them, the process of tag propagation algorithm is briefly described as follows: Step1: Each point on the graph uses its own point id as its own label; Step2: Each point gets its neighbor label from its neighbor; Step3: After each point receives the labels from all neighbors, it will receive the label with the most occurrences as its own label (if there is a right map, it will be the one with the highest weight and the highest). If there are tags with the same number of tags, choose one of the tags with the most occurrences as your own tag; Step4: Output the label on each point as its own community label. Step3: Repeat Step2 until all points do not change; Step4: Regard each community obtained in Step3 as a point, repeat Step2 until all communities do not change; Step5: Output the label on each point as its own community label. After dividing the communities in the relationship diagram, the black sample concentration of each community can be calculated. The black sample concentration of each community can be determined in the following three ways: The first method is to determine the first proportion of the nodes corresponding to all black samples in each community in the total nodes of the community, and use the first proportion as the black sample concentration of the community. The second method is to determine the second proportion of the nodes corresponding to all black samples in each community in the total nodes in the relationship graph, and use the first proportion as the black sample concentration of the community. The third type: determine the third proportion of the nodes corresponding to all black samples in each community in the total nodes of the community, and the fourth proportion of the total nodes in the community in the total nodes in the relationship graph, and obtain the A weighted average of the third proportion and the fourth proportion, and the weighted average is used as the black sample concentration of the community. Specifically, in this embodiment, using the first method, the black sample concentration can be defined as the number of black samples in the community divided by the total number of community nodes. For example: Community A includes a total of 5 nodes, of which one node is the node corresponding to the black sample. In this way, the black sample concentration of community A can be calculated to be 1/5, and the black sample concentration of all nodes in the community is 1/5. Of course, you can also use the second method above to define the size of the entire relationship graph. For example: Community A includes a total of 5 nodes, of which one node is the node corresponding to the black sample, and the relationship graph includes 10 nodes. In this way, the black sample concentration of community A can be calculated to be 1/10. The black sample concentration of all nodes is 1/10. Of course, it can be set in combination with the size of the community and the proportion of black samples in the community, using the third method described above. For example: Community A includes a total of 5 nodes, of which one node is the node corresponding to the black sample, and the relationship graph includes 10 nodes. In this way, the black sample concentration of Community A can be calculated as K1*1/5+K2 *5/10, where K1 and K2 represent weighting coefficients, which can be set according to actual needs, and the black sample concentration of all nodes in the community is 0.2K1+0.5k2. In the specific implementation process, the definition of the black sample concentration can be set according to actual needs, and this application is not limited here. After the black sample concentration of each sample is determined, through step S202, based on the black sample concentration of each community, the white sample sampling probability of each unknown sample is determined, and sampling is performed based on the white sample sampling probability of each unknown sample. Obtain a white sample. In the method of this embodiment, when mining potential black samples from unknown samples, combined with community attributes, even if a small part of the unknown samples do not reflect the true characteristics of the black samples at the current moment, in-depth mining can be carried out in combination with community characteristics, which can truly expand the black samples. The sample ratio meets the requirements of model training. Specifically, in this embodiment, for each unknown sample in the training sample, the white sample sampling probability of the sample can be determined according to the black sample concentration of the sample. For example: if unknown sample 1 is located in community A, the black sample concentration of this community A is 1/5, so the black sample concentration of unknown sample 1 is 1/5, and the white sample sampling probability of unknown sample 1 is P1=1-1/ 5=4/5. Further, in the initial first training, the white sample sampling probability of each unknown sample can be set to a fixed value. For example, if there are 100 unknown samples, the white sample probability of each unknown sample can be set to 1/100 during the first training. In the subsequent rounds of training, the black sample concentration of the unknown sample is combined to determine the white sample sampling probability of the unknown sample. In this way, after the white sample sampling probabilities of each unknown sample are determined, the unknown samples can be sampled with white samples according to their respective white sample sampling probabilities, the white samples drawn are determined, and then the black samples with marked attributes are combined through step S203 , Based on the semi-supervised machine learning algorithm, the black samples and white samples are trained to obtain the sample attribute evaluation model. The specific implementation may include the following steps: Train black samples and white samples based on semi-supervised machine learning algorithms to obtain sample attribute evaluation models; Determine whether the sample attribute evaluation model meets the preset convergence conditions; If not, update the black sample concentration of each community, and continue training based on the updated black sample concentration of each community and the semi-supervised machine learning algorithm until the trained sample attribute evaluation model meets the preset convergence conditions, which will meet the pre-set Set the sample attribute evaluation model of the convergence condition as the target sample attribute evaluation model. In this embodiment, the semi-supervised machine learning algorithm includes semi-supervised (Positive and Unlabeled Learning, positive sample and unlabeled learning) machine learning algorithm, which is a kind of semi-supervised learning machine learning algorithm, which refers to training Among the training samples of the machine learning model, only part of the training samples are labeled samples, while the rest of the training samples are unlabeled samples. Unlabeled samples are used to assist the learning process of labeled samples. Among the training samples collected by the modeling party, there are only a small number of labeled black samples, and the remaining samples are unlabeled unknown samples. The machine learning process is aimed at labeled positive samples and unlabeled samples. After constructing the black samples and white samples, these training samples can be trained based on a semi-supervised machine learning algorithm to build a sample attribute evaluation model. For semi-supervised machine learning algorithms, multiple machine learning strategies can usually be included. For example, semi-supervised machine learning algorithms include typical machine learning strategies, including two-stage strategy and cost-sensitive strategy. The so-called two-stage method, the algorithm is based on the known positive samples and unlabeled samples, and discovers the potential reliable negative samples in the unlabeled samples, and then based on the known positive samples and the mined reliable negative samples, the problem is transformed Train the classification model for the traditional supervised machine learning process. For cost-sensitive strategies, the algorithm assumes that the proportion of positive samples in unlabeled samples is extremely low, and by directly treating unlabeled samples as negative samples, a higher cost-sensitive weight is set for positive samples relative to negative samples. For example, usually in the target equation based on cost-sensitive semi-supervised machine learning algorithms, a higher cost-sensitive weight is set for the loss function corresponding to the positive sample. By setting a higher cost-sensitive weight for the positive samples, the cost of the final trained classification model for dividing a positive sample is far greater than the cost of dividing a negative sample. In this way, you can directly use positive samples and unlabeled The samples (as negative samples) learn a cost-sensitive classifier to classify unknown samples. In this embodiment, the above training samples can be trained based on a cost-sensitive semi-supervised machine learning algorithm, or a two-stage method can be used to train the above training samples. In the specific implementation process, it can be set according to needs, and this application does not limit it here. In this embodiment, a two-stage semi-supervised machine learning algorithm is mainly used for detailed introduction. Taking an insurance claim scenario as an example, suppose that the black sample in the above training sample set is marked as 1, which means that the sample is known fraudulent insurance data, and the white sample is marked as 0, which means that the insurance data corresponding to the training sample is normal. After training the black sample and the white sample sampled based on the sampling probability of the white sample, the binary classification model is obtained to obtain the sample attribute evaluation model, and then the sample attribute evaluation model is used to evaluate the unknown samples, and each unknown sample is marked as The black sample score of the black sample. The black sample score is a value ranging from 0 to 1, indicating the probability that the unknown sample belongs to the black sample. Of course, the black sample, white sample and the corresponding black sample score can also be defined in other ways, and this application is not limited here. In this way, the training samples are trained in multiple rounds. After each round of training, the corresponding sample attribute evaluation model is obtained. It is necessary to determine whether the sample attribute evaluation model meets the preset convergence conditions. If the model converges, the sample obtained from this round of training The attribute evaluation model is used as the final target sample attribute evaluation model. If the model has not converged, update the black sample concentration of each unknown sample and continue training in the aforementioned manner until the trained model reaches the convergence condition. In this embodiment, judging whether the model converges can be achieved through the following steps: Evaluate each unknown sample based on the sample attribute evaluation model, obtain the current round of attribute evaluation results for each unknown sample, and obtain a total of M current round of attribute evaluation results, where M is the number of unknown samples; Based on M current-round attribute evaluation results and M last-round attribute evaluation results, it is judged whether the sample attribute evaluation model meets the preset convergence condition. Among them, each unknown sample is evaluated based on the sample attribute evaluation model, and the current round of attribute evaluation results for each unknown sample are obtained, including: Each unknown sample is evaluated based on the sample attribute evaluation model, and the black sample score of each unknown sample is obtained. If the black sample score value is greater than the preset score, the attribute information of the unknown sample is marked as a black sample. The attribute information of the unknown sample is included in the current round of attribute evaluation results of the unknown sample. Specifically, in this embodiment, the sample attribute evaluation model corresponding to the round of training is obtained in each round of training, and the black sample of each unknown sample is scored by using the model, and the unknown sample can be marked according to the score. Specifically, if the score value of the black sample is greater than the preset score, the attribute information of the unknown sample is marked as a black sample. For example, the default score is set to 0.8, the black sample score of the unknown sample 1 is 0.9, and the attribute information of the unknown sample 1 is marked as a black sample. The black sample score of the unknown sample 2 is 0.4, and the attribute information of the unknown sample 2 remains unchanged, and it is still an unknown attribute sample. In this way, the attribute evaluation result of each unknown sample in this round of training can be determined, and the evaluation result can include the black sample score and attribute information of the unknown sample in this round of training. If the number of unknown samples is M, then M results of this round of attribute evaluation are obtained. Furthermore, the attribute evaluation results corresponding to the unknown sample in the previous round of training can also be obtained, that is, M attribute evaluation results of the previous round. According to the M results of this round of attribute evaluation and the M results of the previous round of attribute evaluation, it can be determined whether the sample attribute evaluation model meets the preset convergence conditions, which can be achieved through the following steps: Determine whether the attribute information in the current round of attribute evaluation results of each unknown sample is consistent with the attribute information in the previous round of attribute evaluation results of the unknown sample. If so, it indicates that the current round of sample attribute evaluation model meets the preset convergence condition. Specifically, in this embodiment, through the attribute information in each evaluation result of the M last round of attribute evaluation results, it is determined which unknown samples are included in the black samples marked as black samples in the previous round of training, and the M current round attributes are passed. The attribute information in each evaluation result in the evaluation result determines which unknown samples are included as black samples in this round of training. If the unknown samples marked as black samples in the previous round are consistent with the unknown samples marked as black samples in this round, it indicates that the label of each unknown sample has not changed, and the model has reached convergence. For example, the black samples in the unknown samples in the previous round of training include unknown sample 1, unknown sample 2, unknown sample 5, and unknown sample 10. The black samples in this round of training also include unknown sample 1, unknown sample 2, unknown sample 5, and unknown sample 10, indicating that the unknown sample has not changed, and the sample attribute evaluation model trained in this round has reached convergence. The sample attribute evaluation model obtained in this round of training is used as the target sample attribute evaluation model. Further, the preset convergence condition for determining whether the model reaches convergence can be set according to actual needs. The foregoing example is only an example of specific implementation, and does not constitute a limitation to the application. For example: it can also be set as if the unknown samples marked as black samples in the previous round are consistent with the unknown samples marked as black samples in this round, the proportion of unknown samples reached the preset proportion, indicating that each unknown sample has no mark Change, the model reaches convergence. Further, if it is determined that the sample attribute evaluation model obtained in the current round of training does not meet the preset convergence condition, it indicates that the model has not yet converged, and the next round of training is required. Before the next round of training, since the marked black samples have changed from the previous round of training, it is necessary to update the black sample concentration of the community based on the marked black samples, and then update the black sample concentration of each unknown sample , The specific implementation can include the following steps: Based on the M results of this round of attribute evaluation and the M results of the last round of attribute evaluation, determine the unknown sample whose attribute information has changed; recalculate the black sample concentration of the community corresponding to the unknown sample whose attribute information has changed. Specifically, in this embodiment, based on the attribute evaluation result of each unknown sample corresponding to the current round of training and the attribute evaluation result of each unknown sample corresponding to the previous round of training, it is possible to locate which unknown samples’ attribute information has changed. The change can be from a black sample attribute to an unknown attribute, or from an unknown attribute to a black sample attribute. It then locates the community where the unknown sample that caused the attribute change is located, recalculates the black sample concentration of the community, and updates the white sample probability of the node corresponding to the community based on the updated black sample concentration of the community. For example, community A includes nodes corresponding to unknown sample 1, unknown sample 2, and black sample 1. In the last round of training, the attributes of all nodes in community A have changed, and the black sample concentration corresponding to each node in community A is 1/3. In this round of training, the unknown sample 1 is marked as a black sample, and the attributes of the remaining nodes have not changed, and the black sample concentration corresponding to each node in the community A is updated to 2/3. In this way, the sampling probability of the white samples corresponding to the unknown sample 1 and the unknown sample 2 are both 1/3. In this way, the white sample sampling probability of the unknown node can be updated, and then the white sample sampling according to the white sample sampling probability of each node can be repeated to obtain the white sample. The white sample obtained by the sampling and the known black sample are combined, based on semi-supervised The machine learning algorithm performs the next round of training of the sample attribute evaluation model until the sample attribute evaluation model obtained by training reaches the above-mentioned preset convergence condition. Furthermore, the target sample attribute evaluation model is obtained by training in the foregoing manner, and the model can be used to evaluate the sample attribute of the newly-injected sample to determine the evaluation result of the newly-injected sample, wherein the evaluation result includes the black sample score and/or attribute information of the newly-injected sample . Specifically, in this embodiment, the known black samples and the remaining white samples with higher trustworthiness after screening out potential black samples are used for model training, and the obtained target sample attribute evaluation model has a higher evaluation accuracy, which can be used for new samples. Perform sample attribute evaluation, and the evaluation result may include the aforementioned black sample score, indicating the probability that the new sample belongs to the black sample. The evaluation result may also include the attribute information of the new sample. For example, the black sample score of the new sample is 0.9, which is greater than a preset score of 0.8, and it is determined that the attribute information of the new sample is a black sample. Through the evaluation results, relevant personnel can learn the attributes of the new sample in time and conduct risk control in a timely manner. Further, in this embodiment, since the relationship between nodes will change over time, the relationship graph in the foregoing embodiment can be updated at a preset time interval, and the updated relationship graph can be re-divided into communities. Obtain the black sample concentration of the corresponding community, and retrain the sample attribute evaluation model so that the model can be updated at a preset time interval. The method in this embodiment can be applied to insurance claims scenarios. The training sample is the insurance information of claimants, and the black sample is the insurance information of known fraudsters. The fraud evaluation model is obtained by the aforementioned method, and the new claims applicants After inputting the relevant insurance data into the insurance fraud evaluation model, the newly recruited claim adjuster can obtain the evaluation score of the fraudulent insurance personnel or the attribute of whether it is fraudulent insurance. In this way, the relevant personnel of the insurance company can conduct follow-up related examinations on the suspected fraudulent insurance personnel based on such assessment results, avoiding unnecessary property losses. In the second aspect, based on the same inventive concept, an embodiment of this specification provides a sample attribute evaluation model training device. Please refer to FIG. 3, which includes: The first determining unit 301 is configured to determine the black sample concentration of each community in the relationship graph corresponding to the training sample, where the training sample includes the black sample and the unknown sample; The second determining unit 302 is configured to determine the white sample sampling probability of each unknown sample based on the black sample concentration of each community, and perform sampling with the white sample sampling probability of each unknown sample to obtain a white sample ； The training unit 303 is configured to train the black sample and the white sample based on a semi-supervised machine learning algorithm to obtain an attribute evaluation model of the target sample. In an optional implementation manner, the training unit 303 is specifically configured to: Training the black sample and the white sample based on a semi-supervised machine learning algorithm to obtain a sample attribute evaluation model; Judging whether the sample attribute evaluation model satisfies a preset convergence condition; If not, update the black sample concentration of each community, and continue training based on the updated black sample concentration of each community and the semi-supervised machine learning algorithm until the sample attribute evaluation model obtained by training satisfies the preset The convergence condition is to use the sample attribute evaluation model that satisfies the preset convergence condition as the target sample attribute evaluation model. In an optional implementation manner, the training unit 303 is specifically configured to: Evaluate each of the unknown samples based on the sample attribute evaluation model, obtain the current round of attribute evaluation results of each of the unknown samples, and obtain a total of M current round of attribute evaluation results, where M is the number of unknown samples; Based on the M current-round attribute evaluation results and the M last-round attribute evaluation results, it is determined whether the sample attribute evaluation model satisfies a preset convergence condition. In an optional implementation manner, the training unit 303 is specifically configured to: Evaluate each of the unknown samples based on the sample attribute evaluation model to obtain the black sample score of each unknown sample. If the black sample score value is greater than the preset score value, mark the attribute information of the unknown sample as black Samples, wherein the current round of attribute evaluation results of each of the unknown samples include attribute information of the unknown sample. In an optional implementation manner, the training unit 303 is specifically configured to: Determine whether the attribute information in the current round of attribute evaluation results of each unknown sample is consistent with the attribute information in the previous round of attribute evaluation results of the unknown sample. If so, it indicates that the current round of sample attribute evaluation model satisfies the preset convergence condition. In an optional implementation manner, the training unit 303 is specifically configured to: Based on the M current-round attribute evaluation results and the M last-round attribute evaluation results, determine unknown samples whose attribute information has changed; Recalculate the black sample concentration of the community corresponding to the unknown sample whose attribute information has changed. In an optional implementation manner, the device further includes an evaluation unit, and the evaluation unit is specifically configured to: After the sample attribute evaluation model that satisfies the preset convergence condition is used as the target sample attribute evaluation model, the new sample is evaluated according to the target sample attribute evaluation model, and the evaluation result of the new sample is determined, wherein the The evaluation result includes the score and/or attribute information of the newly injected black sample. In an optional implementation manner, the training samples are insurance data corresponding to claimants, and the black samples are insurance data corresponding to insurance fraudsters. In an optional implementation manner, the first determining unit is specifically configured to: Determine the first proportion of nodes corresponding to all black samples in each community in the total nodes of the community, and use the first proportion as the black sample concentration of the community; or Determine the second proportion of the nodes corresponding to all black samples in each community in the total nodes in the relationship graph, and use the first proportion as the black sample concentration of the community; or Determine the third proportion of the nodes corresponding to all black samples in each community among the total nodes of the community, and the fourth proportion of the total nodes of the community among the total nodes in the relationship graph, to obtain the third proportion And the weighted average of the fourth proportion, and the weighted average is used as the black sample concentration of the community. In the third aspect, based on the same inventive concept as the sample attribute evaluation method in the previous embodiment, the present invention also provides a server, as shown in FIG. 4, which includes a memory 404, a processor 402, and a device stored on the memory 404. A computer program that can run on the processor 402, and when the processor 402 executes the program, the steps of any one of the aforementioned sample attribute evaluation methods are implemented. Among them, in FIG. 4, the bus bar architecture (represented by the bus bar 400), the bus bar 400 may include any number of interconnected bus bars and bridges, and the bus bar 400 will include one or more processes represented by the processor 402 The memory and various circuits of the memory represented by the memory 404 are linked together. The bus bar 400 can also link various other circuits such as peripheral devices, voltage regulators, and power management circuits, which are all known in the art, and therefore, no further description will be given herein. The bus interface 406 provides an interface between the bus 400 and the receiver 401 and transmitter 403. The receiver 401 and the transmitter 403 may be the same element, that is, a transceiver, which provides a unit for communicating with various other devices on the transmission medium. The processor 402 is responsible for managing the bus 400 and general processing, and the memory 404 can be used to store data used by the processor 402 when performing operations. In a fourth aspect, based on the inventive concept of the sample attribute evaluation in the foregoing embodiment, the present invention also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the above-mentioned sample attribute evaluation is realized The steps of any method of the method. This specification is described with reference to flowcharts and/or block diagrams of methods, equipment (systems), and computer program products according to the embodiments of this specification. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to the processors of general-purpose computers, dedicated computers, embedded processors, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment can be used to generate It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram. These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product that includes the instruction device, The instruction device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram. These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to generate computer-implemented processing, which can be executed on the computer or other programmable equipment. The instructions provide steps for implementing functions specified in one flow or multiple flows in the flowchart and/or one block or multiple blocks in the block diagram. Although the preferred embodiments of this specification have been described, those skilled in the art can make additional changes and modifications to these embodiments once they learn the basic progressive concepts. Therefore, the scope of the attached patent application is intended to be interpreted as including the preferred embodiments and all changes and modifications falling within the scope of this specification. Obviously, those skilled in the art can make various changes and modifications to this specification without departing from the spirit and scope of the specification. In this way, if these modifications and variations of this specification fall within the scope of the patent application for this specification and its equivalent technology, this specification is also intended to include these modifications and variations.

301:第一確定單元 302:第二確定單元 303:訓練單元 400:匯流排 401:接收器 402:處理器 403:發送器 404:記憶體 406:匯流排界面301: The first determination unit 302: The second determination unit 303: Training Unit 400: bus 401: Receiver 402: processor 403: Transmitter 404: Memory 406: bus interface

圖1為本說明書實施例樣本屬性評估應用場景示意圖；圖2為本說明書實施例第一態樣樣本屬性評估方法流程圖；圖3為本說明書實施例第二態樣樣本屬性評估模型訓練裝置結構示意圖；圖4為本說明書實施例第三態樣樣本屬性評估伺服器結構示意圖。Fig. 1 is a schematic diagram of an application scenario of sample attribute evaluation in an embodiment of the specification; FIG. 2 is a flowchart of the method for evaluating the attributes of a sample in the first aspect of the embodiment of this specification; 3 is a schematic diagram of the structure of the training device for the sample attribute evaluation model of the second aspect of the embodiment of this specification; 4 is a schematic diagram of the structure of the sample attribute evaluation server in the third aspect of the embodiment of this specification.

Claims

A sample attribute model training method includes: determining the black sample concentration of each community in a relationship graph corresponding to the training sample, wherein the training sample includes a black sample and an unknown sample; based on the black sample concentration of each community, The white sample sampling probability of each unknown sample is determined, and the white sample sampling probability of each unknown sample is sampled to obtain a white sample; based on a semi-supervised machine learning algorithm, the black sample and the white sample The sample is trained to obtain a sample attribute evaluation model; it is judged whether the sample attribute evaluation model meets the preset convergence condition; if not, the black sample concentration of each community is updated based on the updated black sample concentration of each community and The semi-supervised machine learning algorithm continues training until the sample attribute evaluation model obtained by training satisfies the preset convergence condition, and the sample attribute assessment model that satisfies the preset convergence condition is used as the target sample attribute assessment model, wherein The judging whether the sample attribute evaluation model satisfies a preset convergence condition includes: evaluating each of the unknown samples based on the sample attribute evaluation model, and obtaining the current round of attribute evaluation results of each of the unknown samples, in total Obtain M current-round attribute evaluation results, where M is the number of the unknown samples; based on the M current-round attribute evaluation results and M last-round attribute evaluation results, determine whether the sample attribute evaluation model meets preset convergence condition.

According to the method described in item 1 of the scope of patent application, the determination of the black sample concentration of each community in the relationship graph corresponding to the training sample includes: determining the number of nodes corresponding to all black samples in each community among the total nodes of the community Take the first proportion as the black sample concentration of the community; or determine the second proportion of the nodes corresponding to all black samples in each community in the total nodes in the relationship graph, and set the first proportion The proportion is used as the black sample concentration of the community; or determine the third proportion of the nodes corresponding to all black samples in each community in the total nodes of the community, and the third proportion of the total nodes of the community in the total nodes in the relationship graph Four proportions, a weighted average of the third proportion and the fourth proportion is obtained, and the weighted average is used as the black sample concentration of the community.

According to the method described in item 3 of the scope of patent application, the evaluation of each of the unknown samples based on the sample attribute evaluation model to obtain the current round of attribute evaluation results of each of the unknown samples includes: The sample attribute evaluation model evaluates each of the unknown samples to obtain a black sample score of each of the unknown samples, and if the black sample score value is greater than a preset score value, the attribute information of the unknown sample is marked as a black sample , Wherein the current round of attribute evaluation results of each of the unknown samples include the attribute information of the unknown sample.

According to the method described in item 3 of the scope of patent application, the M current round of attribute evaluation results and M last round of attribute evaluation results, judging whether the sample attribute evaluation model satisfies the preset convergence condition, including: judging all of the current round of attribute evaluation results of each of the unknown samples Whether the attribute information is consistent with the attribute information in the previous round of attribute evaluation results of the unknown sample, if so, it indicates that the current round of sample attribute evaluation model satisfies the preset convergence condition.

According to the method described in item 3 of the scope of patent application, said updating the concentration of black samples in each community includes: determining all the results based on the M current-round attribute evaluation results and the M last-round attribute evaluation results. The unknown sample whose attribute information has changed; and the black sample concentration of the community corresponding to the unknown sample whose attribute information has changed is recalculated.

According to the method described in any one of items 1 to 5 in the scope of the patent application, the training sample is insurance data corresponding to the claim settlement personnel, and the black sample is the insurance data corresponding to the insurance fraud personnel.

A sample attribute evaluation method includes: a target sample attribute evaluation model trained according to any one of the methods described in the scope of patent application 1 to 5, evaluates a new sample, and determines the evaluation result of the new sample, wherein , The evaluation result includes the black sample score and/or attribute information of the new sample.

A sample attribute evaluation model training device includes: a first determining unit for determining the concentration of black samples in each community in a relationship graph corresponding to training samples, wherein the training samples include black samples and unknown samples; and second determining A unit for determining the white sample sampling probability of each unknown sample based on the black sample concentration of each community, and sampling with the white sample sampling probability of each unknown sample to obtain a white sample; training A unit for training the black sample and the white sample based on a semi-supervised machine learning algorithm to obtain a sample attribute evaluation model; judging whether the sample attribute evaluation model satisfies a preset convergence condition; if not, updating the The black sample concentration of each community is based on the updated black sample concentration of each community and the semi-supervised machine learning algorithm to continue training until the sample attribute evaluation model obtained by training satisfies the preset convergence condition, and The sample attribute evaluation model that satisfies the preset convergence condition is used as the target sample attribute evaluation model, wherein the training unit is specifically configured to: evaluate each of the unknown samples based on the sample attribute evaluation model to obtain each of the For the current round of attribute evaluation results of unknown samples, a total of M current round of attribute evaluation results are obtained, where M is the number of the unknown samples; based on the M current round of attribute evaluation results and M last round of attribute evaluation results, judge all The sample attribute evaluation model meets the preset convergence conditions.

According to the device described in item 8 of the scope of patent application, the first determining unit is specifically configured to: determine the first proportion of all black sample corresponding nodes in each community in the total nodes of the community, and calculate the first proportion As the black sample concentration of the community; or determine the second proportion of all black sample corresponding nodes in each community in the total nodes in the relationship graph, and use the first proportion as the black sample concentration of the community; or Determine the third proportion of the nodes corresponding to all black samples in each community among the total nodes of the community, and the fourth proportion of the total nodes of the community among the total nodes in the relationship graph, to obtain the third proportion And the weighted average of the fourth proportion, and the weighted average is used as the black sample concentration of the community.

According to the device described in item 8 of the scope of patent application, the training unit is specifically configured to: evaluate each of the unknown samples based on the sample attribute evaluation model, and obtain the black sample score of each of the unknown samples, if The black sample score value is greater than the preset score value, and the attribute information of the unknown sample is marked as a black sample, wherein the attribute information of the unknown sample is included in the current round of attribute evaluation results of each of the unknown samples.

According to the device described in item 10 of the scope of patent application, the training unit is specifically used for: Determine whether the attribute information in the current round of attribute evaluation results of each of the unknown samples is consistent with the attribute information in the previous round of attribute evaluation results of the unknown sample. If so, it indicates that the current round of sample attribute evaluation model satisfies all requirements. The preset convergence conditions are described.

According to the device described in item 10 of the scope of patent application, the training unit is specifically configured to: determine the unknown attribute information change based on the M current-round attribute evaluation results and the M last-round attribute evaluation results Sample; recalculate the black sample concentration of the community corresponding to the unknown sample whose attribute information has changed.

A sample attribute evaluation device, comprising: an evaluation unit, which is used to evaluate a new sample according to a target sample attribute evaluation model trained by the device described in any one of items 8-12 of the scope of patent application, and determine the new sample The evaluation result of, wherein the evaluation result includes the black sample score and/or attribute information of the new sample.

A server includes a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and when the processor executes the program, any one of items 1-7 of the scope of patent application can be realized The steps of the method described in item.

A computer-readable storage medium has a computer program stored thereon, and when the program is executed by a processor, the steps of the method described in any one of items 1-7 of the scope of the patent application are realized.