TW202011285A

TW202011285A - Sample attribute evaluation model training method and apparatus, and server

Info

Publication number: TW202011285A
Application number: TW108122547A
Authority: TW
Inventors: 王修坤; 趙婷婷; 劉斌
Original assignee: 香港商阿里巴巴集團服務有限公司
Priority date: 2018-08-31
Filing date: 2019-06-27
Publication date: 2020-03-16
Also published as: WO2020042795A1; CN109325525A; TWI726341B

Abstract

The embodiments of the present description provide a sample attribute evaluation method. Training samples are firstly determined, the training samples only comprise a small number of black samples with confirmed attributes, and there are also a large number of unknown samples with unconfirmed attributes. Based on a relationship graph corresponding to the training samples, the black sample concentration of each community is determined. In conjunction with the black sample concentration of each community and a semi-supervised machine learning algorithm, even if the number of black samples is relatively small, potential black samples can be mined from the unknown samples by means of the method in the embodiments, thereby determining white samples required for model training and meeting model training requirements, so that a trained model can accurately evaluate whether a sample complies with the attribute of a black sample.

Description

Sample attribute evaluation model training method, device and server

本說明書實施例涉及網際網路技術領域，尤其涉及一種樣本屬性評估模型訓練方法、裝置及伺服器。The embodiments of the present specification relate to the field of Internet technology, and in particular, to a training method, device, and server for a sample attribute evaluation model.

隨著網際網路的快速發展，越來越多的業務可以通過網路實現，如線上支付、線上購物、線上保險理賠等網際網路業務。網際網路在給人們生活提供便利的同時，也帶來了風險。不法人員可能會進行電子業務詐欺，給其它使用者造成損失。對於龐大的業務樣本集而言，明確屬性為黑的風險黑樣本數量較少，大部分是未知屬性的樣本，由於業務詐欺資料樣本具有隱藏性，所以，為了能夠提升整體風控能力，亟需設計一種能夠基於少量已知黑樣本訓練得到能夠準確對未知樣本進行屬性評估的方案。With the rapid development of the Internet, more and more business can be realized through the Internet, such as online payment, online shopping, online insurance claims and other Internet services. While the Internet provides convenience for people's lives, it also brings risks. Illegal personnel may commit fraud in electronic business and cause losses to other users. For a large business sample set, the number of black samples with a clear attribute of black is small, and most of them are samples of unknown attributes. Due to the hidden nature of business fraud data samples, in order to improve the overall risk control ability, it is urgently needed Design a scheme that can be trained based on a small number of known black samples and can accurately evaluate attributes of unknown samples.

本說明書實施例提供及一種樣本屬性評估方法、裝置及伺服器。第一態樣，本說明書實施例提供一種樣本屬性評估方法，包括：確定與訓練樣本對應的關係圖中每個社區的黑樣本濃度，其中，所述訓練樣本包括黑樣本和未知樣本；基於所述每個社區的黑樣本濃度，確定每個所述未知樣本的白樣本抽樣概率，以每個所述未知樣本的白樣本抽樣概率進行抽樣，獲得白樣本；基於半監督機器學習演算法對所述黑樣本與所述白樣本進行訓練，獲得目標樣本屬性評估模型。第二態樣，本說明書實施例提供一種樣本屬性評估模型訓練裝置，包括：第一確定單元，用於確定與訓練樣本對應的關係圖中每個社區的黑樣本濃度，其中，所述訓練樣本包括黑樣本和未知樣本；第二確定單元，用於基於所述每個社區的黑樣本濃度，確定每個所述未知樣本的白樣本抽樣概率，以每個所述未知樣本的白樣本抽樣概率進行抽樣，獲得白樣本；訓練單元，用於基半監督機器學習演算法對所述黑樣本與所述白樣本進行訓練，獲得目標樣本屬性評估模型。第三態樣，本說明書實施例提供一種伺服器，包括記憶體、處理器及儲存在記憶體上並可在處理器上運行的電腦程式，所述處理器執行所述程式時實現上述任一項所述樣本屬性評估方法的步驟。第四態樣，本說明書實施例提供一種電腦可讀儲存媒體，其上儲存有電腦程式，該程式被處理器執行時實現上述任一項所述樣本屬性評估方法的步驟。本說明書實施例有益效果如下：本說明書實施例中，通過確定訓練樣本，該訓練樣本中僅包括少量已確認屬性的黑樣本，還有大部分未確認屬性的未知樣本。基於訓練樣本對應的關係圖，確定每個社區的黑樣本濃度，結合社區黑樣本濃度以及半監督機器學習演算法，即使已知黑樣本數量較少，本實施例中的方法也可以從未知樣本中挖掘潛在黑樣本，進而確定模型訓練所需要的白樣本，達到模型訓練要求，使得訓練出的模型能夠準確地對樣本是否屬於黑樣本的屬性進行評估。The embodiments of the present specification provide and a sample attribute evaluation method, device and server. In a first aspect, an embodiment of the present specification provides a sample attribute evaluation method, including: determining a black sample concentration of each community in a relationship graph corresponding to a training sample, where the training sample includes a black sample and an unknown sample; The black sample concentration of each community, determine the white sample sampling probability of each of the unknown samples, and sample the white sample sampling probability of each of the unknown samples to obtain white samples; based on the semi-supervised machine learning algorithm The black sample and the white sample are trained to obtain a target sample attribute evaluation model. In a second aspect, an embodiment of the present specification provides a sample attribute evaluation model training device, including: a first determining unit, configured to determine a black sample concentration of each community in a relationship graph corresponding to a training sample, wherein the training sample Including a black sample and an unknown sample; a second determining unit for determining the white sample sampling probability of each unknown sample based on the black sample concentration of each community, and taking the white sample sampling probability of each unknown sample Sampling is performed to obtain white samples; a training unit is used to train the black samples and the white samples based on a semi-supervised machine learning algorithm to obtain a target sample attribute evaluation model. In a third aspect, the embodiments of the present specification provide a server, including a memory, a processor, and a computer program stored on the memory and executable on the processor. When the processor executes the program, any of the above The steps of the sample attribute evaluation method described in the item. According to a fourth aspect, an embodiment of the present specification provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the steps of any one of the foregoing sample attribute evaluation methods are implemented. The beneficial effects of the embodiments of this specification are as follows: In the embodiment of the present specification, the training samples are determined, and the training samples include only a small number of black samples with confirmed attributes, and most unknown samples with unconfirmed attributes. Based on the relationship diagram corresponding to the training samples, determine the black sample concentration of each community, combined with the community black sample concentration and the semi-supervised machine learning algorithm, even if the number of black samples is known to be small, the method in this embodiment can also start from unknown samples Mining potential black samples, and then determining the white samples required for model training, to meet the model training requirements, so that the trained model can accurately evaluate whether the samples belong to the attributes of black samples.

為了更好的理解上述技術方案，下面通過圖式以及具體實施例對本說明書實施例的技術方案做詳細的說明，應當理解本說明書實施例以及實施例中的具體特徵是對本說明書實施例技術方案的詳細的說明，而不是對本說明書技術方案的限定，在不衝突的情況下，本說明書實施例以及實施例中的技術特徵可以相互組合。請參見圖1，為本說明書實施例的樣本屬性評估應用場景示意圖。終端100位於使用者側，與網路側的伺服器200通信。使用者可通過終端100中的APP或網站產生即時事件，一些業務資料。伺服器200收集各個終端產生的即時事件，即可挑選出訓練樣本。本說明書實施例可應用於風險樣本識別或保險理賠中騙保樣本識別等風控場景，也可以應用於二分類的分類場景。第一態樣，本說明書實施例提供一種樣本屬性評估方法，請參考圖2，包括步驟S201-S203。 S201：確定與訓練樣本對應的關係圖中每個社區的黑樣本濃度，其中，所述訓練樣本包括黑樣本和未知樣本； S202：基於所述每個社區的黑樣本濃度，確定每個所述未知樣本的白樣本抽樣概率，以每個所述未知樣本的白樣本抽樣概率進行抽樣，獲得白樣本； S203：基於半監督機器學習演算法對所述黑樣本與所述白樣本進行訓練，獲得目標樣本屬性評估模型。具體的，在本實施例中，首先通過步驟S201確定訓練樣本，訓練樣本如前述所示，可以是各個終端側產生的業務資料，訓練樣本中包括已經標記好屬性的黑樣本，還包括未知屬性的未知樣本。例如：在保險理賠場景中，訓練樣本為申請理賠的使用者的相關資料，其中，確定騙保使用者對應的樣本為黑樣本，保險理賠場景中已定騙保事實的黑樣本較少，缺乏大量黑樣本標記，從而導致樣本屬性評估模型精准度大大折扣，如何解決這種場景下的模型訓練問題是非常重要工作。本實施例中的方法，可以結合樣本的社區屬性與半監督機器學習演算法，來從大量未知樣本中挖掘潛在的黑樣本，達到模型訓練所需要的黑樣本數量，過濾得到信任度較高的白樣本，訓練時確保了黑樣本和白樣本的純度，從而完成模型訓練，得到精度較高的樣本屬性評估模型。進一步，再通過步驟S201確定與訓練樣本對應的關係圖中每個社區的黑樣本濃度。具體的，在本實施例中，需要預先構建包括訓練樣本的關係圖。具體的，每個訓練樣本對應一個節點，構建的關係圖中可以僅包括訓練樣本對應的節點，還可以是全網節點對應的關係圖。圖的構建過程可以是獲取各節點在預定時間段內的歷史事件，基於歷史事件，按預設構圖方法確定關係圖，採用預設社區發現演算法對關係圖中的節點進行社區劃分，其中，每個節點對應有該節點所屬的社區標籤。其中，預設時間段可以預先指定，預設構圖方法需要定義以下各個內容：節點的定義，邊的定義以及邊的權重值的定義。本實施例也不限制具體的構圖規則。不同的場景、不同實現中可以採用不同的構圖規則。舉例而言，在保險理賠場景中，預設構圖方法可以是：以使用者為點，若在半年內兩個使用者有過金融交易(如：轉帳)，則將兩個使用者連接起來，邊的權重可以是兩個使用者轉帳的次數。具體的，在本實施例中，針對上述構建的關係圖上給運行一個或多個預設社區發現演算法，這樣，每一個點得到一個該節點所屬社區的社區標籤。預設社區探索方法可以是標籤傳播演算法(LPA，Label Propagation Algorithm)，也可以是快速折疊演算法(FU，Fast Unfolding)等等，在此，本申請不做限制。其中，標籤傳播演算法流程簡述如下： Step1：圖上的每一個點都以自己點id作為自己的標籤； Step2：每一個點都從自己的鄰居那獲取各鄰居標籤； Step3：每一個點收到來自所有鄰居的標籤之後，將收到標籤中出現最多的作為自己的標籤(如果有權圖則是權重和最高的那個)。如果出現標籤數相同多的標籤，則在這些出現最多的標籤中任選一個作為自己的標籤； Step4：將每個點上的標籤作為自己的社區標籤輸出。 Step3：重複Step2直到所有點都不發生變化； Step4：將Step3得到的每一個社區當成點，重複Step2直到所有社區不發生變化； Step5：將每個點上的標籤作為自己的社區標籤輸出。在對關係圖劃分好社區後，即可計算得到每個社區的黑樣本濃度，每個社區的黑樣本濃度的確定方式包括但不限於以下三種：第一種：確定每個社區中所有黑樣本對應節點在該社區總節點中的第一占比，將所述第一占比作為該社區的黑樣本濃度。第二種：確定每個社區中所有黑樣本對應節點在所述關係圖中總節點中的第二占比，將所述第一占比作為該社區的黑樣本濃度。第三種：確定每個社區中所有黑樣本對應節點在該社區總節點中的第三占比，以及該社區總節點在所述關係圖中的總節點中的第四占比，獲得所述第三占比與所述第四占比的加權平均值，將所述加權平均值作為該社區的黑樣本濃度。具體的，在本實施例中，採用第一種方式，黑樣本濃度可以定義為社區內的黑樣本個數除以社區節點總數。例如：社區A內總共包括5個節點，其中，有一個節點是黑樣本對應的節點，這樣，可計算得到社區A的黑樣本濃度為1/5，該社區內所有節點的黑樣本濃度均為1/5。當然，還可以通過整個關係圖規模，採用上述第二種方式定義。例如：社區A內總共包括5個節點，其中，有一個節點是黑樣本對應的節點，關係圖中包括10個節點，這樣，可計算得到社區A的黑樣本濃度為1/10，該社區內所有節點的黑樣本濃度均為1/10。當然，可以結合社區規模以及黑樣本在社區中的占比兩個維度來設定，採用上述第三種方式定義。例如：社區A內總共包括5個節點，其中，有一個節點是黑樣本對應的節點，關係圖中包括10個節點，這樣，可計算得到社區A的黑樣本濃度為K1*1/5+K2*5/10，其中，K1與K2表示加權係數，可根據實際需要繼續進行設定，則該社區內所有節點的黑樣本濃度均為0.2K1+0.5k2。在具體實施過程中，黑樣本濃度的定義方式可根據實際需要進行設定，在此，本申請不做限制。在確定好各個樣本的黑樣本濃度後，通過步驟S202，基於每個社區的黑樣本濃度，確定每個未知樣本的白樣本抽樣概率，以每個所述未知樣本的白樣本抽樣概率進行抽樣，獲得白樣本。本實施例的方法，在從未知樣本中挖掘潛在黑樣本時，結合社區屬性，即使小部分未知樣本在當前時刻沒有體現出來黑樣本真實特徵，結合社區特性，進行深度挖掘，可以真實的擴大黑樣本比例，達到模型訓練要求。具體的，在本實施例中，針對訓練樣本中的每個未知樣本，可以根據該樣本的黑樣本濃度確定該樣本的白樣本抽樣概率。比如：如果未知樣本1位於社區A，該社區A的黑樣本濃度為1/5，所以，未知樣本1的黑樣本濃度為1/5，未知樣本1的白樣本抽樣概率P1=1-1/5=4/5。進一步，在初始的第一次訓練時，可以將每個未知樣本的白樣本抽樣概率設定為固定值。比如：有100個未知樣本，在第一次訓練時可將每個未知樣本的白樣本抽樣概率設定為1/100。在後續的多輪訓練中再結合未知樣本的黑樣本濃度確定該未知樣本的白樣本抽樣概率。這樣，在確定好各個未知樣本的白樣本抽樣概率後，可以對未知樣本按各自的白樣本抽樣概率進行白樣本抽樣，確定抽取到的白樣本，然後，通過步驟S203結合已經標記屬性的黑樣本，基於半監督機器學習演算法對黑樣本與白樣本進行訓練，獲得樣本屬性評估模型。具體實現可包括如下步驟：基於半監督機器學習演算法對黑樣本與白樣本進行訓練，獲得樣本屬性評估模型；判斷樣本屬性評估模型是否滿足預設收斂條件；如果否，更新每個社區的黑樣本濃度，基於更新後的每個社區的黑樣本濃度與半監督機器學習演算法繼續訓練，直至訓練得到的樣本屬性評估模型滿足預設收斂條件，將滿足預設收斂條件的樣本屬性評估模型作為目標樣本屬性評估模型。本實施例中，採用半監督機器學習演算法包括半監督(Positive and Unlabeled Learning，正樣本和無標記學習)機器學習演算法，它是一種半監督學習的機器學習演算法，是指用於訓練機器學習模型的訓練樣本中，僅部分訓練樣本是有標記樣本，而其餘的訓練樣本為無標記樣本，利用無標記樣本來輔助有標記樣本的學習過程。應用於建模一方收集到的訓練樣本中只有少量有標記的黑樣本，其餘的樣本均為無標記的未知樣本，針對有標記的正樣本和無標記樣本的機器學習過程。在構建好黑樣本和白樣本後，可以基於半監督機器學習演算法對這些訓練樣本進行訓練，來構建樣本屬性評估模型。對於半監督機器學習演算法而言，通常可以包含多種機器學習策略。例如：半監督機器學習演算法包含典型的機器學習策略，包括兩階段法(two-stage strategy)和代價敏感法(cost-sensitive strategy)兩類。所謂兩階段法，演算法首先基於已知的正樣本和無標記樣本，在無標記樣本中挖掘發現潛在的可靠負樣本，然後基於已知的正樣本和挖掘出來的可靠負樣本，將問題轉化為傳統的有監督的機器學習的過程，來訓練分類模型。而對於代價敏感的策略而言，演算法假設無標記樣本中正樣本的比例極低，通過直接將無標記樣本看作負樣本對待，為正樣本設置一個相對於負樣本更高的代價敏感權重。例如，通常會在基於代價敏感的半監督機器學習演算法的目標方程中，為與正樣本對應的損失函數，設置一個更高的代價敏感權重。通過給正樣本設置更高的代價敏感權重，使得最終訓練出的分類模型分錯一個正樣本的代價遠遠大於分錯一個負樣本的代價，如此一來，可以直接通過利用正樣本和無標記樣本(當作負樣本)學習一個代價敏感的分類器，來對未知的樣本進行分類。在本實施例中，既可以基於代價敏感的半監督機器學習演算法對上述訓練樣本進行訓練，也可以採用兩階段法對上述訓練樣本進行訓練。在具體實施過程中，可根據需要進行設定，在此，本申請不做限制。在本實施例中主要以兩階段的半監督機器學習演算法進行詳細介紹。以保險理賠場景為例，假設上述訓練樣本集中的黑樣本被標記為1，表示該樣本為已知的騙保的保險資料，白樣本標記為0，表示該訓練樣本對應保險資料是正常的。在對黑樣本和基於白樣本抽樣概率抽樣出的白樣本進行二分類模型訓練後，得到樣本屬性評估模型，然後再採用該樣本屬性評估模型對未知樣本進行評估，得到每個未知樣本被標記為黑樣本的黑樣本評分，該黑樣本評分為一個範圍在0~1的數值，表明未知樣本屬於黑樣本的概率。當然，還可以以其他方式定義黑樣本白樣本以及對應的黑樣本評分，在此，本申請不做限制。按照這樣的方式對訓練樣本進行多輪訓練，每輪訓練後得到對應的樣本屬性評估模型，需要判斷該樣本屬性評估模型是否滿足預設收斂條件，如果模型收斂，則將該輪訓練得到的樣本屬性評估模型作為最終的目標樣本屬性評估模型。如果模型還沒有收斂，則更新每個未知樣本的黑樣本濃度後繼續按照前述方式進行訓練，直至訓練得到的模型達到收斂條件。在本實施例中，判斷模型是否收斂可以通過如下步驟實現：基於樣本屬性評估模型對每個未知樣本進行評估，獲得每個未知樣本的本輪屬性評估結果，共計獲得M個本輪屬性評估結果，M為未知樣本的個數；基於M個本輪屬性評估結果與M個上一輪屬性評估結果，判斷樣本屬性評估模型是否滿足預設收斂條件。其中，基於樣本屬性評估模型對每個未知樣本進行評估，獲得每個未知樣本的本輪屬性評估結果，包括：基於樣本屬性評估模型對每個未知樣本進行評估，獲得每個未知樣本的黑樣本評分，如果黑樣本評分值大於預設分值，將該未知樣本的屬性資訊標記為黑樣本，其中，每個未知樣本的本輪屬性評估結果中包括該未知樣本的屬性資訊。具體的，在本實施例中，在每輪訓練得到該輪訓練對應的樣本屬性評估模型，利用該模型對每個未知樣本的黑樣本評分，可以根據評分對該未知樣本進行標記。具體的，如果黑樣本評分值大於預設分值，將該未知樣本的屬性資訊標記為黑樣本。舉例來說，預設分值設定為0.8，未知樣本1的黑樣本評分為0.9，將該未知樣本1的屬性資訊標記為黑樣本。未知樣本2的黑樣本評分為0.4，將該未知樣本2的屬性資訊保持不變，還是未知屬性樣本。通過這樣的方式，可以確定出每個未知樣本在該輪訓練中的屬性評估結果，該評估結果中可包括該未知樣本在本輪訓練中的黑樣本評分和屬性資訊。未知樣本個數為M，則得到M個本輪屬性評估結果。進而，還可獲得未知樣本在上一輪訓練對應的屬性評估結果，即M個上一輪屬性評估結果。通過M個本輪屬性評估結果與M個上一輪屬性評估結果，即可判斷樣本屬性評估模型是否滿足預設收斂條件，具體可通過如下步驟實現：判斷每個未知樣本的本輪屬性評估結果中的屬性資訊與該未知樣本的上一輪屬性評估結果中的屬性資訊是否一致，如果是，表明本輪樣本屬性評估模型滿足預設收斂條件。具體的，在本實施例中，通過M個上一輪屬性評估結果中每個評估結果中的屬性資訊，確定在上一輪訓練中被標記為黑樣本包括哪些未知樣本，以及通過M個本輪屬性評估結果中每個評估結果中的屬性資訊，確定在本輪訓練中被標記為黑樣本包括哪些未知樣本。如果上一輪被標記為黑樣本的未知樣本與本輪被標記為黑樣本的未知樣本一致，表明每個未知樣本的標記已經沒有變化，模型達到收斂。舉例而言，上一輪中訓練中未知樣本中的黑樣本包括未知樣本1、未知樣本2、未知樣本5、未知樣本10。本輪訓練中的黑樣本也包括未知樣本1、未知樣本2、未知樣本5、未知樣本10，表明未知樣本沒有變化，本輪訓練出的樣本屬性評估模型已達到收斂。將該輪訓練得到的樣本屬性評估模型作為目標樣本屬性評估模型。進一步，判定模型是否達到收斂的預設收斂條件可以根據實際需要進行設定，上述示例只是具體實現的一種示例，並不對本申請構成限定。例如：還可以設定為如果上一輪被標記為黑樣本的未知樣本與本輪被標記為黑樣本的未知樣本一致的未知樣本數量占比達到預設占比，表明每個未知樣本的標記已經沒有變化，模型達到收斂。進一步，如果確定本輪訓練得到的樣本屬性評估模型不滿足預設收斂條件，則表明模型還沒有收斂，需要進行下一輪訓練。在進行下一輪訓練之前，由於標記的黑樣本相對於上一輪訓練發生了變化，所以，需要根據標記的黑樣本對社區的黑樣本濃度進行更新，進而對每個未知樣本的黑樣本濃度進行更新，具體實現可包括如下步驟：基於M個本輪屬性評估結果與M個上一輪屬性評估結果，確定屬性資訊發生變化的未知樣本；重新計算與屬性資訊發生變化的未知樣本對應的社區的黑樣本濃度。具體的，在本實施例中，基於本輪訓練對應的每個未知樣本的屬性評估結果與上一輪訓練對應的每個未知樣本的屬性評估結果，可以定位出哪些未知樣本的屬性資訊發生變化，該變化可以是由黑樣本屬性變更為未知屬性，還可以是由未知屬性變更為黑樣本屬性。進而定位到產生屬性變化的未知樣本所在社區，重新計算該社區的黑樣本濃度，根據該社區更新後的黑樣本濃度，更新該社區對應節點的白樣本抽樣概率。舉例來說，社區A包括未知樣本1、未知樣本2、黑樣本1對應的節點。上一輪訓練中社區A中的所有節點屬性均為發生改變，社區A中每個節點對應的黑樣本濃度為1/3。此輪訓練中將未知樣本1標記為黑樣本，其餘節點屬性未發生變化，將社區A中每個節點對應的黑樣本濃度更新為2/3。這樣，未知樣本1、未知樣本2對應的白樣本抽樣概率均為1/3。按照這樣的方式，可以更新未知節點的白樣本抽樣概率，然後重複執行前述按各個節點的白樣本抽樣概率的白樣本抽樣得到白樣本，結合抽樣得到的白樣本與已知黑樣本，基於半監督機器學習演算法進行下一輪的樣本屬性評估模型的訓練，直至訓練得到的樣本屬性評估模型達到上述預設收斂條件。進而，通過前述方式訓練得到目標樣本屬性評估模型，可用該模型對新進樣本進行樣本屬性評估，確定新進樣本的評估結果，其中，評估結果中包括該新進樣的黑樣本評分和/或屬性資訊。具體的，本實施例中採用已知黑樣本和篩選出潛在黑樣本後的剩餘的信任度較高的白樣本進行模型訓練，得到的目標樣本屬性評估模型的評估精度較高，可以對新進樣本進行樣本屬性評估，評估結果可以包括前述的黑樣本評分，表明該新進樣本屬於黑樣本的概率。評估結果也可以包括該新進樣本的屬性資訊，例如：該新進樣本的黑樣本評分為0.9，大於預設分值0.8，確定該新進樣本的屬性資訊為黑樣本。通過該評估結果，相關人員即可及時獲知該新進樣本的屬性，及時進行風險調控。進一步，在本實施例中，由於節點間的關係會隨著時間發生變化，所以，可以按照預設時間間隔對前述實施例中的關係圖進行更新，對更新後的關係圖重新進行社區劃分，得到對應社區的黑樣本濃度，重新進行樣本屬性評估模型的訓練，以使得模型能夠按預設時間間隔更新。本實施例中的方法可以應用於保險理賠場景，訓練樣本為申請理賠人員的保險資料，黑樣本為已知的騙保人員的保險資料，通過前述方式獲得騙保評估模型，新進的申請理賠人員的相關保險資料輸入該騙保評估模型後即可得到該新進的申請理賠人員屬於騙保人員的評估得分或是否為騙保的屬性。這樣，保險公司相關人員就可以根據這樣的評估結果對疑似騙保人員進行後續相關審查，避免了不必要的財產損失。第二態樣，基於同一發明構思，本說明書實施例提供一種樣本屬性評估模型訓練裝置，請參考圖3，包括：第一確定單元301，用於確定與訓練樣本對應的關係圖中每個社區的黑樣本濃度，其中，所述訓練樣本包括黑樣本和未知樣本；第二確定單元302，用於基於所述每個社區的黑樣本濃度，確定每個所述未知樣本的白樣本抽樣概率，以每個所述未知樣本的白樣本抽樣概率進行抽樣，獲得白樣本；訓練單元303，用於基半監督機器學習演算法對所述黑樣本與所述白樣本進行訓練，獲得目標樣本屬性評估模型。在一種可選實現方式中，所述訓練單元303具體用於：基於半監督機器學習演算法對所述黑樣本與所述白樣本進行訓練，獲得樣本屬性評估模型；判斷所述樣本屬性評估模型是否滿足預設收斂條件；如果否，更新所述每個社區的黑樣本濃度，基於更新後的每個社區的黑樣本濃度與所述半監督機器學習演算法繼續訓練，直至訓練得到的樣本屬性評估模型滿足所述預設收斂條件，將滿足所述預設收斂條件的樣本屬性評估模型作為目標樣本屬性評估模型。在一種可選實現方式中，所述訓練單元303具體用於：基於所述樣本屬性評估模型對每個所述未知樣本進行評估，獲得每個所述未知樣本的本輪屬性評估結果，共計獲得M個本輪屬性評估結果，M為未知樣本的個數；基於所述M個本輪屬性評估結果與M個上一輪屬性評估結果，判斷所述樣本屬性評估模型是否滿足預設收斂條件。在一種可選實現方式中，所述訓練單元303具體用於：基於所述樣本屬性評估模型對每個所述未知樣本進行評估，獲得每個所述未知樣本的黑樣本評分，如果黑樣本評分值大於預設分值，將該未知樣本的屬性資訊標記為黑樣本，其中，每個所述未知樣本的本輪屬性評估結果中包括該未知樣本的屬性資訊。在一種可選實現方式中，所述訓練單元303具體用於：判斷每個未知樣本的本輪屬性評估結果中的屬性資訊與該未知樣本的上一輪屬性評估結果中的屬性資訊是否一致，如果是，表明所述本輪樣本屬性評估模型滿足所述預設收斂條件。在一種可選實現方式中，所述訓練單元303具體用於：基於所述M個本輪屬性評估結果與M個上一輪屬性評估結果，確定屬性資訊發生變化的未知樣本；重新計算與所述屬性資訊發生變化的未知樣本對應的社區的黑樣本濃度。在一種可選實現方式中，所述裝置還包括評估單元，所述評估單元具體用於：在所述將滿足所述預設收斂條件的樣本屬性評估模型作為目標樣本屬性評估模型之後，根據目標樣本屬性評估模型，對新進樣本進行評估，確定所述新進樣本的評估結果，其中，所述評估結果中包括該新進樣的黑樣本評分和/或屬性資訊。在一種可選實現方式中，所述訓練樣本為申請理賠人員對應的保險資料，所述黑樣本為騙保人員對應保險資料。在一種可選實現方式中，所述第一確定單元具體用於：確定每個社區中所有黑樣本對應節點在該社區總節點中的第一占比，將所述第一占比作為該社區的黑樣本濃度；或確定每個社區中所有黑樣本對應節點在所述關係圖中總節點中的第二占比，將所述第一占比作為該社區的黑樣本濃度；或確定每個社區中所有黑樣本對應節點在該社區總節點中的第三占比，以及該社區總節點在所述關係圖中的總節點中的第四占比，獲得所述第三占比與所述第四占比的加權平均值，將所述加權平均值作為該社區的黑樣本濃度。第三態樣，基於與前述實施例中樣本屬性評估方法同樣的發明構思，本發明還提供一種伺服器，如圖4所示，包括記憶體404、處理器402及儲存在記憶體404上並可在處理器402上運行的電腦程式，所述處理器402執行所述程式時實現前文所述樣本屬性評估方法的任一方法的步驟。其中，在圖4中，匯流排架構(用匯流排400來代表)，匯流排400可以包括任意數量的互聯的匯流排和橋，匯流排400將包括由處理器402代表的一個或多個處理器和記憶體404代表的記憶體的各種電路鏈接在一起。匯流排400還可以將諸如週邊設備、穩壓器和功率管理電路等之類的各種其他電路鏈接在一起，這些都是本領域所公知的，因此，本文不再對其進行進一步描述。匯流排界面406在匯流排400和接收器401和發送器403之間提供介面。接收器401和發送器403可以是同一個元件，即收發機，提供用於在傳輸媒體上與各種其他裝置通信的單元。處理器402負責管理匯流排400和通常的處理，而記憶體404可以被用於儲存處理器402在執行操作時所使用的資料。第四態樣，基於與前述實施例中樣本屬性評估的發明構思，本發明還提供一種電腦可讀儲存媒體，其上儲存有電腦程式，該程式被處理器執行時實現前文所述樣本屬性評估的方法的任一方法的步驟。本說明書是參照根據本說明書實施例的方法、設備(系統)、和電腦程式產品的流程圖和/或方塊圖來描述的。應理解可由電腦程式指令實現流程圖和/或方塊圖中的每一流程和/或方塊、以及流程圖和/或方塊圖中的流程和/或方塊的結合。可提供這些電腦程式指令到通用電腦、專用電腦、嵌入式處理機或其他可編程資料處理設備的處理器以產生一個機器，使得通過電腦或其他可編程資料處理設備的處理器執行的指令產生用於實現在流程圖一個流程或多個流程和/或方塊圖一個方塊或多個方塊中指定的功能的設備。這些電腦程式指令也可儲存在能引導電腦或其他可編程資料處理設備以特定方式工作的電腦可讀記憶體中，使得儲存在該電腦可讀記憶體中的指令產生包括指令設備的製造品，該指令設備實現在流程圖一個流程或多個流程和/或方塊圖一個方塊或多個方塊中指定的功能。這些電腦程式指令也可裝載到電腦或其他可編程資料處理設備上，使得在電腦或其他可編程設備上執行一系列操作步驟以產生電腦實現的處理，從而在電腦或其他可編程設備上執行的指令提供用於實現在流程圖一個流程或多個流程和/或方塊圖一個方塊或多個方塊中指定的功能的步驟。儘管已描述了本說明書的較佳實施例，但本領域內的技術人員一旦得知了基本進步性概念，則可對這些實施例作出另外的變更和修改。所以，所附申請專利範圍意欲解釋為包括較佳實施例以及落入本說明書範圍的所有變更和修改。顯然，本領域的技術人員可以對本說明書進行各種改動和變型而不脫離本說明書的精神和範圍。這樣，倘若本說明書的這些修改和變型屬於本說明書申請專利範圍及其等同技術的範圍之內，則本說明書也意圖包含這些改動和變型在內。In order to better understand the above technical solutions, the following describes the technical solutions of the embodiments of the present specification in detail through the drawings and specific embodiments. It should be understood that the embodiments of the present specification and the specific features in the embodiments are the technical solutions of the embodiments of the present specification The detailed description, rather than the limitation on the technical solutions of this specification, the embodiments of this specification and the technical features in the embodiments can be combined with each other without conflict. Please refer to FIG. 1, which is a schematic diagram of a sample attribute evaluation application scenario according to an embodiment of this specification. The terminal 100 is located on the user side and communicates with the server 200 on the network side. The user can generate real-time events and some business data through the APP or website in the terminal 100. The server 200 collects the real-time events generated by each terminal to select the training samples. The embodiments of the present specification can be applied to risk control scenarios such as risk sample identification or fraud insurance sample identification in insurance claims, and can also be applied to two-category classification scenarios. In the first aspect, the embodiment of the present specification provides a sample attribute evaluation method, please refer to FIG. 2, including steps S201-S203. S201: Determine the concentration of black samples of each community in the relationship graph corresponding to the training samples, where the training samples include black samples and unknown samples; S202: Determine the white sample sampling probability of each unknown sample based on the black sample concentration of each community, and sample with the white sample sampling probability of each unknown sample to obtain a white sample; S203: Train the black sample and the white sample based on a semi-supervised machine learning algorithm to obtain a target sample attribute evaluation model. Specifically, in this embodiment, the training sample is first determined through step S201. The training sample may be service data generated by each terminal side as shown in the foregoing. The training sample includes black samples with marked attributes, and also includes unknown attributes. Unknown samples. For example, in the insurance claim scenario, the training sample is the relevant data of the user who applies for claim settlement. Among them, the sample corresponding to the fraud insurance user is determined to be a black sample. A large number of black samples are marked, which leads to a greatly reduced accuracy of the sample attribute evaluation model. How to solve the model training problem in this scenario is very important work. The method in this embodiment can combine the community attributes of the samples with the semi-supervised machine learning algorithm to mine potential black samples from a large number of unknown samples to achieve the number of black samples needed for model training, and filter to obtain high trust White samples ensure the purity of black samples and white samples during training, thus completing model training and obtaining a high-precision sample attribute evaluation model. Further, through step S201, the black sample concentration of each community in the relationship graph corresponding to the training sample is determined. Specifically, in this embodiment, a relationship graph including training samples needs to be constructed in advance. Specifically, each training sample corresponds to a node, and the constructed relationship graph may include only the nodes corresponding to the training samples, or may be a relationship graph corresponding to the nodes of the entire network. The construction process of the graph may be to obtain the historical events of each node within a predetermined period of time, based on the historical events, determine the relationship graph according to the preset composition method, and use the preset community discovery algorithm to divide the nodes in the relationship graph into communities, where, Each node corresponds to the community label to which the node belongs. The preset time period can be specified in advance, and the preset composition method needs to define the following contents: the definition of the node, the definition of the edge, and the definition of the weight value of the edge. This embodiment also does not limit specific composition rules. Different composition rules can be used in different scenarios and in different implementations. For example, in the insurance claims scenario, the default composition method may be: take the user as the point, if two users have financial transactions (such as transfer) within six months, then connect the two users, The weight of an edge can be the number of transfers between two users. Specifically, in this embodiment, one or more preset community discovery algorithms are run on the relationship graph constructed above, so that each point gets a community label of the community to which the node belongs. The preset community exploration method may be a label propagation algorithm (LPA, Label Propagation Algorithm), or a fast folding algorithm (FU, Fast Unfolding), etc. Here, the application is not limited. Among them, the label propagation algorithm flow is briefly described as follows: Step1: Each point on the picture has its own id as its own label; Step2: Each point gets its neighbor labels from its neighbors; Step3: After receiving the labels from all neighbors at each point, it will take the most appearing label as its own label (if the power graph is the weight and the highest one). If there are tags with the same number of tags, choose one of these tags as your own tags; Step4: Output the label on each point as its own community label. Step3: Repeat Step2 until all points are unchanged; Step4: Take each community obtained by Step3 as a point and repeat Step2 until all the communities do not change; Step5: Output the label on each point as its own community label. After dividing the relationship graph into communities, you can calculate the black sample concentration of each community. The determination method of the black sample concentration of each community includes but is not limited to the following three types: The first type: determine the first proportion of all black sample corresponding nodes in each community in the total nodes of the community, and use the first proportion as the black sample concentration of the community. The second method is to determine the second proportion of the nodes corresponding to all black samples in each community in the total nodes in the relationship graph, and use the first proportion as the concentration of black samples in the community. Third: Determine the third proportion of all black sample corresponding nodes in each community in the total nodes of the community, and the fourth proportion of the total nodes of the community in the total nodes in the relationship graph, obtain the The weighted average of the third ratio and the fourth ratio, the weighted average is used as the black sample concentration of the community. Specifically, in this embodiment, in the first manner, the black sample concentration may be defined as the number of black samples in the community divided by the total number of community nodes. For example, there are a total of 5 nodes in community A, one of which is the node corresponding to the black sample. In this way, the black sample concentration of community A can be calculated as 1/5, and the black sample concentration of all nodes in the community is 1/5. Of course, it can also be defined in the second way above through the scale of the whole relationship graph. For example: community A includes a total of 5 nodes, one of which is the node corresponding to the black sample, and the relationship graph includes 10 nodes. In this way, the concentration of the black sample of community A is calculated to be 1/10. The black sample concentration of all nodes is 1/10. Of course, it can be set in combination with the two dimensions of community size and the proportion of black samples in the community, defined by the third method above. For example: community A includes a total of 5 nodes, one of which is the node corresponding to the black sample, and the relationship graph includes 10 nodes. In this way, the concentration of the black sample in community A can be calculated as K1*1/5+K2 *5/10, where K1 and K2 represent weighting coefficients, which can be set according to actual needs, then the black sample concentration of all nodes in the community is 0.2K1+0.5k2. In the specific implementation process, the definition method of the black sample concentration can be set according to actual needs, and here, the application is not limited. After determining the black sample concentration of each sample, through step S202, based on the black sample concentration of each community, determine the white sample sampling probability of each unknown sample, and sample with the white sample sampling probability of each unknown sample, Obtain white samples. In the method of this embodiment, when mining potential black samples from unknown samples, combined with community attributes, even if a small number of unknown samples do not reflect the true characteristics of black samples at the current moment, combined with community characteristics, in-depth mining can truly expand black The sample ratio meets the model training requirements. Specifically, in this embodiment, for each unknown sample in the training sample, the sampling probability of the white sample of the sample may be determined according to the black sample concentration of the sample. For example: if unknown sample 1 is located in community A, the concentration of black sample in community A is 1/5, so the concentration of black sample in unknown sample 1 is 1/5, and the sampling probability of white sample in unknown sample 1 is P1=1-1/ 5=4/5. Further, during the initial first training, the sampling probability of the white samples of each unknown sample can be set to a fixed value. For example: there are 100 unknown samples, and the white sample sampling probability of each unknown sample can be set to 1/100 during the first training. In the subsequent multiple rounds of training, the black sample concentration of the unknown sample is combined with the unknown sample to determine the white sample sampling probability. In this way, after the white sample sampling probability of each unknown sample is determined, white samples can be sampled according to their respective white sample sampling probabilities of the unknown samples to determine the extracted white samples, and then, through step S203, the black samples with the marked attributes are combined Based on the semi-supervised machine learning algorithm, the black and white samples are trained to obtain the sample attribute evaluation model. The specific implementation may include the following steps: Train black samples and white samples based on semi-supervised machine learning algorithms to obtain sample attribute evaluation models; Determine whether the sample attribute evaluation model meets the preset convergence conditions; If not, update the black sample concentration of each community, and continue training based on the updated black sample concentration of each community and the semi-supervised machine learning algorithm until the trained sample attribute evaluation model meets the preset convergence conditions and will meet the pre- The sample attribute evaluation model with convergence conditions is set as the target sample attribute evaluation model. In this embodiment, the semi-supervised machine learning algorithm includes semi-supervised (Positive and Unlabeled Learning, positive sample and unlabeled learning) machine learning algorithm, which is a semi-supervised learning machine learning algorithm, which refers to training Among the training samples of the machine learning model, only part of the training samples are labeled samples, while the remaining training samples are unlabeled samples. The unlabeled samples are used to assist the learning process of labeled samples. There are only a few labeled black samples in the training samples collected by the modeling side, and the rest of the samples are unlabeled unknown samples. It is a machine learning process for labeled positive samples and unlabeled samples. After constructing black samples and white samples, these training samples can be trained based on semi-supervised machine learning algorithms to build a sample attribute evaluation model. For semi-supervised machine learning algorithms, it can usually contain multiple machine learning strategies. For example, semi-supervised machine learning algorithms include typical machine learning strategies, including two types of two-stage strategy and cost-sensitive strategy. In the so-called two-stage method, the algorithm first based on known positive samples and unlabeled samples, mining potential unreliable negative samples from unlabeled samples, and then based on known positive samples and mined reliable negative samples, to transform the problem Train the classification model for the traditional supervised machine learning process. For cost-sensitive strategies, the algorithm assumes that the proportion of positive samples in unlabeled samples is extremely low. By directly treating unlabeled samples as negative samples, a higher cost-sensitive weight is set for positive samples than negative samples. For example, in the objective equation based on the cost-sensitive semi-supervised machine learning algorithm, a higher cost-sensitive weight is set for the loss function corresponding to the positive sample. By setting higher cost-sensitive weights for positive samples, the cost of classifying the final trained classification model for a positive sample is much greater than the cost for classifying a negative sample. In this way, you can directly use positive samples and unlabeled The samples (used as negative samples) learn a cost-sensitive classifier to classify the unknown samples. In this embodiment, the training samples may be trained based on cost-sensitive semi-supervised machine learning algorithms, or the two-stage method may be used to train the training samples. In the specific implementation process, it can be set according to the needs, here, the application is not limited. In this embodiment, a two-stage semi-supervised machine learning algorithm is mainly used for detailed introduction. Taking the insurance claim scenario as an example, suppose the black sample in the training sample set is marked as 1, indicating that the sample is known insurance data for fraud insurance, and the white sample is marked as 0, indicating that the training sample corresponding to the insurance data is normal. After training a binary classification model on black samples and white samples sampled based on the sampling probability of white samples, a sample attribute evaluation model is obtained, and then the sample attribute evaluation model is used to evaluate unknown samples, and each unknown sample is marked as The black sample score of the black sample. The black sample score is a value ranging from 0 to 1, indicating the probability that the unknown sample belongs to the black sample. Of course, the black sample and the white sample and the corresponding black sample score can also be defined in other ways. Here, the application is not limited. Perform multiple rounds of training on the training samples in this way, and obtain the corresponding sample attribute evaluation model after each round of training. You need to determine whether the sample attribute evaluation model meets the preset convergence conditions. If the model converges, the samples obtained from the training round The attribute evaluation model is used as the final target sample attribute evaluation model. If the model has not converged, update the black sample concentration of each unknown sample and continue training in the aforementioned manner until the trained model reaches the convergence condition. In this embodiment, determining whether the model has converged can be achieved by the following steps: Based on the sample attribute evaluation model, each unknown sample is evaluated to obtain the current round attribute evaluation result of each unknown sample, and a total of M current round attribute evaluation results are obtained, where M is the number of unknown samples; Based on M current round attribute evaluation results and M previous attribute evaluation results, it is determined whether the sample attribute evaluation model satisfies the preset convergence condition. Among them, each unknown sample is evaluated based on the sample attribute evaluation model to obtain the current round of attribute evaluation results for each unknown sample, including: Evaluate each unknown sample based on the sample attribute evaluation model to obtain a black sample score for each unknown sample. If the black sample score value is greater than the preset score, mark the attribute information of the unknown sample as a black sample, where each The current round of attribute evaluation results of the unknown sample includes attribute information of the unknown sample. Specifically, in this embodiment, a sample attribute evaluation model corresponding to the round of training is obtained in each round of training, and the black sample of each unknown sample is scored using the model, and the unknown sample can be marked according to the score. Specifically, if the black sample score value is greater than the preset score value, the attribute information of the unknown sample is marked as a black sample. For example, the default score is set to 0.8, the black sample score of the unknown sample 1 is 0.9, and the attribute information of the unknown sample 1 is marked as a black sample. The black sample score of the unknown sample 2 is 0.4, and the attribute information of the unknown sample 2 remains unchanged, or it is an unknown attribute sample. In this way, the attribute evaluation result of each unknown sample in this round of training can be determined, and the evaluation result can include the black sample score and attribute information of the unknown sample in this round of training. If the number of unknown samples is M, M results of this round of attribute evaluation are obtained. Furthermore, the attribute evaluation results corresponding to the previous round of training of the unknown samples, that is, the M attribute evaluation results of the previous round can also be obtained. Through M current round attribute evaluation results and M previous attribute evaluation results, it can be judged whether the sample attribute evaluation model meets the preset convergence conditions, which can be achieved by the following steps: Determine whether the attribute information in the current round of attribute evaluation results for each unknown sample is consistent with the attribute information in the previous round of attribute evaluation results for the unknown sample. If so, it indicates that the current sample attribute evaluation model meets the preset convergence conditions. Specifically, in this embodiment, the attribute information in each evaluation result of the M previous round of attribute evaluation results is used to determine which unknown samples are included in the black samples marked in the previous round of training, and through the M current round of attributes The attribute information in each evaluation result in the evaluation results determines which unknown samples are included as black samples in this training round. If the unknown samples marked as black samples in the previous round are consistent with the unknown samples marked as black samples in the current round, it means that the label of each unknown sample has not changed, and the model has reached convergence. For example, the black samples in the unknown samples in the training in the previous round include unknown sample 1, unknown sample 2, unknown sample 5, and unknown sample 10. The black samples in this round of training also include unknown sample 1, unknown sample 2, unknown sample 5, and unknown sample 10, indicating that the unknown sample has not changed, and the sample attribute evaluation model trained in this round has reached convergence. The sample attribute evaluation model obtained from this round of training is taken as the target sample attribute evaluation model. Further, a preset convergence condition for determining whether the model reaches convergence can be set according to actual needs. The above example is only an example of specific implementation, and does not limit the present application. For example, it can also be set that if the unknown samples marked as black samples in the previous round and the unknown samples marked as black samples in the current round account for the same proportion of unknown samples, indicating that each unknown sample has no mark Changes, the model reaches convergence. Further, if it is determined that the sample attribute evaluation model obtained in this round of training does not satisfy the preset convergence condition, it indicates that the model has not yet converged, and the next round of training is required. Before the next round of training, because the marked black samples have changed from the previous round of training, it is necessary to update the black sample concentration of the community according to the marked black samples, and then update the black sample concentration of each unknown sample , The specific implementation may include the following steps: Based on the M current round attribute evaluation results and M previous attribute evaluation results, determine unknown samples whose attribute information has changed; recalculate the black sample concentration of the community corresponding to the unknown samples whose attribute information has changed. Specifically, in this embodiment, based on the attribute evaluation result of each unknown sample corresponding to the current round of training and the attribute evaluation result of each unknown sample corresponding to the previous round of training, it can be located which attribute information of unknown samples has changed, The change may be from a black sample attribute to an unknown attribute, or from an unknown attribute to a black sample attribute. Then locate the community where the unknown sample with attribute change is located, recalculate the black sample concentration of the community, and update the white sample sampling probability of the corresponding node of the community according to the updated black sample concentration of the community. For example, community A includes nodes corresponding to unknown sample 1, unknown sample 2, and black sample 1. In the last round of training, the attributes of all nodes in community A changed, and the black sample concentration corresponding to each node in community A was 1/3. In this round of training, the unknown sample 1 is marked as a black sample, and the attributes of the remaining nodes have not changed. The concentration of the black sample corresponding to each node in community A is updated to 2/3. In this way, the sampling probability of the white samples corresponding to unknown sample 1 and unknown sample 2 are both 1/3. In this way, it is possible to update the white sample sampling probability of unknown nodes, and then repeatedly perform the white sample sampling according to the white sample sampling probability of each node to obtain a white sample, combining the white sample obtained with the sample and the known black sample, based on semi-supervision The machine learning algorithm performs the next round of training on the sample attribute evaluation model until the trained sample attribute evaluation model reaches the above-mentioned preset convergence condition. Furthermore, the target sample attribute evaluation model is obtained by training in the foregoing manner, and the model can be used to evaluate the sample attributes of the new sample and determine the evaluation result of the new sample. The evaluation result includes the black sample score and/or attribute information of the new sample. . Specifically, in this embodiment, a known black sample and a white sample with a high degree of confidence after filtering out the potential black sample are used for model training. The obtained target sample attribute evaluation model has high evaluation accuracy and can be used for new samples. Perform sample attribute evaluation, and the evaluation result may include the aforementioned black sample score, indicating the probability that the new sample belongs to the black sample. The evaluation result may also include attribute information of the new sample, for example, the black sample score of the new sample is 0.9, which is greater than the preset score value of 0.8, and the attribute information of the new sample is determined to be a black sample. Through this evaluation result, the relevant personnel can be informed of the attributes of the newly entered sample in time and carry out risk control in a timely manner. Further, in this embodiment, since the relationship between the nodes will change with time, the relationship graph in the foregoing embodiment may be updated at preset time intervals, and the updated relationship graph may be re-divided into communities. Obtain the black sample concentration of the corresponding community, and re-train the sample attribute evaluation model, so that the model can be updated at preset time intervals. The method in this embodiment can be applied to an insurance claim scenario. The training sample is the insurance data of the claimant, and the black sample is the insurance data of the known fraudster. The fraudulent insurance assessment model is obtained by the aforementioned method. After entering relevant insurance data of the fraud insurance assessment model, you can get the assessment score of the newly entered claimant who belongs to the fraud insurance officer or whether it is an attribute of fraud insurance. In this way, the relevant personnel of the insurance company can conduct follow-up relevant review on the suspected fraudulent insurance personnel based on such evaluation results, avoiding unnecessary property losses. In the second aspect, based on the same inventive concept, the embodiments of the present specification provide a sample attribute evaluation model training device, please refer to FIG. 3, including: The first determining unit 301 is configured to determine the concentration of black samples in each community in the relationship graph corresponding to the training samples, where the training samples include black samples and unknown samples; The second determining unit 302 is configured to determine the sampling probability of the white sample of each unknown sample based on the black sample concentration of each community, and sample with the sampling probability of the white sample of each unknown sample to obtain a white sample ; The training unit 303 is used to train the black sample and the white sample based on a semi-supervised machine learning algorithm to obtain a target sample attribute evaluation model. In an optional implementation manner, the training unit 303 is specifically configured to: Training the black sample and the white sample based on a semi-supervised machine learning algorithm to obtain a sample attribute evaluation model; Judging whether the sample attribute evaluation model meets the preset convergence condition; If not, update the black sample concentration of each community, and continue training based on the updated black sample concentration of each community and the semi-supervised machine learning algorithm until the trained sample attribute evaluation model meets the preset For the convergence condition, the sample attribute evaluation model satisfying the preset convergence condition is used as the target sample attribute evaluation model. In an optional implementation manner, the training unit 303 is specifically configured to: Evaluate each of the unknown samples based on the sample attribute evaluation model to obtain the current round of attribute evaluation results for each of the unknown samples, and obtain a total of M current round of attribute evaluation results, where M is the number of unknown samples; Based on the M attribute evaluation results of the current round and the M attribute evaluation results of the previous round, it is determined whether the sample attribute evaluation model satisfies a preset convergence condition. In an optional implementation manner, the training unit 303 is specifically configured to: Evaluate each of the unknown samples based on the sample attribute evaluation model to obtain a black sample score for each of the unknown samples. If the black sample score value is greater than the preset score value, mark the attribute information of the unknown sample as black Samples, wherein the current round of attribute evaluation results for each of the unknown samples includes attribute information of the unknown sample. In an optional implementation manner, the training unit 303 is specifically configured to: Determine whether the attribute information in the current round of attribute evaluation results for each unknown sample is consistent with the attribute information in the previous round of attribute evaluation results for the unknown sample, and if so, it indicates that the current sample attribute evaluation model meets the preset convergence condition. In an optional implementation manner, the training unit 303 is specifically configured to: Based on the M current round attribute evaluation results and the M previous round attribute evaluation results, determine unknown samples whose attribute information has changed; Recalculate the black sample concentration of the community corresponding to the unknown sample whose attribute information has changed. In an optional implementation manner, the device further includes an evaluation unit, and the evaluation unit is specifically configured to: After the sample attribute evaluation model satisfying the preset convergence condition is taken as the target sample attribute evaluation model, the new sample is evaluated according to the target sample attribute evaluation model, and the evaluation result of the new sample is determined, wherein, the The evaluation result includes the black sample score and/or attribute information of the newly injected sample. In an optional implementation manner, the training sample is insurance data corresponding to the claimant, and the black sample is insurance data corresponding to the fraudster. In an optional implementation manner, the first determining unit is specifically configured to: Determine the first proportion of the nodes corresponding to all black samples in each community in the total nodes of the community, and use the first proportion as the concentration of black samples in the community; or Determine the second proportion of the nodes corresponding to all black samples in each community in the total nodes in the relationship graph, and use the first proportion as the concentration of black samples in the community; or Determine the third proportion of all black sample corresponding nodes in each community in the total nodes of the community, and the fourth proportion of the total nodes of the community in the total nodes in the relationship graph, to obtain the third proportion The weighted average with the fourth ratio is taken as the black sample concentration of the community. In the third aspect, based on the same inventive concept as the sample attribute evaluation method in the foregoing embodiment, the present invention also provides a server, as shown in FIG. 4, including a memory 404, a processor 402, and stored on the memory 404 and A computer program that can be run on the processor 402, and when the processor 402 executes the program, the steps of any method of the sample attribute evaluation method described above are implemented. Among them, in FIG. 4, the busbar architecture (represented by the busbar 400 ), the busbar 400 may include any number of interconnected busbars and bridges, and the busbar 400 will include one or more processes represented by the processor 402 The various circuits of the memory represented by the memory and the memory 404 are linked together. The bus bar 400 can also link various other circuits such as peripheral devices, voltage regulators, and power management circuits, etc., which are well known in the art, and therefore, they will not be further described herein. The bus interface 406 provides an interface between the bus 400 and the receiver 401 and the transmitter 403. The receiver 401 and the transmitter 403 may be the same element, that is, a transceiver, providing a unit for communicating with various other devices on a transmission medium. The processor 402 is responsible for managing the bus 400 and general processing, and the memory 404 can be used to store data used by the processor 402 in performing operations. According to a fourth aspect, based on the inventive concept of the sample attribute evaluation in the foregoing embodiments, the present invention also provides a computer-readable storage medium on which a computer program is stored, which when executed by the processor implements the sample attribute evaluation described above The steps of any method. This specification is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of the specification. It should be understood that each flow and/or block in the flowchart and/or block diagram and a combination of the flow and/or block in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processors of general-purpose computers, special-purpose computers, embedded processors, or other programmable data processing equipment to generate a machine that enables the generation of instructions executed by the processor of the computer or other programmable data processing equipment A device for realizing the functions specified in one block or multiple blocks in one flow or multiple processes in a flowchart and/or one block or multiple blocks in a block diagram. These computer program instructions can also be stored in a computer readable memory that can guide the computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer readable memory produce manufactured products including the instruction equipment, The instruction device implements the functions specified in one block or multiple blocks in one flow or multiple flows in the flowchart and/or one block or multiple blocks in the block diagram. These computer program instructions can also be loaded on a computer or other programmable data processing device, so that a series of operating steps can be performed on the computer or other programmable device to generate computer-implemented processing, which can be executed on the computer or other programmable device The instructions provide steps for implementing the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and/or block diagrams. Although the preferred embodiments of the present specification have been described, those skilled in the art can make additional changes and modifications to these embodiments once they learn the basic progressive concept. Therefore, the scope of the attached patent application is intended to be interpreted as including the preferred embodiments and all changes and modifications falling within the scope of this specification. Obviously, those skilled in the art can make various modifications and variations to this description without departing from the spirit and scope of this description. In this way, if these modifications and variations of this specification fall within the scope of the patent application of this specification and the scope of its equivalent technology, this specification is also intended to include these modifications and variations.

301:第一確定單元 302:第二確定單元 303:訓練單元 400:匯流排 401:接收器 402:處理器 403:發送器 404:記憶體 406:匯流排界面301: First determination unit 302: Second determination unit 303: Training unit 400: bus 401: receiver 402: processor 403: Transmitter 404: Memory 406: Busbar interface

圖1為本說明書實施例樣本屬性評估應用場景示意圖；圖2為本說明書實施例第一態樣樣本屬性評估方法流程圖；圖3為本說明書實施例第二態樣樣本屬性評估模型訓練裝置結構示意圖；圖4為本說明書實施例第三態樣樣本屬性評估伺服器結構示意圖。FIG. 1 is a schematic diagram of a sample attribute evaluation application scenario according to an embodiment of this specification; FIG. 2 is a flowchart of a first aspect sample attribute evaluation method according to an embodiment of the present specification; 3 is a schematic structural diagram of a training device for a second aspect sample attribute evaluation model according to an embodiment of the present specification; 4 is a schematic structural diagram of a third aspect sample attribute evaluation server according to an embodiment of the present specification.

Claims

A sample attribute model training method, including: Determine the concentration of black samples of each community in the relationship graph corresponding to the training samples, where the training samples include black samples and unknown samples; Determine the white sample sampling probability of each unknown sample based on the black sample concentration of each community, and sample with the white sample sampling probability of each unknown sample to obtain a white sample; Training the black sample and the white sample based on a semi-supervised machine learning algorithm to obtain a target sample attribute evaluation model.

According to the method described in item 1 of the patent application scope, the determination of the black sample concentration of each community in the relationship graph corresponding to the training sample includes: Determine the first proportion of the nodes corresponding to all black samples in each community in the total nodes of the community, and use the first proportion as the concentration of black samples in the community; or Determine the second proportion of the nodes corresponding to all black samples in each community in the total nodes in the relationship graph, and use the first proportion as the concentration of black samples in the community; or Determine the third proportion of all black sample corresponding nodes in each community in the total nodes of the community, and the fourth proportion of the total nodes of the community in the total nodes in the relationship graph, to obtain the third proportion The weighted average with the fourth ratio is taken as the black sample concentration of the community.

According to the method described in item 1 of the patent application scope, the semi-supervised machine learning algorithm trains the black sample and the white sample to obtain a target sample attribute evaluation model, including: Training the black sample and the white sample based on a semi-supervised machine learning algorithm to obtain a sample attribute evaluation model; Judging whether the sample attribute evaluation model meets the preset convergence condition; If not, update the black sample concentration of each community, and continue training based on the updated black sample concentration of each community and the semi-supervised machine learning algorithm until the trained sample attribute evaluation model meets the preset For the convergence condition, the sample attribute evaluation model satisfying the preset convergence condition is used as the target sample attribute evaluation model.

According to the method described in item 3 of the patent application scope, the judging whether the sample attribute evaluation model satisfies a preset convergence condition includes: Evaluate each of the unknown samples based on the sample attribute evaluation model to obtain the current round of attribute evaluation results for each of the unknown samples, and obtain a total of M current round of attribute evaluation results, where M is the number of unknown samples; Based on the M attribute evaluation results of the current round and the M attribute evaluation results of the previous round, it is determined whether the sample attribute evaluation model satisfies a preset convergence condition.

According to the method described in item 4 of the patent application scope, the evaluation of each of the unknown samples based on the sample attribute evaluation model to obtain the current round of attribute evaluation results for each of the unknown samples includes: Evaluate each of the unknown samples based on the sample attribute evaluation model to obtain a black sample score for each of the unknown samples. If the black sample score value is greater than the preset score value, mark the attribute information of the unknown sample as black Samples, wherein the current round of attribute evaluation results for each of the unknown samples includes attribute information of the unknown sample.

According to the method described in item 5 of the patent application scope, the determining whether the sample attribute evaluation model satisfies the predetermined convergence condition based on the M current round attribute evaluation results and M previous attribute evaluation results includes: Determine whether the attribute information in the current round of attribute evaluation results for each unknown sample is consistent with the attribute information in the previous round of attribute evaluation results for the unknown sample, and if so, it indicates that the current sample attribute evaluation model meets the preset convergence condition.

According to the method described in item 5 of the patent application scope, the updating of the black sample concentration of each community includes: Based on the M current round attribute evaluation results and the M previous round attribute evaluation results, determine unknown samples whose attribute information has changed; Recalculate the black sample concentration of the community corresponding to the unknown sample whose attribute information has changed.

According to the method described in any one of items 1-7 of the patent application scope, the training sample is insurance data corresponding to the claimant, and the black sample is insurance data corresponding to the fraudster.

A sample attribute evaluation method, including: According to the target sample attribute evaluation model trained by the method described in any one of the items 1-7 of the patent scope, the new sample is evaluated to determine the evaluation result of the new sample, wherein the evaluation result includes all Describe the black sample score and/or attribute information of the new sample.

A sample attribute evaluation model training device, including: A first determining unit, configured to determine the concentration of black samples of each community in the relationship graph corresponding to the training samples, wherein the training samples include black samples and unknown samples; A second determining unit, configured to determine the white sample sampling probability of each of the unknown samples based on the black sample concentration of each community, and sample with the white sample sampling probability of each of the unknown samples to obtain white samples; The training unit is used to train the black sample and the white sample based on a semi-supervised machine learning algorithm to obtain a target sample attribute evaluation model.

According to the device described in item 10 of the patent application scope, the first determining unit is specifically used to: Determine the first proportion of the nodes corresponding to all black samples in each community in the total nodes of the community, and use the first proportion as the concentration of black samples in the community; or Determine the second proportion of the nodes corresponding to all black samples in each community in the total nodes in the relationship graph, and use the first proportion as the concentration of black samples in the community; or Determine the third proportion of all black sample corresponding nodes in each community in the total nodes of the community, and the fourth proportion of the total nodes of the community in the total nodes in the relationship graph, to obtain the third proportion The weighted average with the fourth ratio is taken as the black sample concentration of the community.

According to the device described in item 10 of the patent application scope, the training unit is specifically used for: Training the black sample and the white sample based on a semi-supervised machine learning algorithm to obtain a sample attribute evaluation model; Judging whether the sample attribute evaluation model meets the preset convergence condition; If not, update the black sample concentration of each community, and continue training based on the updated black sample concentration of each community and the semi-supervised machine learning algorithm until the trained sample attribute evaluation model meets the preset For the convergence condition, the sample attribute evaluation model satisfying the preset convergence condition is used as the target sample attribute evaluation model.

According to the device described in item 12 of the patent application scope, the training unit is specifically used for: Evaluate each of the unknown samples based on the sample attribute evaluation model to obtain the current round of attribute evaluation results for each of the unknown samples, and obtain a total of M current round of attribute evaluation results, where M is the number of unknown samples; Based on the M attribute evaluation results of the current round and the M attribute evaluation results of the previous round, it is determined whether the sample attribute evaluation model satisfies a preset convergence condition.

According to the device described in item 13 of the patent application scope, the training unit is specifically used for: Evaluate each of the unknown samples based on the sample attribute evaluation model to obtain a black sample score for each of the unknown samples. If the black sample score value is greater than the preset score value, mark the attribute information of the unknown sample as black Samples, wherein the current round of attribute evaluation results for each of the unknown samples includes attribute information of the unknown sample.

According to the device described in item 14 of the patent application scope, the training unit is specifically used for: Determine whether the attribute information in the current round of attribute evaluation results for each unknown sample is consistent with the attribute information in the previous round of attribute evaluation results for the unknown sample, and if so, it indicates that the current sample attribute evaluation model meets the preset convergence condition.

According to the device described in item 14 of the patent application scope, the training unit is specifically used for: Based on the M current round attribute evaluation results and the M previous round attribute evaluation results, determine unknown samples whose attribute information has changed; Recalculate the black sample concentration of the community corresponding to the unknown sample whose attribute information has changed.

A sample attribute evaluation device, including: The evaluation unit is configured to evaluate the new sample according to the target sample attribute evaluation model trained by the device described in any one of items 10-16 of the patent application scope, and determine the evaluation result of the new sample, wherein, the The evaluation result includes the black sample score and/or attribute information of the newly entered sample.

A server, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, when the processor executes the program, any one of the patent application items 1-9 Describe the steps of the method.

A computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the steps of the method described in any of items 1-9 of the patent application.