TWI766626B

TWI766626B - Grouping system and method thereof

Info

Publication number: TWI766626B
Application number: TW110110606A
Authority: TW
Inventors: 李懷松; 陳永環
Original assignee: 大陸商支付寶（杭州）信息技術有限公司
Priority date: 2020-04-01
Filing date: 2021-03-24
Publication date: 2022-06-01
Also published as: CN111461225A; TW202143117A; CN111461225B; WO2021197032A1

Abstract

本發明揭露了一種分群系統及其方法，該分群系統包括：樣本特徵層，用於獲得樣本的特徵向量；編碼層，使用包含注意力機制的神經網路模型實現，用於對所述樣本的特徵向量進行編碼；相似度計算層，用於根據所述編碼層輸出的所述樣本的特徵向量的編碼計算各樣本兩兩之間的相似度；分群劃分層，用於根據所述各樣本兩兩之間的相似度進行分群劃分，其中將相似度大於預定閾值的兩個樣本歸於一類。The invention discloses a grouping system and a method thereof. The grouping system includes: a sample feature layer, used for obtaining the feature vector of the sample; The feature vector is encoded; the similarity calculation layer is used to calculate the similarity between each pair of samples according to the encoding of the feature vector of the sample output by the encoding layer; the grouping layer is used to calculate the similarity between the samples according to the The similarity between the two samples is divided into groups, wherein two samples whose similarity is greater than a predetermined threshold are classified into one class.

Description

Grouping system and method thereof

本發明關於電腦資料處理技術領域，特別是關於分群技術。The present invention relates to the technical field of computer data processing, in particular to grouping technology.

近年來，在反洗錢領域中通常透過分群的方式對洗錢模式進行劃分。例如，目前的分群演算法有例如：K-means、基於密度的分群、譜分群、層次分群等，通常需要依賴距離函數、類個數、半徑、最小類個數等參數的選擇，並且，不同的參數選擇造成分群結果差別很大。In recent years, in the field of anti-money laundering, money laundering patterns are usually divided by grouping. For example, the current clustering algorithms include K-means, density-based clustering, spectral clustering, hierarchical clustering, etc., which usually depend on the selection of parameters such as distance function, number of classes, radius, and minimum number of classes. The choice of parameters results in very different clustering results.

本說明書提供了一種分群系統及方法，不需要對參數進行選擇，具有更高的分類效率和準確性。本發明揭露了一種分群系統，包括：樣本特徵層，用於獲得樣本的特徵向量；編碼層，使用包含注意力機制的神經網路模型實現，用於對所述樣本的特徵向量進行編碼；相似度計算層，用於根據所述編碼層輸出的所述樣本的特徵向量的編碼計算各樣本兩兩之間的相似度；分群劃分層，用於根據所述各樣本兩兩之間的相似度進行分群劃分，其中將相似度大於預定閾值的兩個樣本歸於一類。在一個較佳實施例中，所述包含注意力機制的神經網路模型是轉換器(Transformer)。在一個較佳實施例中，所述編碼層包括多個依次疊加的轉換器(Transformer)。在一個較佳實施例中，所述樣本為訓練樣本；所述分群系統更包括損失函數計算層，用於基於所述相似度計算層輸出的各訓練樣本兩兩之間的相似度和所述各訓練樣本之間已知的真實相似度，計算損失函數，所述損失函數用於對所述神經網路模型的參數進行更新。在一個較佳實施例中，所述樣本特徵層更用於根據應用場景相應的清洗規則，篩選出滿足條件的訓練樣本。在一個較佳實施例中，所述特徵是以下之一或它們的任意組合：客戶年齡、客戶性別、客戶是否結婚、客戶一個月的流入金額、客戶一個月的流出金額、客戶住址、客戶職業、客戶所在地域、業務類型。在一個較佳實施例中，所述樣本特徵層更用於獲得每個所述訓練樣本對應的真實的類的標籤；並且所述損失函數計算層更用於根據所述訓練樣本對應的真實的類的標籤，獲得表示每個訓練樣本與其它訓練樣本間的真實相似度。本發明更揭露了一種分群方法包括：獲得樣本的特徵向量；使用包含注意力機制的神經網路模型實現，用於對所述樣本的特徵向量進行編碼；根據經編碼的所述樣本的特徵向量計算各樣本兩兩之間的相似度；根據所述各樣本兩兩之間的相似度進行分群劃分，其中將相似度大於預定閾值的兩個樣本歸於一類。在一個較佳實施例中，所述包含注意力機制的神經網路模型是轉換器(Transformer)。在一個較佳實施例中，所述包含注意力機制的神經網路模型包括多個依次疊加的轉換器Transformer。在一個較佳實施例中，所述樣本為訓練樣本；所述分群方法更包括：基於所述相似度計算層輸出的各訓練樣本兩兩之間的相似度和所述各訓練樣本之間已知的真實相似度，計算損失函數；使用所述損失函數對所述神經網路模型的參數進行更新。在一個較佳實施例中，所述獲得樣本的特徵向量之前，更包括：根據應用場景相應的清洗規則，篩選出滿足條件的訓練樣本。在一個較佳實施例中，所述特徵是以下之一或它們的任意組合：客戶年齡、客戶性別、客戶是否結婚、客戶一個月的流入金額、客戶一個月的流出金額、客戶住址、客戶職業、客戶所在地域、業務類型。在一個較佳實施例中，所述獲得樣本的特徵向量，更包括：獲得每個所述訓練樣本對應的真實的類的標籤；並且所述計算損失函數更包括：根據所述訓練樣本對應的真實的類的標籤，獲得表示每個訓練樣本與其它訓練樣本間的真實相似度。本說明書實施方式不需要對參數進行選擇，具有更高的客戶分群效率和準確性。本說明書中記載了大量的技術特徵，分佈在各個技術方案中，如果要羅列出本發明所有可能的技術特徵的組合(即技術方案)的話，會使得說明書過於冗長。為了避免這個問題，本說明書上述發明內容中揭露的各個技術特徵、在下文各個實施方式和例子中揭露的各技術特徵、以及圖式中揭露的各個技術特徵，都可以自由地互相組合，從而構成各種新的技術方案(這些技術方案均應該視為在本說明書中已經記載)，除非這種技術特徵的組合在技術上是不可行的。例如，在一個例子中揭露了特徵A+B+C，在另一個例子中揭露了特徵A+B+D+E，而特徵C和D是起到相同作用的等同技術手段，技術上只要擇一使用即可，不可能同時採用，特徵E技術上可以與特徵C相組合，則，A+B+C+D的方案因技術不可行而應當不被視為已經記載，而A+B+C+E的方案應當視為已經被記載。This specification provides a grouping system and method, which does not need to select parameters, and has higher classification efficiency and accuracy. The present invention discloses a grouping system, comprising: The sample feature layer is used to obtain the feature vector of the sample; an encoding layer, implemented using a neural network model including an attention mechanism, for encoding the feature vector of the sample; A similarity calculation layer, configured to calculate the similarity between each pair of samples according to the coding of the feature vector of the sample output by the coding layer; The grouping and dividing layer is configured to perform grouping and dividing according to the similarity between the samples, wherein two samples whose similarity is greater than a predetermined threshold are classified into one category. In a preferred embodiment, the neural network model including the attention mechanism is a Transformer. In a preferred embodiment, the encoding layer includes a plurality of transformers (Transformers) stacked in sequence. In a preferred embodiment, the sample is a training sample; The grouping system further includes a loss function calculation layer, which is used to calculate the loss function based on the similarity between the training samples output by the similarity calculation layer and the known real similarity between the training samples. , the loss function is used to update the parameters of the neural network model. In a preferred embodiment, the sample feature layer is further configured to filter out training samples that meet the conditions according to the corresponding cleaning rules of the application scenario. In a preferred embodiment, the characteristic is one of the following or any combination of them: age of the customer, gender of the customer, whether the customer is married, the monthly inflow of the customer, the monthly outflow of the customer, the address of the customer, the occupation of the customer , customer location, business type. In a preferred embodiment, the sample feature layer is further used to obtain the label of the real class corresponding to each of the training samples; and The loss function calculation layer is further configured to obtain the true similarity between each training sample and other training samples according to the labels of the real classes corresponding to the training samples. The present invention further discloses a grouping method comprising: Obtain the feature vector of the sample; Implemented using a neural network model including an attention mechanism for encoding the feature vector of the sample; Calculate the similarity between each sample pairwise according to the encoded feature vector of the sample; Grouping is performed according to the similarity between the samples in pairs, wherein two samples whose similarity is greater than a predetermined threshold are classified into one category. In a preferred embodiment, the neural network model including the attention mechanism is a Transformer. In a preferred embodiment, the neural network model including the attention mechanism includes a plurality of Transformers stacked in sequence. In a preferred embodiment, the sample is a training sample; The grouping method further includes: Calculate the loss function based on the similarity between the training samples output by the similarity calculation layer and the known real similarity between the training samples; The parameters of the neural network model are updated using the loss function. In a preferred embodiment, before obtaining the feature vector of the sample, it further includes: According to the corresponding cleaning rules of the application scenario, the training samples that meet the conditions are screened out. In a preferred embodiment, the characteristic is one of the following or any combination of them: age of the customer, gender of the customer, whether the customer is married, the monthly inflow of the customer, the monthly outflow of the customer, the address of the customer, the occupation of the customer , customer location, business type. In a preferred embodiment, the obtaining the feature vector of the sample further comprises: obtaining the label of the real class corresponding to each of the training samples; and The calculating the loss function further includes: obtaining, according to the label of the real class corresponding to the training sample, the real similarity between each training sample and other training samples. The embodiment of the present specification does not need to select parameters, and has higher efficiency and accuracy of customer grouping. A large number of technical features are recorded in this specification, which are distributed in various technical solutions. If all possible combinations of technical features (ie technical solutions) of the present invention are listed, the description will be too long. In order to avoid this problem, the various technical features disclosed in the above-mentioned content of the present specification, the various technical features disclosed in the various embodiments and examples below, and the various technical features disclosed in the drawings can be freely combined with each other to form Various new technical solutions (these technical solutions should be considered to have been recorded in this specification), unless the combination of such technical features is technically infeasible. For example, in one example, features A+B+C are disclosed, and in another example, features A+B+D+E are disclosed, and features C and D are equivalent technical means that serve the same function. It can be used as soon as it is used, it is impossible to use it at the same time, and feature E can technically be combined with feature C, then the solution of A+B+C+D should not be regarded as having been recorded because it is technically infeasible, while A+B+ The C+E scheme shall be deemed to have been documented.

在以下的敘述中，為了使讀者更好地理解本發明而提出了許多技術細節。但是，本領域的普通技術人員可以理解，即使沒有這些技術細節和基於以下各實施方式的種種變化和修改，也可以實現本發明所要求保護的技術方案。術語解釋：第一矩陣X：指訓練樣本對應的原始矩陣，例如，如果有500個訓練樣本，每個訓練樣本有10個特徵，則第一矩陣為500×10的矩陣。第二矩陣X_2：指基於第一矩陣，透過注意力機制獲得的每個訓練樣本與其它訓練樣本間的第一相似度對應的矩陣。第三矩陣X_3：指下文中根據第二矩陣X_2獲得的「新的矩陣」，具體而言，指將第二矩陣X_2與對應的值矩陣相乘，獲得的矩陣，例如，如果有500個訓練樣本，輸出的第三矩陣X_3為500×M的矩陣，其中M的大小透過設置參數來確定，比如，第三矩陣X_3可以是500×20的矩陣，或是500×100的矩陣，等等。在這種情況下，可以理解，第三矩陣X_3中的特徵向量含有每個訓練樣本與其它訓練樣本間的相似度關係。第四矩陣X_4：透過將第三矩陣X_3對應的第三查詢矩陣與第三鍵矩陣轉置相乘，並透過sigmoid變換，獲得的矩陣，第四矩陣X_4用於表示每個訓練樣本與其它訓練樣本間的第二相似度。例如，如果有500個訓練樣本，則第四矩陣X_4是500×500的矩陣。第五矩陣X_5：基於訓練樣本的真實的類所構造的矩陣，其中，第五矩陣X_5的每一列表示當前訓練樣本的真實的類與其它訓練樣本的真實的類之間的相似度，即，真實相似度，例如，如果兩者相同，則相似度值為1，否則，兩者不相同，則相似度值為0。特徵：表示客戶屬性的資訊，例如，客戶的年齡、性別、是否結婚、一個月的流入金額、一個月的流出金額，等等。在本說明書的一個實施例中，特徵更可以進一步包括客戶住址、職業(所屬公司、職位、在職時間、收入等等)、地域(例如是否來自洗錢、詐騙之類活動猖獗的國家和地區)、業務類型(例如交易類型、資金流向等)以及其它各種指標。類：表示按某一種特定的方式將一批客戶分為不同的類別，例如，將500個客戶分成10個類，第一類都是與套現有關的，第二類都是與詐騙有關的，等等。下面將結合圖式對本說明書的實施方式作進一步地詳細描述。本說明書的第一實施方式涉及一種分群系統，如圖1所示，包括：樣本特徵層，用於獲得樣本的特徵向量。在一些實施例中，樣本特徵層更用於根據應用場景相應的清洗規則，篩選出滿足條件的訓練樣本。編碼層，使用包含注意力機制的神經網路模型實現，用於對所述樣本的特徵向量進行編碼。在一些實施例中，包含注意力機制的神經網路模型是轉換器(Transformer)模型。在一些實施例中，編碼層包括多個依次疊加的Transformer模型。相似度計算層，用於根據所述編碼層輸出的所述樣本的特徵向量的編碼計算各樣本兩兩之間的相似度。分群劃分層，用於根據所述各樣本兩兩之間的相似度進行分群劃分，其中將相似度大於預定閾值的兩個樣本歸於一類。損失函數計算層，用於基於所述相似度計算層輸出的各訓練樣本之間的相似度和所述各訓練樣本之間已知的真實相似度，計算損失函數，所述損失函數用於對所述神經網路模型的參數進行更新。下面對上述各層進行詳細的說明：樣本特徵層，用於獲得訓練樣本的特徵向量，並根據所述訓練樣本的特徵向量構成第一矩陣X。具體而言，樣本特徵層準備訓練樣本及其特徵，其中，可根據具體的應用場景對每個客戶的特徵進行清洗，由此得到第一矩陣X，縱軸是客戶數N，橫軸是特徵維度，以及一個N維的類標籤向量Y。更具體而言，在一個實施例中，樣本特徵層由多個訓練樣本構成第一矩陣X，所述第一矩陣X的每一列對應一個訓練樣本的特徵向量。進一步的，樣本特徵層先獲取多個訓練樣本，並且，獲得每個所述訓練樣本對應的真實的類的標籤，然後，由所述訓練樣本的特徵向量構成第一矩陣X。需指出，「特徵」可以根據應用場景選擇。在一個實施例中，應用于反洗錢的場景，即識別出有洗錢嫌疑的客戶和沒有洗錢嫌疑的客戶，或識別出洗錢的類型。所用的特徵可以是例如：客戶年齡、客戶性別、客戶是否結婚、客戶一個月的流入金額、客戶一個月的流出金額、客戶住址、客戶職業、客戶所在地域、業務類型，等等。其中，客戶職業可進一步包括：所屬公司、職位、在職時間、收入等等；客戶所在地域可進一步包括：是否來自洗錢、詐騙之類活動猖獗的國家和地區等等；業務類型可進一步包括：交易類型、資金流向，等等。進一步的，在本說明書的實施例中，樣本特徵層更可以進一步根據具體的應用場景對上述特徵進行清洗，資料清洗可以理解為根據應用場景相應的清洗規則，篩選出滿足條件的訓練樣本。例如，如果要篩選出客戶收入大於1000萬的訓練樣本，則樣本特徵層將相應的特徵向量設置為1或0，大於1000萬設置為1，否則為0，由此針對客戶收入進行樣本篩選；又例如，如果要篩選出操作地在金三角的訓練樣本，則樣本特徵層將相應的特徵向量設置為1或0，操作地在金三角的設置為1，否則為0，由此針對操作地進行樣本篩選。然後，樣本特徵層根據上述訓練樣本的特徵構成第一矩陣X，該矩陣的縱軸是訓練樣本數，即，客戶數N，橫軸是特徵維度。例如，有500個訓練樣本，即500個客戶，每個訓練那樣本有10個特徵，則第一矩陣X總共有500列，每列有10個特徵，即，它是一個500×10的矩陣。進一步的，如上所述，樣本特徵層更獲取每個訓練樣本對應的真實的類的標籤，即每個樣本對應的客戶的真實所屬的類的標籤，構成一個N維的類標籤向量Y。例如，將與套現有關的分為第一類，將與詐騙有關的分為第二類，以此類推，則對於上述類標籤向量Y，第一類，即與套現有關的，對應的特徵向量值為「1」，第二類，即與詐騙有關的，對應的特徵向量值為「2」，等等。編碼層使用包含注意力機制的神經網路模型實現，用於對所述樣本的特徵向量進行編碼。具體而言，上述訓練樣本的特徵向量構成第一矩陣，每個訓練樣本與其它訓練樣本間的相似度關係構成第三矩陣。更具體而言，如圖2所示，在一個實施例中，編碼層採用了Transformer模組作為包含注意力機制的神經網路模型，也可以稱為轉換器(Transformer)模型(下文中簡稱為「Transformer」)，它將所述第一矩陣輸入Transformer模型，透過所述注意力機制獲得表示每個訓練樣本與其它訓練樣本間的第一相似度的第二矩陣，然後將所述第二矩陣與對應的值矩陣相乘，獲得含有每個訓練樣本與其它訓練樣本間的相似度關係的第三矩陣。也就是說，第三矩陣X_3中的特徵向量含有每個訓練樣本與其它訓練樣本間的相似度關係。需指出，Transformer模組透過Transformer的網路結構來對第一矩陣的向量進行編碼，因為transformer使用attention計算第一矩陣中每個訓練樣本與其它訓練樣本間的相似度，即，每個客戶與其它客戶的相似度，即attention裡的查詢矩陣Q(下文中簡稱為「Q」)與鍵矩陣K(下文中簡稱為「K」)相乘的結果，可以認為採用了非線性的方式計算一個客戶和其它客戶的相似度，即分群中最重要的客戶和其它客戶的距離計算；然後，Transformer模組利用上述相似度和值矩陣V(下文中簡稱為「V」)進行相乘得到新的向量，構成第三矩陣，可以理解，這樣獲得的第三矩陣含有每個訓練樣本與其它訓練樣本間的相似度關係。需指出，Transformer模組的結構如圖3所示，主要包括：多頭注意力機制(Multi-Head Attention)模組、殘差網路和歸一化(ADD&Norm)模組，以及前饋神經網路(Feed Forward)模組。其中，多頭注意力機制(Multi-Head Attention)模組使用多個K、Q、V對來處理第一矩陣X，這樣可以在不同的頭上得到不同的重要度，避免樣本本身對向量的影響過大。殘差網路和歸一化(ADD&Norm)模組用於避免梯度爆炸或消失的問題，同時也可以解決網路退化的問題，歸一化有利於網路的快速學習與收斂，同時避免過擬合。前饋神經網路(Feed Forward)模組是普通的全連接網路，將Attention的向量透過神經網路進一步進行非線性變換。在本說明書的實施例中，Transformer可以是多層疊加結構，例如，在本實施方式中用了3層疊加結構的Transformer，但不限於此，在其它實施方式中可以根據具體訓練效果進行添加或刪除。下面透過一個具體實施例進一步對Transformer模組進行解釋說明：Transformer模組將所述第一矩陣X輸入包含注意力機制的神經網路模型，透過所述注意力機制獲得每個訓練樣本與其它訓練樣本間的第一相似度對應的第二矩陣X_2，然後將所述第二矩陣X_2與對應的值矩陣相乘，獲得第三矩陣X_3。具體地說：首先，Transformer模組使用注意力機制(attention)計算第一矩陣的每個訓練樣本與其它訓練樣本間的第一相似度，獲得第一相似度對應的第二矩陣X_2。具體而言，在注意力機制下，第二矩陣X_2中的第一相似度是第一矩陣X對應的第一查詢矩陣Q與第一鍵矩陣K相乘的結果，注意力機制採用如下計算方式：第二矩陣X_2=第一查詢矩陣Q×第一鍵矩陣K 其中，第一查詢矩陣Q=第一矩陣X×第一查詢權重矩陣WQ，第一鍵矩陣K=第一矩陣X×第一鍵權重矩陣WK。注意力機制採用非線性的方式計算每個訓練樣本和其它訓練樣本之間的第一相似度，換句話說，該第一相似度也可以認為類似於分群中的每個客戶與其它客戶間的距離。然後，將獲得的第一相似度的矩陣，即，第二矩陣X_2與對應的第二值矩陣V_2相乘，獲得第三矩陣X_3。具體如下：第三矩陣X_3=第二矩陣X_2×第二值矩陣V_2，其中，第二值矩陣V_2=第二矩陣X_2×第二值權重矩陣WV_2。由此，獲得第三矩陣X_3中的新的特徵向量。需指出，第三矩陣X_3的列數是訓練樣本的數量，行數則可以透過設置參數進行調整，例如，第三矩陣X_3可以是500×20的矩陣，或是500×100的矩陣，等等。透過Transformer模組，使用注意力機制，對第一矩陣X的特徵向量進行編碼，獲得每個訓練樣本與其它訓練樣本間的第一相似度對應的第二矩陣X_2，並將該第二矩陣X_2與第二值矩陣V_2相乘獲得第三矩陣X_3，在這種方式下，獲得了新的特徵向量。這樣做的好處在於，第三矩陣X_3中獲得的新的特徵向量由每個訓練樣本與其它訓練樣本共同決定，即，新的特徵向量包含了每個訓練樣本與其它訓練樣本間的關係，因此，有利於後續進一步的相似度計算。另一方面，由於權重矩陣WQ的值的學習，就是對每個樣本中特徵的選擇，因此這種方式更能夠自動處理分群演算法中的特徵的選擇。根據本說明書的實施例，編碼層可以使用包含注意力機制的神經網路模型實現，例如上述Transformer模組，或其它包含注意力機制的神經網路模型。進一步的，在採用Transformer模組的情況下，神經網路模型既可以採用單頭注意力機制，也可以採用多頭注意力機制，其中，如果採用多頭注意力機制，首先，使用鍵矩陣K、查詢矩陣Q、值矩陣V的多個矩陣對來處理樣本矩陣X，這樣可以在不同的頭(head)上得到不同的相似度。這樣做的好處是，能夠避免訓練樣本本身對所獲得的新的向量的影響過大，這即是「多頭注意力機制(Multi-Head Attention)」。多頭注意力機制是本領域技術人員的公知手段，在此不做贅述。相似度計算層，用於根據所述每個訓練樣本與其它訓練樣本間的相似度關係，確定每個訓練樣本與其它訓練樣本間的第二相似度，進一步更用於根據所述每個訓練樣本與其它訓練樣本間的第二相似度對所述訓練樣本進行分群劃分。其中，每個訓練樣本與其它訓練樣本間的相似度關係，即第三矩陣X_3，每個訓練樣本與其它訓練樣本間的第二相似度，即第四矩陣X_4。具體而言，相似度計算層透過將第三矩陣X_3對應的第三查詢矩陣Q_3與第三鍵矩陣K_3轉置相乘，並透過sigmoid變換，獲得第四矩陣X_4，第四矩陣X_4用於表示每個訓練樣本與其它訓練樣本間的第二相似度。也就是說，相似度計算層透過借鑒attention的計算方式，計算每個訓練樣本與其它訓練樣本之間的第二相似度，將每個訓練樣本和與該訓練樣本之間的第二相似度值最大的兩個訓練樣本歸為一類，即可以完成分群劃分。該過程具體如下：首先，將Transformer模組輸出的每個訓練樣本與其它訓練樣本間的相似度關係，即，第三矩陣X_3的訓練樣本向量，與查詢權重矩陣WQ_3和鍵權重矩陣WK_3相乘，分別得到查詢矩陣Q_3(下文中簡稱「Q_3」)和鍵矩陣K_3(下文中簡稱「K_3」)，即：Q_3=X_3*WQ_3，K_3=X_3*KQ_3；然後，將Q_3與K_3的轉置相乘，並透過sigmoid變換，得到一個N*N的矩陣R，N表示樣本的個數，每一列表示當前樣本Xi和其它樣本Xj(j=1,2,3...,N)的相似度，相似度透過sigmoid轉換為0到1之間的值，值越大表示相似度越大，兩者之間越可能屬於同一類；然後，找出每列中相似度大於一定閾值(如0.5)的行j，可能會有多個，將Xi和Xj歸為一類，從第一列開始直到第N列進行相同的操作，這樣就完成了對N個樣本的類的劃分。下面透過一個實施例，進一步解釋和說明相似度計算層。在本實施例中，相似度計算層將所述第三矩陣X_3對應的第三查詢矩陣Q_3與第三鍵矩陣K_3轉置相乘，並透過sigmoid變換，獲得第四矩陣X_4，所述第四矩陣X_4用於表示每個訓練樣本與其它訓練樣本間的第二相似度。由此，根據第三矩陣X_3，使用類似於注意力機制的方式，獲得表示每個訓練樣本與其它訓練樣本間的第二相似度的第四矩陣X_4。進一步的，相似度計算層將第四矩陣X_4中最大的第二相似度，或者大於預設閾值的第二相似度所對應的兩個訓練樣本歸為一類，實現對訓練樣本的劃分。例如，如果第四矩陣X_4中的第i列的第j行的值，即訓練樣本Xi和訓練樣本Xj間的第二相似度大於閾值，則將訓練樣本Xi和訓練樣本Xj歸為一類。下面對具體過程進行解釋說明。首先，相似度計算層將Transformer模組輸出的第三矩陣X_3，分別與第三查詢權重矩陣WQ_3和第三鍵權重矩陣WK_3相乘，得到第三查詢矩陣Q_3和第三鍵矩陣K_3，即：第三查詢矩陣Q_3=第三矩陣X_3×第三查詢權重矩陣WQ_3，第三鍵矩陣K_3=第三矩陣X_3×第三鍵權重矩陣WK_3。然後，將第三查詢矩陣Q_3與第三鍵矩陣K_3轉置相乘，並透過sigmoid變換，獲得一個N×N的第四矩陣X_4，其中，N表示訓練樣本的個數，例如，可以是500個，則第四矩陣X_4是一個500×500的矩陣。在這種情況下，在第四矩陣X_4中，每一列表示當前訓練樣本Xi和其它訓練樣本Xj(j=1,2,3...,N)間的第二相似度。進一步的，上述第二相似度透過sigmoid函數轉換為0到1之間的值，其中，這個值越大，則表示第二相似度越大，它所對應的兩個訓練樣本就越可能屬於同一類。上述sigmoid函數也叫Logistic函數或S型生長曲線，具有單增以及反函數單增等性質，常被用作神經網路的啟動函數，例如，用於隱層神經元輸出，取值範圍為(0,1)，它可以將一個實數映射到(0,1)的區間。該函數是本領域的公知技術，在此不做贅述。然後，分群劃分層進一步根據第四矩陣X_4對訓練樣本進行歸類。具體而言，預先設置第二相似度的閾值，並將第四矩陣X_4中所有大於該閾值的第二相似度所對應的兩個訓練向量歸為一類。例如，假設第四矩陣X_4中的每列i代表一個訓練樣本Xi，行j代表訓練樣本Xj與該訓練樣本Xi間的第二相似度，當行j的值大於閾值時，則表示上述兩個訓練樣本Xi與Xj間的相似度符合預先設置的要求，在這種情況下，將訓練樣本Xi和訓練樣本Xj歸為一類。從第四矩陣X_4的第一列開始直到第N列(N為訓練樣本的個數)進行相同的操作，這樣就完成了對N個訓練樣本的類的劃分，獲得了對訓練樣本的分群結果。例如，有500個訓練樣本，按照訓練樣本間的第二相似度，可分為10類，即，第1類至第10類。第一類可以是「套現」類，第二類可以是「賭博」類，以此類推。損失函數計算層，用於基於每個訓練樣本與其它訓練樣本間的第二相似度以及所述兩個訓練樣本間的真實相似度，計算損失函數，所述損失函數用於更新所述神經網路模型。進一步的，損失函數計算層更用於根據所述訓練樣本對應的真實的類的標籤，獲得表示每個訓練樣本與其它訓練樣本間的真實相似度。具體而言，對於相似度計算層獲得的第四矩陣X_4，即，N*N的第二相似度矩陣R，每列中的值表示樣本Xi和其它樣本Xj的相似度Rij，另一方面，根據每個訓練樣本的真實的類的標籤構造一個這樣的N*N矩陣T，每一列表示一個訓練樣本，每列中的每行表示其它訓練樣本，如果與當前樣本屬於同一個類，則Tij的值(相似度)設置為1，不在同一類的相似度設置為0, 則損失函數是矩陣R與矩陣T的平方誤差：Loss=∑∑(Rij-Tij)^2。然後，根據該損失函數更新神經網路模型。下面透過一個實施例，進一步解釋和說明損失函數計算層的實現方法。損失函數計算層根據所述訓練樣本對應的真實的類的標籤，獲得表示每個訓練樣本與其它訓練樣本間的真實相似度的第五矩陣X_5，並基於所述第五矩陣X_5和第四矩陣X_4，確定損失函數，根據所述損失更新所述神經網路模型，以用於處理待分析的客戶特徵資料。具體而言，損失函數計算層獲得了一個N×N(N為訓練樣本的個數)的第四矩陣X_4，即，第二相似度矩陣R，其中，第二相似度矩陣R的每列中的各個值表示當前訓練樣本Xi和其它訓練樣本Xj間的相似度Rij。進一步的，根據訓練樣本的真實的類的標籤，構造一個N×N(N為訓練樣本的個數)的第五矩陣X_5，即，真實相似度矩陣T，其中，真實相似度矩陣T的每一列表示當前訓練樣本Ti，每列中的每行則表示當前訓練樣本Ti的真實的類與其它訓練樣本Tj的真實的類是否屬於同一類，其中，每個訓練樣本Ti可以與多個其它訓練樣本Tj屬於同一類。例如，如果真實相似度矩陣T的第i列對應的訓練樣本Ti的真實的類標籤與第j行代表的訓練樣本Tj的真實的類標籤相同，即，兩者屬於同一個類，則將Tij的值，即，真實的相似度，設置為1，否則，兩者不屬於同一個類，則將Tij的值設置為0。在這種情況下，損失函數是第四矩陣X_4與第五矩陣X_5的平方誤差，也就是第二相似度矩陣R與真實相似度矩陣T的平方誤差： Loss=∑∑(Rij-Tij)^2 其中，Rij代表第四矩陣X_4，即第二相似度矩陣R中第Xi個訓練樣本與其它訓練樣本Xj的第二相似度，Tij代表第五矩陣X_5，即真實相似度矩陣T中該第Xi個訓練樣本與該其它訓練樣本Xj的真實相似度。然後，根據該損失更新所述神經網路模型，其中根據損失函數針對每個參數求導，利用得到的導數來更新參數的值。利用損失函數更新模型參數是本領域的技術人員的公知常識，在此不做贅述。上述實施例的技術效果包含：首先，完全不需要設置比如k-means演算法需要的初始化分群中心及分群個數等參數，也不需設置比如DBSCAN演算法需要的提供最小類成員數等參數，因此，能夠避免在現有的分群處理中由於選擇不同的參數所造成的分群結果差別大的問題。進一步的，透過有監督的損失函數實現網路參數學習，能夠不斷提高分群模型的準確率。另外，由於上述k-means、DBSCAN等分群演算法分別適用分散型、流型等不同的資料場景，而本演算法可以不受這些場景的限制，因此有效避免了應用場景的限制。本說明書的第二實施方式涉及一種分群方法，其結構如圖4所示，該分群方法包括：在步驟402，獲得樣本的特徵向量。在步驟404，使用包含注意力機制的神經網路模型用於對樣本的特徵向量進行編碼。在一些實施例中，包含注意力機制的神經網路模型是轉換器Transformer模型。在一些實施例中，編碼層包括多個依次疊加的Transformer模型。在步驟406，根據經編碼的樣本的特徵向量計算各樣本兩兩之間的相似度。在步驟408，根據各樣本兩兩之間的相似度進行分群劃分，其中將相似度大於預定閾值的兩個樣本歸於一類。可選地，在一些實施例中，更包括對上述神經網路的訓練步驟(如下)，此時步驟402中所獲取的樣本為預先標定過的訓練樣本，並且在步驟402中獲得每個訓練樣本對應的真實的類的標籤(預先標定過的)。在步驟410，基於相似度計算層輸出的各訓練樣本之間的相似度和各訓練樣本之間已知的真實相似度，計算損失函數。例如，根據訓練樣本對應的真實的類的標籤，獲得表示每個訓練樣本與其它訓練樣本間的真實相似度。在步驟412，使用損失函數對神經網路模型的參數進行更新。在一些實施例中，步驟410和412的訓練步驟是可選的，如果神經網路模型已經是訓練好的，可以直接使用，無需進一步訓練。可選地，在一個實施例中，根據應用場景相應的清洗規則，篩選出滿足條件的訓練樣本。本實施方式的分群方法可以應用在反洗錢的場景中，從而識別出是否有洗錢的嫌疑以及各種洗錢的類型，所用的特徵可以是例如：客戶年齡、客戶性別、客戶是否結婚、客戶一個月的流入金額、客戶一個月的流出金額、客戶住址、客戶職業、客戶所在地域、業務類型，等等。其中，客戶職業可進一步包括：所屬公司、職位、在職時間、收入等等；客戶所在地域可進一步包括：是否來自洗錢、詐騙之類活動猖獗的國家和地區等等；業務類型可進一步包括：交易類型、資金流向，等等。第一實施方式是與本實施方式相對應的系統實施方式，第一實施方式中的技術細節可以應用於本實施方式，本實施方式中的技術細節也可以應用於第一實施方式。需要說明的是，本領域技術人員應當理解，上述分群方法的實施方式中所示的各個步驟可參照前述分群系統的相關描述而理解。上述分群系統的實施方式中所示的各模組的功能可透過運行於處理器上的程式(可執行指令)而實現，也可透過具體的邏輯電路而實現。本說明書實施例上述分群系統如果以軟體功能模組的形式實現並作為獨立的產品銷售或使用時，也可以儲存在一個電腦可讀取儲存媒體中。基於這樣的理解，本說明書實施例的技術方案本質上或者說對現有技術做出貢獻的部分可以以軟體產品的形式體現出來，該電腦軟體產品儲存在一個儲存媒體中，包括若干指令用以使得一台電腦設備(可以是個人電腦、伺服器、或者網路設備等)執行本說明書各個實施例所述方法的全部或部分。而前述的儲存媒體包括：隨身碟、行動硬碟、唯讀記憶體(ROM，Read Only Memory)、磁碟或者光碟等各種可以儲存程式碼的媒體。這樣，本說明書實施例不限制於任何特定的硬體和軟體結合。相應地，本說明書實施方式更提供一種電腦可讀儲存媒體，其中儲存有電腦可執行指令，該電腦可執行指令被處理器執行時實現本說明書的各方法實施方式。電腦可讀儲存媒體包括永久性和非永久性、可移除式和不可移除式媒體可以由任何方法或技術來實現資訊儲存。資訊可以是電腦可讀指令、資料結構、程式的模組或其它資料。電腦的儲存媒體的例子包括但不限於，相變記憶體(PRAM)、靜態隨機存取記憶體(SRAM)、動態隨機存取記憶體(DRAM)、其它類型的隨機存取記憶體(RAM)、唯讀記憶體(ROM)、電子可抹除可程式化唯讀記憶體(EEPROM)、快閃記憶體或其它記憶體技術、唯讀光碟唯讀記憶體(CD-ROM)、數位多功能光碟(DVD)或其它光學儲存、磁盒式磁帶，磁帶式磁碟儲存器或其它磁性存放裝置或任何其它非傳輸媒體，可用於儲存可以被計算設備存取的資訊。按照本文中的界定，電腦可讀取儲存媒體不包括暫態電腦可讀取媒體(transitory media)，如調變的資料訊號和載波。此外，本說明書實施方式更提供一種客戶分群模型的訓練設備，其中包括用於儲存電腦可執行指令的儲存器，以及，處理器；該處理器用於在執行該儲存器中的電腦可執行指令時實現上述各方法實施方式中的步驟。其中，該處理器可以是中央處理單元(Central Processing Unit，簡稱“CPU”)，更可以是其它通用處理器、數位訊號處理器(Digital Signal Processor，簡稱“DSP”)、專用積體電路(Application Specific Integrated Circuit，簡稱“ASIC”)等。前述的儲存器可以是唯讀記憶體(read-only memory，簡稱“ROM”)、隨機存取記憶體(random access memory，簡稱“RAM”)、快閃記憶體(Flash)、硬碟或者固態硬碟等。本發明各實施方式所揭露的方法的步驟可以直接體現為硬體處理器執行完成，或者用處理器中的硬體及軟體模組組合執行完成。需要說明的是，在本專利的申請文件中，諸如第一和第二等之類的關係術語僅僅用來將一個實體或者操作與另一個實體或操作區分開來，而不一定要求或者暗示這些實體或操作之間存在任何這種實際的關係或者順序。而且，術語「包括」、「包含」或者其任何其它變體意在涵蓋非排他性的包含，從而使得包括一系列要素的過程、方法、物品或者設備不僅包括那些要素，而且更包括沒有明確列出的其它要素，或者是更包括為這種過程、方法、物品或者設備所固有的要素。在沒有更多限制的情況下，由語句「包括一個」限定的要素，並不排除在包括所述要素的過程、方法、物品或者設備中更存在另外的相同要素。本專利的申請文件中，如果提到根據某要素執行某行為，則是指至少根據該要素執行該行為的意思，其中包括了兩種情況：僅根據該要素執行該行為、和根據該要素和其它要素執行該行為。多個、多次、多種等表達包括2個、2次、2種以及2個以上、2次以上、2種以上。在本說明書提及的所有文獻都被認為是整體性地包括在本說明書的揭露內容中，以便在必要時可以作為修改的依據。此外應理解，以上所述僅為本說明書的較佳實施例而已，並非用於限定本說明書的保護範圍。凡在本說明書一個或多個實施例的精神和原則之內，所作的任何修改、等同替換、改進等，均應包含在本說明書一個或多個實施例的保護範圍之內。上述對本說明書特定實施例進行了描述。其它實施例在所附申請專利範圍的範圍內。在一些情況下，在申請專利範圍中記載的動作或步驟可以按照不同於實施例中的順序來執行並且仍然可以實現期望的結果。另外，在圖式中描繪的過程不一定要求示出的特定順序或者連續順序才能實現期望的結果。在某些實施方式中，多工處理和平行處理也是可以的或者可能是有利的。 In the following description, numerous technical details are set forth in order to provide the reader with a better understanding of the present invention. However, those skilled in the art can understand that even without these technical details and various changes and modifications based on the following embodiments, the technical solutions claimed in the present invention can be implemented. Explanation of terms: The first matrix X: refers to the original matrix corresponding to the training samples. For example, if there are 500 training samples and each training sample has 10 features, the first matrix is a 500×10 matrix. The second matrix X_2: refers to the matrix corresponding to the first similarity between each training sample and other training samples obtained through the attention mechanism based on the first matrix. The third matrix X_3: refers to the “new matrix” obtained from the second matrix X_2 below, specifically, refers to the matrix obtained by multiplying the second matrix X_2 with the corresponding value matrix, for example, if there are 500 training Sample, the output third matrix X_3 is a 500×M matrix, where the size of M is determined by setting parameters. For example, the third matrix X_3 can be a 500×20 matrix, or a 500×100 matrix, and so on. In this case, it can be understood that the feature vector in the third matrix X_3 contains the similarity relationship between each training sample and other training samples. Fourth matrix X_4: The matrix obtained by transposing the third query matrix corresponding to the third matrix X_3 and the third key matrix, and through sigmoid transformation, the fourth matrix X_4 is used to represent each training sample and other training samples. The second similarity between samples. For example, if there are 500 training samples, the fourth matrix X_4 is a 500×500 matrix. Fifth matrix X_5: a matrix constructed based on the real classes of the training samples, wherein each column of the fifth matrix X_5 represents the similarity between the real classes of the current training samples and the real classes of other training samples, that is, The true similarity, for example, if the two are the same, the similarity value is 1, otherwise, the two are not the same, the similarity value is 0. Features: Information representing customer attributes, such as the customer's age, gender, whether they are married, monthly inflow amount, monthly outflow amount, and so on. In an embodiment of this specification, the features may further include the customer's address, occupation (company, position, working time, income, etc.), region (for example, whether from a country or region where money laundering, fraud and other activities are rampant), Business type (e.g. transaction type, capital flow, etc.) and various other metrics. Class: Indicates that a batch of customers is divided into different categories in a specific way. For example, 500 customers are divided into 10 categories. The first category is related to cash out, and the second category is related to fraud. and many more. The embodiments of the present specification will be further described in detail below with reference to the drawings. The first embodiment of the present specification relates to a grouping system, as shown in FIG. 1 , including: a sample feature layer for obtaining a feature vector of a sample. In some embodiments, the sample feature layer is further configured to filter out training samples that meet the conditions according to the corresponding cleaning rules of the application scenario. The encoding layer, implemented using a neural network model including an attention mechanism, encodes the feature vector of the sample. In some embodiments, the neural network model containing the attention mechanism is a Transformer model. In some embodiments, the encoding layer includes a plurality of Transformer models stacked one after the other. The similarity calculation layer is configured to calculate the similarity between each pair of samples according to the coding of the feature vector of the samples output by the coding layer. The grouping and dividing layer is configured to perform grouping and dividing according to the similarity between the samples, wherein two samples whose similarity is greater than a predetermined threshold are classified into one category. The loss function calculation layer is used to calculate a loss function based on the similarity between the training samples output by the similarity calculation layer and the known true similarity between the training samples, and the loss function is used to compare the The parameters of the neural network model are updated. The above-mentioned layers are described in detail below: The sample feature layer is used to obtain the feature vector of the training sample, and form a first matrix X according to the feature vector of the training sample. Specifically, the sample feature layer prepares training samples and their features, wherein the features of each customer can be cleaned according to specific application scenarios, thereby obtaining a first matrix X, the vertical axis is the number of customers N, and the horizontal axis is the feature dimension, and an N-dimensional class label vector Y. More specifically, in one embodiment, the sample feature layer comprises a plurality of training samples to form a first matrix X, and each column of the first matrix X corresponds to a feature vector of one training sample. Further, the sample feature layer first obtains a plurality of training samples, and obtains the label of the real class corresponding to each of the training samples, and then forms a first matrix X from the feature vectors of the training samples. It should be pointed out that the "feature" can be selected according to the application scenario. In one embodiment, it is applied to an anti-money laundering scenario, that is, identifying customers suspected of money laundering and customers not suspected of money laundering, or identifying the type of money laundering. The characteristics used can be, for example: customer age, customer gender, whether customer is married, customer inflow amount per month, customer outflow amount per month, customer address, customer occupation, customer location, business type, and the like. Among them, the occupation of the client may further include: company, position, working time, income, etc.; the location of the client may further include: whether it is from a country and region where money laundering, fraud and other activities are rampant, etc.; business type may further include: transaction Type, flow of funds, etc. Further, in the embodiments of this specification, the sample feature layer can further clean the above features according to specific application scenarios, and data cleaning can be understood as filtering out training samples that meet the conditions according to the corresponding cleaning rules of the application scenarios. For example, if you want to filter out training samples with customer income greater than 10 million, the sample feature layer will set the corresponding feature vector to 1 or 0, if it is greater than 10 million, it will be set to 1, otherwise it will be 0, so as to filter samples for customer income; For another example, if you want to filter out the training samples that operate in the Golden Triangle, the sample feature layer will set the corresponding feature vector to 1 or 0, the operation will be set to 1 in the Golden Triangle, otherwise it will be 0. filter. Then, the sample feature layer forms a first matrix X according to the features of the training samples. The vertical axis of the matrix is the number of training samples, that is, the number of customers N, and the horizontal axis is the feature dimension. For example, there are 500 training samples, i.e. 500 customers, and each training case has 10 features, then the first matrix X has a total of 500 columns, each column has 10 features, that is, it is a 500 × 10 matrix . Further, as mentioned above, the sample feature layer further obtains the label of the real class corresponding to each training sample, that is, the label of the real class to which the customer corresponding to each sample belongs, to form an N-dimensional class label vector Y. For example, the first category is related to cash, the second category is related to fraud, and so on, for the above class label vector Y, the first category, that is, related to cash, the corresponding feature vector The value is "1", the second category, that is, related to fraud, the corresponding eigenvector value is "2", and so on. The encoding layer is implemented using a neural network model incorporating an attention mechanism for encoding the feature vectors of the samples. Specifically, the feature vectors of the above training samples constitute a first matrix, and the similarity relationship between each training sample and other training samples constitutes a third matrix. More specifically, as shown in FIG. 2 , in one embodiment, the encoding layer uses a Transformer module as a neural network model including an attention mechanism, which may also be referred to as a Transformer model (hereinafter referred to as a Transformer model). "Transformer"), which inputs the first matrix into the Transformer model, obtains a second matrix representing the first similarity between each training sample and other training samples through the attention mechanism, and then converts the second matrix Multiply with the corresponding value matrix to obtain a third matrix containing the similarity relationship between each training sample and other training samples. That is to say, the feature vector in the third matrix X_3 contains the similarity relationship between each training sample and other training samples. It should be pointed out that the Transformer module encodes the vector of the first matrix through the Transformer's network structure, because the Transformer uses attention to calculate the similarity between each training sample in the first matrix and other training samples, that is, each customer and The similarity of other customers, that is, the result of multiplying the query matrix Q (hereinafter referred to as "Q") in the attention and the key matrix K (hereinafter referred to as "K"), can be considered as a nonlinear method to calculate a The similarity between customers and other customers, that is, the distance calculation between the most important customers in the group and other customers; then, the Transformer module uses the above similarity and value matrix V (hereinafter referred to as "V") to multiply to obtain a new The vector constitutes the third matrix. It can be understood that the third matrix obtained in this way contains the similarity relationship between each training sample and other training samples. It should be pointed out that the structure of the Transformer module is shown in Figure 3, which mainly includes: Multi-Head Attention module, residual network and normalization (ADD&Norm) module, and feedforward neural network (Feed Forward) mod. Among them, the multi-head attention mechanism (Multi-Head Attention) module uses multiple K, Q, V pairs to process the first matrix X, so that different degrees of importance can be obtained on different heads, so as to avoid the excessive influence of the sample itself on the vector . The residual network and normalization (ADD&Norm) module is used to avoid the problem of gradient explosion or disappearance, and can also solve the problem of network degradation. Normalization is conducive to the rapid learning and convergence of the network, while avoiding overfitting. combine. The feedforward neural network (Feed Forward) module is an ordinary fully connected network, which further nonlinearly transforms the Attention vector through the neural network. In the embodiment of this specification, the Transformer may be a multi-layer superposition structure. For example, in this embodiment, a Transformer with a 3-layer superposition structure is used, but it is not limited to this. In other embodiments, it can be added or deleted according to specific training effects. . The Transformer module is further explained below through a specific embodiment: The Transformer module inputs the first matrix X into a neural network model including an attention mechanism, and obtains each training sample and other training samples through the attention mechanism The second matrix X_2 corresponding to the first similarity between the samples is then multiplied by the second matrix X_2 and the corresponding value matrix to obtain a third matrix X_3. Specifically: First, the Transformer module uses an attention mechanism to calculate the first similarity between each training sample of the first matrix and other training samples, and obtains a second matrix X_2 corresponding to the first similarity. Specifically, under the attention mechanism, the first similarity in the second matrix X_2 is the result of multiplying the first query matrix Q corresponding to the first matrix X and the first key matrix K, and the attention mechanism adopts the following calculation method : The second matrix X_2=the first query matrix Q×the first key matrix K where, the first query matrix Q=the first matrix X×the first query weight matrix WQ, the first key matrix K=the first matrix X×the first The key weight matrix WK. The attention mechanism calculates the first similarity between each training sample and other training samples in a non-linear way. distance. Then, the obtained matrix of the first similarity, that is, the second matrix X_2, is multiplied by the corresponding second value matrix V_2 to obtain a third matrix X_3. The details are as follows: the third matrix X_3=the second matrix X_2×the second value matrix V_2, where the second value matrix V_2=the second matrix X_2×the second value weight matrix WV_2. Thus, a new eigenvector in the third matrix X_3 is obtained. It should be pointed out that the number of columns of the third matrix X_3 is the number of training samples, and the number of rows can be adjusted by setting parameters. For example, the third matrix X_3 can be a 500×20 matrix, or a 500×100 matrix, etc. . Through the Transformer module, the attention mechanism is used to encode the feature vector of the first matrix X, and the second matrix X_2 corresponding to the first similarity between each training sample and other training samples is obtained, and the second matrix X_2 A third matrix X_3 is obtained by multiplying with the second value matrix V_2, and in this way, a new eigenvector is obtained. The advantage of this is that the new feature vector obtained in the third matrix X_3 is jointly determined by each training sample and other training samples, that is, the new feature vector contains the relationship between each training sample and other training samples, so , which is beneficial to the subsequent further similarity calculation. On the other hand, since the learning of the value of the weight matrix WQ is the selection of features in each sample, this method can more automatically handle the selection of features in the clustering algorithm. According to the embodiments of the present specification, the encoding layer can be implemented using a neural network model including an attention mechanism, such as the above Transformer module, or other neural network models including an attention mechanism. Further, in the case of using the Transformer module, the neural network model can use either a single-head attention mechanism or a multi-head attention mechanism. If the multi-head attention mechanism is used, first, use the key matrix K, query Multiple matrix pairs of matrix Q and value matrix V are used to process sample matrix X, so that different degrees of similarity can be obtained on different heads. The advantage of this is that it can avoid the training sample itself having too much influence on the new vector obtained, which is "Multi-Head Attention". The multi-head attention mechanism is a well-known means of those skilled in the art, and details are not described here. The similarity calculation layer is used to determine the second similarity between each training sample and other training samples according to the similarity relationship between each training sample and other training samples, and is further used to determine the second similarity between each training sample and other training samples according to the similarity between each training sample and other training samples. The training samples are grouped by the second similarity between the samples and other training samples. The similarity relationship between each training sample and other training samples is the third matrix X_3, and the second similarity between each training sample and other training samples is the fourth matrix X_4. Specifically, the similarity calculation layer obtains a fourth matrix X_4 by transposing and multiplying the third query matrix Q_3 corresponding to the third matrix X_3 and the third key matrix K_3, and through sigmoid transformation, and the fourth matrix X_4 is used to represent The second similarity between each training sample and other training samples. That is to say, the similarity calculation layer calculates the second similarity between each training sample and other training samples by borrowing the calculation method of attention, and calculates the second similarity value between each training sample and the training sample. The two largest training samples are classified into one category, that is, the grouping can be completed. The process is as follows: First, the similarity relationship between each training sample output by the Transformer module and other training samples, that is, the training sample vector of the third matrix X_3, is multiplied by the query weight matrix WQ_3 and the key weight matrix WK_3 , obtain the query matrix Q_3 (hereinafter referred to as "Q_3") and the key matrix K_3 (hereinafter referred to as "K_3"), namely: Q_3=X_3*WQ_3, K_3=X_3*KQ_3; then, the transpose of Q_3 and K_3 Multiply and transform through sigmoid to get an N*N matrix R, where N represents the number of samples, and each column represents the similarity between the current sample Xi and other samples Xj (j=1,2,3...,N) The similarity is converted to a value between 0 and 1 through sigmoid. The larger the value, the greater the similarity, and the more likely they belong to the same class; then, find out the similarity in each column is greater than a certain threshold (such as 0.5 ), there may be more than one row j, classify Xi and Xj into one class, and perform the same operation from the first column to the Nth column, thus completing the class division of N samples. The similarity calculation layer is further explained and illustrated below through an embodiment. In this embodiment, the similarity calculation layer transposes and multiplies the third query matrix Q_3 corresponding to the third matrix X_3 and the third key matrix K_3, and obtains the fourth matrix X_4 through sigmoid transformation. The matrix X_4 is used to represent the second similarity between each training sample and other training samples. Thus, according to the third matrix X_3, a fourth matrix X_4 representing the second degree of similarity between each training sample and other training samples is obtained in a manner similar to the attention mechanism. Further, the similarity calculation layer classifies the two training samples corresponding to the largest second similarity in the fourth matrix X_4 or the second similarity greater than the preset threshold into one category, so as to realize the division of the training samples. For example, if the value of the jth row of the ith column in the fourth matrix X_4, that is, the second similarity between the training sample Xi and the training sample Xj is greater than the threshold, the training sample Xi and the training sample Xj are classified into one class. The specific process is explained below. First, the similarity calculation layer multiplies the third matrix X_3 output by the Transformer module with the third query weight matrix WQ_3 and the third key weight matrix WK_3, respectively, to obtain the third query matrix Q_3 and the third key matrix K_3, namely: The third query matrix Q_3=the third matrix X_3×the third query weight matrix WQ_3, and the third key matrix K_3=the third matrix X_3×the third key weight matrix WK_3. Then, transpose and multiply the third query matrix Q_3 and the third key matrix K_3, and obtain an N×N fourth matrix X_4 through sigmoid transformation, where N represents the number of training samples, for example, it can be 500 , then the fourth matrix X_4 is a 500×500 matrix. In this case, in the fourth matrix X_4, each column represents the second similarity between the current training sample Xi and other training samples Xj (j=1, 2, 3..., N). Further, the above-mentioned second similarity is converted into a value between 0 and 1 through the sigmoid function, wherein the larger the value, the greater the second similarity, and the more likely the two training samples corresponding to it belong to the same kind. The above sigmoid function is also called Logistic function or S-shaped growth curve. It has the properties of single increase and inverse function single increase, and is often used as the starting function of neural networks. For example, it is used for the output of hidden layer neurons. The value range is ( 0,1), which can map a real number to the interval (0,1). This function is a well-known technology in the art, and will not be repeated here. Then, the grouping layer further classifies the training samples according to the fourth matrix X_4. Specifically, the threshold of the second similarity is preset, and the two training vectors corresponding to all the second similarity greater than the threshold in the fourth matrix X_4 are classified into one category. For example, assuming that each column i in the fourth matrix X_4 represents a training sample Xi, and row j represents the second similarity between the training sample Xj and the training sample Xi, when the value of row j is greater than the threshold, it means that the above two The similarity between the training samples Xi and Xj meets the preset requirements. In this case, the training samples Xi and Xj are classified into one category. The same operation is performed from the first column of the fourth matrix X_4 to the Nth column (N is the number of training samples), thus completing the division of the classes of the N training samples, and obtaining the grouping results of the training samples . For example, there are 500 training samples, which can be divided into 10 categories according to the second similarity between the training samples, that is, the first category to the tenth category. The first category can be "cash out" category, the second category can be "gambling" category, and so on. The loss function calculation layer is used to calculate a loss function based on the second similarity between each training sample and other training samples and the real similarity between the two training samples, and the loss function is used to update the neural network road model. Further, the loss function calculation layer is further configured to obtain the actual similarity between each training sample and other training samples according to the labels of the real classes corresponding to the training samples. Specifically, for the fourth matrix X_4 obtained by the similarity calculation layer, that is, the second similarity matrix R of N*N, the value in each column represents the similarity Rij of the sample Xi and other samples Xj, on the other hand, Construct such an N*N matrix T according to the real class label of each training sample, each column represents a training sample, and each row in each column represents other training samples, if it belongs to the same class as the current sample, then Tij The value of (similarity) is set to 1, and the similarity of not in the same class is set to 0, then the loss function is the squared error of matrix R and matrix T: Loss=∑∑(Rij-Tij)^2. Then, the neural network model is updated according to this loss function. The implementation method of the loss function calculation layer is further explained and illustrated below through an embodiment. The loss function calculation layer obtains a fifth matrix X_5 representing the real similarity between each training sample and other training samples according to the label of the real class corresponding to the training sample, and based on the fifth matrix X_5 and the fourth matrix X_4: Determine a loss function, and update the neural network model according to the loss, so as to process the customer characteristic data to be analyzed. Specifically, the loss function calculation layer obtains a fourth matrix X_4 of N×N (N is the number of training samples), that is, the second similarity matrix R, wherein, in each column of the second similarity matrix R Each value of represents the similarity Rij between the current training sample Xi and other training samples Xj. Further, according to the labels of the real classes of the training samples, construct a fifth matrix X_5 of N×N (N is the number of training samples), that is, the real similarity matrix T, wherein each value of the real similarity matrix T is One column represents the current training sample Ti, and each row in each column represents whether the real class of the current training sample Ti and the real classes of other training samples Tj belong to the same class, wherein each training sample Ti can be combined with multiple other training samples. The samples Tj belong to the same class. For example, if the true class label of the training sample Ti corresponding to the i-th column of the true similarity matrix T is the same as the true class label of the training sample Tj represented by the j-th row, that is, both belong to the same class, then Tij The value of , that is, the true similarity, is set to 1, otherwise, the two do not belong to the same class, the value of Tij is set to 0. In this case, the loss function is the squared error between the fourth matrix X_4 and the fifth matrix X_5, that is, the squared error between the second similarity matrix R and the real similarity matrix T: Loss=∑∑(Rij-Tij)^ 2 Among them, Rij represents the fourth matrix X_4, that is, the second similarity between the Xith training sample in the second similarity matrix R and other training samples Xj, and Tij represents the fifth matrix X_5, which is the second similarity in the real similarity matrix T. The true similarity between the Xi training samples and the other training samples Xj. Then, the neural network model is updated according to the loss, wherein each parameter is derived according to the loss function, and the value of the parameter is updated with the derived derivative. Using the loss function to update the model parameters is the common knowledge of those skilled in the art, and will not be repeated here. The technical effects of the above-described embodiments include: First, it is not necessary to set parameters such as the initialization clustering center and the number of clusters required by the k-means algorithm, and there is also no need to set parameters such as the minimum number of class members required by the DBSCAN algorithm, Therefore, it is possible to avoid the problem that the grouping results differ greatly due to the selection of different parameters in the conventional grouping process. Furthermore, the network parameter learning is realized through a supervised loss function, which can continuously improve the accuracy of the clustering model. In addition, since the above-mentioned clustering algorithms such as k-means and DBSCAN are applicable to different data scenarios such as distributed and manifold, and this algorithm is not limited by these scenarios, it effectively avoids the limitation of application scenarios. The second embodiment of the present specification relates to a grouping method, the structure of which is shown in FIG. 4 , and the grouping method includes: In step 402 , a feature vector of a sample is obtained. At step 404, a neural network model including an attention mechanism is used to encode the feature vector of the sample. In some embodiments, the neural network model containing the attention mechanism is a Transformer model. In some embodiments, the encoding layer includes a plurality of Transformer models stacked one after the other. In step 406, the similarity between each pair of samples is calculated according to the eigenvectors of the encoded samples. In step 408, grouping is performed according to the similarity between the samples, wherein two samples whose similarity is greater than a predetermined threshold are classified into one category. Optionally, in some embodiments, a training step for the above-mentioned neural network is further included (as follows), and the samples obtained in step 402 are pre-calibrated training samples, and each training sample is obtained in step 402 The label of the real class corresponding to the sample (pre-calibrated). In step 410, a loss function is calculated based on the similarity between the training samples output by the similarity calculation layer and the known true similarity between the training samples. For example, according to the label of the real class corresponding to the training sample, obtain the real similarity between each training sample and other training samples. At step 412, the parameters of the neural network model are updated using the loss function. In some embodiments, the training steps of steps 410 and 412 are optional, and if the neural network model is already trained, it can be used directly without further training. Optionally, in one embodiment, training samples that meet the conditions are screened out according to the cleaning rules corresponding to the application scenarios. The grouping method of this embodiment can be applied in an anti-money laundering scenario, so as to identify whether there is a suspicion of money laundering and various types of money laundering. Inflow amount, outflow amount of the customer for a month, customer address, customer occupation, customer location, business type, etc. Among them, the occupation of the client may further include: company, position, working time, income, etc.; the location of the client may further include: whether it is from a country and region where money laundering, fraud and other activities are rampant, etc.; business type may further include: transaction Type, flow of funds, etc. The first embodiment is a system embodiment corresponding to this embodiment, the technical details in the first embodiment can be applied to this embodiment, and the technical details in this embodiment can also be applied to the first embodiment. It should be noted that those skilled in the art should understand that each step shown in the implementation of the foregoing grouping method can be understood with reference to the relevant description of the foregoing grouping system. The functions of each module shown in the above-mentioned embodiments of the grouping system can be realized by a program (executable instruction) running on the processor, or can be realized by a specific logic circuit. If the above-mentioned grouping system in the embodiment of the present specification is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the embodiments of this specification or the parts that make contributions to the prior art can be embodied in the form of software products, and the computer software products are stored in a storage medium and include several instructions for making A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of this specification. The aforementioned storage medium includes: a flash disk, a mobile hard disk, a read only memory (ROM, Read Only Memory), a magnetic disk or a CD, and other mediums that can store program codes. As such, embodiments of the present specification are not limited to any particular combination of hardware and software. Correspondingly, the embodiments of the present specification further provide a computer-readable storage medium in which computer-executable instructions are stored. When the computer-executable instructions are executed by a processor, each method embodiment of the present specification is implemented. Computer-readable storage media includes permanent and non-permanent, removable and non-removable media and can be implemented by any method or technology for information storage. Information can be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM) , Read-Only Memory (ROM), Electronically Erasable Programmable Read-Only Memory (EEPROM), Flash or other memory technologies, CD-ROM (CD-ROM), Digital Versatile A compact disc (DVD) or other optical storage, magnetic tape cassette, magnetic tape storage or other magnetic storage device or any other non-transmission medium may be used to store information that can be accessed by a computing device. As defined herein, computer-readable storage media does not include transient computer-readable media, such as modulated data signals and carrier waves. In addition, embodiments of the present specification further provide a training device for a customer grouping model, which includes a memory for storing computer-executable instructions, and a processor; the processor is used for executing the computer-executable instructions in the memory when executing the computer-executable instructions in the memory. The steps in the above-mentioned method embodiments are implemented. The processor may be a central processing unit (Central Processing Unit, referred to as "CPU"), or other general-purpose processors, digital signal processors (Digital Signal Processor, referred to as "DSP"), dedicated integrated circuits (Application Specific Integrated Circuit, referred to as "ASIC") and so on. The aforementioned storage may be read-only memory (referred to as “ROM”), random access memory (referred to as “RAM”), flash memory (Flash), hard disk or solid state hard disk, etc. The steps of the methods disclosed in the various embodiments of the present invention can be directly embodied as executed by a hardware processor, or executed by a combination of hardware and software modules in the processor. It should be noted that, in the application documents of this patent, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these There is no such actual relationship or sequence between entities or operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or device comprising a list of elements includes not only those elements, but also those not explicitly listed other elements, or more include elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a" does not preclude the presence of additional identical elements in the process, method, article or device that includes the element. In the application documents of this patent, if it is mentioned that an action is performed according to a certain element, it means at least that the action is performed according to the element, which includes two situations: the action is performed only according to the element, and the action is performed according to the element and Other elements perform this behavior. Expressions such as multiple, multiple, multiple, etc. include 2, 2, 2, and 2 or more, 2 or more, and 2 or more. All documents mentioned in this specification are considered to be incorporated in their entirety into the disclosure of this specification so that they can be relied upon for modification if necessary. In addition, it should be understood that the above descriptions are only preferred embodiments of the present specification, and are not intended to limit the protection scope of the present specification. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of one or more embodiments of this specification shall be included within the protection scope of one or more embodiments of this specification. The foregoing describes specific embodiments of the present specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. Additionally, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. Multiplexing and parallel processing are also possible or may be advantageous in certain embodiments.

402:步驟 404:步驟 406:步驟 408:步驟 410:步驟 412:步驟402: Step 404: Step 406: Step 408: Step 410: Steps 412: Steps

[圖1]是根據本說明書第一實施方式的分群系統的結構示意圖； [圖2]是根據本說明書第一實施方式的分群系統的詳細結構示意圖； [圖3]是根據本說明書第一實施方式的分群系統中編碼層的示意圖； [圖4]是根據本說明書第二實施方式的分群方法的流程示意圖。[Fig. 1] is a schematic structural diagram of a grouping system according to the first embodiment of the present specification; [ Fig. 2 ] is a schematic diagram of the detailed structure of the grouping system according to the first embodiment of the present specification; [ Fig. 3 ] is a schematic diagram of a coding layer in the grouping system according to the first embodiment of the present specification; [ Fig. 4 ] is a schematic flowchart of a grouping method according to a second embodiment of the present specification.

Claims

A grouping system comprising: The sample feature layer is used to obtain the feature vector of the sample; an encoding layer, implemented using a neural network model including an attention mechanism, for encoding the feature vector of the sample; A similarity calculation layer, configured to calculate the similarity between each pair of samples according to the coding of the feature vector of the sample output by the coding layer; The grouping and dividing layer is configured to perform grouping and dividing according to the similarity between the samples, wherein two samples whose similarity is greater than a predetermined threshold are classified into one category.

The grouping system of claim 1, wherein the neural network model including the attention mechanism is a Transformer.

The grouping system of claim 2, wherein the encoding layer includes a plurality of transformers (Transformers) stacked in sequence.

The grouping system according to any one of claim 1-3, wherein the sample is a training sample; The grouping system further includes a loss function calculation layer, which is used to calculate the loss function based on the similarity between the training samples output by the similarity calculation layer and the known real similarity between the training samples. , the loss function is used to update the parameters of the neural network model.

The grouping system according to claim 4, wherein the sample feature layer is further configured to filter out training samples that meet the conditions according to cleaning rules corresponding to application scenarios.

The grouping system of claim 4, wherein the characteristic is one of the following or any combination thereof: age of the customer, gender of the customer, whether the customer is married, the monthly inflow of the customer, the monthly outflow of the customer, the customer Address, customer occupation, customer location, business type.

The grouping system according to claim 4, wherein the sample feature layer is further used to obtain a label of the real class corresponding to each of the training samples; and The loss function calculation layer is further configured to obtain the true similarity between each training sample and other training samples according to the labels of the real classes corresponding to the training samples.

A grouping method comprising: Obtain the feature vector of the sample; Implemented using a neural network model including an attention mechanism for encoding the feature vector of the sample; Calculate the similarity between each sample pairwise according to the encoded feature vector of the sample; Grouping is performed according to the similarity between the samples in pairs, wherein two samples whose similarity is greater than a predetermined threshold are classified into one category.

The grouping method according to claim 8, wherein the neural network model including the attention mechanism is a Transformer.

The grouping method according to claim 9, wherein the neural network model including the attention mechanism includes a plurality of transformers (Transformers) stacked in sequence.

The grouping method according to any one of claim 8-10, wherein the sample is a training sample; The grouping method further includes: Calculate the loss function based on the similarity between the training samples output by the similarity calculation layer and the known real similarity between the training samples; The parameters of the neural network model are updated using the loss function.

The method for grouping according to claim 11, wherein before obtaining the feature vector of the sample, the method further comprises: According to the corresponding cleaning rules of the application scenario, the training samples that meet the conditions are screened out.

The grouping method according to claim 11, wherein the characteristic is one of the following or any combination thereof: age of the customer, gender of the customer, whether the customer is married, the inflow amount of the customer in a month, the outflow amount of the customer in a month, the customer Address, customer occupation, customer location, business type.

The grouping method according to claim 11, wherein the obtaining the feature vector of the sample further comprises: obtaining the label of the real class corresponding to each of the training samples; and The calculating the loss function further includes: obtaining, according to the label of the real class corresponding to the training sample, the real similarity between each training sample and other training samples.