TWI662421B

TWI662421B - Community division method and device based on feature matching network

Info

Publication number: TWI662421B
Application number: TW106142677A
Authority: TW
Inventors: 李旭瑞; 邱雪濤; 趙金濤; 鐘毅; 胡奕
Original assignee: 大陸商中國銀聯股份有限公司
Priority date: 2016-12-06
Filing date: 2017-12-06
Publication date: 2019-06-11
Also published as: TW201822022A; CN106709800B; WO2018103456A1; CN106709800A

Abstract

本發明屬於資料處理領域，尤其是關於一種基於特徵匹配網路的社團劃分方法和裝置，用於對社團劃分。本發明實施例中，根據預設的K個雜湊函數，確定每個帳號資訊對應的K位元雜湊向量；將每個帳號資訊對應的雜湊向量，順序劃分為m=K/k類子雜湊向量；針對每個類，將子雜湊向量相同的帳號資訊劃分為同一組；計算同一組內的各帳號資訊之間的相似度；若各帳號資訊之間的相似度大於閾值，則在各帳號資訊之間建立互連邊，形成特徵匹配網路；根據特徵匹配網路，對各帳號資訊進行社團劃分，進而可以根據劃分後的社團進行社團分析，發現異常社團。 The invention belongs to the field of data processing, in particular to a method and device for classifying communities based on a feature matching network, and is used for classifying communities. In the embodiment of the present invention, according to the preset K hash functions, a K-bit hash vector corresponding to each account information is determined; the hash vector corresponding to each account information is sequentially divided into m = K / k sub-hash vectors. ; For each class, divide the account information with the same sub-hash vector into the same group; calculate the similarity between the account information in the same group; if the similarity between the account information is greater than the threshold, the account information is An interconnection edge is established between them to form a feature matching network. According to the feature matching network, the account information is divided into communities, and the community analysis can be performed according to the divided communities to find abnormal communities.

Description

Community division method and device based on feature matching network

本發明屬於資料處理領域，尤其是關於一種基於特徵匹配網路的社團劃分方法和裝置。 The invention belongs to the field of data processing, in particular to a method and a device for classifying communities based on a feature matching network.

目前，國內信用卡市場面臨的風險形勢日益嚴峻，信用卡套現、偽卡欺詐、盜卡欺詐等案件日益增加，具體的，信用卡套現是指持卡人通過虛假消費交易或與商戶合謀刷卡後獲取現金，之後退款或購買容易變現商品後變賣獲取現金等行為、偽卡欺詐是指按照銀行卡的磁條資訊格式寫磁，凸印或平印偽造真實有效的銀行卡進行交易的欺詐行為；盜卡欺詐是指欺詐者獲得真實持卡人的部分或者全部資訊並假冒真實持卡人對帳戶的資訊進行變更以達到欺詐目的的行為。信用卡犯罪手段不斷向著高科技、集團化、專業化發展，案件實施過程更為隱蔽，手法不斷翻新，這對銀行和持卡人的資金安全構成威脅，成為制約信用卡產業長期健康發展的重要因素。 At present, the domestic credit card market is facing an increasingly serious risk situation, and cases of credit card cashing, counterfeit card fraud, and card theft are increasing. Specifically, credit card cashing refers to cardholders obtaining cash through false consumption transactions or collusion with merchants to swipe their cards. Refunds or purchases of goods that are easy to realise and then sell them for cash, counterfeit card fraud refers to the fraudulent behavior of writing a magnetic card in accordance with the magnetic stripe information format of a bank card, embossing or lithography to forge a real and valid bank card for transactions; card theft Fraud refers to an act where a fraudster obtains some or all of the information of a real cardholder and impersonates the real cardholder to change the account information to achieve the purpose of fraud. The method of credit card crime is constantly developing towards high technology, groupization, and professionalization. The case implementation process is more hidden and the methods are constantly being renovated. This poses a threat to the safety of banks and cardholders' funds and has become an important factor restricting the long-term healthy development of the credit card industry.

面對各種各樣的欺詐手段，現有技術中，通常採用聚類的方法來應對，然而採用這種方法存在多種缺陷，例如，一方面，如果後續對反欺詐模型添加資料，會對反欺詐模型更新資料造成困難，另一方面，經過聚類之後，雖然能將節點劃分為若干類，但群體內的結構以及結構之間的關聯仍然難以描述。 Faced with a variety of fraud methods, in the prior art, a clustering method is usually used to deal with it. However, this method has various disadvantages. For example, on the one hand, if subsequent data is added to the anti-fraud model, the anti-fraud model will be affected. Updating the data causes difficulties. On the other hand, after clustering, although the nodes can be divided into several categories, the structure within the group and the relationships between the structures are still difficult to describe.

綜上所述，現有技術中存在著如果後續對反欺詐模型添加資料，造成反欺詐模型更新資料困難；經過聚類之後，群體內的結構以及結構之間的關聯仍然難以描述的問題，因此，需要採取有效的措施來解決以上問題。 To sum up, there is a problem in the prior art that if subsequent data is added to the anti-fraud model, it is difficult to update the data of the anti-fraud model; after clustering, the structure within the group and the relationship between the structures are still difficult to describe. Therefore, Effective measures are needed to solve the above problems.

本發明實施例提供一種基於特徵匹配網路的社團劃分方法和裝置，用以解決現有技術中存在著如果後續對反欺詐模型添加資料，造成反欺詐模型更新資料困難、經過聚類之後，群體內的結構以及結構之間的關聯仍然難以描述的問題。 Embodiments of the present invention provide a method and a device for dividing a community based on a feature matching network, which are used to solve the existing problems in the prior art. The structure of the structure and the relationships between the structures remain difficult to describe.

本發明實施例提供一種基於特徵匹配網路的社團劃分方法，包括：根據預設的K個雜湊函數，確定每個帳號資訊對應的K位元雜湊向量；將每個帳號資訊對應的雜湊向量，順序劃分為m=K/k類子雜湊向量；針對每個類，將子雜湊向量相同的帳號資訊劃分為同一組；計算同一組內的各帳號資訊之間的相似度；若各帳號資訊之間的相似度大於閾值，則在各帳號資訊之間建立互連邊，形成特徵匹配網路；根據特徵匹配網路，對各帳號資訊進行社團劃分。 An embodiment of the present invention provides a community division method based on a feature matching network, which includes: determining a K-bit hash vector corresponding to each account information according to preset K hash functions; and assigning a hash vector corresponding to each account information, The order is divided into m = K / k sub-hash vectors; for each class, the account information with the same sub-hash vector is divided into the same group; the similarity between each account information in the same group is calculated; If the similarity between them is greater than the threshold, an interconnection edge is established between each account information to form a feature matching network; according to the feature matching network, each account information is divided into communities.

可選地，計算同一組內的各帳號資訊之間的相似度，包括：若第i帳號資訊與第j帳號資訊位於n類同組中，則將n/m作為第i帳號資訊與第j帳號資訊之間的相似度；第i帳號資訊與第j帳號資訊為各帳號資訊中的任一個。 Optionally, calculating the similarity between the account information in the same group includes: If the i-th account information and the j-th account information are in the same n-type group, n / m is used as the i-th account information and the j-th account information Similarity between account information; the i-th account information and the j-th account information are any of the account information.

可選地，計算同一組內的各帳號資訊之間的相似度，包括：若第i帳號資訊與第j帳號資訊位於同一組中，統計第i帳號資訊的雜湊向量與第j帳號資訊的雜湊向量中位於同一位且雜湊向量值相同的個數h；第i帳號資訊與第j帳號資訊為各帳號資訊中的任一個；第i帳號資訊與第j帳號資訊的相似度s=h/K。 Optionally, calculating the similarity between the account information in the same group includes: if the i-th account information and the j-th account information are in the same group, counting the hash vector of the i-th account information and the hash of the j-th account information The number h in the vector that is the same bit and has the same hash vector value; the i-th account information and the j-th account information are any of the account information; the similarity between the i-th account information and the j-th account information s = h / K .

可選地，根據預設的K個雜湊函數，確定每個帳號資訊對應的K位元雜湊向量，包括：根據公式(1)確定每個帳號資訊對應的K位元雜湊向量H()；其中，2'b表示H()是一個二進位數字，h _i()是預設的K個雜湊函數中的一個，表示帳號資訊的特徵向量，其中，，c₁,c₂…,c_d表示帳號資訊的特徵屬性，表示隨機選取的一個非零向量，。 Optionally, determining the K-bit hash vector corresponding to each account information according to the preset K hash functions includes: determining the K-bit hash vector H ( ); Where 2 ' b represents H ( ) Is a binary number, h _i ( ) Is one of the preset K hash functions, Feature vector representing account information, where, , C ₁ , c ₂ …, c _d represent the characteristic attributes of account information, Represents a randomly selected non-zero vector, .

可選地，根據特徵匹配網路，對各帳號資訊進行社團劃分，包括：(1)將各帳號資訊劃分在特徵匹配網路中不同的小區中；(2)根據各帳號資訊之間的相似度，計算每個帳號資訊的相似強度，從而生成節點相似強度矩陣；(3)針對每個帳號資訊，從節點相似強度矩陣中帳號資訊所在的行，按相似強度從大到小的的順序嘗試將帳號資訊劃至其他小區中；若帳號資訊自第p小區劃分至第q小區後的模組度差為正數，則將帳號資訊劃分至第q小區後結束；(4)重複執行，直到小區結構不再改變為止。 Optionally, grouping the account information according to the feature matching network includes: (1) dividing each account information into different communities in the feature matching network; (2) based on the similarity between the account information Degree, calculate the similarity strength of each account information, thereby generating a node similarity strength matrix; (3) for each account information, from the row where the account information is located in the node similarity strength matrix, try in the order of similarity strength from large to small Assign the account information to other communities; if the module information difference from the p-th cell to the q-th cell is positive, the account information is divided into the q-th cell and the process ends; (4) Repeat until The cell structure no longer changes.

可選地，根據各帳號之間的相似度，計算每個帳號資訊的相似強度，包括：根據公式(2)計算第i帳號資訊與第j帳號資訊之間的相似強度s _i,j；其中，Γ(i)表示第i帳號資訊的鄰居集合，Γ(i)∩Γ(j)表示第i帳號資訊與第j帳號資訊的共同鄰居集合，w _ai,z為任意帳號資訊ai與第z帳號資訊之間的邊的權重和。 Optionally, calculating the similarity strength of each account information according to the similarity between each account includes: calculating the similarity strength s _{i, j} between the i-th account information and the j-th account information according to formula (2); Among them, Γ ( i ) represents the neighbor set of the i-th account information, Γ ( i ) ∩Γ ( j ) represents the common neighbor set of the i-th account information and the j-th account information, and w _{ai, z} are arbitrary account information ai and the first Sum of the weights of edges between account information.

本發明實施例還提供一種基於特徵匹配網路的社團劃分裝置，包括：確定單元：用於根據預設的K個雜湊函數，確定每個帳號資訊對應的K位元雜湊向量；第一劃分單元：用於將每個帳號資訊對應的雜湊向量，順序劃分為m=K/k類子雜湊向量；第二劃分單元：用於針對每個類，將子雜湊向量相同的帳號資訊劃分為同一組；計算單元：用於計算同一組內的各帳號資訊之間的相似度；形成網路單元，用於若各帳號資訊之間的相似度大於閾值，則在各帳號資訊之間建立互連邊，形成特徵匹配網路；第三劃分單元：用於根據特徵匹配網路，對各帳號資訊進行社團劃分。 An embodiment of the present invention further provides a community partitioning device based on a feature matching network, including: a determining unit for determining a K-bit hash vector corresponding to each account information according to preset K hash functions; a first partitioning unit; : Used to divide the hash vector corresponding to each account information into order m = K / k sub-hash vectors; the second division unit: used to divide the account information with the same sub-hash vector into the same group for each class ; Calculation unit: used to calculate the similarity between the account information in the same group; forming a network unit for establishing an interconnection edge between the account information if the similarity between the account information is greater than a threshold To form a feature matching network; the third division unit is used to classify each account information according to the feature matching network.

可選地，計算單元具體用於：若第i帳號資訊與第j帳號資訊位於n類同組中，則將n/m作為第i帳號資訊與第j帳號資訊之間的相似度；第i帳號資訊與第j帳號資訊為各帳號資訊中的任一個。 Optionally, the calculation unit is specifically configured to: if the i-th account information and the j-th account information are in the same group of n , use n / m as the similarity between the i-th account information and the j-th account information; The account information and the j-th account information are any of the account information.

可選地，計算單元具體還用於：若第i帳號資訊與第j帳號資訊位於同一組中，統計第i帳號資訊的雜湊向量與第j帳號資訊的雜湊向量中位於同一位且雜湊向量值相同的個數h；第i帳號資訊與第j帳號資訊為各帳號資訊中的任一個；第i帳號資訊與第j帳號資訊的相似度s=h/K。 Optionally, the calculation unit is further configured to: if the i-th account information and the j-th account information are in the same group, the hash vector of the i-th account information and the j-th information of the j-th account information are located in the same bit and the hash vector value is The same number h; the i-th account information and the j-th account information are any of the account information; the similarity s = h / K between the i-th account information and the j-th account information.

可選地，確定單元用於：根據公式(3)確定每個帳號資訊對應的K位元雜湊向量H()；其中，2'b表示H()是一個二進位數字，h _i()是預設的K個雜湊函數中的一個，表示帳號資訊的特徵向量，其中，，c₁,c₂…,c_d表示帳號資訊的特徵屬性，表示隨機選取的一個非零向量，。 Optionally, the determining unit is configured to determine, according to formula (3), a K-bit hash vector H ( ); Where 2 ' b represents H ( ) Is a binary number, h _i ( ) Is one of the preset K hash functions, Feature vector representing account information, where, , C ₁ , c ₂ …, c _d represent the characteristic attributes of account information, Represents a randomly selected non-zero vector, .

可選地，第三劃分單元具體用於：(1)將各帳號資訊劃分在特徵匹配網路中不同的小區中；(2)根據各帳號資訊之間的相似度，計算每個帳號資訊的相似強度，從而生成節點相似強度矩陣；(3)針對每個帳號資訊，從節點相似強度矩陣中帳號資訊所在的行，按相似強度從大到小的的順序嘗試將帳號資訊劃至其他小區中；若帳號資訊自第p小區劃分至第q小區後的模組度差為正數，則將帳號資訊劃分至第q小區後結束；(4)重複執行，直到小區結構不再改變為止。 Optionally, the third dividing unit is specifically configured to: (1) divide each account information into different cells in the feature matching network; (2) calculate each account information according to the similarity between the account information Similar strength to generate the node similarity matrix; (3) For each account information, from the row where the account information is located in the node similarity matrix, try to assign the account information to other communities in the order of similarity from large to small. ; If the module information difference between the account information from the p-th cell to the q-th cell is a positive number, the account information is divided into the q-th cell and ends; (4) Repeat until the cell structure no longer changes.

可選地，計算單元具體還用於：根據公式(4)計算第i帳號資訊與第j帳號資訊之間的相似強度s _i,j；其中，Γ(i)表示第i帳號資訊的鄰居集合，Γ(i)∩Γ(j)表示第i帳號資訊與第j帳號資訊的共同鄰居集合，w _ai,z為任意帳號資訊ai與第z帳號資訊之間的邊的權重和。 Optionally, the calculation unit is further configured to calculate the similarity strength s _{i, j} between the i-th account information and the j-th account information according to formula (4); Among them, Γ ( i ) represents the neighbor set of the i-th account information, Γ ( i ) ∩Γ ( j ) represents the common neighbor set of the i-th account information and the j-th account information, and w _{ai, z} are arbitrary account information ai and the first Sum of the weights of edges between account information.

本發明實施例中提供了一種基於特徵匹配網路的社團劃分方法和裝置，根據預設的K個雜湊函數，確定每個帳號資訊對應的K位元雜湊向量；將每個帳號資訊對應的雜湊向量，順序劃分為m=K/k類子雜湊向量；針對每個類，將子雜湊向量相同的帳號資訊劃分為同一組；計算同一組內的各帳號資訊之間的相似度；若各帳號資訊之間的相似度大於閾值，則在各帳號資訊之間建立互連邊，形成特徵匹配網路；根據特徵匹配網路，對各帳號資訊進行社團劃分。本發明實施例中首先通過根據預設的K個雜湊函數，確定每個帳號資訊對應的K位元雜湊向量，對於網路中數量巨大的帳號資訊來說，僅僅產生兩個雜湊值的雜湊函數是不夠的，因此確定每個帳號資訊對應的K位元雜湊向量能夠應對複雜的網路帳號資訊。然後針對每個類，將子雜湊向量相同的帳號資訊劃分為一組，計算同一組內任意帳號資訊之間的相似度，能夠避免針對整個網路中任意帳號資訊之間計算相似度而帶來的計算量非常大的缺點；本發明技術方案能夠有效減少帳號資訊之間相似度的計算量，僅僅計算同一組內的帳號資訊之間的相似度。最後根據確定各帳號資訊之間的相似度大於閾值，在各帳號資訊之間建立互連邊，形成特徵匹配網路；根據特徵匹配網路，對各帳號資訊進行社團劃分，能夠更精准的對各帳號資訊進行社團劃分，這樣不僅能夠使社團之間的關聯關係很清楚，而且能夠對劃分的社團進行分析，找出異常社團，進而對異常社團內的帳號進行異常帳號排查，更加有針對性地找出欺詐帳號，提高應對欺詐帳號的效率。此外，如果需要對劃分出的社團添加帳號資訊，只需要對該添加的帳號資訊重複以上簡單的幾個步驟，將所添加的帳號資訊更新到相應的位置即可，並不會產生更新困難的問題。 An embodiment of the present invention provides a community division method and device based on a feature matching network. According to preset K hash functions, a K-bit hash vector corresponding to each account information is determined; a hash corresponding to each account information is determined. Vector, divided into sub-hash vectors of m = K / k order; for each class, divide account information with the same sub-hash vector into the same group; calculate the similarity between account information in the same group; if each account If the similarity between the information is greater than the threshold, an interconnection edge is established between the account information to form a feature matching network. According to the feature matching network, the account information is divided into communities. In the embodiment of the present invention, the K-bit hash vector corresponding to each account information is first determined according to the preset K hash functions. For a large amount of account information in the network, only two hash values of the hash function are generated. It is not enough, so it is necessary to determine the K-bit hash vector corresponding to each account information to cope with the complicated network account information. For each class, the account information with the same sub-hash vector is divided into a group, and the similarity between any account information in the same group is calculated, which can avoid the calculation of similarity between any account information in the entire network. The disadvantage is that the calculation amount is very large; the technical solution of the present invention can effectively reduce the calculation amount of the similarity between account information, and only calculate the similarity between the account information in the same group. Finally, it is determined that the similarity between the account information is greater than the threshold, and an interconnection edge is established between the account information to form a feature matching network. According to the feature matching network, the account information is divided into communities, which can more accurately compare Divide the account information into communities, which not only makes the association relationship between the communities clear, but also analyzes the divided communities to find abnormal communities, and then conducts abnormal account investigation on the accounts in the abnormal communities, which is more targeted. Identify fraudulent accounts and improve efficiency in dealing with fraudulent accounts. In addition, if you need to add account information to the divided community, you only need to repeat the above simple steps for the added account information, and update the added account information to the corresponding location, which will not cause update difficulties. problem.

S101-106、S201-S206‧‧‧步驟 S101-106, S201-S206‧‧‧ steps

301‧‧‧確定單元 301‧‧‧ Confirmation unit

302‧‧‧第一劃分單元 302‧‧‧First Division

303‧‧‧第二劃分單元 303‧‧‧Second Division

304‧‧‧計算單元 304‧‧‧ Computing Unit

305‧‧‧形成網路單元 305‧‧‧form network unit

306‧‧‧第三劃分單元 306‧‧‧Third Division

為了更清楚地說明本發明實施例中的技術方案，下面將對實施例描述中所需要使用的附圖作簡要介紹。 In order to explain the technical solution in the embodiments of the present invention more clearly, the drawings used in the description of the embodiments will be briefly introduced below.

圖1為本發明實施例提供的一種基於特徵匹配網路的社團劃分方法流程示意圖；圖2為本發明實施例提供的本發明的整體思路流程圖；圖3為本發明實施例提供的一種基於特徵匹配網路的社團劃分裝置結構示意圖。 FIG. 1 is a schematic flowchart of a community classification method based on a feature matching network according to an embodiment of the present invention; FIG. 2 is a flowchart of an overall idea of the present invention provided by an embodiment of the present invention; and FIG. 3 is a flowchart based on an embodiment of the present invention. Schematic diagram of community classification device for feature matching network.

為了使本發明的目的、技術方案及有益效果更加清楚明白，以下結合附圖及實施例，對本發明進行進一步詳細說明。應當理解，此處所描述的具體實施例僅僅用以解釋本發明，並不用於限定本發明。 In order to make the objectives, technical solutions, and beneficial effects of the present invention clearer, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not intended to limit the present invention.

應理解，本發明實施例的技術方案可以應用於各種銀行出現的網路欺詐手段的場景，比如可以是信用卡產品的欺詐、銀行卡產品的欺詐、盜卡欺詐、偽卡欺詐、套現欺詐等等。本發明實施例的技術方案的應用場景也可以是對異常帳號資訊社團的發現、發現特定種類欺詐的共性、根據欺詐帳號資訊樣本發現其它欺詐帳號資訊、說明發現未知欺詐類型等。 It should be understood that the technical solutions of the embodiments of the present invention can be applied to various scenarios of online fraud methods at banks, such as credit card product fraud, bank card product fraud, card theft fraud, counterfeit card fraud, cash fraud, etc. . The application scenarios of the technical solutions of the embodiments of the present invention may also be the discovery of abnormal account information communities, the commonality of finding specific types of fraud, the discovery of other fraudulent account information based on fraudulent account information samples, and the explanation of the discovery of unknown types of fraud.

圖1示例性示出了本發明實施例提供的一種基於特徵匹配網路的社團劃分方法流程示意圖，如圖1所示，包括以下步驟：步驟S101：根據預設的K個雜湊函數，確定每個帳號資訊對應的K位元雜湊向量；步驟S102：將每個帳號資訊對應的雜湊向量，順序劃分為m=K/k類子雜湊向量；步驟S103：針對每個類，將子雜湊向量相同的帳號資訊劃分為同一組；步驟S104：計算同一組內的各帳號資訊之間的相似度；步驟S105：若各帳號資訊之間的相似度大於閾值，則在各帳號資訊之間建立互連邊，形成特徵匹配網路；步驟S106：根據特徵匹配網路，對各帳號資訊進行社團劃分。 FIG. 1 exemplarily illustrates a schematic flowchart of a community division method based on a feature matching network according to an embodiment of the present invention. As shown in FIG. 1, the method includes the following steps: Step S101: Determine each of the K hash functions according to presets. K-bit hash vector corresponding to each account information; step S102: sequentially divide the hash vector corresponding to each account information into m = K / k sub-hash vectors; step S103: for each class, sub-hash vectors are the same The account information is divided into the same group; Step S104: Calculate the similarity between the account information in the same group; Step S105: If the similarity between the account information is greater than the threshold, establish an interconnection between the account information Edge to form a feature matching network; step S106: classify each account information according to the feature matching network.

步驟S101中，根據預設的K個雜湊函數，確定每個帳號資訊對應的K位元雜湊向量，具體來說，經過每個預設的雜湊函數的處理都能得到一位雜湊向量，那麼，根據預設的K個雜湊函數，就可以產生K位雜湊向量，而每個帳號資訊對應K位元雜湊向量，具體實施中，每個帳號資訊是包含多個特徵屬性的，如果僅僅使用現有技術中一個帳號資訊只用一個雜湊函數來表示的話，會存在不足以表達一個帳號資訊的多個特徵屬性的缺點，所以，本步驟可以有效避免這個缺點。其中，K的取值可以根據具體實施中各帳號資訊的具體情況來設定，比如，K可以設定為4，那麼帳號資訊就可以表示為一個4位的雜湊向量。 In step S101, a K-bit hash vector corresponding to each account information is determined according to the preset K hash functions. Specifically, a hash vector can be obtained after processing of each preset hash function. Then, According to the preset K hash functions, a K-bit hash vector can be generated, and each account information corresponds to a K-bit hash vector. In specific implementation, each account information contains multiple characteristic attributes. If only the existing technology is used, If only one hash function is used to represent one account information, there will be a shortcoming that is not enough to express multiple characteristic attributes of one account information. Therefore, this step can effectively avoid this shortcoming. The value of K can be set according to the specific situation of each account information in the specific implementation. For example, K can be set to 4, then the account information can be expressed as a 4-bit hash vector.

步驟S102：將每個帳號資訊對應的雜湊向量，順序劃分為m=K/k類子雜湊向量，也可理解為將這K個雜湊函數分成m類子雜湊向量。其中每一類子雜湊向量包含若干個雜湊函數，並且每一類子雜湊向量分到的雜湊函數的個數是相等的，都是k。也就是說k=K/m。具體來說，比如，K=4，k=2，那麼，就將每個帳號資訊為4位元的雜湊向量劃分為2類子雜湊向量，劃分的好處是為後續計算帳號間的相似度減少計算量，避免出現像現有技術中並沒有對帳號資訊的雜湊向量進行劃分而出現直接對所有帳號資訊中的任意兩個帳號來進行相似度計算而造成的計算量特別大的缺點。 Step S102: sequentially divide the hash vector corresponding to each account information into m = K / k sub-hash vectors, which can also be understood as dividing the K hash functions into m-class sub-hash vectors. Each type of sub-hash vector includes several hash functions, and the number of hash functions assigned to each type of sub-hash vector is equal to k. In other words, k = K / m. Specifically, for example, K = 4, k = 2, then the 4-bit hash vector of each account information is divided into 2 types of sub-hash vectors. The benefit of the division is that the similarity between accounts is reduced in subsequent calculations. The calculation amount avoids the disadvantage that the calculation amount caused by directly performing similarity calculation on any two account numbers in all account information is not divided as in the prior art, and the hash vector of the account information is not divided.

步驟S103：針對每個類，將子雜湊向量相同的帳號資訊劃分為同一組，具體來說，對每個帳號資訊劃分為各類之後，針對劃分的每個類，將子雜湊向量相同的帳號資訊劃分為同一組，比如，K=4，k=2的話，在第1類中，所有帳號資訊中4位元雜湊向量中前兩位相同的為一組，同樣，在第2類中，所有帳號資訊中4位元雜湊向量中後兩位元相同的帳號資訊為一組。這樣劃分的目的也是為了後面減少計算相似度的計算量，只計算各類之間子雜湊向量相同的帳號資訊之間的相似度。 Step S103: For each class, the account information with the same sub-hash vector is divided into the same group. Specifically, after each account information is divided into various types, the account with the same sub-hash vector is divided for each class. The information is divided into the same group. For example, if K = 4 and k = 2, in the first category, the first two bits of the 4-bit hash vector in all account information are the same. Also, in the second category, The account information with the same two last two digits in the 4-digit hash vector in all account information is a group. The purpose of this division is also to reduce the amount of calculation of similarity calculation, and only calculate the similarity between account information with the same sub-hash vector between various types.

步驟S104：計算同一組內的各帳號資訊之間的相似度，具體實施中，可以統計同一組內各帳號資訊的雜湊向量的位元相同的個數與位的大小的比值，比如，帳號資訊1的雜湊向量為0010，帳號資訊2的雜湊向量為0011，按照K=4，k=2，那麼兩個帳號資訊在第一類中位於同一組，則確定位於同一組的兩個帳號資訊的相似度；那麼，兩個帳號資訊的雜湊向量的位元相同的個數是3，位的大小是4位的，所以，這兩個帳號資訊之間的相似度為3/4，也可以根據關於相似度的計算公式來計算同一組內的任意兩個帳號資訊之間的相似度，比如相似度的計算公式可以是歐式距離、余弦距離、傑卡德距離公式等。一方面，相比於計算所有帳號資訊中的任意兩個帳號資訊的相似度，只計算同一組內的任意兩個帳號資訊之間的相似度能夠大大減少計算量。比如，取N個帳號資訊樣本，那麼N個帳號資訊樣本就被分到了2^k個組內，每個組內的帳號資訊樣本數為N/2^k，每組內進行任意兩個帳號資訊進行相似度計算的次數為，2^k個組進行任意兩個帳號資訊進行相似度計算的次數為，因此，所有類需要進行相似度計算的次數就為，其中，是劃分的類的個數，這個值是一個根據實際情況可以進行控制的常數，而傳統的方法計算所有帳號中任意兩個帳號資訊進行相似度計算需要進行次，綜上可以看出，採用本發明的計算同一組內的任意兩個帳號資訊之間的相似度的計算量比傳統的方法計算所有帳號中任意兩個帳號資訊的相似度的計算量大約縮減2^k級別的倍數。另一方面，每一組內的帳號資訊的相似度是較大的，所以對同一組內的帳號資訊進行相似度計算，也能夠提高網路建立的效率和準確率。 Step S104: Calculate the similarity between the account information in the same group. In specific implementation, the ratio of the same number of bits to the size of the hash vector of each account information in the same group can be counted. For example, account information The hash vector of 1 is 0010, and the hash vector of account information 2 is 0011. According to K = 4, k = 2, then the two account information are in the same group in the first category, then it is determined that the two account information are in the same group. Similarity; then, the number of bits in the hash vector of the two account information is 3, and the size of the bit is 4 bits. Therefore, the similarity between the two account information is 3/4. The similarity calculation formula is used to calculate the similarity between any two account information in the same group. For example, the similarity calculation formula can be Euclidean distance, cosine distance, Jeckard distance formula, etc. On the one hand, compared to calculating the similarity of any two account information in all account information, calculating only the similarity between any two account information in the same group can greatly reduce the amount of calculation. For example, if N account information samples are taken, then the N account information samples are divided into 2 ^k groups, and the number of account information samples in each group is N / 2 ^k . Any two account information samples are performed in each group. The number of similarity calculations is , The number of similarity calculations of 2 ^k groups for any two account information is Therefore, the number of similarity calculations required for all classes is ,among them, Is the number of classes divided, this value is a constant that can be controlled according to the actual situation, and the traditional method of calculating any two account information in all accounts for similarity calculation needs to be performed To sum up, it can be seen that the amount of calculation of the similarity between any two account information in the same group using the present invention is approximately larger than that of the calculation of the similarity between any two account information in all accounts by the traditional method. Reduce multiples of 2 ^k levels. On the other hand, the similarity of account information in each group is relatively large, so similarity calculation of account information in the same group can also improve the efficiency and accuracy of network establishment.

步驟S105：若各帳號資訊之間的相似度大於閾值，則在各帳號資訊之間建立互連邊，形成特徵匹配網路，具體來說，如果任意兩個帳號資訊之間的相似度大於閾值，就在任意兩個帳號資訊之間建立一條互連邊，邊的權重就是兩個帳號資訊之間的相似度值，最終形成特徵匹配網路。具體實施中，閾值的選取可以選擇較高的值，這樣最終可以生成較為稀疏的特徵匹配網路，便於後續的計算，另外，閾值的取值可以根據實際情況進行調整。 Step S105: If the similarity between the account information is greater than the threshold, establish an interconnection edge between the account information to form a feature matching network. Specifically, if the similarity between any two account information is greater than the threshold , An interconnecting edge is established between any two account information, and the weight of the edge is the similarity value between the two account information, and finally a feature matching network is formed. In specific implementation, a higher value can be selected for the selection of the threshold, so that a sparse feature matching network can be finally generated to facilitate subsequent calculations. In addition, the value of the threshold can be adjusted according to actual conditions.

步驟S106：根據特徵匹配網路，對各帳號資訊進行社團劃分，具體來說，根據計算出來的各帳號資訊之間的相似度值，相似度值越接近的越容易被劃分到同一個社團中。劃分社團之後，對於網路中的欺詐帳號更容易去排查，可以計算欺詐帳號樣本在每個社團中的比例，比例較大的，則該社團為異常社團的可能性就越大，可以根據業務需要進行相關調查，再對異常社團內的帳號根據一些指標來進行計算，找出具有代表性的帳號，對這些具有代表性的帳號再進行相關案件排查，其中，一些指標可以是社團內帳號資訊的度中心性、緊密中心性、特徵向量中心性等；或者也可以對社團內的帳號資訊進行特徵再分析，以期發現該社團的一些共同行為的特徵，進行有針對性地欺詐預防。此外，如果新加入的帳號資訊形成新的社團，則可以根據前面查出來的異常社團進行比對，這對於未知欺詐的偵測與預防是大有裨益的。 Step S106: According to the feature matching network, the account information is divided into communities. Specifically, according to the calculated similarity value between the account information, the closer the similarity value is, the easier it is to be classified into the same society. . After the community is divided, it is easier to investigate fraudulent accounts on the Internet. The proportion of fraudulent account samples in each community can be calculated. The larger the proportion, the greater the probability that the community is an abnormal community. Relevant investigations are needed, and then the accounts in the abnormal communities are calculated according to some indicators to find representative accounts, and relevant cases are investigated for these representative accounts. Among them, some indicators can be account information in the communities. Degree centrality, closeness centrality, feature vector centrality, etc .; or you can also perform feature reanalysis on account information in the community, in order to find some characteristics of the community's common behavior and conduct targeted fraud prevention. In addition, if the newly added account information forms a new community, it can be compared according to the abnormal community detected earlier, which is of great benefit to the detection and prevention of unknown fraud.

計算同一組內的各帳號信息之間的相似度，可以以下面兩種方法來計算：方式1：可選地，計算同一組內的各帳號資訊之間的相似度，包括：若第i帳號資訊與第j帳號資訊位於n類同組中，則將n/m作為第i帳號資訊與第j帳號資訊之間的相似度；第i帳號資訊與第j帳號資訊為各帳號資訊中的任一個，具體來說，在所有帳號資訊中任意取兩個帳號資訊，比如稱為帳號資訊1與帳號資訊2，m取3，也就是帳號資訊1與帳號資訊2分在了3類中，這3類分別稱為第1類、第2類、第3類，假設這兩個帳號資訊在第1類與第3類中同組，那麼，這兩個帳號資訊在這3類中的相似度為2/3；方式2：可選地，計算同一組內的各帳號資訊之間的相似度，包括：若第i帳號資訊與第j帳號資訊位於同一組中，統計第i帳號資訊的雜湊向量與第j帳號資訊的雜湊向量中位於同一位且雜湊向量值相同的個數h；第i帳號資訊與第j帳號資訊為各帳號資訊中的任一個；第i帳號資訊與第j帳號資訊的相似度s=h/K，具體來說，如果所有帳號資訊中任意的兩個帳號資訊，帳號資訊1與帳號資訊2位於同一組，並且帳號資訊1與帳號資訊2 都是4位元的，也就是K為4，帳號資訊1與帳號資訊2的4位元雜湊向量中，前3位是完全相同的，第4位不同，那麼，帳號資訊1與帳號資訊2的相似度s為3/4。 Calculate the similarity between the account information in the same group, which can be calculated in the following two methods: Method 1: Optionally, calculate the similarity between the account information in the same group, including: if the i-th account Information and jth account information are in the same group of n , then n / m is used as the similarity between the ith account information and the jth account information; the ith account information and the jth account information are any of the account information. One, specifically, take two account information arbitrarily from all account information, such as account information 1 and account information 2, m is 3, that is, account information 1 and account information 2 are divided into 3 categories, this The three categories are called category 1, category 2, and category 3, assuming that the two account information are in the same group in category 1 and category 3. Then, the similarity of the two account information in these 3 categories It is 2/3; Method 2: Optionally, calculate the similarity between the account information in the same group, including: if the i-th account information and the j-th account information are in the same group, calculate the hash of the i-th account information The vector is in the same place as the hash vector of the jth account information and the hash vector value is the same The number of H; i-j account information and account information of each account to any one of the information; s i-th similarity account information and account information of the j-th = h / K, particularly, if all of the account information in any Of the two account information, account information 1 and account information 2 are in the same group, and account information 1 and account information 2 are both 4-bit, that is, K is 4, the 4-digit hash of account information 1 and account information 2 In the vector, the first three digits are exactly the same, and the fourth digit is different. Then, the similarity s between account information 1 and account information 2 is 3/4.

以上兩種計算同一組內各個帳號資訊之間的相似度的計算方法，可以得出，第1中方法是計算的兩個帳號資訊在各個類中的相似度，而第2種方法是計算的被分到了各類中同一組中的兩個帳號資訊之間的相似度，可以看出，這兩種方法中，相比於第2種方法，第1種方法是比較粗略的計算兩個帳號資訊所屬的類與類之間的相似度，而第2種計算的兩個帳號資訊在同一組之間的相似度則更精准。不過，這兩種方法都相比於現有技術中利用歐式距離公式等來計算網路中所有帳號資訊中任意兩個帳號資訊之間的相似度的計算量上得到了明顯的改善，進一步加速了網路的建立。 The above two calculation methods for calculating the similarity between various account information in the same group can be obtained that the first method is to calculate the similarity of the two account information in each category, and the second method is to calculate It is classified into the similarity between two account information in the same group in each category. It can be seen that in these two methods, compared to the second method, the first method is a rough calculation of the two accounts. The similarity between the class to which the information belongs, and the similarity between the two account information in the second group calculated in the same group is more accurate. However, these two methods have significantly improved the calculation amount compared with the conventional technique using the European distance formula and the like to calculate the similarity between any two account information in all account information on the network, further accelerating. Network establishment.

可選地，根據預設的K個雜湊函數，確定每個帳號資訊對應的K位元雜湊向量，包括：根據公式(1)確定每個帳號資訊對應的K位元雜湊向量H()； Optionally, determining the K-bit hash vector corresponding to each account information according to the preset K hash functions includes: determining the K-bit hash vector H ( );

其中，2'b表示H()是一個二進位數字，h _i()是預設的K個雜湊函數中的一個， Where 2 ' b represents H ( ) Is a binary number, h _i ( ) Is one of the preset K hash functions,

表示帳號資訊的特徵向量，其中，，c₁,c₂…,c_d表示帳號資訊的特徵屬性，表示隨機選取的一個非零向量，。具體來說，預設的雜湊函數是h _i()，h _i()是預設的K個雜湊函數中的任一個，雜湊函數h _i()的值用0或1來表示，也就是說這樣的一個雜湊函數只能產生兩個雜湊值，對於數量巨大的帳號資訊來說明顯是不夠的，所以根據這樣的雜湊函數，來確定每個帳號的K位雜湊向量H()，H()是一個K位的二進位數字，比如，可以是6位的二進位數字，具體可以為010110，那麼，、、、、、，其中，表示帳號資訊的特徵向量，，c₁,c₂…,c_d表示帳號資訊的特徵屬性，具體的帳號資訊特徵屬性可以是交易金額、交易時間、交易地點、交易地點數、轉帳地點、轉帳金額、轉帳次數等。其中，各帳號資訊的特徵向量在具體實施中可以經過篩選來得到一批理論上效果最好的特徵向量，具體地，在一定時間段內抽取欺詐帳號資訊樣本以及正常帳號資訊樣本，將抽取的欺詐帳號資訊樣本以及正常帳號資訊樣本組合為一個整體帳號資訊樣本，根據業務經驗進行整體帳號資訊的資料預處理、特徵篩選及屬性相關性分析等步驟之後，篩選出一批理論上效果最好的特徵向量。根據預設的K個雜湊函數，確定每個帳號資訊對應的K位元雜湊向量，能夠充分提取每個帳號資訊的特徵屬性並用特徵向量表示出來，能夠應對複雜的網路中帳號資訊數量巨大的情況。此外，需要說明的是，第一，每個帳號資訊對應的K位元雜湊向量H()的確定實際上是經過一個雜湊隨機映射的過程得來的，是由h _i()經過雜湊映射得到H()，這裡使用雜湊隨機映射的主要目的是使得使得帳號資訊的特徵向量能映射為0或1的統一表示，以便後續處理，而並非簡單的降維；第二，原來的特徵向量映射到新的雜湊空間中，會使得在原來的特徵向量相似的資料在新的雜湊空間中資料也相似的概率很大，這個概率為：，符合相似度s到概率p的單調遞增映射關係。 Feature vector representing account information, where, , C ₁ , c ₂ …, c _d represent the characteristic attributes of account information, Represents a randomly selected non-zero vector, . Specifically, the default hash function is h _i ( ), H _i ( ) Is any of the preset K hash functions. The hash function h _i ( The value of) is represented by 0 or 1. That is to say, such a hash function can only generate two hash values, which is obviously not enough for a large amount of account information. Therefore, each hash function is used to determine each K-bit hash vector H of account number ( ), H ( ) Is a K-digit binary number. For example, it can be a 6-digit binary number. Specifically, it can be 010110. Then, , , , , , ,among them, Feature vector representing account information, , C ₁ , c ₂ …, c _d represent characteristic attributes of account information, and specific account information characteristic attributes may be transaction amount, transaction time, transaction location, number of transaction locations, transfer location, transfer amount, number of transfers, and the like. Among them, the feature vector of each account information can be filtered in the specific implementation to obtain a batch of feature vectors that theoretically have the best effect. Specifically, a sample of fraudulent account information and a sample of normal account information will be extracted within a certain period of time. The sample of fraudulent account information and the sample of normal account information are combined into a sample of overall account information. After performing the steps of data preprocessing, feature selection, and attribute correlation analysis of the overall account information based on business experience, a batch of theoretically the best results are selected. Feature vector. According to the preset K hash functions, the K-bit hash vector corresponding to each account information is determined, and the characteristic attributes of each account information can be fully extracted and represented by the feature vector, which can cope with the huge amount of account information in a complex network. Happening. In addition, it should be noted that, first, the K-bit hash vector H ( ) Is actually determined through a process of mapping a random hash come, by h _i ( H is obtained by hash mapping ( ), The main purpose of using hash random mapping here is to make the feature vector of account information be mapped to a unified representation of 0 or 1, for subsequent processing, instead of simple dimensionality reduction; second, the original feature vector Mapping to the new hash space will make it very likely that data with similar original feature vectors will have similar data in the new hash space. This probability is: , In accordance with the monotonically increasing mapping relationship from similarity s to probability p.

以上實施方式中，對於每個帳號資訊對應的K位元雜湊向量以及將每個帳號資訊對應的雜湊向量順序劃分為m=K/k類子雜湊向量的關係，下面以一個表格的方式將其展示出來，表1示例性地示出了帳號資訊樣本與類之間的關係，如表1所示： In the above embodiment, for the relationship between the K-bit hash vector corresponding to each account information and the order in which the hash vector corresponding to each account information is divided into m = K / k sub-hash vectors, the following table will be used to divide them. As shown, Table 1 exemplarily shows the relationship between sample account information and classes, as shown in Table 1:

表1中，帳號資訊樣本與類之間的關係可以表示成一個K行N列的矩陣，N表示取的帳號資訊樣本數，c ₁到c _N代表N個帳號資訊樣本，將N個帳號資訊樣本分到m=K/k個類，其中，表格中除第一行之外下面的每一行代表一個類，N個帳號資訊樣本被分到了2^k個組內。 In Table 1, the relationship between account information samples and classes can be expressed as a matrix of K rows and N columns, where N is the number of account information samples taken, and c ₁ to c _{N are} N account information samples. The sample is divided into m = K / k classes, where each row except the first row in the table represents a class, and N account information samples are divided into 2 ^k groups.

可選地，根據特徵匹配網路，對各帳號資訊進行社團劃分，包括：(1)將各帳號資訊劃分在特徵匹配網路中不同的小區中；(2)根據各帳號資訊之間的相似度，計算每個帳號資訊的相似強度，從而生成節點相似強度矩陣；(3)針對每個帳號資訊，從節點相似強度矩陣中帳號資訊所在的行，按相似強度從大到小的的順序嘗試將帳號資訊劃至其他小區中；若帳號資訊自第p小區劃分至第q小區後的模組度差為正數，則將帳號資訊劃分至第q小區後結束；(4)重複執行，直到小區結構不再改變為止。 Optionally, grouping the account information according to the feature matching network includes: (1) dividing each account information into different communities in the feature matching network; (2) based on the similarity between the account information Degree, calculate the similarity strength of each account information to generate a node similarity strength matrix; (3) for each account information, from the row where the account information is located in the node similarity strength matrix, try in the order of similarity strength from large to small Assign the account information to other communities; if the module information difference from the p-th cell to the q-th cell is positive, the account information is divided into the q-th cell and the process ends; (4) Repeat until The cell structure no longer changes.

可選地，根據各帳號之間的相似度，計算每個帳號資訊的相似強度，包括：根據公式(2)計算第i帳號資訊與第j帳號資訊之間的相似強度s _i,j； Optionally, calculating the similarity strength of each account information according to the similarity between each account includes: calculating the similarity strength s _{i, j} between the i-th account information and the j-th account information according to formula (2);

其中，Γ(i)表示第i帳號資訊的鄰居集合，Γ(i)∩Γ(j)表示第i帳號資訊與第j帳號資訊的共同鄰居集合，w _ai,z為任意帳號資訊ai與第z帳號資訊之間的邊的權重和。 Among them, Γ ( i ) represents the neighbor set of the i-th account information, Γ ( i ) ∩Γ ( j ) represents the common neighbor set of the i-th account information and the j-th account information, and w _{ai, z} are arbitrary account information ai and the first Sum of the weights of edges between account information.

具體實施中，第(1)步驟，初始化特徵匹配網路，將每個帳號資訊劃分到不同的小區中，這一步驟中的劃分可以是隨機劃分的；第(2)步驟，根據公式(2)來計算各帳號資訊的相似強度，具體地，假如帳號資訊1與帳號資訊2的共同鄰居是帳號資訊3，帳號資訊1與帳號資訊2合起來與帳號資訊3的邊的權重是5，那麼，任意帳號資訊ai與帳號資訊3相連邊的權重為5，因而，帳號資訊1與帳號資訊2的相似強度是1/5，類似的，其它帳號資訊之間也是用此方法來計算。假如，取4個帳號資訊樣本，經過計算之後，形成一個4*4的矩陣，假如，這個矩陣為，從這個矩陣可以看出，帳號資訊1與帳號資訊2的相似度為0.25，帳號資訊1與帳號資訊3的相似度為0.7，帳號資訊2與帳號資訊3的相似度為0.4；第(3)步驟，從這個相似強度矩陣中帳號資訊所在的行，按相似強度從大到小的的順序嘗試將帳號資訊劃至其他小區中，例如從這個相似矩陣第一行可以看出，想要把帳號資訊1劃分到其它某一社團中時，優先選擇相似度較大的帳號資訊3(第一行中0.6最大)所在的小區中去。如果△Q<0，再將帳號資訊1嘗試劃分到帳號資訊4(第一行中0.4次大)所在的社團中去。如果△Q<0，則再將帳號資訊1嘗試劃分到帳號資訊2所在的社團中去。如果仍然△Q<0，則帳號資訊1作為一個獨立的社團進行保留，矩陣不做更新，再進行第2行的計算。如果上述嘗試過程中只要發現△Q>0,比如優先嘗試的將帳號1劃分到相似度較大的帳號資訊3(第一行中0.6最大)所在的小區中去以後，發現△Q>0，那麼表示嘗試成功，第一行計算結束。由於此時帳號1的狀態已經發生改變，因此將矩陣中第一行第一列所有資料刪除，表示後續帳號資訊不再與帳號資訊1進行比較，也就是，變成，然後以同樣的過程開始新一輪的嘗試計算，即對帳號資訊2進行社團劃分。其中，模組度差△Q的計算公式：來驗證上面對帳號資訊的嘗試劃分小區是否正確，其中，n表示網路中所有的權重，k _i表示與頂點i連接的邊的權重，k _i,in表示帳號資訊i在小區內部的權重之和，Σ_in表示小區內部的邊權重和，Σ_tot表示與小區內部帳號資訊連接的邊的權重和，包括小區內部的邊以及小區外部的邊，若△Q為正數，則接受本次的劃分，若不為正數，則放棄本次的劃分。通過帳號資訊的相似強度矩陣的計算，優先將帳號資訊劃分到與其最相似的鄰居帳號資訊的社團中去，大大節省了社團劃分的嘗試次數，進一步提高了演算法的速度，另外，對帳號資訊嘗試的劃分是否合理通過模組度差公式來驗證，更加有效保證了嘗試劃分的合理性與準確性。 In specific implementation, in step (1), a feature matching network is initialized, and each account information is divided into different communities. The division in this step may be randomly divided. In step (2), according to formula (2 ) To calculate the similarity strength of each account information. Specifically, if the common neighbor of account information 1 and account information 2 is account information 3, the weight of the edge of account information 1 and account information 2 combined with account information 3 is 5, then The weight of any account information ai connected to account information 3 is 5, so the similarity between account information 1 and account information 2 is 1/5, and similarly, other account information is also calculated using this method. Suppose that 4 account information samples are taken and after calculation, a 4 * 4 matrix is formed. If this matrix is From this matrix, it can be seen that the similarity between account information 1 and account information 2 is 0.25, the similarity between account information 1 and account information 3 is 0.7, and the similarity between account information 2 and account information 3 is 0.4; ) Step, from the row where the account information is located in the similarity matrix, try to divide the account information into other communities in the order of similarity from large to small. For example, you can see from the first row of the similarity matrix that you want to When the account information 1 is divided into some other community, priority is given to the community in which the account information 3 (the largest 0.6 in the first row) with a large similarity is located. If △ Q <0, try to divide account information 1 into the community where account information 4 (0.4 times large in the first row) is located. If △ Q <0, then try to divide account information 1 into the community where account information 2 is located. If △ Q <0, account information 1 is retained as an independent community, and the matrix is not updated, and then the calculation in line 2 is performed. If only △ Q > 0 is found during the above attempt, for example, after trying to divide account 1 into the community with the account information 3 (0.6 largest in the first row) with a similar degree, it is found that △ Q > 0. Then the attempt is successful, and the first line of calculation is over. Since the status of account 1 has changed at this time, deleting all data in the first row and first column in the matrix means that subsequent account information is no longer compared with account information 1, that is, become , And then start a new round of trial calculations in the same process, that is, to divide the account information 2 into communities. Among them, the calculation formula of the module degree difference △ Q : To verify that the above attempt to divide the account information into cells is correct, where n represents all the weights in the network, k _i represents the weight of the edge connected to vertex i, and k _{i, in} represents the weight of account information i within the cell. Sum _in , Σ _in represents the sum of the edge weights inside the cell, Σ _tot represents the sum of the weights of the edges connected to the account information inside the cell, including the edges inside the cell and the edges outside the cell. If △ Q is positive, the current If it is not a positive number, the current division is abandoned. Through the calculation of the similarity matrix of account information, the account information is preferentially divided into the communities with the most similar neighbor account information, which greatly saves the number of attempts to divide the community and further improves the speed of the algorithm. In addition, the account information Whether the attempted division is reasonable is verified by the module degree difference formula, which more effectively guarantees the rationality and accuracy of the attempted division.

為了更好的理解本發明技術方案，圖2示例性地示出了本發明的整體思路流程圖，如圖2所示：步驟S201：將各帳號資訊的特徵屬性通過雜湊映射的方法映射為一個多位的雜湊映射向量；步驟S202：將各帳號資訊的雜湊映射向量進行分類；步驟S203：對於每個類，將雜湊映射向量相同的帳號資訊劃分為一組；步驟S204：對每組中的任意兩個帳號資訊進行相似度計算；步驟S205：若每組中的任意兩個帳號資訊的相似度大於閾值，則建立這兩個帳號資訊之間的互連邊，邊的權重為相似度，從而形成特徵匹配網路，其中，形成的特徵匹配網路是稀疏的特徵匹配網路；步驟S206：根據特徵匹配網路中各帳號資訊的相似強度矩陣對特徵匹配網路進行社團劃分。 In order to better understand the technical solution of the present invention, FIG. 2 exemplarily shows a flowchart of the overall idea of the present invention, as shown in FIG. 2: Step S201: Map the characteristic attributes of each account information into a hash map method. Multi-bit hash map vector; step S202: classify the hash map vector of each account information; step S203: for each class, divide account information with the same hash map vector into one group; step S204: The similarity calculation of any two account information is performed; step S205: if the similarity of any two account information in each group is greater than a threshold, an interconnection edge between the two account information is established, and the weight of the edge is the similarity, Thus, a feature matching network is formed, wherein the formed feature matching network is a sparse feature matching network; step S206: the feature matching network is divided into communities according to the similarity intensity matrix of each account information in the feature matching network.

與現有技術相比，本發明實施例中，第一，通過隨機雜湊映射的方法將各帳號資訊的特徵屬性映射到一個新的雜湊空間中，形成各帳號資訊的雜湊映射向量，對各帳號資訊的雜湊映射向量進行分類，能夠在高相似度的帳號資訊之間建立邊，有效避免了大量的任意兩個帳號資訊之間的相似度計算，且高效地為每條邊建立了可信的權重值，能夠提高後續社團劃分的精度與速度；第二，根據各帳號資訊的相似度建立了特徵匹配網路，然後根據網路中各帳號資訊的相似強度矩陣對特徵匹配網路進行社團劃分，不僅可以有效發現異常社團並進行有針對性地措施，同時可以偵測未知的欺詐類型，而且通過相似強度矩陣對對特徵匹配網路進行社團劃分，即優先將帳號資訊劃分到與其最相似的鄰居帳號資訊的社團中去，大大節省了社團劃分嘗試的次數，進一步提高了演算法的速度；第三，通過形成特徵匹配網路，相關帳號資訊間的相似度作為邊的權重被永久存儲，即使有較多的新的帳號資訊進來，也不會對網路中原來的互連邊產生影響，僅僅需要將新的帳號資訊插入到原特徵匹配網路中。在向原特徵匹配網路圖添加新數帳號資訊的時候，仍然先採用隨機雜湊映射方法及對各帳號資訊進行分類，然後與類內的帳號資訊進行相似度計算，如果該相似度大於閾值，則添加新的邊。後續只需要進行計算量較小但是更加精准的社團劃分演算法即可實現功能。同時，特徵匹配網路的結構能更加清晰地展示社團內部及社團間的關聯結構，這是傳統聚類方法所不能實現的。 Compared with the prior art, in the embodiment of the present invention, first, the characteristic attributes of each account information are mapped into a new hash space by a random hash mapping method to form a hash map vector of each account information, and for each account information Classifying the hash map vector, can establish edges between account information with high similarity, effectively avoid a large number of similarity calculations between any two account information, and efficiently establish a credible weight value for each edge , Can improve the accuracy and speed of subsequent community classification; second, a feature matching network is established based on the similarity of each account information, and then the feature matching network is classified into communities based on the similarity matrix of each account information in the network. It can effectively find anomalous communities and take targeted measures. At the same time, it can detect unknown types of fraud. It also uses the similarity strength matrix to classify the feature matching network, that is, it preferentially divides account information to the nearest neighbor account. Information society, greatly reducing the number of community division attempts and further improving The speed of the algorithm. Third, by forming a feature matching network, the similarity between related account information is permanently stored as the edge weight. Even if more new account information comes in, it will not affect the original The interconnection edge has an impact, and only new account information needs to be inserted into the original feature matching network. When adding new account information to the original feature matching network graph, the random hash mapping method is used to classify each account information, and then the similarity calculation is performed with the account information in the class. If the similarity is greater than the threshold, then Add new edges. Subsequent only need to perform a smaller but more accurate community division algorithm to achieve the function. At the same time, the structure of the feature matching network can more clearly show the association structure within and between communities, which cannot be achieved by traditional clustering methods.

基於相同構思，本發明實施例提供的一種基於特徵匹配網路的社團劃分裝置，如圖3所示，該裝置包括確定單元301、第一劃分單元302、第二劃分單元303、計算單元304、形成網路單元305和第三劃分單元306。其中：確定單元301：用於根據預設的K個雜湊函數，確定每個帳號資訊對應的K位元雜湊向量；第一劃分單元302：用於將每個帳號資訊對應的雜湊向量，順序劃分為m=K/k類子雜湊向量；第二劃分單元303：用於針對每個類，將子雜湊向量相同的帳號資訊劃分為同一組；計算單元304：用於計算同一組內的各帳號資訊之間的相似度；形成網路單元305：用於若各帳號資訊之間的相似度大於閾值，則在各帳號資訊之間建立互連邊，形成特徵匹配網路；第三劃分單元306：用於根據特徵匹配網路，對各帳號資訊進行社團劃分。 Based on the same concept, an embodiment of the present invention provides a community classification device based on a feature matching network. As shown in FIG. 3, the device includes a determination unit 301, a first division unit 302, a second division unit 303, a calculation unit 304, A network unit 305 and a third division unit 306 are formed. Wherein: the determining unit 301 is configured to determine the K-bit hash vector corresponding to each account information according to the preset K hash functions; the first dividing unit 302 is configured to sequentially divide the hash vector corresponding to each account information M = K / k sub-hash vector; second division unit 303: for each class, divide account information with the same sub-hash vector into the same group; calculation unit 304: used to calculate each account in the same group Similarity between information; forming a network unit 305: used to establish an interconnection edge between each account information if the similarity between the account information is greater than a threshold value, to form a feature matching network; a third division unit 306 : It is used to classify the account information based on the characteristics matching network.

可選地，計算單元304具體用於：若第i帳號資訊與第j帳號資訊位於n類同組中，則將n/m作為第i帳號資訊與第j帳號資訊之間的相似度；第i帳號資訊與第j帳號資訊為各帳號資訊中的任一個。 Optionally, the calculation unit 304 is specifically configured to: if the i-th account information and the j-th account information are in the same group of n , use n / m as the similarity between the i-th account information and the j-th account information; The i account information and the jth account information are any of the account information.

可選地，計算單元304具體還用於：若第i帳號資訊與第j帳號資訊位於同一組中，統計第i帳號資訊的雜湊向量與第j帳號資訊的雜湊向量中位於同一位且雜湊向量值相同的個數h；第i帳號資訊與第j帳號資訊為各帳號資訊中的任一個；第i帳號資訊與第j帳號資訊的相似度s=h/K。 Optionally, the calculation unit 304 is further specifically configured to: if the i-th account information and the j-th account information are in the same group, the hash vector of the i-th account information and the j-th account information are in the same bit and the hash vector The number h with the same value; the i-th account information and the j-th account information are any of the account information; the similarity s = h / K between the i-th account information and the j-th account information.

可選地，確定單元301用於：根據公式(3)確定每個帳號資訊對應的K位元雜湊向量H()； Optionally, the determining unit 301 is configured to determine, according to formula (3), a K-bit hash vector H ( );

表示帳號資訊的特徵向量，其中，，c₁,c₂…,c_d表示帳號資訊的特徵屬性，表示隨機選取的一個非零向量，。 Feature vector representing account information, where, , C ₁ , c ₂ …, c _d represent the characteristic attributes of account information, Represents a randomly selected non-zero vector, .

可選地，第三劃分單元306具體用於：(1)將各帳號資訊劃分在特徵匹配網路中不同的小區中；(2)根據各帳號資訊之間的相似度，計算每個帳號資訊的相似強度，從而生成節點相似強度矩陣；(3)針對每個帳號資訊，從節點相似強度矩陣中帳號資訊所在的行，按相似強度從大到小的的順序嘗試將帳號資訊劃至其他小區中；若帳號資訊自第p小區劃分至第q小區後的模組度差為正數，則將帳號資訊劃分至第q小區後結束； (4)重複執行，直到小區結構不再改變為止。 Optionally, the third dividing unit 306 is specifically configured to: (1) divide each account information into different cells in the feature matching network; (2) calculate each account information according to the similarity between the account information (3) For each account information, from the row where the account information in the node similarity matrix is located, try to assign the account information to other communities in the order of the similarity from large to small. Medium; if the module degree difference between the account information from the p-th cell to the q-th cell is positive, the account information is divided into the q-th cell and ended; (4) Repeat until the cell structure no longer changes.

可選地，計算單元304具體還用於：根據公式(4)計算第i帳號資訊與第j帳號資訊之間的相似強度s _i,j； Optionally, the calculation unit 304 is further configured to calculate the similarity strength s _{i, j} between the i-th account information and the j-th account information according to formula (4);

從上述內容可看出：本發明實施例中提供一種基於特徵匹配網路的社團劃分裝置，根據預設的K個雜湊函數，確定每個帳號資訊對應的K位元雜湊向量；將每個帳號資訊對應的雜湊向量，順序劃分為類子雜湊向量；針對每個類，將子雜湊向量相同的帳號資訊劃分為同一組；計算同一組內的各帳號資訊之間的相似度；若各帳號資訊之間的相似度大於閾值，則在各帳號資訊之間建立互連邊，形成特徵匹配網路；根據特徵匹配網路，對各帳號資訊進行社團劃分根據各帳號資訊之間的相似度，對各帳號資訊進行社團劃分。本發明實施例中首先通過根據預設的K個雜湊函數，確定每個帳號資訊對應的K位元雜湊向量，對於網路中數量巨大的帳號資訊來說，僅僅產生兩個雜湊值的雜湊函數是不夠的，因此確定每個帳號資訊對應的K位元雜湊向量能夠應對複雜的網路帳號資訊。然後針對每個類，將子雜湊向量相同的帳號資訊劃分為一組，計算同一組內任意帳號資訊之間的相似度，能夠避免針對整個網路中任意帳號資訊之間計算相似度而帶來的計算量非常大的缺點；本發明技術方案能夠有效減少帳號資訊之間相似度的計算量，僅僅計算同一組內的帳號資訊之間的相似度。最後根據確定各帳號資訊之間的相似度大於閾值，在各帳號資訊之間建立互連邊，形成特徵匹配網路；根據特徵匹配網路，對各帳號資訊進行社團劃分，能夠更精准的對各帳號資訊進行社團劃分，這樣不僅能夠使社團之間的關聯關係很清楚，而且能夠對劃分的社團進行分析，找出異常社團，進而對異常社團內的帳號進行異常帳號排查，更加有針對性地找出欺詐帳號，提高應對欺詐帳號的效率。此外，如果需要對劃分出的社團添加帳號資訊，只需要對該添加的帳號資訊重複以上簡單的幾個步驟，將所添加的帳號資訊更新到相應的位置即可，並不會產生更新困難的問題。 It can be seen from the foregoing that an embodiment of the present invention provides a community classification device based on a feature matching network, and determines a K-bit hash vector corresponding to each account information according to preset K hash functions; The hash vector corresponding to the information is sequentially divided into class sub-hash vectors; for each class, the account information with the same sub-hash vector is divided into the same group; the similarity between the account information in the same group is calculated; if the account information If the similarity between them is greater than the threshold, an interconnection edge is established between the account information to form a feature matching network. According to the feature matching network, the account information is divided into communities. According to the similarity between the account information, Each account information is divided into communities. In the embodiment of the present invention, the K-bit hash vector corresponding to each account information is first determined according to the preset K hash functions. For a large amount of account information in the network, only two hash values of the hash function are generated. It is not enough, so it is necessary to determine the K-bit hash vector corresponding to each account information to cope with the complicated network account information. For each class, the account information with the same sub-hash vector is divided into a group, and the similarity between any account information in the same group is calculated, which can avoid the calculation of similarity between any account information in the entire network. The disadvantage is that the calculation amount is very large; the technical solution of the present invention can effectively reduce the calculation amount of the similarity between account information, and only calculate the similarity between the account information in the same group. Finally, it is determined that the similarity between the account information is greater than the threshold, and an interconnection edge is established between the account information to form a feature matching network. According to the feature matching network, the account information is divided into communities, which can more accurately compare Divide the account information into communities, which not only makes the association relationship between the communities clear, but also analyzes the divided communities to find abnormal communities, and then conducts abnormal account investigation on the accounts in the abnormal communities, which is more targeted. Identify fraudulent accounts and improve efficiency in dealing with fraudulent accounts. In addition, if you need to add account information to the divided community, you only need to repeat the above simple steps for the added account information, and update the added account information to the corresponding location, which will not cause update difficulties. problem.

本領域內的技術人員應明白，本發明的實施例可提供為方法、或電腦程式產品。因此，本發明可採用完全硬體實施例、完全軟體實施例、或結合軟體和硬體方面的實施例的形式。而且，本發明可採用在一個或多個其中包含有電腦可用程式碼的電腦可用存儲介質(包括但不限於磁碟記憶體、CD-ROM、光學記憶體等)上實施的電腦程式產品的形式。 Those skilled in the art should understand that the embodiments of the present invention may be provided as a method or a computer program product. Therefore, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to magnetic disk memory, CD-ROM, optical memory, etc.) containing computer-usable code therein. .

本發明是參照根據本發明實施例的方法、設備(系統)、和電腦程式產品的流程圖和/或方框圖來描述的。應理解可由電腦程式指令實現流程圖和/或方框圖中的每一流程和/或方框、以及流程圖和/或方框圖中的流程和/或方框的結合。可提供這些電腦程式指令到通用電腦、專用電腦、嵌入式處理機或其他可程式設計資料處理設備的處理器以產生一個機器，使得通過電腦或其他可程式設計資料處理設備的處理器執行的指令產生用於實現在流程圖一個流程或多個流程和/或方框圖一個方框或多個方框中指定的功能的裝置。 The present invention is described with reference to flowcharts and / or block diagrams of methods, devices (systems), and computer program products according to embodiments of the present invention. It should be understood that each process and / or block in the flowchart and / or block diagram, and a combination of the process and / or block in the flowchart and / or block diagram may be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to generate a machine for instructions executed by the processor of the computer or other programmable data processing device Means are generated for implementing the functions specified in one or more of the flowcharts and / or one or more of the block diagrams.

這些電腦程式指令也可存儲在能引導電腦或其他可程式設計資料處理設備以特定方式工作的電腦可讀記憶體中，使得存儲在該電腦可讀記憶體中的指令產生包括指令裝置的製造品，該指令裝置實現在流程圖一個流程或多個流程和/或方框圖一個方框或多個方框中指定的功能。 These computer program instructions can also be stored in computer readable memory that can guide a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer readable memory produce a manufactured article including a command device The instruction device implements the functions specified in a flowchart or a plurality of processes and / or a block or a plurality of blocks in the block diagram.

這些電腦程式指令也可裝載到電腦或其他可程式設計資料處理設備上，使得在電腦或其他可程式設計設備上執行一系列操作步驟以產生電腦實現的處理，從而在電腦或其他可程式設計設備上執行的指令提供用於實現在流程圖一個流程或多個流程和/或方框圖一個方框或多個方框中指定的功能的步驟。 These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operating steps can be performed on the computer or other programmable equipment to generate computer-implemented processing, and thus on the computer or other programmable equipment The instructions executed on the steps provide steps for implementing the functions specified in one or more of the flowcharts and / or one or more of the block diagrams.

儘管已描述了本發明的優選實施例，但本領域內的技術人員一旦得知了基本創造性概念，則可對這些實施例作出另外的變更和修改。所以，所附申請專利範圍意欲解釋為包括優選實施例以及落入本發明範圍的所有變更和修改。 Although the preferred embodiments of the present invention have been described, those skilled in the art can make other changes and modifications to these embodiments once they know the basic inventive concepts. Therefore, the scope of the appended patent applications is intended to be construed to include the preferred embodiments and all changes and modifications that fall within the scope of the invention.

顯然，本領域的技術人員可以對本發明進行各種改動和變型而不脫離本發明的精神和範圍。這樣，倘若本發明的這些修改和變型屬於本發明申請專利範圍及其等同技術的範圍之內，則本發明也意圖包含這些改動和變型在內。 Obviously, those skilled in the art can make various modifications and variations to the present invention without departing from the spirit and scope of the present invention. In this way, if these modifications and variations of the present invention fall within the scope of the patent application for the present invention and the scope of the equivalent technology, the present invention also intends to include these modifications and variations.

Claims

A community division method based on a feature matching network, comprising: determining a K-bit hash vector K corresponding to each account information greater than 1 according to preset K hash functions; and assigning a hash corresponding to each account information Vector, divided into m = K / k sub-hash vectors, k is not greater than K; for each class, divide account information with the same sub-hash vector into the same group; calculate the similarity between account information in the same group If the similarity between the account information is greater than the threshold value, an interconnection edge is established between the account information to form a feature matching network; according to the feature matching network, the account information is divided into communities.

The method for segmenting a community based on a feature matching network according to claim 1, wherein calculating the similarity between account information in the same group includes: if the i-th account information and the j-th account information are in the same n-type group , The n / m is used as the similarity between the i-th account information and the j-th account information; the i-th account information and the j-th account information are any of the account information.

The method for dividing a community based on a feature matching network according to claim 1, wherein calculating the similarity between account information in the same group includes: if the i-th account information and the j-th account information are in the same group, Count the number h of the hash vector of the i-th account information and the hash vector of the j-th account information that are in the same position and have the same hash vector value; the i-th account information and the j-th account information are among the account information. Any one; the similarity s = h / K between the i-th account information and the j-th account information.

The community division method based on feature matching network according to claim 1, wherein the K-bit hash vector corresponding to each account information is determined according to the preset K hash functions, including: determining the formula according to formula (1) K-bit hash vector corresponding to each account information ; Where 2 ' b represents Is a binary number, Is one of the preset K hash functions, Feature vector representing account information, where, , C ₁ , c ₂ …, c _d represent the characteristic attributes of account information, Represents a randomly selected non-zero vector, .

The method for classifying a community based on a feature matching network according to any one of claims 1 to 4, wherein classifying each account information based on the feature matching network includes: (1) grouping account information Divided into different communities in the feature matching network; (2) Calculate the similarity strength of each account information based on the similarity between the account information to generate a node similarity matrix; (3) For each account information , From the row where the account information is located in the node's similar strength matrix, try to assign the account information to other cells in the order of similar strength from large to small; if the account information is divided from the p-th cell to the q-th cell After the difference in module degrees is positive, the account information is divided into the q-th cell and ended; (4) Repeat until the cell structure is no longer changed.

The community division method based on feature matching network according to claim 5, wherein calculating the similarity strength of each account information according to the similarity between the account information includes: calculating the i-th item according to formula (2) The similarity intensity s _{i, j} between the account information and the jth account information; ,among them, ; Where Γ ( i ) represents the neighbor set of the i-th account information, Γ ( i ) ∩Γ ( j ) represents the common neighbor set of the i-th account information and the j-th account information, w _{ai, z} is an arbitrary account The sum of the weights of the edges between the information ai and the zth account information.

A community segmentation device based on a feature matching network, comprising: a determining unit, configured to determine a K-bit hash vector corresponding to each account information according to preset K hash functions, where K is greater than 1; Dividing unit, which is used to divide the hash vector corresponding to each account information into m = K / k sub-hash vectors, k is not greater than K; the second dividing unit is used to make the sub-hash vector the same for each class The account information is divided into the same group; the calculation unit is used to calculate the similarity between the account information in the same group; the network unit is formed to be used if the similarity between the account information is greater than the threshold. An interconnection edge is established between the account information to form a feature matching network; a third division unit is configured to divide the account information into communities according to the feature matching network.

The community classification device based on the feature matching network according to claim 7, wherein the calculation unit is specifically configured to: if the i-th account information and the j-th account information are in the same group of n , use n / m as the The similarity between the i-th account information and the j-th account information; the i-th account information and the j-th account information are any of the account information.

The community classification device based on the feature matching network according to claim 7, wherein the calculation unit is further configured to specifically calculate the hash of the i-th account information if the i-th account information and the j-th account information are in the same group. The number h is the same number in the hash vector of the j-th account information and the same hash vector value; the i-th account information and the j-th account information are any of the account information; the i-th account information The degree of similarity with the j-th account information is s = h / K.

The community classification device based on the feature matching network according to claim 7, wherein the determining unit is configured to determine a K-bit hash vector corresponding to each account information according to formula (3) ; Where 2 ' b represents Is a binary number, Is one of the preset K hash functions, Feature vector representing account information, where, , C ₁ , c ₂ …, c _d represent the characteristic attributes of account information, Represents a randomly selected non-zero vector, .

The community division device based on the feature matching network according to any one of claims 7 to 10, wherein the third division unit is specifically configured to: (1) divide each account information into the feature matching network In different communities; (2) Calculate the similarity strength of each account information according to the similarity between the account information to generate a node similarity matrix; (3) For each account information, from the node similarity matrix In the row where the account information is located, try to assign the account information to other communities in the order of similar strength from large to small; if the module information difference from the p-th cell to the q-th cell is a positive number , Then the account information is divided into the q-th cell and ended; (4) Repeatedly performed until the cell structure no longer changes.

The community classification device based on the feature matching network according to claim 11, wherein the calculation unit is further configured to calculate the similarity strength s between the i-th account information and the j-th account information according to formula (4). _{i, j} ; , Where w (z) = w _{ai, z} formula (4); where Γ ( i ) represents the neighbor set of the i-th account information, and Γ ( i ) ∩Γ ( j ) represents the i-th account information and the The common neighbor set of the j-th account information, w _{ai, z} is the sum of the weights of edges between any account information ai and the z-th account information.