CN116304252A

CN116304252A - Communication network fraud prevention method based on graph structure clustering

Info

Publication number: CN116304252A
Application number: CN202310006675.6A
Authority: CN
Inventors: 周泽宇; 袁龙; 陈紫; 俞唯仁
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2023-01-04
Filing date: 2023-01-04
Publication date: 2023-06-23

Abstract

The invention discloses a communication network anti-fraud method and system based on graph structure clustering. The method comprises the following steps of constructing a network structure according to the user and call relationship, converting the network structure into a graph model and storing it in a CSR structure; constructing a network structure according to the user and call relationship, converting the network structure into a graph model and storing it in a CSR structure; The CSR structure of graph data generates non-repetitive edge tables and edge table indexes based on degree pointing; the data of the graph model is transmitted to the computing system, and the computing system executes the structural clustering method of the graph and outputs the structural clustering results; the present invention can help Law enforcement agencies can quickly locate areas where criminals may be active from the huge communication network, detect possible fraud gangs and defrauded objects, narrow the scope of observation, and improve law enforcement efficiency.

Description

A communication network anti-fraud method based on graph structure clustering

技术领域technical field

本发明属于大数据和数据挖掘技术领域，具体为一种基于图结构聚类的通信网络防诈骗方法。The invention belongs to the technical field of big data and data mining, and specifically relates to a communication network anti-fraud method based on graph structure clustering.

背景技术Background technique

通信网络结构形成的图在如今的大数据时代背景下是十分庞大的，一个几百万人口的城市通信数据形成的图结构可能存在上亿条边。由于打击诈骗讲究时效性，如何短时间内挖掘出其中的重要层次结构信息一直是技术发展的热点与难题。近年来的研究提出了几种经典的根据子图内聚度量定义的社区搜索方法，例如k-core、k-truss、k-clique和k-ECC。这些方法可以找出内聚且联通的子图作为社区。然而这些方法要么是条件不够完备导致找出的子图社区内聚程度不好，比如k-core；要么是条件过于严苛，导致这种定义的社区在图中存量很少且进行搜索时消耗时间过多，比如k-truss和k-clique。因此，这些方法其实并不适用大多数通信网络图的聚类搜索。同样作为发现网络中内聚子图的图结构聚类方法就可以很好的解决这个问题。图结构聚类形成的各个聚类在保证其聚类的效果不错的同时还有着较高的性能。从而图结构聚类能够更好的应用于规模庞大的通信网络中。The graph formed by the communication network structure is very large in the context of today's big data era. The graph structure formed by the communication data of a city with a population of several million may have hundreds of millions of edges. Due to the timeliness of combating fraud, how to dig out the important hierarchical structure information in a short period of time has always been a hot spot and difficult problem in technological development. Recent studies have proposed several classic community search methods defined in terms of subgraph cohesion metrics, such as k-core, k-truss, k-clique, and k-ECC. These methods can find cohesive and connected subgraphs as communities. However, these methods either have insufficient conditions to find out subgraph communities with poor cohesion, such as k-core; or the conditions are too strict, resulting in such defined communities that have very little stock in the graph and are consumed when searching Too much time, such as k-truss and k-clique. Therefore, these methods are actually not suitable for clustering search of most communication network graphs. Also as a graph structure clustering method for discovering cohesive subgraphs in the network, it can solve this problem well. Each cluster formed by graph structure clustering has high performance while ensuring good clustering effect. Therefore, graph structure clustering can be better applied to large-scale communication networks.

对于传统的图结构聚类而言，其往往采用有利于CPU串行执行的方法来进行结构聚类。上述的方法存在很大的问题：在不超过百万关系级别的中小型图中才能取得不错的性能，而在高频次且图数据规模较大的情形中效果很差。比如现在的twitter网络和微博网络这种具有十几亿条关系的大型社交网络图中，要找出其中的内聚子图，传统方法很难胜任，因其计算出聚类需要耗时几个小时，不符合目前大数据时代的反诈骗现实应用。For traditional graph structure clustering, it often adopts a method that is conducive to CPU serial execution for structural clustering. The above method has a big problem: good performance can only be achieved in small and medium-sized graphs with no more than one million relationship levels, but the effect is poor in high-frequency and large-scale graph data. For example, the current twitter network and microblog network have large-scale social network graphs with more than one billion relationships. To find out the cohesive subgraphs in it, the traditional method is difficult to do, because it takes several hours to calculate the clustering. Hours, does not meet the anti-fraud practical application in the current era of big data.

发明内容Contents of the invention

为了解决现有技术中的上述技术缺陷，本发明提出了一种基于图结构聚类的通信网络防诈骗方法。In order to solve the above-mentioned technical defects in the prior art, the present invention proposes a communication network anti-fraud method based on graph structure clustering.

实现本发明目的的技术方案为：一种基于图结构聚类的通信网络防诈骗方法，包括以下步骤：The technical solution for realizing the object of the present invention is: a communication network anti-fraud method based on graph structure clustering, comprising the following steps:

通信运营商平台定时收集设定时间段内通信网络所有通信过的用户以及所有用户的通话记录；The communication operator platform regularly collects all communication network users and call records of all users within the set time period;

根据用户以及通话关系构建网络结构，将网络结构转换为图模型并以CSR结构存储；Build a network structure based on user and call relationships, convert the network structure into a graph model and store it in a CSR structure;

基于图数据的CSR结构，生成基于度数指向的不重复边表和边表索引；Based on the CSR structure of graph data, generate non-repeating edge tables and edge table indexes based on degree points;

将图模型的数据传输给计算系统，计算系统执行图的结构聚类方法，输出结构聚类结果；The data of the graph model is transmitted to the computing system, and the computing system executes the structural clustering method of the graph, and outputs the structural clustering result;

分析结构聚类结果，展示分析结果，列出可疑对象。Analyze the structural clustering results, display the analysis results, and list suspicious objects.

优选地，将网络结构转换为图模型的具体方法为：Preferably, the specific method for converting the network structure into a graph model is:

用户作为图中的点，用户信息数据建模成为点的属性；The user is a point in the graph, and the user information data is modeled as an attribute of the point;

用户之间的通信作为图的边，通信数据建模为边的属性。The communication between users is regarded as the edge of the graph, and the communication data is modeled as the attribute of the edge.

优选地，将图模型以CSR结构存储的具体方法为：Preferably, the specific method for storing the graph model in a CSR structure is:

按照图中的点Id顺序记录点的度数(与点相连的边的数量)构成度数数组，按照点的Id顺序将各点的邻居点也按Id顺序存储下来形成邻接数组Adj。再根据点的度数数组的前缀和来得到表示各点的邻接点集在邻接数组Adj的起始位置的数组Rptr；Rptr和Adj即构成了图的CSR结构。According to the order of the point Id in the figure, record the degree of the point (the number of edges connected to the point) to form a degree array, and store the neighbor points of each point in the order of Id to form an adjacency array Adj. Then, according to the prefix sum of the degree array of the point, the array Rptr representing the starting position of the adjacent point set of each point in the adjacent array Adj is obtained; Rptr and Adj constitute the CSR structure of the graph.

优选地，基于图数据的CSR结构，生成基于度数指向的不重复边表和边表索引的具体方法为：Preferably, based on the CSR structure of the graph data, the specific method of generating the non-repeating edge table and edge table index based on the degree pointing is as follows:

将每个点u的邻居点度数大于或者度数相等Id大于u的集合设为N₊(u)，创建辅助数组upptr和his，以分别记录每个点u的邻居点v是否属于N₊(u)中的第几个元素和N₊(u)的大小；Set the set of the neighbors of each point u whose degree is greater than or whose degree is equal to Id to be greater than u as N ₊ (u), create auxiliary arrays upptr and his to record whether each point u’s neighbor v belongs to N ₊ (u ) in which element and the size of N ₊ (u);

计算his上的排他前缀和，以获得以u为其源点的边的边列表Degree-orientedEdge List中的写入位置；Calculate the exclusive prefix sum on his to obtain the write position in the edge list Degree-orientedEdge List of the edge with u as its source point;

在两级循环中遍历adj阵列中的每个元素，并记录v∈N(u)的位置表示为O_uv，当处理点v∈N₊(u)时，通过将elptr(u)添加到O_uv的起始位置的相对偏移量upptr(u)中来创建映射eid(O_uv)，并且将e(u,v)分配给边表，否则，调用N(v)上的二分搜索来定位O_vu，并通过使用v作为源点的边来创建映射eid(O_uv)。Each element in the adj array is traversed in a two-level loop, and the position of v∈N(u) is recorded as O _uv , when processing a point v∈N ₊ (u), by adding elptr(u) to O The relative offset of the starting position of _uv upptr(u) to create the mapping eid(O _uv ), and assign e(u,v) to the edge table, otherwise, call the binary search on N(v) to locate O _vu , and create a mapping eid(O _uv ) by using v as the edge of the source point.

优选地，计算系统执行图的结构聚类方法，输出结构聚类结果的具体过程为：Preferably, the computing system executes the structural clustering method of the graph, and the specific process of outputting the structural clustering results is as follows:

输入聚类的参数，初始化各点所属的聚类Id以及所有点的确定度和有效度；Input the clustering parameters, initialize the cluster Id to which each point belongs and the degree of certainty and validity of all points;

利用剪枝策略以及输入参数确定一些边的相似度和一些对象的聚类角色，来消除冗余计算；Use the pruning strategy and input parameters to determine the similarity of some edges and the clustering role of some objects to eliminate redundant calculations;

计算出每条边的相似度，通过与每个点相关的边的相似情况确定其是否为核心点；Calculate the similarity of each edge, and determine whether it is a core point through the similarity of the edges related to each point;

利用并查集将核心点初步聚类，并从形成的初步聚类往外扩张形成最终聚类；The core points are preliminarily clustered by using union search, and the final cluster is formed by expanding outward from the formed preliminary cluster;

将类外的点根据其是否连接了不同的聚类而将其分类成枢纽点或者离群点；Classify out-of-class points as hub points or outliers according to whether they are connected to different clusters;

得到整个图所有点的聚类情况，将这些聚类情况对应的用户群组划分情况返回给平台。Get the clustering status of all points in the entire graph, and return the user group division status corresponding to these clustering statuses to the platform.

优选地，GPU多线程并行地初始化每个点的确定度，确定度初始化为0；GPU多线程并行地初始化每个点的有效度，有效度为每个点的邻居点个数，如果有效度小于μ-1，则提前将该点归类为非核心点。Preferably, GPU multithreading initializes the degree of certainty of each point in parallel, and the degree of certainty is initialized to 0; GPU multithreading initializes the degree of validity of each point in parallel, and the degree of validity is the number of neighbor points of each point. If the degree of validity If it is less than μ-1, the point is classified as a non-core point in advance.

优选地，利用剪枝策略以及输入参数确定一些边的相似度和一些对象的聚类角色，来消除冗余计算的具体过程为：Preferably, using the pruning strategy and input parameters to determine the similarity of some edges and the clustering roles of some objects to eliminate redundant calculations is as follows:

根据已经生成的基于度数指向的不重复边表，GPU的每个warp按照顺序从边表中选取一条边(u,v)来处理，如果边满足|N[u]|<ε²·|N[v]|，则可直接提前判定(u,v)不相似，并将点u和点v的有效度减一，如果导致这两点的有效度小于μ-1，则提前将该点归类为非核心点。其中N[u]表示点u自身及其邻居点的集合，|N[u]|即表示这个集合的元素个数，其值等于u的度数加一。According to the non-repeating edge table based on the degree pointing that has been generated, each warp of the GPU selects an edge (u, v) from the edge table in order to process, if the edge satisfies |N[u]|<ε ² ·|N [v]|, you can directly determine that (u, v) are not similar in advance, and reduce the validity of point u and point v by one. If the validity of these two points is less than μ-1, return the point to Classes are non-core points. Among them, N[u] represents the set of point u itself and its neighbor points, and |N[u]| represents the number of elements in this set, and its value is equal to the degree of u plus one.

优选地，计算出每条边的相似度，通过与每个点相关的边的相似情况确定其是否为核心点的具体方法为：Preferably, the similarity of each edge is calculated, and the specific method for determining whether it is a core point through the similarity of the edge related to each point is:

根据如下公式计算每条边(u,v)的相似度：Calculate the similarity of each edge (u, v) according to the following formula:

根据两点u和v的属性匹配情况，更新相似度，具体可分为：将点对应的地址属性考虑在内，两点若地址属性相同，则相似度按照一定权值增加；将边(u,v)对应的通信时间属性考虑在内，边所代表的通信时间如果大于设定值，则u和v相似度按照一定权值增加；According to the attribute matching of two points u and v, update the similarity, which can be specifically divided into: consider the address attribute corresponding to the point, if the two points have the same address attribute, the similarity will increase according to a certain weight; the edge (u , v) The corresponding communication time attribute is taken into account, if the communication time represented by the edge is greater than the set value, then the similarity between u and v increases according to a certain weight;

如果σ(u,v)<ε，则不相似，并更新u和v的有效度，将它们的有效度都减一，如果σ(u,v)≥ε，则相似，并更新u和v的确定度，将它们的确定度都加一，如果有效度因此小于μ-1，则将该点归类为非核心点，如果确定度因此大于μ-1，则将该点归类为核心点。If σ(u,v)<ε, dissimilar, and update the validity of u and v, and reduce their validity by one, if σ(u,v)≥ε, similar, and update u and v If the degree of certainty is greater than μ-1, then the point is classified as a non-core point, and if the degree of certainty is therefore greater than μ-1, the point is classified as a core point point.

优选地，利用并查集将核心点初步聚类，并从形成的初步聚类往外扩张形成最终聚类的具体方法为：Preferably, the core points are preliminarily clustered by union search, and the specific method of expanding from the formed preliminary clusters to form the final clusters is as follows:

每个核心点被初始化成一个单元素树形集合，该点作为树形集合的树根，集合的编号Id即为树根Id；Each core point is initialized into a single-element tree set, which serves as the root of the tree set, and the number Id of the set is the root Id;

首先对核心点聚类，第一步是GPU的每个warp并行处理一个核心点u，warp中的线程并行的遍历u的邻居，如果点u的邻居v也是核心点，那么再由已经生成的边表索引找出边(u,v)在边表中的位置，根据位置找到边(u,v)记录的相似度，如果(u,v)相似度未知，则跳过，如果相似的话，则通过并查集的Find操作向上寻找u点和v点所在集合的根节点R_u和R_v，若R_u和R_v不同，即他们不在一个集合中，则用并查集的union操作合并两个集合；第二步与第一步类似，同样是并行检查所有核心点u与其邻居核心点v之间的相似度，如果相似度未知，则采用上述的相似度计算方法算出相似度，如果是相似的，则采用第一步的方法合并u和v所在的两个集合；First, cluster the core points. The first step is that each warp of the GPU processes a core point u in parallel, and the threads in the warp traverse the neighbors of u in parallel. If the neighbor v of point u is also a core point, then the generated The edge table index finds the position of the edge (u, v) in the edge table, and finds the similarity of the edge (u, v) record according to the position. If the similarity of (u, v) is unknown, skip it. If it is similar, Then use the Find operation of the union to find the root nodes R _u and R _v of the set where the points u and v are located. If _Ru and R _v are different, that is, they are not in the same set, use the union operation of the union to merge Two sets; the second step is similar to the first step. It is also to check the similarity between all core points u and its neighbor core points v in parallel. If the similarity is unknown, use the above similarity calculation method to calculate the similarity. If are similar, then use the method of the first step to merge the two sets where u and v are located;

接着对非核心点聚类GPU的每个warp并行的对每个核心点检查，warp中的每个线程寻找u的非核心点邻居v，同样根据边表索引和边表查到(u,v)相似度，在(u,v)相似的情况下将该非核心点加入到聚类中，即将v的聚类Id赋值成为与核心点u所在聚类Id。Then check each core point in parallel for each warp of the non-core point clustering GPU. Each thread in the warp looks for the non-core point neighbor v of u, and also finds (u, v) according to the edge table index and edge table ) similarity, when (u, v) is similar, the non-core point is added to the cluster, that is, the cluster Id of v is assigned as the cluster Id of the core point u.

本发明与现有技术相比，其显著优点为：一方面，本发明能够帮助执法机关从庞大的通信网络中快速定位犯罪分子可能活跃的区域，检测出可能的诈骗团伙和被骗对象，缩小观察范围、提高执法效率；另一方面，通过Nvidia GPU的高并发计算框架CUDA来加速图的结构聚类，在百毫秒级的时间内可以完成千万条边级别的大型图数据的结构聚类，同时提出了目前最优的GPU并行下的相似度计算策略和并查集聚类策略，大幅度提高了计算速度，使本发明具有低延迟、响应快、鲁棒性好的优点。Compared with the prior art, the present invention has the following significant advantages: on the one hand, the present invention can help law enforcement agencies quickly locate areas where criminals may be active from a huge communication network, detect possible fraudulent gangs and defrauded objects, and narrow down Observation range, improve law enforcement efficiency; on the other hand, through the high concurrent computing framework CUDA of Nvidia GPU to accelerate the structural clustering of graphs, the structural clustering of large-scale graph data with tens of millions of edges can be completed within a hundred milliseconds At the same time, the current optimal GPU parallel similarity calculation strategy and union search clustering strategy are proposed, which greatly improves the calculation speed, and makes the present invention have the advantages of low delay, fast response and good robustness.

本发明的其他特征和优点将在随后的说明书中阐述，并且，部分的从说明书中变得显而易见，或者通过实施本发明而了解。本发明的目的和其他优点可通过在所写的说明书、权利要求书、以及附图中所特别指出的结构来实现和获得。Additional features and advantages of the invention will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

附图说明Description of drawings

附图仅用于示出具体实施例的目的，而并不认为是对本发明的限制，在整个附图中，相同的参考符号表示相同的部件。The drawings are for the purpose of illustrating specific embodiments only and are not to be considered as limitations of the invention, and like reference numerals refer to like parts throughout the drawings.

图1是本发明实施例系统架构图。FIG. 1 is a system architecture diagram of an embodiment of the present invention.

图2是本发明实施例流程图。Fig. 2 is a flowchart of an embodiment of the present invention.

图3是本发明图结构聚类方法的数据结构图。Fig. 3 is a data structure diagram of the graph structure clustering method of the present invention.

图4是本发明图结构聚类方法的GPU并行计算图。Fig. 4 is a GPU parallel computing diagram of the graph structure clustering method of the present invention.

图5是本发明实施于通信网络中的结果图。Fig. 5 is a diagram showing the result of implementing the present invention in a communication network.

具体实施方式Detailed ways

容易理解，依据本发明的技术方案，在不变更本发明的实质精神的情况下，本领域的一般技术人员可以想象出本发明的多种实施方式。因此，以下具体实施方式和附图仅是对本发明的技术方案的示例性说明，而不应当视为本发明的全部或者视为对本发明技术方案的限制或限定。相反，提供这些实施例的目的是为了使本领域的技术人员更透彻地理解本发明。下面结合附图来具体描述本发明的优选实施例，其中，附图构成本申请一部分，并与本发明的实施例一起用于阐释本发明的创新构思。It is easy to understand that, according to the technical solution of the present invention, those skilled in the art can imagine various implementations of the present invention without changing the essence and spirit of the present invention. Therefore, the following specific embodiments and drawings are only exemplary descriptions of the technical solution of the present invention, and should not be regarded as the entirety of the present invention or as a limitation or limitation on the technical solution of the present invention. Rather, these embodiments are provided to enable those skilled in the art to more thoroughly understand the present invention. Preferred embodiments of the present invention will be specifically described below in conjunction with the accompanying drawings, wherein the accompanying drawings constitute a part of the application and are used together with the embodiments of the present invention to explain the innovative concept of the present invention.

为了使本发明的目的、技术方案更加清楚明白，下面结合附图并举实施例，对本发明的技术方案进行详细的说明。In order to make the purpose and technical solution of the present invention more clear, the technical solution of the present invention will be described in detail below in conjunction with the accompanying drawings and examples.

本发明的系统结构参见图1。图1是本发明实施例系统架构图，该系统架构属于分布式架构，包括通信网络、前端平台、数据分析层和计算模块。通信网络可以是各大通信运营商运营维护的电话网络或者移动互联网络。前端平台是部署了前端界面的服务器，所有的操作包括搜索范围设置、定时任务的设置、参数的输入和结果展示都在这个平台中。数据分析层是接收前端平台参数传递以及收集通信网络数据的后端服务器集群，负责将收集到的通信数据结构化成图数据传递给计算模块处理，并接收计算模块的结果对其进行分析回传给前端平台展示。计算模块是搭载了图形处理器的服务器集群，计算方法是基于图数据的结构聚类方法，图形处理器强大的并行计算能力可以大幅提高其效率。Refer to Fig. 1 for the system structure of the present invention. FIG. 1 is a system architecture diagram of an embodiment of the present invention. The system architecture belongs to a distributed architecture, including a communication network, a front-end platform, a data analysis layer, and a computing module. The communication network may be a telephone network or a mobile Internet network operated and maintained by major communication operators. The front-end platform is a server deployed with a front-end interface. All operations including search range setting, timing task setting, parameter input and result display are all performed on this platform. The data analysis layer is a back-end server cluster that receives front-end platform parameter transmission and collects communication network data. It is responsible for structuring the collected communication data into graph data and passing it to the computing module for processing, and receiving the results of the computing module for analysis and sending it back to Front-end platform display. The calculation module is a server cluster equipped with a graphics processor. The calculation method is a structural clustering method based on graph data. The powerful parallel computing capability of the graphics processor can greatly improve its efficiency.

为了实现本发明系统的功能，通信网络中需要确定每个用户独一无二的id(比如电话号码)，获取每个用户在使用的过程中的特征信息。还需要按区域划分通信网络，即可识别目标区域内的用户所使用的通信id。数据分析层中维持一张用户的索引表，索引表记录了该服务器负责区域内目前时间段内识别到的用户。索引表中的每个用户都对应着一个通信对象表，存储与该用户发生通信的人。每次通信发生时，通信网络获取通信双方的信息，传递给数据分析层，查看索引表中是否已有这两个用户，若没有则添加进去，若已存在则无需增加。接着检查他们的邻接表是否已经记录了通信对象，若没有则添加进去。In order to realize the functions of the system of the present invention, the unique id (such as a phone number) of each user needs to be determined in the communication network, and the feature information of each user in the process of use needs to be obtained. It is also necessary to divide the communication network by area, so that the communication id used by the users in the target area can be identified. An index table of users is maintained in the data analysis layer, and the index table records the users identified in the current time period in the area that the server is responsible for. Each user in the index table corresponds to a communication object table, which stores people who communicate with the user. Every time a communication occurs, the communication network obtains the information of both parties and passes it to the data analysis layer to check whether the two users already exist in the index table. If not, add them. If they already exist, there is no need to add them. Then check whether their adjacency list has recorded the communication object, if not, add it.

如果采用上述办法收集通信信息时，还可以生成用户对应的属性映射表和用户通信对象对应的属性映射表。在获取用户id时如果运营商的通信网络有能力的话还可以获取用户的属性信息，比如根据基站或者ip获取用户的物理地址和通话时间等。确定用户信息后可以添加到与索引表中的用户一一对应的属性映射表中，通信信息则可以添加到与通信对象表一一对应的属性映射表中。这些属性在将来可以建模为图中点的属性和边的属性，为相似度的计算提供更多参考，提高精准度。If the above method is used to collect communication information, an attribute mapping table corresponding to the user and an attribute mapping table corresponding to the communication object of the user may also be generated. When obtaining the user id, if the operator's communication network is capable, the user's attribute information can also be obtained, such as obtaining the user's physical address and call time according to the base station or ip. After the user information is determined, it can be added to the attribute mapping table corresponding to the users in the index table one-to-one, and the communication information can be added to the attribute mapping table corresponding to the communication object table. These attributes can be modeled as point attributes and edge attributes in the graph in the future, providing more references for similarity calculations and improving accuracy.

索引表、通信对象表和属性表的结构可以如表1所示：The structure of the index table, the communication object table and the attribute table can be as shown in Table 1:

表1Table 1

在表1中，用户的身份标识采用了数字编号，用户的属性采用了物理地址。通信对象的属性则采用了通话时长。表中的示例用户通信对象都是两个，但实际上数目不限，且索引表、通信对象表和属性表的形式不限于表一一种。In Table 1, the user's identity uses a digital number, and the user's attribute uses a physical address. The attribute of the communication object adopts the duration of the call. There are two example user communication objects in the table, but actually the number is not limited, and the form of index table, communication object table and attribute table is not limited to one table.

基于上述数据分析层收集到的数据表可以利用内存数据库存储来提升书写速度。当收到前端平台的任务请求后，将这些数据建模成为图。用户作为图中的点，用户信息数据建模成为点的属性。用户之间的通信作为图的边，通信对象属性建模为边的属性，形成一个逻辑上的网络图结构。接着把该结构转化生成图的CSR结构。参见图3，图3的左边部分有本发明的一个实施例存储结构示意图，图中的网络有4个点包括v₁,v₂,v₃,v₄，以及这些点之间的边。按照图中的点Id顺序记录点的度数(与点相连的边的数量)构成度数数组，按照点的Id顺序将各点的邻居点也按Id顺序存储下来形成邻接数组Adj。再根据点的度数数组的前缀和来得到表示各点的邻接点集在邻接数组Adj的起始位置的数组Rptr；Rptr和Adj即构成了图的CSR结构。The data tables collected based on the above data analysis layer can use memory database storage to improve writing speed. After receiving the task request from the front-end platform, model the data into a graph. The user is a point in the graph, and the user information data is modeled as an attribute of the point. The communication between users is regarded as the edge of the graph, and the attribute of the communication object is modeled as the attribute of the edge, forming a logical network graph structure. This structure is then transformed into the CSR structure of the generated graph. Referring to Fig. 3, the left part of Fig. 3 has a schematic diagram of the storage structure of an embodiment of the present invention. The network in the figure has 4 points including v ₁ , v ₂ , v ₃ , v ₄ , and the edges between these points. According to the order of the point Id in the figure, record the degree of the point (the number of edges connected to the point) to form a degree array, and store the neighbor points of each point in the order of Id to form an adjacency array Adj. Then, according to the prefix sum of the degree array of the point, the array Rptr representing the starting position of the adjacent point set of each point in the adjacent array Adj is obtained; Rptr and Adj constitute the CSR structure of the graph.

完成结构化后将CSR结构传输给计算模块，为了后面的相似度计算和GPU并行时工作任务的分配，计算模块需要再生成不重复记录图的所有边的Degree-oriented EdgeList边表和将Adj映射到边表的索引Eid。将图的每一条边，即其两个端点组成的点对结构按顺序存入边表中。参见图3，图3中的Edge List展示了本发明实施例图的Degree-oriented Edge List边表情况。After the structuring is completed, the CSR structure is transmitted to the computing module. For the subsequent similarity calculation and the assignment of GPU parallel tasks, the computing module needs to regenerate the Degree-oriented EdgeList edge table and map the Adj Index Eid to the edge table. Store each edge of the graph, that is, the point pair structure composed of its two endpoints, into the edge table in order. Referring to FIG. 3 , the Edge List in FIG. 3 shows the degree-oriented Edge List edge list of the graph of the embodiment of the present invention.

参见图3右边部分，生成Degree-oriented Edge List边表和索引Eid的过程具体表现为：首先，每个点u的邻居点度数大于或者度数相等Id大于u的集合设为N₊(u)，创建辅助数组upptr和his，以分别记录每个点u的邻居点v是否属于N₊(u)中的第几个元素和N₊(u)的大小。其次，计算his上的排他前缀和，以获得以u为其源点的边的边列表Degree-oriented Edge List(由elptr表示)中的写入位置。第三，在两级循环中遍历adj阵列中的每个元素，并记录v∈N(u)的位置表示为O_uv。当处理点v∈N₊(u)时，通过将elptr(u)添加到O_uv从的起始位置的相对偏移量upptr(u)中来创建映射eid(O_uv)，并且将e(u,v)分配给边表。否则，调用N(v)上的二分搜索来定位O_vu，并通过使用v作为源点的边来创建映射eid(O_uv)。Referring to the right part of Figure 3, the process of generating the Degree-oriented Edge List edge table and the index Eid is as follows: First, the set of the neighbor points of each point u whose degree is greater than or equal to the degree Id greater than u is set to N ₊ (u), Create auxiliary arrays upptr and his to respectively record whether the neighbor point v of each point u belongs to which element in N ₊ (u) and the size of N ₊ (u). Second, the exclusive prefix sum over his is computed to obtain the write position in the Degree-oriented Edge List (denoted by elptr ) of the edge with u as its source. Third, each element in the adj array is traversed in a two-stage loop, and the position of v∈N(u) is recorded as O _uv . When processing a point v ∈ N ₊ (u), the mapping eid(O _uv ) is created by adding elptr(u) to the relative offset upptr(u) from the starting position of O _uv from, and adding e( u, v) are assigned to edge tables. Otherwise, invoke a binary search on N(v) to locate O _vu and create a map eid(O _uv ) by using v as the edge of the source point.

由CSR生成辅助数组upptr和his的过程可以利用CPU多线程并行加速，由elptr构造Degree-oriented Edge List和Eid的过程也可以有效并行化。因此该过程可以通过并行化大大加快速度，在亿级的大图数据中也只消耗几秒的时间完成。The process of generating auxiliary arrays upptr and his from CSR can be accelerated in parallel by using CPU multithreading, and the process of constructing Degree-oriented Edge List and Eid from elptr can also be effectively parallelized. Therefore, the process can be greatly accelerated through parallelization, and it only takes a few seconds to complete in the case of billion-level large-scale data.

完成数据结构的构造后，输入聚类的参数ε和μ，将参数和前面所述的结构从设备的内存传输到GPU显存中。使用GPU多线程并行的初始化每个点的确定度和有效度。确定度初始化为0，而有效度初始化为点的邻居点个数。确定度和有效度用于在计算图中边的相似度的同时判定点是否为核心点。初始化过程中一旦发现点的度数小于μ-1时，该点可以提前确定为非核心点。After completing the construction of the data structure, input the clustering parameters ε and μ, and transfer the parameters and the aforementioned structure from the device memory to the GPU memory. The certainty and efficiency of each point are initialized in parallel using GPU multithreading. The degree of certainty is initialized to 0, and the degree of validity is initialized to the number of neighbor points of the point. The degree of certainty and validity are used to determine whether a point is a core point while calculating the similarity of edges in the graph. Once the degree of a point is found to be less than μ-1 during the initialization process, the point can be determined in advance as a non-core point.

为了避免冗余计算，在接下来的相似度计算步骤前需要利用剪枝策略，根据公式推导，若|N[u]|＜ε²·|N[v]|(不失一般性，假设|N[u]|<|N[v]|)，则(u,v)必然不相似，其中N[u]表示点u自身及其邻居点的集合，|N[u]|即表示这个集合的元素个数，其值等于u的度数加一，这样就无需计算u和v的公共邻居直接可以确定它们不相似。并且由于经验推导，现实世界图的大部分边是不相似的，这个剪枝策略的效果很好。In order to avoid redundant calculations, the pruning strategy needs to be used before the next similarity calculation step. According to the formula derivation, if |N[u]|<ε ² ·|N[v]| (without loss of generality, assume| N[u]|<|N[v]|), then (u, v) must be dissimilar, where N[u] represents the set of point u itself and its neighbors, and |N[u]| represents this set The number of elements is equal to the degree of u plus one, so that u and v can be directly determined to be dissimilar without calculating the common neighbors of u and v. And due to empirical derivation, most of the edges in the real world graph are dissimilar, this pruning strategy works well.

因此，根据已经生成的基于度数指向的不重复边表，GPU的每个warp按照顺序从边表中选取一条边(u,v)来处理，如果边满足|N[u]|<ε²·|N[v]|，则可直接提前判定(u,v)不相似，并将点u和点v的有效度减一，如果导致这两点的有效度小于μ-1，则提前将该点归类为非核心点。Therefore, according to the non-repeating edge table based on the degree pointing that has been generated, each warp of the GPU selects an edge (u, v) from the edge table in order to process, if the edge satisfies |N[u]|<ε ² · |N[v]|, you can directly determine that (u, v) are not similar in advance, and reduce the validity of point u and point v by one, if the validity of these two points is less than μ-1, then advance the Points are classified as non-core points.

完成初始的准备后，开始计算相似度并聚类，(u,v)的相似度计算公式如下：After completing the initial preparation, start to calculate the similarity and cluster. The similarity calculation formula of (u,v) is as follows:

其中N[u]为点u的结构邻居，即u的邻居点加上自身构成的集合，当σ(u，v)≥ε时，则认为(u,v)相似。Among them, N[u] is the structural neighbor of point u, that is, the set of neighbor points of u plus itself. When σ(u, v)≥ε, it is considered that (u, v) are similar.

相似度计算过程为：根据已经生成的基于度数指向的不重复边表，GPU的每个warp按照顺序从边表中选取一条边(u,v)来处理，先检查u点和v点是否都已经确定为核心点或者非核心点，如果两者都已经确定，则跳过对边(u,v)的相似度计算，其相似度保持未知状态，若u或v至少一个没确定，则每个warp中的32个线程一起来计算选取的边(u,v)的两点u和v的公共邻居数目，u和v公共邻居数目加二即为u和v的结构邻居数目，即为|N[u]∩N[v]|；从CSR结构中直接得到u和v的度数，点u的度数加一即为|N[u]|，点v的度数加一即为N[v]，再根据公式算出相似度σ(u,v)。如果σ(u,v)<ε，则不相似，并更新u和v的有效度，将它们的有效度都减一，如果σ(u,v)≥ε，则相似，并更新u和v的确定度，将它们的确定度都加一，如果有效度因此小于μ-1，则将该点归类为非核心点，如果确定度因此大于μ-1，则将该点归类为核心点。The similarity calculation process is as follows: according to the generated non-repeating edge table based on degree pointing, each warp of the GPU selects an edge (u, v) from the edge table in order to process, first check whether the points u and v are both It has been determined as a core point or a non-core point. If both have been determined, skip the similarity calculation of the opposite side (u, v), and its similarity remains unknown. If at least one of u or v is not determined, each The 32 threads in a warp calculate the number of common neighbors of two points u and v on the selected edge (u, v), and the number of common neighbors of u and v plus two is the number of structural neighbors of u and v, which is | N[u]∩N[v]|; Get the degrees of u and v directly from the CSR structure, add one to the degree of point u to get |N[u]|, add one to the degree of point v to get N[v] , and then calculate the similarity σ(u,v) according to the formula. If σ(u,v)<ε, dissimilar, and update the validity of u and v, and reduce their validity by one, if σ(u,v)≥ε, similar, and update u and v If the degree of certainty is greater than μ-1, then the point is classified as a non-core point, and if the degree of certainty is therefore greater than μ-1, the point is classified as a core point point.

所述方法的计算过程主要采用的GPU并行下的二分搜索策略，参见图3。图3中以一个样例图表示该步骤的实施过程。图3显示了warp的工作分配和计算(v₁,v₀)共同邻居的二分搜索过程。图中a和b部分说明，相邻位置的边(v₁,v₀)、(v₂,v₀)和(v₂,v₁)是由连续的W_i、W_i+1和W_i+2的warp同时处理的。以W_i处理的边(v₁,v₀)为例。为了计算其相似性，首先从CSR结构中得到v₁和v₀的相邻点。然后，由于v₁的度数较小，Wx中的线程t₀、t₁、t₂和t₃分别负责匹配邻居点v₀、v₂、v₄和v₇。图3的c部分和d部分显示了这些过程。The calculation process of the method mainly adopts a binary search strategy under GPU parallelism, as shown in FIG. 3 . Figure 3 shows the implementation process of this step with a sample diagram. Figure 3 shows the warp's work distribution and binary search process for computing (v ₁ ,v ₀ ) common neighbors. Parts a and b in the figure illustrate that the edges (v ₁ , v ₀ ), (v ₂ , v ₀ ) and (v ₂ , v ₁ ) at adjacent positions are composed of continuous W _i , W _i+1 and W _{i +2} warps are processed simultaneously. Take the edge (v ₁ , v ₀ ) processed by W _i as an example. In order to calculate its similarity, the adjacent points of _v1 and _v0 are first obtained from the CSR structure. Then, since the degree of v ₁ is small, threads t ₀ , t ₁ , t ₂ and t ₃ in Wx are responsible for matching neighbor points v ₀ , v ₂ , v ₄ and v ₇ , respectively. Parts c and d of Figure 3 show these processes.

所述计算过程中，这些线程将其对应的点放入v₀的邻居数组，通过二进制搜索进行匹配，如图中e部分所示。根据二分搜索树，在第一次迭代中，四个线程将其目标点与v₄进行比较，线程2成功命中后结束任务。由于二分搜索的规则，线程0和线程1的点id小于v₄，并向二叉树的左边分支搜索，而线程3大于v₄，向右边分支搜索。接下来的几轮迭代同理，最终统计匹配到的公共邻居有三个，符合示例图直观的体现。找到的公共邻居个数用于计算相似度，计算后如果结果为不相似，可以通过通信数据转化的点的属性和边的属性进一步优化相似度的计算。具体可分为：将点对应的地址属性考虑在内，两点若处于同一地域，则相似度按照一定权值增加；将边对应的通信时间属性考虑在内，边所代表的通信时间如果足够长，则边的两端点相似度按照一定权值增加。将属性考虑进去后可能改边的相似度增加超过阈值ε从而变成相似边，但这有助于本方法贴合实际情况。相似度确定后根据结果更新两点的有效度和确定度进而判定它们是否为核心点。During the calculation process, these threads put their corresponding points into the neighbor array of v ₀ , and perform matching through binary search, as shown in part e in the figure. According to the binary search tree, in the first iteration, four threads compare their target points with _v4 , and thread 2 ends the task after a successful hit. Due to the rule of binary search, thread 0 and thread 1 whose point id is less than v ₄ will search to the left of the binary tree, while thread 3 is greater than v ₄ and will search to the right. The next few rounds of iterations are the same, and finally there are three common neighbors that are statistically matched, which is in line with the intuitive reflection of the example picture. The number of common neighbors found is used to calculate the similarity. If the result of the calculation is dissimilar, the calculation of the similarity can be further optimized by converting the attributes of the points and edges through the communication data. Specifically, it can be divided into: taking into account the address attribute corresponding to the point, if two points are in the same region, the similarity will increase according to a certain weight; taking into account the communication time attribute corresponding to the edge, if the communication time represented by the edge is sufficient is long, the similarity between the two ends of the edge increases according to a certain weight. After taking attributes into account, the similarity of modified edges may increase beyond the threshold ε to become similar edges, but this helps this method to fit the actual situation. After the similarity is determined, the validity and certainty of the two points are updated according to the results to determine whether they are core points.

所述的聚类过程是在确定了所有核心点之后，用于将这些核心点聚类在一起，随后以此延展形成最终的聚类。第一步是将有边相连且相似的核心点在同一聚类中。第二步是将核心点的非核心点相似邻居加入到核心点形成的聚类中。The clustering process is used to cluster these core points together after all the core points are determined, and then extend this to form the final cluster. The first step is to group edge-connected and similar core points in the same cluster. The second step is to add the non-core point similar neighbors of the core point to the cluster formed by the core point.

聚类过程具体表现为：每个核心点被初始化成一个单元素树形集合，该点作为树形集合的树根，集合的编号Id即为树根Id；首先对核心点聚类，第一步是GPU的每个warp并行处理一个核心点u，warp中的线程并行的遍历u的邻居，如果点u的邻居v也是核心点，那么再由已经生成的边表索引找出边(u,v)在边表中的位置，根据位置找到边(u,v)记录的相似度，如果(u,v)相似度未知，则跳过，如果相似的话，则通过并查集的Find操作向上寻找u点和v点所在集合的根节点R_u和R_v，若R_u和R_v不同，即他们不在一个集合中，则用并查集的union操作合并两个集合；第二步与第一步类似，同样是并行检查所有核心点u与其邻居核心点v之间的相似度，如果相似度未知，则采用上述的相似度计算方法算出相似度，如果是相似的，则采用第一步的方法合并u和v所在的两个集合；接着对非核心点聚类GPU的每个warp并行的对每个核心点检查，warp中的每个线程寻找u的非核心点邻居v，同样根据边表索引和边表查到(u,v)相似度，在(u,v)相似的情况下将该非核心点加入到聚类中，即将v的聚类Id赋值成为与核心点u所在聚类Id。综上可以找出图中所有由核心点扩展而成的结构聚类。The specific performance of the clustering process is as follows: each core point is initialized into a single-element tree set, and this point is used as the root of the tree set, and the number Id of the set is the tree root Id; first, the core points are clustered, and the first The first step is that each warp of the GPU processes a core point u in parallel, and the threads in the warp traverse the neighbors of u in parallel. If the neighbor v of point u is also a core point, then find out the edge (u, v) The position in the edge table, find the similarity of the edge (u, v) record according to the position, if the similarity of (u, v) is unknown, skip it, if it is similar, go up through the Find operation of the union search Find the root nodes R _u and R _v of the set where u point and v point are located. If _Ru and R _v are different, that is, they are not in the same set, use the union operation of the union search set to merge the two sets; the second step is the same as the first One step is similar, also check the similarity between all core points u and its neighbor core points v in parallel, if the similarity is unknown, use the above similarity calculation method to calculate the similarity, if it is similar, use the first step The method merges the two sets where u and v are located; then checks each core point in parallel for each warp of the non-core point clustering GPU, and each thread in the warp looks for the non-core point neighbor v of u, also according to The edge table index and the edge table find the similarity of (u, v), and add the non-core point to the cluster when (u, v) is similar, that is, assign the cluster Id of v to be the same as the core point u. Cluster ID. In summary, all the structural clusters expanded from the core points in the graph can be found.

所述的聚类过程同样采用GPU并行加速，利用并行化的并查集聚类。在一个线程将两点所属聚类Union(即将一个聚类的根节点指向另一个聚类)的同时，另一个线程在操作相同的结点的话，会导致访问冲突造成的线程不安全。因此需要保证这些临界资源的互斥访问，本发明采用CUDA中的原子操作atomicCAS来确保多个线程操作同一结点数据的互斥访问。The clustering process also adopts GPU parallel acceleration, utilizing parallelized union search and clustering. When a thread unifies the clusters to which two points belong (that is, the root node of one cluster points to another cluster), if another thread is operating the same node, it will lead to thread insecurity caused by access conflicts. Therefore, it is necessary to ensure mutually exclusive access to these critical resources. The present invention adopts the atomic operation atomicCAS in CUDA to ensure that multiple threads operate mutually exclusive access to the same node data.

本发明的所述的方法在得到图的结构聚类作为搜索出来的社区后，还会对那些不在聚类的点进行分类，进一步挖掘图的结构信息。其中，只直接连接或者间接连接在同一个聚类的点标注为离群点。这些点的特征是：邻居全都属于同一个聚类或者也是离群点。其它的连接不同聚类的点被标注为枢纽点，它们是连接不同聚类的枢纽。具体方法为：每个GPU的warp对应一个非聚类内点的分类，这些点默认值为离群点。一个warp先检查该点有没有连接到聚类中的点；若有，则warp中的32个线程同时寻找其是否存在不在相同聚类中的邻居；如果找到则将该点归为枢纽点，否则不改变其默认值。After the method of the present invention obtains the structural clustering of the graph as the searched community, it will also classify those points that are not in the clustering, and further mine the structural information of the graph. Among them, the points that are only directly connected or indirectly connected in the same cluster are marked as outliers. The characteristics of these points are: the neighbors all belong to the same cluster or are also outliers. Other points connecting different clusters are marked as hub points, which are the hubs connecting different clusters. The specific method is: each GPU warp corresponds to the classification of a non-clustering internal point, and the default value of these points is the outlier point. A warp first checks whether the point is connected to a point in the cluster; if so, 32 threads in the warp simultaneously look for neighbors that are not in the same cluster; if found, the point is classified as a pivot point, Otherwise leave its default value unchanged.

计算模块的结果传回给数据分析层，该层再根据结果分析可疑的犯罪团伙。聚类对应的点集实际意义是一个用户的群组，他们内部通信紧密，犯罪团伙更有可能出现在这些集群中。而与聚类有边相连的离群点则更有可能是诈骗受害者。连接不同聚类的枢纽点则最大可能作为联系多个团伙的上级。为了使得分析更精准，还可以引入深度学习方法，比如多层感知机或生成对抗网络。一种可行方法是：将聚类对应的群组中用户的特征信息(是否在聚类中、聚类规模、与其他用户相连的边数以及边的属性等)输入和公安机关真实验证过后的身份数据对照进行训练。引入人工智能可疑进一步提升精准度，辅助执法人员判断。The results of the calculation modules are passed back to the data analysis layer, which in turn analyzes suspected criminal gangs based on the results. The actual meaning of the point set corresponding to the cluster is a group of users with close internal communication, and criminal gangs are more likely to appear in these clusters. Outliers that are edge-connected to the cluster are more likely to be victims of fraud. The pivot points connecting different clusters are most likely to be the superiors linking multiple groups. In order to make the analysis more accurate, deep learning methods can also be introduced, such as multi-layer perceptrons or generative confrontation networks. A feasible method is: input the characteristic information of users in the group corresponding to the cluster (whether they are in the cluster, the size of the cluster, the number of edges connected to other users, and the attributes of the edges, etc.) Identity data is used for training. The introduction of artificial intelligence is suspicious to further improve the accuracy and assist law enforcement officers to judge.

数据分析层的结果传递到前端平台展示，执法人员可以根据结果快速定位可疑的犯罪团伙，缩小排查范围，有效打击电信诈骗。The results of the data analysis layer are transmitted to the front-end platform for display, and law enforcement officers can quickly locate suspicious criminal gangs based on the results, narrow the scope of investigation, and effectively combat telecommunications fraud.

本发明的效果可通过以下仿真实验进一步说明：Effect of the present invention can be further illustrated by following simulation experiments:

仿真条件Simulation conditions

为验证我们提出发明的效果，仿真实验的数据集采用了常用于测试评估图算法性能的8张真实世界的大规模图。分别为soc-LiveJournal、Enwiki-2022、com-orkut、Hollywood-2011、tech-p2p、UK-2002、soc-Twitter和Yahoo songs。这些数据集由国外知名图数据集网站SNAPNets、Network Repository和Laboratory for Web Algorithmics获得。这其中最小的图soc-Journal有着千万级的关系数，最大的图Yahoo songs则有着七亿条关系。仿真实验和相关对比试验均在Ubuntu16.04操作系统下进行，编程语言环境为C++11和CUDA10.1，硬件方面则使用的是Nvidia Tesla v100计算卡。实验数据为执行三次时间的平均值In order to verify the effect of our proposed invention, the data set of the simulation experiment uses 8 real-world large-scale graphs that are commonly used to test and evaluate the performance of graph algorithms. They are soc-LiveJournal, Enwiki-2022, com-orkut, Hollywood-2011, tech-p2p, UK-2002, soc-Twitter and Yahoo songs. These datasets are obtained from well-known foreign graph dataset websites SNAPNets, Network Repository and Laboratory for Web Algorithmics. The smallest graph, Soc-Journal, has tens of millions of relationships, and the largest graph, Yahoo songs, has 700 million relationships. The simulation experiments and related comparative experiments are carried out under the Ubuntu16.04 operating system, the programming language environment is C++11 and CUDA10.1, and the hardware is Nvidia Tesla v100 computing card. The experimental data is the average of the execution time three times

本发明采用的评价指标是聚类搜索消耗的时间(单位：s)。为了证明我们发明的有效性，我们同样实现了几个被广泛使用的对比方法作比较，分别是：The evaluation index adopted in the present invention is the time (unit: s) consumed by clustering search. In order to prove the effectiveness of our invention, we also implemented several widely used comparison methods for comparison, namely:

(1)SCAN：该方法实现了图结构聚类的基本方法，采用的是CPU串行计算，相似度计算方法是合并排序，聚类采用的是广度优先搜索。(1) SCAN: This method realizes the basic method of graph structure clustering, which adopts CPU serial calculation, similarity calculation method is merge sort, and clustering adopts breadth-first search.

(2)GPUSCAN：该模型采用GPU加速SCAN，相似度计算方法是并行合并排序，聚类采用的是并行的联通子图生成策略。(2) GPUSCAN: This model adopts GPU-accelerated SCAN, the similarity calculation method is parallel merge sorting, and the clustering adopts the parallel Unicom subgraph generation strategy.

仿真实验结果分析Simulation experiment result analysis

表2为本发明方法与其他对比方法在八个数据集在常用参数ε＝0.6，μ＝6下进行仿真实验的结果。从实验结果来看，我们结果在八个不同的数据集中都得到了最好的效果，体现了我们发明方法的普适性。其中GPUSCAN是首次使用到并行加速的方法，能看到其比普通的结构聚类方法SCAN有着不错的性能提升。但由于GPUSCAN采用的相似度计算方法并行度较差、工作量较大，同时其聚类步骤采用的方法在相似关系较多时效率也不是很理想。因此其性能还是远远弱于本发明的方法，本发明方法的平均性能是其十倍以上。Table 2 shows the results of the simulation experiments of the method of the present invention and other comparison methods on eight data sets under common parameters ε=0.6, μ=6. From the experimental results, our results have achieved the best results in eight different data sets, reflecting the universality of our invented method. Among them, GPUSCAN is the first method to use parallel acceleration, and it can be seen that it has a good performance improvement compared with the ordinary structural clustering method SCAN. However, due to the poor parallelism and heavy workload of the similarity calculation method adopted by GPUSCAN, the efficiency of the method used in the clustering step is not very ideal when there are many similar relationships. Therefore its performance is still far weaker than the method of the present invention, and the average performance of the method of the present invention is its more than ten times.

表3为本发明方法与其他对比方法在参数ε＝0.6，μ＝16下进行仿真实验的结果。可以看到在参数μ改变时，本发明与其他对比方法的性能基本保持不变，本发明方法依旧保持着十倍以上的性能领先。Table 3 shows the results of the simulation experiment of the method of the present invention and other comparative methods under the parameters ε=0.6, μ=16. It can be seen that when the parameter μ is changed, the performances of the present invention and other comparative methods remain basically unchanged, and the performance of the present invention still maintains a performance lead of more than ten times.

表4为本发明方法与其他对比方法在参数ε＝0.2，μ＝6下进行仿真实验的结果。可以看到在参数ε较小时，图相似的关系增多，GPUSCAN采用的并行的联通子图生成策略在相似关系增多的情况下迭代次数大大增加，因而性能降低。而本发明在对比下基本保持不变的速度，体现了本发明方法的鲁棒性。Table 4 shows the results of the simulation experiment of the method of the present invention and other comparative methods under the parameters ε=0.2, μ=6. It can be seen that when the parameter ε is small, the number of similar relationships in the graph increases, and the parallel Unicom subgraph generation strategy adopted by GPUSCAN increases the number of iterations greatly when the similar relationship increases, so the performance decreases. However, the speed of the present invention remains substantially unchanged under the comparison, reflecting the robustness of the method of the present invention.

为进一步体现本方法的效果，可视化了在通信网络上进行图结构聚类的的结果，由图5展示。为某地区一万人的在一天内的通信记录数据。在这个数据集中，每个顶点代表一个电话用户，从u到v的边代表由u表示的用户和由v表示的用户通信。使用本发明方法对数据集进行聚类。参数值及结果如图所示，其中由灰色线圈起来的部分即为结构聚类算法找到的所有聚类(图中以聚类id为C₁和C₂的聚类为例)，相同聚类的用户用同种颜色表示，社区外的用户用浅灰色表示。其中有三个聚类规模较大、内聚紧密，为可疑团伙。另外两个虽然是聚类但是分别只有2个和5个成员，规模较小，是团伙的可能性较低。对于类外的离群点，标注为Outliers，以u和v为例，他们与聚类内的用户有通信，因此很有可能是受骗对象。而深灰色标记的那部分用户由于与不同的可疑团伙通信而被归类为枢纽，标记为Hubs，是可疑的犯罪团伙上级。In order to further reflect the effect of this method, the results of graph structure clustering on the communication network are visualized, as shown in Figure 5. Record data for the communication of 10,000 people in a certain area in one day. In this dataset, each vertex represents a phone user, and the edge from u to v represents the communication between the user represented by u and the user represented by v. The data set is clustered using the method of the present invention. The parameter values and results are shown in the figure, and the part surrounded by the gray circle is all the clusters found by the structural clustering algorithm (in the figure, the clusters with cluster IDs of C ₁ and C ₂ are taken as examples), and the same clusters Users in the community are represented by the same color, and users outside the community are represented by light gray. Among them, there are three clusters with large scale and tight cohesion, which are suspicious gangs. Although the other two clusters have only 2 and 5 members respectively, they are smaller in scale and less likely to be gangs. For the outliers outside the class, they are marked as Outliers. Taking u and v as examples, they have communication with users in the cluster, so they are likely to be deceived. Those users marked in dark gray are classified as hubs due to their communication with different suspicious gangs, marked as Hubs, which are suspected criminal gang superiors.

表2Table 2

表3table 3

表4Table 4

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。The above is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art within the technical scope disclosed in the present invention can easily think of changes or Replacement should be covered within the protection scope of the present invention.

应当理解，为了精简本发明并帮助本领域的技术人员理解本发明的各个方面，在上面对本发明的示例性实施例的描述中，本发明的各个特征有时在单个实施例中进行描述，或者参照单个图进行描述。但是，不应将本发明解释成示例性实施例中包括的特征均为本专利权利要求的必要技术特征。It should be appreciated that in the foregoing description of exemplary embodiments of the invention, in order to streamline the present disclosure and to assist those skilled in the art in understanding its various aspects, various features of the invention are sometimes described in the context of a single embodiment, or with reference to A single graph is described. However, the present invention should not be interpreted that the features included in the exemplary embodiments are all essential technical features of the patent claims.

应当理解，可以对本发明的一个实施例的设备中包括的模块、单元、组件等进行自适应性地改变以把它们设置在与该实施例不同的设备中。可以把实施例的设备包括的不同模块、单元或组件组合成一个模块、单元或组件，也可以把它们分成多个子模块、子单元或子组件。It should be understood that the modules, units, components, etc. included in the device of one embodiment of the present invention can be adaptively changed so as to be arranged in a device different from that of the embodiment. Different modules, units or components included in the device of the embodiment can be combined into one module, unit or component, or they can be divided into multiple sub-modules, sub-units or sub-components.

Claims

1. A communication network anti-fraud method based on graph structure clustering, characterized in that, comprising the following steps:

The communication operator platform regularly collects all communication network users and call records of all users within the set time period;

Build a network structure based on user and call relationships, convert the network structure into a graph model and store it in a CSR structure;

Based on the CSR structure of graph data, generate non-repeating edge tables and edge table indexes based on degree points;

The data of the graph model is transmitted to the computing system, and the computing system executes the structural clustering method of the graph, and outputs the structural clustering result;

Analyze the structural clustering results, display the analysis results, and list suspicious objects.

2. The communication network anti-fraud method based on graph structure clustering according to claim 1, characterized in that, the specific method for converting the network structure into a graph model is:

The user is a point in the graph, and the user information data is modeled as an attribute of the point;

The communication between users is regarded as the edge of the graph, and the communication data is modeled as the attribute of the edge.

3. The communication network anti-fraud method based on graph structure clustering according to claim 1, wherein the specific method for storing the graph model with a CSR structure is:

Record the degrees of the points according to the order of the point Id in the figure to form a degree array, and store the neighbor points of each point in the order of Id according to the order of the point Id to form an adjacency array Adj; according to the prefix sum of the degree array of the point to represent each point The adjacency point set in the array Rptr of the starting position of the adjacency array Adj; Rptr and Adj constitute the CSR structure of the graph.

4. The communication network anti-fraud method based on graph structure clustering according to claim 1, characterized in that, based on the CSR structure of graph data, the specific method for generating non-repetitive edge tables and edge table indexes based on degree points is:

Set the set of the neighbors of each point u whose degree is greater than or whose degree is equal to Id to be greater than u as N ₊ (u), create auxiliary arrays upptr and his to record whether each point u’s neighbor v belongs to N ₊ (u ) in which element and the size of N ₊ (u);

Calculate the exclusive prefix sum on his to obtain the write position in the edge list Degree-oriented EdgeList of the edge with u as its source point;

Each element in the adj array is traversed in a two-level loop, and the position of v∈N(u) is recorded as O _uv , when processing a point v∈N ₊ (u), by adding elptr(u) to O The relative offset of the starting position of _uv upptr(u) to create the mapping eid(O _uv ), and assign e(u,v) to the edge table, otherwise, call the binary search on N(v) to locate O _vu , and create a mapping eid(O _uv ) by using v as the edge of the source point.

5. The communication network anti-fraud method based on graph structure clustering according to claim 1, wherein the computing system executes the graph structure clustering method, and the specific process of outputting the structure clustering results is:

Input the clustering parameters, initialize the cluster Id to which each point belongs and the degree of certainty and validity of all points;

Use the pruning strategy and input parameters to determine the similarity of some edges and the clustering role of some objects to eliminate redundant calculations;

Calculate the similarity of each edge, and determine whether it is a core point through the similarity of the edges related to each point;

The core points are preliminarily clustered by using union search, and the final cluster is formed by expanding outward from the formed preliminary cluster;

Classify out-of-class points as hub points or outliers according to whether they are connected to different clusters;

Get the clustering status of all points in the entire graph, and return the user group division status corresponding to these clustering statuses to the platform.

6. The communication network anti-fraud method based on graph structure clustering according to claim 5, wherein the degree of certainty of each point is initialized in parallel by GPU multithreading, and the degree of certainty is initialized to 0; the degree of certainty is initialized in parallel by GPU multithreading The validity of each point is the number of neighbor points of each point. If the validity is less than μ-1, the point is classified as a non-core point in advance.

7. The communication network anti-fraud method based on graph structure clustering according to claim 5, wherein the similarity of some edges and the clustering roles of some objects are determined by using pruning strategies and input parameters to eliminate redundancy The specific process of calculation is:

According to the non-repeating edge table based on the degree pointing that has been generated, each warp of the GPU selects an edge (u, v) from the edge table in order to process, if the edge satisfies |N[u]|<ε ² ·|N [v]|, you can directly determine that (u, v) are not similar in advance, and reduce the validity of point u and point v by one. If the validity of these two points is less than μ-1, return the point to The class is a non-core point, where N[u] represents the set of point u itself and its neighbors, and |B[u]| represents the number of elements in this set, and its value is equal to the degree of u plus one.

8. The communication network anti-fraud method based on graph structure clustering according to claim 5, wherein the similarity of each edge is calculated, and whether it is the core is determined by the similarity of the edge associated with each point The specific method of point is:

According to the generated non-repeating edge table based on degree pointing, each warp of the GPU selects an edge (u, v) from the edge table in order to process, first check whether point u and point v have been determined as core points or For non-core points, if both of them have been determined, the similarity calculation of the opposite side (u, v) will be skipped, and its similarity will remain unknown; if at least one of u or v is not determined, then 32 in the warp The threads work together to calculate the number of common neighbors of two points u and v on the selected edge (u, v), and the number of common neighbors of u and v plus two is the number of structural neighbors of u and v, expressed as |N[u]∩N [v]|; Obtain the degree of u and v directly from the CSR structure, the degree of point u plus one is |N[u]|, the degree of point v plus one is N[v];

Calculate the similarity of each edge (u, v) according to the following formula:

According to the attribute matching of two points u and v, update the similarity, which can be specifically divided into: consider the address attribute corresponding to the point, if the two points have the same address attribute, the similarity will increase according to a certain weight; the edge (u , v) The corresponding communication time attribute is taken into account, if the communication time represented by the edge is greater than the set value, then the similarity between u and v increases according to a certain weight;

If σ(u,v)<ε, dissimilar, and update the validity of u and v, and reduce their validity by one, if σ(u,v)≥ε, similar, and update u and v If the degree of certainty is greater than μ-1, then the point is classified as a non-core point, and if the degree of certainty is therefore greater than μ-1, the point is classified as a core point point.

9. The communication network fraud prevention method based on graph structure clustering according to claim 5, characterized in that the core points are preliminarily clustered by using union search, and the final cluster is formed by expanding outward from the formed preliminary cluster. The specific method is:

Each core point is initialized into a single-element tree set, which serves as the root of the tree set, and the number Id of the set is the root Id;

First, cluster the core points. The first step is that each warp of the GPU processes a core point u in parallel. The threads in the warp traverse the neighbors of u in parallel. If the neighbor v of point u is also a core point, then the generated The edge table index finds the position of the edge (u, v) in the edge table, finds the similarity of the edge (u, v) record according to the position, if the similarity is unknown, skip, if similar, pass and check The Find operation searches upwards for the root nodes R _u and R _v of the set where u point and v point are located. If R _u and R _v are different, that is, they are not in the same set, use the union operation of union search to merge the two sets; The second step is similar to the first step. It is also to check the similarity between all core points u and its neighbor core points v in parallel. If the similarity is unknown, then adopt the similarity calculation method described in claim 8 to calculate the similarity. If it is Similarly, use the method of the first step to merge the two sets where u and v are located;

Then, each warp of the non-core point clustering GPU is checked in parallel for each core point. Each thread in the warp looks for the non-core point neighbor v of u, which is also found according to the edge table index and edge table (u, v) Similarity, if it is similar, add the non-core point to the cluster, that is, assign the cluster Id of v to the cluster Id of the core point u; if the similarity is unknown, calculate the similarity first, if similar Then add the non-core point to the cluster.