CN116304252A - Communication network fraud prevention method based on graph structure clustering - Google Patents
Communication network fraud prevention method based on graph structure clustering Download PDFInfo
- Publication number
- CN116304252A CN116304252A CN202310006675.6A CN202310006675A CN116304252A CN 116304252 A CN116304252 A CN 116304252A CN 202310006675 A CN202310006675 A CN 202310006675A CN 116304252 A CN116304252 A CN 116304252A
- Authority
- CN
- China
- Prior art keywords
- point
- edge
- points
- similarity
- graph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 97
- 238000004891 communication Methods 0.000 title claims abstract description 65
- 230000002265 prevention Effects 0.000 title claims abstract description 11
- 230000008569 process Effects 0.000 claims description 27
- 238000004364 calculation method Methods 0.000 claims description 25
- 238000003491 array Methods 0.000 claims description 7
- 238000013507 mapping Methods 0.000 claims description 7
- 238000013138 pruning Methods 0.000 claims description 6
- 238000004458 analytical method Methods 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 238000004088 simulation Methods 0.000 description 8
- 238000007405 data analysis Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 5
- 238000012733 comparative method Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- HPTJABJPZMULFH-UHFFFAOYSA-N 12-[(Cyclohexylcarbamoyl)amino]dodecanoic acid Chemical compound OC(=O)CCCCCCCCCCCNC(=O)NC1CCCCC1 HPTJABJPZMULFH-UHFFFAOYSA-N 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W12/00—Security arrangements; Authentication; Protecting privacy or anonymity
- H04W12/12—Detection or prevention of fraud
- H04W12/121—Wireless intrusion detection systems [WIDS]; Wireless intrusion prevention systems [WIPS]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9024—Graphs; Linked lists
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/906—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/40—Business processes related to the transportation industry
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M3/00—Automatic or semi-automatic exchanges
- H04M3/22—Arrangements for supervision, monitoring or testing
- H04M3/2281—Call monitoring, e.g. for law enforcement purposes; Call tracing; Detection or prevention of malicious calls
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- Signal Processing (AREA)
- Business, Economics & Management (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computer Networks & Wireless Communication (AREA)
- Technology Law (AREA)
- Economics (AREA)
- General Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a communication network fraud prevention method and system based on graph structure clustering. Constructing a network structure according to a user and a call relationship, converting the network structure into a graph model and storing the graph model in a CSR structure; constructing a network structure according to the user and the call relationship, converting the network structure into a graph model and storing the graph model in a CSR structure; generating a non-repeated side table and side table index based on the degree pointing based on the CSR structure of the graph data; transmitting the data of the graph model to a computing system, executing a structure clustering method of the graph by the computing system, and outputting a structure clustering result; the invention can help law enforcement agencies to quickly locate areas where criminals are likely to be active from a huge communication network, detect possible fraudulent party and spoofed objects, reduce the observation range and improve the law enforcement efficiency.
Description
Technical Field
The invention belongs to the technical field of big data and data mining, and particularly relates to a communication phishing prevention method based on graph structure clustering.
Background
The graph formed by the communication network structure is quite huge in the background of the present big data age, and the graph formed by urban communication data of millions of population can have hundreds of millions of edges. Because of the timeliness of fraud, how to mine out important hierarchical structure information in a short time is always a hot spot and a difficult problem of technical development. Recent studies have proposed several classical community search methods defined by subgraph cohesion metrics, such as k-core, k-cluss, k-clique and k-ECC. These methods can find cohesive and connected subgraphs as communities. However, these methods are either under-conditioned, resulting in poor cohesion of the found sub-communities, such as k-core; or too severe, resulting in such defined communities being very small in the graph and too time consuming to search, such as k-trus and k-clique. Therefore, these methods are not really applicable to cluster search of most communication network diagrams. The method can be used as a graph structure clustering method for finding cohesive sub-graphs in a network, so that the problem can be well solved. Each cluster formed by the graph structure clusters has higher performance while ensuring good clustering effect. Therefore, the graph structure clustering can be better applied to large-scale communication networks.
For traditional graph structure clustering, a method which is beneficial to serial execution of a CPU is often adopted for structure clustering. The above method has great problems: good performance can be achieved in small and medium-sized graphs at the level of no more than a million relationships, but the effect is poor in the case of high frequency and large scale graph data. For example, in the large social network graph with billions of relations, such as the existing twitter network and the microblog network, the traditional method is difficult to be qualified because the calculation of clusters takes a few hours, and the method is not suitable for anti-fraud practical application in the current big data age.
Disclosure of Invention
In order to solve the technical defects in the prior art, the invention provides a communication phishing prevention method based on graph structure clustering.
The technical scheme for realizing the purpose of the invention is as follows: a communication phishing prevention method based on graph structure clustering, comprising the following steps:
the communication operator platform periodically collects call records of all the users communicated with the communication network in a set time period;
constructing a network structure according to the user and the call relationship, converting the network structure into a graph model and storing the graph model in a CSR structure;
generating a non-repeated side table and side table index based on the degree pointing based on the CSR structure of the graph data;
transmitting the data of the graph model to a computing system, executing a structure clustering method of the graph by the computing system, and outputting a structure clustering result;
analyzing the structure clustering result, displaying the analysis result, and listing suspicious objects.
Preferably, the specific method for converting the network structure into the graph model is as follows:
the user is used as a point in the graph, and the user information data is modeled as the attribute of the point;
communication between users is used as the edge of the graph, and communication data is modeled as the attribute of the edge.
Preferably, the specific method for storing the graph model in the CSR structure is as follows:
the degrees of the points (the number of edges connected with the points) are recorded according to the Id sequence of the points in the figure to form a degree array, and the neighbor points of the points are also stored according to the Id sequence of the points to form an adjacency array Adj. Obtaining an array Rpt representing the initial position of the adjacent point set of each point in the adjacent array Adj according to the prefix sum of the degree arrays of the points; rtr and Adj constitute the CSR structure of the graph.
Preferably, based on the CSR structure of the graph data, the specific method for generating the non-repeated side table and side table index based on the degree pointing is as follows:
the neighbor point degree of each point u is greater than or the set with equal degree Id greater than u is set as N + (u) creating auxiliary arrays upptr and his to record respectively whether the neighbor point v of each point u belongs toN + What element and N in (u) + The size of (u);
calculating an exclusive prefix sum on his to obtain a write position in an edge list devie-oriented Edge List of the edge with u as its source point;
traversing each element in the adj array in a two-stage loop and recording the position of v E N (u) as O uv When processing point v epsilon N + (u) by adding elptr (u) to O uv In the relative offset upptr (u) of the start position of (a) to create a mapping eid (O) uv ) And e (u, v) is assigned to the edge table, otherwise, a binary search on N (v) is invoked to locate O vu And creates a map eid (O) by using v as the edge of the source point uv )。
Preferably, the computing system executes a structure clustering method of the graph, and the specific process of outputting the structure clustering result is as follows:
inputting clustering parameters, and initializing clustering Id of each point, and certainty and validity of all points;
determining the similarity of some edges and the clustering roles of some objects by using pruning strategies and input parameters to eliminate redundant calculation;
calculating the similarity of each edge, and determining whether the edge is a core point according to the similarity of the edges related to each point;
utilizing the union set to preliminarily cluster the core points, and expanding outwards from the formed preliminary clusters to form final clusters;
classifying points outside the class into pivot points or outliers according to whether the points are connected with different clusters or not;
and obtaining the clustering conditions of all points of the whole graph, and returning the user group division conditions corresponding to the clustering conditions to the platform.
Preferably, the GPU initializes the certainty of each point in parallel, the certainty being initialized to 0; the GPU multi-thread initializes the validity of each point in parallel, wherein the validity is the number of neighbor points of each point, and if the validity is smaller than mu-1, the points are classified as non-core points in advance.
Preferably, the pruning strategy and the input parameters are utilized to determine the similarity of some edges and the clustering roles of some objects, so as to eliminate the specific process of redundant calculation:
according to the generated non-repeated edge table based on degree pointing, each warp of the GPU sequentially selects an edge (u, v) from the edge table to process, if the edge satisfies |N [ u ]]|<ε 2 ·|N[v]It is possible to directly determine in advance that (u, v) are dissimilar and subtract one from the validity of point u and point v, and if the validity of the two points is caused to be less than μ -1, classify the point as a non-core point in advance. Wherein N [ u ]]Representing the point u itself and its neighbor point set, |N [ u ]]I is the number of elements representing this set, which value is equal to the number of degrees of u plus one.
Preferably, the specific method for calculating the similarity of each edge and determining whether the edge is a core point according to the similarity of the edges related to each point is as follows:
the similarity of each edge (u, v) is calculated according to the following formula:
according to the attribute matching condition of the two points u and v, updating the similarity can be specifically divided into: taking address attributes corresponding to the points into consideration, and if the address attributes of the two points are the same, increasing the similarity according to a certain weight; taking the communication time attribute corresponding to the edge (u, v) into consideration, and if the communication time represented by the edge is larger than a set value, increasing the similarity of the u and the v according to a certain weight;
if σ (u, v) < ε, then dissimilarity and update the validity of u and v, decrease their validity by one, if σ (u, v) > ε, then similarity and update the certainty of u and v, increase their certainty by one, classify the point as a non-core point if the validity is thus less than μ -1, and classify the point as a core point if the certainty is thus greater than μ -1.
Preferably, the specific method for primarily clustering the core points by utilizing the union set and expanding the formed primary clusters outwards to form final clusters is as follows:
each core point is initialized to form a single element tree-shaped set, the point is used as a tree root of the tree-shaped set, and the number Id of the set is the tree root Id;
firstly, clustering core points, wherein each warp of the GPU processes a core point u in parallel, threads in the warp traverse the neighbors of the u in parallel, if the neighbors v of the point u are also core points, finding the position of an edge (u, v) in an edge table by using an already generated edge table index, finding the similarity of the edge (u, v) records according to the position, if the similarity of the edge (u, v) is unknown, skipping, if the similarity of the edge (u, v) is unknown, and if the similarity of the edge (u, v) is unknown, searching the root node R of a set where the u point and the v point are located upwards through Find-collected Find-out Find-up operation u And R is v If R is u And R is v Different, i.e. they are not in one set, then merge the two sets with a union operation of the union; the second step is similar to the first step, and the similarity between all the core points u and the neighboring core points v is checked in parallel, if the similarity is unknown, the similarity is calculated by adopting the similarity calculation method, and if the similarity is similar, the two sets where u and v are positioned are combined by adopting the method of the first step;
and then checking each warp of the non-core point clustering GPU in parallel for each core point, wherein each thread in the warp searches for a non-core point neighbor v of u, and also checks the similarity of (u, v) according to the edge table index and the edge table, and adds the non-core point into the cluster under the condition of similarity of (u, v), namely assigning the cluster Id of v to be the cluster Id where the core point u is located.
Compared with the prior art, the invention has the remarkable advantages that: on one hand, the invention can help law enforcement authorities to quickly locate areas where criminals are likely to be active from a huge communication network, detect possible fraud partners and spoofed objects, reduce the observation range and improve the law enforcement efficiency; on the other hand, the structure clustering of the graph is accelerated through the high concurrency computing framework CUDA of the Nvidia GPU, the structure clustering of the large graph data with the edge level of tens of millions can be completed in hundred milliseconds, and meanwhile, the similarity computing strategy and the concurrency searching gathering strategy under the current optimal GPU are provided, so that the computing speed is greatly improved, and the method has the advantages of low delay, quick response and good robustness.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views.
FIG. 1 is a diagram of a system architecture according to an embodiment of the present invention.
Fig. 2 is a flow chart of an embodiment of the present invention.
FIG. 3 is a data structure diagram of the graph structure clustering method of the present invention.
FIG. 4 is a graph of GPU parallel computations for the graph structure clustering method of the present invention.
Fig. 5 is a diagram of the results of the implementation of the present invention in a communication network.
Detailed Description
It is easy to understand that various embodiments of the present invention can be envisioned by those of ordinary skill in the art without altering the true spirit of the present invention in light of the present teachings. Accordingly, the following detailed description and drawings are merely illustrative of the invention and are not intended to be exhaustive or to limit or restrict the invention. Rather, these embodiments are provided so that this disclosure will be thorough and complete by those skilled in the art. Preferred embodiments of the present invention are described in detail below with reference to the attached drawing figures, which form a part of the present application and are used in conjunction with embodiments of the present invention to illustrate the innovative concepts of the present invention.
In order to make the objects and technical solutions of the present invention more clear, the technical solutions of the present invention will be described in detail below by referring to the accompanying drawings and examples.
The system architecture of the present invention is shown in fig. 1. FIG. 1 is a diagram of a system architecture according to an embodiment of the present invention, the system architecture belonging to a distributed architecture, including a communication network, a front-end platform, a data analysis layer, and a computing module. The communication network may be a telephone network or a mobile internet network operated and maintained by each large communication operator. The front-end platform is a server that deploys the front-end interface, and all operations including search scope settings, timing task settings, parameter inputs, and result presentation are all in this platform. The data analysis layer is a rear-end server cluster for receiving the parameter transmission of the front-end platform and collecting the communication network data, and is responsible for structuring the collected communication data into graph data to be transmitted to the calculation module for processing, and receiving the result of the calculation module to analyze the graph data and transmit the graph data back to the front-end platform for display. The computing module is a server cluster carrying a graphic processor, the computing method is a structure clustering method based on graph data, and the powerful parallel computing capacity of the graphic processor can greatly improve the efficiency of the graphic processor.
In order to implement the functions of the system of the present invention, a unique id (such as a telephone number) of each user needs to be determined in the communication network, so as to obtain the characteristic information of each user in the using process. There is also a need to divide the communication network by area, i.e. to identify the communication ids used by users within the target area. And maintaining an index table of a user in the data analysis layer, wherein the index table records the users identified in the current time period in the area in which the server is responsible. Each user in the index table corresponds to a communication object table, and stores people who communicate with the user. And when communication occurs each time, the communication network acquires information of both communication parties, transmits the information to the data analysis layer, checks whether the two users exist in the index table, if not, the information is added, and if so, the information does not need to be added. Then check if their adjacency list has already recorded the communication object, if not add it.
If the method is adopted to collect the communication information, the attribute mapping table corresponding to the user and the attribute mapping table corresponding to the user communication object can be generated. The attribute information of the user can also be obtained if the communication network of the operator is capable when the user id is obtained, for example, the physical address, the talk time and the like of the user can be obtained according to the base station or the ip. After the user information is determined, the user information can be added into an attribute mapping table corresponding to the users in the index table one by one, and the communication information can be added into an attribute mapping table corresponding to the communication object table one by one. The attributes can be modeled as attributes of points and edges in the graph in the future, so that more references are provided for similarity calculation, and accuracy is improved.
The structures of the index table, the communication object table, and the attribute table may be as shown in table 1:
TABLE 1
In table 1, the identity of the user adopts a digital number, and the attribute of the user adopts a physical address. The attribute of the communication object adopts the call duration. The example user communication objects in the table are two, but the number is not limited in practice, and the forms of the index table, the communication object table, and the attribute table are not limited to one type of table.
The data table collected based on the data analysis layer can be stored by using an in-memory database to improve writing speed. Upon receipt of a task request from the front-end platform, the data is modeled as a graph. The user is modeled as a point in the graph, and the user information data is modeled as an attribute of the point. Communication between users is used as the edge of the graph, and the communication object attribute is modeled as the attribute of the edge to form a logical network graph structure. This structure is then transformed into the CSR structure of the map. Referring to FIG. 3, the left part of FIG. 3 has a schematic diagram of the storage structure of an embodiment of the present invention, where the network has 4 points including v 1 ,v 2 ,v 3 ,v 4 And edges between these points. The degrees of the points (the number of edges connected with the points) are recorded according to the Id sequence of the points in the figure to form a degree array, and the neighbor points of the points are also stored according to the Id sequence of the points to form an adjacency array Adj. Obtaining an array Rpt representing the initial position of the adjacent point set of each point in the adjacent array Adj according to the prefix sum of the degree arrays of the points; rtr and Adj constitute the CSR structure of the graph.
After the structuring is completed, the CSR structure is transmitted to a computing module, and the computing module needs to regenerate a Degre-oriented Edge List edge table of all edges of the non-repeated record graph and an index Eid for mapping Adj to the edge table for the purpose of subsequent similarity computation and distribution of work tasks when the GPUs are parallel. Each edge of the graph, i.e. the point-to-point structure of its two end points, is stored in the edge table in sequence. Referring to FIG. 3, edge List in FIG. 3 illustrates the Degree-oriented Edge List side expression situation of the embodiment of the present invention.
Referring to the right part of FIG. 3, the process of generating the Degree-oriented Edge List side table and indexing Eid is embodied as follows: first, the set of neighbor points with degree greater than or equal to Id greater than u for each point u is set to N + (u) creating auxiliary arrays upptr and his to record respectively whether neighbor point v of each point u belongs to N + What element and N in (u) + The size of (u). Next, the exclusive prefix sum over his is calculated to obtain the write location in edge list Degree-oriented Edge List (denoted by elptr) for the edge with u as its source point. Third, each element in the adj array is traversed in a two-stage loop and the position of v ε N (u) is recorded as O uv . When processing point v epsilon N + (u) by adding elptr (u) to O uv Creating a map eid (O) from the relative offset upptr (u) of the start position uv ) And e (u, v) is assigned to the edge table. Otherwise, call a binary search on N (v) to locate O vu And creates a map eid (O) by using v as the edge of the source point uv )。
The process of generating auxiliary arrays upptr and his by CSR can be accelerated by CPU multithreading in parallel, and the process of constructing Degre-oriented Edge List and Eid by elptr can be effectively parallelized. The process can be greatly accelerated by parallelization, and only a few seconds are consumed in the large graph data of hundred million levels to complete.
After the data structure is constructed, the parameters epsilon and mu of the cluster are input, and the parameters and the structure are transmitted to the GPU video memory from the memory of the equipment. The certainty and validity of each point is initialized using GPU multithreading. The certainty factor is initialized to 0 and the validity factor is initialized to the number of neighbor points of the point. The certainty and validity are used to determine if the point is a core point while computing the similarity of edges in the graph. Once the degree of a point is found to be less than μ -1 during initialization, the point may be determined in advance to be a non-core point.
To avoid redundant computation, pruning strategies are used before the next similarity computation step, and the algorithm is used to derive if |N [ u ]]|<ε 2 ·|N[v](without loss of generality, assume |N [ u ]]|<|N[v]I), then (u, v) are necessarily dissimilar, where N [ u ]]Representing the point u itself and its neighbor point set, |N [ u ]]I is the number of elements representing this set, whose value is equal to the degree of u plus one, so that common neighbors of u and v can be determined directly to be dissimilar without computing them. And since empirically derived, most edges of the real world graph are dissimilar, this pruning strategy works well.
Thus, each warp of the GPU selects an edge (u, v) from the edge table in order to process based on the degree-pointing-based non-duplicate edge table that has been generated, if the edge satisfies |N [ u ]]|<ε 2 ·|N[v]It is possible to directly determine in advance that (u, v) are dissimilar and subtract one from the validity of point u and point v, and if the validity of the two points is caused to be less than μ -1, classify the point as a non-core point in advance.
After the initial preparation is completed, similarity calculation and clustering are started, and the similarity calculation formula of (u, v) is as follows:
and when sigma (u, v) is not less than epsilon, the (u, v) is considered to be similar.
The similarity calculation process is as follows: according to the generated non-repeated edge table based on degree pointing, each warp of the GPU sequentially selects an edge (u, v) from the edge table to process, whether the u point and the v point are both determined as core points or non-core points is checked, if the u point and the v point are both determined, the similarity calculation of the edge (u, v) is skipped, the similarity is kept in an unknown state, if at least one of the u or the v is not determined, 32 threads in each warp calculate the public neighbor number of two points u and v of the selected edge (u, v), and the structural neighbor number of the u and the v is obtained by adding two to the public neighbor number of the u and the v, namely |N [ u ] [ N ] |; the degrees of u and v are directly obtained from CSR structure, the degree of point u is added by one to be |N [ u ] |, the degree of point v is added by one to be N [ v ], and then the similarity sigma (u, v) is calculated according to the formula. If σ (u, v) < ε, then dissimilarity and update the validity of u and v, decrease their validity by one, if σ (u, v) > ε, then similarity and update the certainty of u and v, increase their certainty by one, classify the point as a non-core point if the validity is thus less than μ -1, and classify the point as a core point if the certainty is thus greater than μ -1.
The computing process of the method mainly adopts a binary search strategy under GPU parallel, see fig. 3. The implementation of this step is illustrated in a sample diagram in fig. 3. FIG. 3 shows the work distribution and computation of warp (v 1 ,v 0 ) And (3) a binary search process of the common neighbors. Parts a and b of the figure illustrate that adjacently located edges (v 1 ,v 0 )、(v 2 ,v 0 ) Sum (v) 2 ,v 1 ) Is made of continuous W i 、W i+1 And W is i+2 Is processed simultaneously with the warp of the (c). In W i Processed edge (v) 1 ,v 0 ) As an example. To calculate its similarity, v is first derived from the CSR structure 1 And v 0 Is a neighbor of (a) to the (b). Then due to v 1 The degree of (2) is smaller, thread t in Wx 0 、t 1 、t 2 And t 3 Respectively responsible for matching neighbor points v 0 、v 2 、v 4 And v 7 . These processes are shown in sections c and d of fig. 3.
In the calculation process, the threads put the corresponding points into v 0 Is matched by binary search, as shown in part e of the figure. According to the binary search tree, in the first iteration, four threads get their target points to v 4 The comparison is made and thread 2 ends the task after a successful hit. Because of the rule of binary search, the point ids of thread 0 and thread 1 are less than v 4 And searches to the left branch of the binary tree, while thread 3 is greater than v 4 The search is branched to the right. The next several rounds of iterative theories are the same, and three public neighbors matched by final statistics are presented, so that the method accords with the visual embodiment of the example graph. The number of the found public neighbors is used for calculating the similarity, and if the result is dissimilar after calculation, the calculation of the similarity can be further optimized through the attribute of the point converted by the communication data and the attribute of the edge. The method can be concretely divided into: taking address attributes corresponding to the points into consideration, and if the two points are in the same region, increasing the similarity according to a certain weight; taking the communication time attribute corresponding to the edge into consideration, if the communication time represented by the edge is long enough, the similarity of two ends of the edge is increased according to a certain weight. The similarity of the possible edges after taking the attributes into account increases beyond a threshold epsilon to become similar edges, but this helps the method fit the actual situation. And after the similarity is determined, updating the validity and the certainty of the two points according to the result so as to judge whether the two points are core points or not.
The clustering process is used for clustering all core points together after all the core points are determined, and then the final clustering is formed by extending the core points. The first step is to connect edges and similar core points in the same cluster. The second step is to add non-core point similar neighbors of the core point to the clusters formed by the core points.
The clustering process is specifically expressed as follows: each core point is initialized to form a single element tree-shaped set, the point is used as a tree root of the tree-shaped set, and the number Id of the set is the tree root Id; firstly, clustering core points, wherein each warp of the GPU processes a core point u in parallel, threads in the warp traverse the neighbors of the u in parallel, if the neighbors v of the point u are also core points, finding the position of an edge (u, v) in an edge table by using an already generated edge table index, finding the similarity of the edge (u, v) records according to the position, if the similarity of the edge (u, v) is unknown, skipping, if the similarity of the edge (u, v) is unknown, and if the similarity of the edge (u, v) is unknown, searching the root node R of a set where the u point and the v point are located upwards through Find-collected Find-out Find-up operation u And R is v If R is u And R is v Different, i.e. they are not in one set, then merge the two sets with a union operation of the union; the second step is similar to the first step, and is also to check all core points u and their neighbor cores in parallelThe similarity between the points v is calculated by adopting the similarity calculation method if the similarity is unknown, and if the similarity is similar, the two sets where u and v are positioned are combined by adopting the method of the first step; and then checking each warp of the non-core point clustering GPU in parallel for each core point, wherein each thread in the warp searches for a non-core point neighbor v of u, and also checks the similarity of (u, v) according to the edge table index and the edge table, and adds the non-core point into the cluster under the condition of similarity of (u, v), namely assigning the cluster Id of v to be the cluster Id where the core point u is located. In the whole, all the structure clusters expanded by the core points in the graph can be found out.
The clustering process is also accelerated by GPU parallel, and parallel union clustering is utilized. While one thread points the cluster Union to which two points belong (i.e., points the root node of one cluster to the other cluster), another thread can cause unsafe threads caused by access conflicts if the other thread is operating the same node. Therefore, the present invention employs atomic operation atomic cas in CUDA to ensure that multiple threads operate mutually exclusive access of the same node data.
After the method of the invention obtains the structural clusters of the graph as the searched communities, the points which are not clustered are classified, and the structural information of the graph is further mined. Wherein only points directly or indirectly connected to the same cluster are labeled as outliers. These points are characterized by: the neighbors all belong to the same cluster or are also outliers. Other points connecting different clusters are labeled pivot points, which are the pivots connecting different clusters. The specific method comprises the following steps: the warp of each GPU corresponds to a classification of non-clustered interior points, which are outliers by default. One warp first checks whether the point is connected to a point in the cluster; if yes, the 32 threads in warp simultaneously find whether neighbors which are not in the same cluster exist or not; if found, the point is classified as a pivot point, otherwise, the default value is not changed.
The results of the calculation module are transmitted back to the data analysis layer, which then analyzes suspected crime partners based on the results. The point set corresponding to the cluster is actually a group of users, and the internal communication of the users is tight, so that crime party is more likely to appear in the clusters. And outliers connected to the clustered edges are more likely to be fraud victims. The pivot point connecting different clusters is then most likely to be the superior to contacting multiple partners. Deep learning methods, such as multi-layer perceptrons or generation of countermeasure networks, may also be introduced in order to make the analysis more accurate. One possible method is: and inputting the characteristic information (whether in the cluster, the cluster scale, the number of edges connected with other users, the attribute of the edges and the like) of the users in the group corresponding to the cluster, and comparing the identity data after the true verification of the public security organization for training. The artificial intelligence is introduced to further improve the accuracy and assist law enforcement personnel in judgment.
The results of the data analysis layer are transmitted to the front-end platform for display, and law enforcement personnel can rapidly locate suspicious crime partners according to the results, so that the investigation range is reduced, and telecommunication fraud is effectively hit.
The effect of the invention can be further illustrated by the following simulation experiment:
simulation conditions
To verify the effect of our proposed invention, the dataset of simulation experiments employed 8 real world large scale graphs commonly used to test and evaluate the performance of the graph algorithm. Soc-LiveJournal, enwiki-2022, com-orkut, hollywood-2011, tech-p2p, UK-2002, soc-Twitter and Yahoo songs, respectively. These datasets were obtained from the foreign well-known map dataset websites SNAPNets, network Repository and Laboratory for Web Algorithmics. The smallest graph soc-Journal has a tens of millions of relationships, and the largest graph Yahoo songs has seven hundred million relationships. The simulation experiment and the related comparison experiment are carried out under the Ubuntu16.04 operating system, the programming language environment is C++11 and CUDA10.1, and the hardware aspect uses an Nvidia Tesla v100 computing card. The experimental data are the average of three times of execution
The evaluation index adopted by the invention is the time (unit: s) consumed by cluster search. To demonstrate the effectiveness of our invention, we have also achieved several widely used comparison methods for comparison, respectively:
(1) SCAN: the method realizes the basic method of graph structure clustering, adopts CPU serial calculation, adopts a similarity calculation method of merging and sorting, and adopts breadth-first search for clustering.
(2) GPUSCAN: the model adopts GPU to accelerate SCAN, the similarity calculation method is parallel merging and sorting, and the clustering adopts a parallel connected subgraph generation strategy.
Simulation experiment result analysis
Table 2 shows the results of simulation experiments performed on eight data sets under the common parameters epsilon=0.6, mu=6, for the inventive method and other comparative methods. From the experimental results, the results obtained the best results in eight different data sets, and the universality of the method of the invention is reflected. The GPUSCAN is a method using parallel acceleration for the first time, and can be seen to have better performance improvement than the SCAN of the common structure clustering method. However, the similarity calculation method adopted by the GPUSCAN has poor parallelism and large workload, and the efficiency of the method adopted by the clustering step is not ideal when the similarity relation is more. The performance is thus still far weaker than in the process of the invention, which has an average performance more than ten times that of the process of the invention.
Table 3 shows the results of simulation experiments performed with the inventive method and other comparative methods under the parameters epsilon=0.6, mu=16. It can be seen that the performance of the present invention remains substantially unchanged from the performance of the other comparative methods as the parameter μ is varied, and the present method remains more than ten times more advanced.
Table 4 shows the results of simulation experiments performed with the inventive method and other comparative methods under the parameters epsilon=0.2, mu=6. It can be seen that when the parameter epsilon is smaller, the graph similarity relationship is increased, and the iteration number is greatly increased under the condition that the similarity relationship is increased by adopting the parallel connected subgraph generation strategy adopted by the GPUSCAN, so that the performance is reduced. The invention basically keeps unchanged speed under comparison, and the robustness of the method is embodied.
To further demonstrate the effect of the present method, the results of graph structure clustering on a communication network are visualized, as illustrated by fig. 5. Data is recorded for communications of tens of thousands of people in a certain area during a day. At this data setEach vertex represents a telephone user, and the edge from u to v represents the user represented by u and the user represented by v communicates. The data sets are clustered using the method of the present invention. The parameter values and results are shown in the figure, wherein the parts coiled by grey are all clusters found by the structure clustering algorithm (in the figure, cluster id is C 1 And C 2 For example), users of the same cluster are represented in the same color, and users outside the community are represented in light gray. Three clusters are large in scale and compact in cohesion, and are suspicious clusters. The other two, although clustered, have only 2 and 5 members, respectively, and are smaller in size and less likely to be a partner. For Outliers outside the class, labeled Outliers, for example u and v, they have communication with users within the cluster and are therefore likely to be spoofed objects. While the portion of users marked in dark grey are categorized as Hubs by communicating with different suspected parties, marked as Hubs, being the suspected criminal party superior.
TABLE 2
TABLE 3 Table 3
TABLE 4 Table 4
The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention.
It should be appreciated that in the above description of exemplary embodiments of the invention, various features of the invention are sometimes described in the context of a single embodiment or with reference to a single figure in order to streamline the invention and aid those skilled in the art in understanding the various aspects of the invention. The present invention should not, however, be construed as including features that are essential to the patent claims in the exemplary embodiments.
It should be understood that modules, units, components, etc. included in the apparatus of one embodiment of the present invention may be adaptively changed to arrange them in an apparatus different from the embodiment. The different modules, units or components comprised by the apparatus of the embodiments may be combined into one module, unit or component or they may be divided into a plurality of sub-modules, sub-units or sub-components.
Claims (9)
1. A communication phishing prevention method based on graph structure clustering, which is characterized by comprising the following steps:
the communication operator platform periodically collects call records of all the users communicated with the communication network in a set time period;
constructing a network structure according to the user and the call relationship, converting the network structure into a graph model and storing the graph model in a CSR structure;
generating a non-repeated side table and side table index based on the degree pointing based on the CSR structure of the graph data;
transmitting the data of the graph model to a computing system, executing a structure clustering method of the graph by the computing system, and outputting a structure clustering result;
analyzing the structure clustering result, displaying the analysis result, and listing suspicious objects.
2. The communication phishing preventing method based on graph structure clustering according to claim 1, wherein the specific method for converting the network structure into the graph model is as follows:
the user is used as a point in the graph, and the user information data is modeled as the attribute of the point;
communication between users is used as the edge of the graph, and communication data is modeled as the attribute of the edge.
3. The communication phishing preventing method based on graph structure clustering of claim 1, wherein the specific method for storing the graph model in the CSR structure is as follows:
recording the degrees of points according to the Id sequence of the points in the graph to form a degree array, and storing neighbor points of the points according to the Id sequence of the points to form an adjacent array Adj; obtaining an array Rpt representing the initial position of the adjacent point set of each point in the adjacent array Adj according to the prefix sum of the degree arrays of the points; rtr and Adj constitute the CSR structure of the graph.
4. The communication phishing preventing method based on graph structure clustering of claim 1, wherein the specific method for generating non-repeated side table and side table index based on degree pointing based on the CSR structure of the graph data is as follows:
the neighbor point degree of each point u is greater than or the set with equal degree Id greater than u is set as N + (u) creating auxiliary arrays upptr and his to record respectively whether neighbor point v of each point u belongs to N + What element and N in (u) + The size of (u);
calculating an exclusive prefix sum on his to obtain a write position in an edge list devie-oriented Edge List of the edge with u as its source point;
traversing each element in the adj array in a two-stage loop and recording the position of v E N (u) as O uv When processing point v epsilon N + (u) by adding elptr (u) to O uv In the relative offset upptr (u) of the start position of (a) to create a mapping eid (O) uv ) And e (u, v) is assigned to the edge table, otherwise, a binary search on N (v) is invoked to locate O vu And creates a map eid (O) by using v as the edge of the source point uv )。
5. The communication phishing prevention method based on graph structure clustering of claim 1, wherein the computing system executes the graph structure clustering method, and the specific process of outputting the structure clustering result is as follows:
inputting clustering parameters, and initializing clustering Id of each point, and certainty and validity of all points;
determining the similarity of some edges and the clustering roles of some objects by using pruning strategies and input parameters to eliminate redundant calculation;
calculating the similarity of each edge, and determining whether the edge is a core point according to the similarity of the edges related to each point;
utilizing the union set to preliminarily cluster the core points, and expanding outwards from the formed preliminary clusters to form final clusters;
classifying points outside the class into pivot points or outliers according to whether the points are connected with different clusters or not;
and obtaining the clustering conditions of all points of the whole graph, and returning the user group division conditions corresponding to the clustering conditions to the platform.
6. The graph structure clustering-based communication phishing prevention method of claim 5, wherein the GPU multi-threads initialize the certainty of each point in parallel, the certainty being initialized to 0; the GPU multi-thread initializes the validity of each point in parallel, wherein the validity is the number of neighbor points of each point, and if the validity is smaller than mu-1, the points are classified as non-core points in advance.
7. The communication phishing prevention method based on graph structure clustering as claimed in claim 5, wherein the specific process of determining the similarity of some edges and the clustering role of some objects by using pruning strategy and input parameters to eliminate redundant calculation is as follows:
according to the generated non-repeated edge table based on degree pointing, each warp of the GPU sequentially selects an edge (u, v) from the edge table to process, if the edge satisfies |N [ u ]]|<ε 2 ·|N[v]I, it can directly determine (u, v) dissimilarity in advance, and subtract one from the validity of point u and point v,if the significance of these two points is caused to be less than μ -1, then the point is classified as a non-core point in advance, where N [ u ]]Representing the point u itself and its neighbor point set, |B [ u ]]I is the number of elements representing this set, which value is equal to the number of degrees of u plus one.
8. The communication phishing preventing method based on graph structure clustering as claimed in claim 5, wherein the specific method for calculating the similarity of each edge and determining whether the edge is a core point according to the similarity of the edges related to each point is as follows:
according to the generated non-repeated edge table based on degree pointing, each warp of the GPU sequentially selects one edge (u, v) from the edge table to process, whether the u point and the v point are both determined to be core points or non-core points is checked, if the u point and the v point are both determined, the similarity calculation of the edge (u, v) is skipped, and the similarity is kept in an unknown state; if at least one of u or v is not determined, the 32 threads in the thread bundle warp together calculate the public neighbor number of two points u and v of the selected edge (u, v), and the public neighbor number of u and v is added with two to be the structure neighbor number of u and v, which is expressed as |N [ u ]. U ] N [ v ] |; directly obtaining the degrees of u and v from CSR structure, wherein the degree of point u is added by one to be |N [ u ] |, and the degree of point v is added by one to be N [ v ] |;
the similarity of each edge (u, v) is calculated according to the following formula:
according to the attribute matching condition of the two points u and v, updating the similarity can be specifically divided into: taking address attributes corresponding to the points into consideration, and if the address attributes of the two points are the same, increasing the similarity according to a certain weight; taking the communication time attribute corresponding to the edge (u, v) into consideration, and if the communication time represented by the edge is larger than a set value, increasing the similarity of the u and the v according to a certain weight;
if σ (u, v) < ε, then dissimilarity and update the validity of u and v, decrease their validity by one, if σ (u, v) > ε, then similarity and update the certainty of u and v, increase their certainty by one, classify the point as a non-core point if the validity is thus less than μ -1, and classify the point as a core point if the certainty is thus greater than μ -1.
9. The communication phishing prevention method based on graph structure clustering as claimed in claim 5, wherein the specific method of clustering the core points preliminarily by utilizing and searching and expanding the formed preliminary clusters outwards to form final clusters is as follows:
each core point is initialized to form a single element tree-shaped set, the point is used as a tree root of the tree-shaped set, and the number Id of the set is the tree root Id;
first, for core point clustering, the first step is that each warp of the GPU processes a core point u in parallel, threads in the warp traverse the neighbors of u in parallel, if the neighbor v of the point u is also a core point, then Find the edge from the already generated edge table index (u, v) locating the position in the edge table, finding the similarity of the records of the edge (u, v) according to the position, skipping if the similarity is unknown, and if the similarity is similar, searching the root node R of the set where the u point and the v point are located upwards through Find-together Find-end operation u And R is v If R is u And R is v Different, i.e. they are not in one set, then merge the two sets with a union operation of the union; the second step is similar to the first step, and the similarity between all the core points u and the neighboring core points v is checked in parallel, if the similarity is unknown, the similarity is calculated by adopting the similarity calculation method of claim 8, and if the similarity is similar, the two sets where u and v are located are combined by adopting the method of the first step;
then checking each core point of each warp of the non-core point clustering GPU in parallel, searching a non-core point neighbor v of u by each thread in the warp, and checking similarity (u, v) according to the edge table index and the edge table, and adding the non-core point into the cluster under the similar condition, namely assigning the cluster Id of v to be the cluster Id where the cluster Id is located with the core point u; and under the condition of unknown similarity, calculating the similarity, and adding the non-core points into the clusters if the similarity is similar.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310006675.6A CN116304252A (en) | 2023-01-04 | 2023-01-04 | Communication network fraud prevention method based on graph structure clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310006675.6A CN116304252A (en) | 2023-01-04 | 2023-01-04 | Communication network fraud prevention method based on graph structure clustering |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116304252A true CN116304252A (en) | 2023-06-23 |
Family
ID=86819326
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310006675.6A Pending CN116304252A (en) | 2023-01-04 | 2023-01-04 | Communication network fraud prevention method based on graph structure clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116304252A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117251380A (en) * | 2023-11-10 | 2023-12-19 | 中国人民解放军国防科技大学 | Priority asynchronous scheduling method and system for monotone flow chart |
-
2023
- 2023-01-04 CN CN202310006675.6A patent/CN116304252A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117251380A (en) * | 2023-11-10 | 2023-12-19 | 中国人民解放军国防科技大学 | Priority asynchronous scheduling method and system for monotone flow chart |
CN117251380B (en) * | 2023-11-10 | 2024-03-19 | 中国人民解放军国防科技大学 | Priority asynchronous scheduling method and system for monotone flow chart |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111784502B (en) | Abnormal transaction account group identification method and device | |
CN111797326A (en) | False news detection method and system fusing multi-scale visual information | |
US20160034505A1 (en) | Systems and methods for large-scale link analysis | |
CN116049454A (en) | Intelligent searching method and system based on multi-source heterogeneous data | |
US20230056760A1 (en) | Method and apparatus for processing graph data, device, storage medium, and program product | |
CN117240632B (en) | Attack detection method and system based on knowledge graph | |
Ruan et al. | Parallel and quantitative sequential pattern mining for large-scale interval-based temporal data | |
Cai et al. | ARIS: a noise insensitive data pre-processing scheme for data reduction using influence space | |
CN116304252A (en) | Communication network fraud prevention method based on graph structure clustering | |
CN114124484B (en) | Network attack identification method, system, device, terminal equipment and storage medium | |
CN115114484A (en) | Abnormal event detection method and device, computer equipment and storage medium | |
CN118337469A (en) | Dynamic network intrusion detection method applied to node time sequence interaction | |
Wang et al. | A novel measure for influence nodes across complex networks based on node attraction | |
CN111984874A (en) | Parallel recommendation method integrating emotion calculation and network crowdsourcing | |
CN116647844A (en) | Vehicle-mounted network intrusion detection method based on stacking integration algorithm | |
CN115643153A (en) | Alarm correlation analysis method based on graph neural network | |
CN115277124A (en) | Online system and server for searching and matching attack mode based on system tracing graph | |
He et al. | Graph joint attention networks | |
Stattner et al. | Towards a hybrid algorithm for extracting maximal frequent conceptual links in social networks | |
CN112115174A (en) | KYC method and system based on graph computing technology | |
Ling et al. | Graph Attention Mechanism-Based Method for Tracing APT Attacks in Power Systems | |
Gao et al. | Construction and Optimization of Co-occurrence-attribute-interaction Model for Column Semantic Recognition. | |
Hu et al. | Mining both frequent and rare episodes in multiple data streams | |
Yu et al. | DIPool: Degree-Induced Pooling for Hierarchical Graph Representation Learning | |
Ansari et al. | Enhanced subgraph matching for large graphs using candidate region-based decomposition and ordering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |