CN114387005A

CN114387005A - Arbitrage group identification method based on graph classification

Info

Publication number: CN114387005A
Application number: CN202111464780.1A
Authority: CN
Inventors: 余杰潮; 徐德华; 汤敏伟; 李�真
Original assignee: Tianyi Electronic Commerce Co Ltd
Current assignee: Tianyi Electronic Commerce Co Ltd
Priority date: 2021-12-02
Filing date: 2021-12-02
Publication date: 2022-04-22

Abstract

The invention discloses a arbitrage partnership identification method based on graph classification, which comprises the following steps: s1: acquiring data preprocessing, extracting entity and relation of composition, and constructing a knowledge graph; s2: performing group division on the constructed map by using a connected subgraph algorithm; s3: calculating the risk index information of each group to form a service risk score; s4: and building and training a depth map convolutional neural network for predicting the structural risk score of the group. The constructed map is divided into groups by adopting a connected subgraph algorithm, so that arbitrage groups can be found more easily; classifying the whole image by adopting a depth image convolution neural network, and directly predicting the risk of the whole group; the invention not only judges the structural risk of the group through the neural network of the depth map, but also combines the service index, analyzes the service index information of the group by using a frequent set mining algorithm, calculates the service risk score, and combines the service risk score with the service data, thereby increasing the interpretability of the model result.

Description

Arbitrage group identification method based on graph classification

Technical Field

The invention relates to the technical field of electronic information, in particular to a arbitrage group identification method based on graph classification.

Background

With the rapid development of internet technology, new opportunities and challenges are brought to industries such as finance, e-commerce and the like. The platform and the merchant can issue various preferential activities on line by means of the network to attract users and increase the flow, but a group of people can collect the preferential activities of the marketing activities on a plurality of platforms by various means, and even a complete industry chain is formed, which is called a wool party or a arbitrage party. These arbitrage parties cause huge losses to merchants and platforms, accounting for billions of economic losses due to black-yielding arbitrage each year. The existing method for identifying arbitrage gangs mainly takes expert rules and a traditional machine learning model as main parts, the expert rules have strong interpretability but need to summarize and summarize historical risk events, and the summarized rules may vary from person to person, have poor consistency and slow response speed. The identification dimension of the traditional machine learning model is mainly single user or single business, and the anomaly of the group property is difficult to identify, because the arbitrage users have no problem from the dimension of the single user, but the anomaly of putting the users of one arbitrage group together is obvious. The graph can more directly and naturally show the association relationship inside the group, and has natural advantages in processing the group identification problem.

Disclosure of Invention

The technical problem to be solved by the invention is to overcome the defects of the prior art, provide a arbitrage partnership identification method based on graph classification, and provide an effective detection means aiming at the existing arbitrage partnership identification problem, so that the method has higher accuracy and better robustness.

The invention provides the following technical scheme:

the invention provides a arbitrage group identification method based on graph classification, which comprises the following steps:

s1: acquiring data preprocessing, extracting entity and relation of composition, and constructing a knowledge graph;

s2: performing group division on the constructed map by using a connected subgraph algorithm;

s3: calculating the risk index information of each group to form a service risk score;

s4: building and training a depth map convolutional neural network for predicting the structural risk score of the group;

s5: calculating the comprehensive risk score of the group by the comprehensive business risk score and the structure risk score, and screening the risk group;

the step S1 includes:

s1.1: and acquiring original transaction data and operation data from the business data table, performing preprocessing work such as data cleaning and the like, and extracting entity and relationship information required by composition. The entities include: account, merchant, device, IP, etc.; the relationship includes: consumption, login, transfer, cash withdrawal and the like;

s1.2: importing data into a graph database Neo4j to construct a graph according to the entities and the relations extracted in the step S1.1, or using a composition tool such as Networkx to compose a graph, wherein the performance of Neo4j is obviously superior to that of Networkx when the data volume is large;

the step S2 includes:

s2.1: based on the map constructed in the step S1.2, carrying out subgraph division on the whole map by adopting a connected subgraph algorithm to form groups which are mutually split and are closely connected with each other;

the step S3 includes:

s3.1: according to the user group divided in the step S2.1, calculating the aggregation index of the users in the group by using an FP-growth frequent set mining algorithm to obtain an aggregation score;

s3.2: calculating a risk degree score of the group based on the aggregation degree index of the step S3.1 and the existing user blacklist, equipment blacklist and the like;

s3.3: calculating the total business risk score of the group based on the aggregation degree score and the risk degree score obtained in the step S3.1 and the step S3.2;

the step S4 includes:

s4.1: and (3) building a deep graph convolution neural network, wherein the model can accept any graph as input without limiting the structure and the node number of the graph, and the neural network is trained based on part of the label groups.

S4.2: based on the trained neural network of step S4.1, taking the user group obtained in step S1.2 as input to obtain a probability of whether each group is a arbitrage group, and processing the probability to serve as a structural risk score of each group;

the step S5 includes:

s5.1: and calculating a comprehensive risk score of each group based on the group business risk scores and the group structure risk scores obtained in the steps S3 and S4, and screening groups with the comprehensive risk scores of the groups larger than a certain threshold value or groups with top TOPN as risk groups to be output.

Compared with the prior art, the invention has the following beneficial effects:

according to the characteristics of arbitrage groups, the invention can directly and naturally show the association relationship, the fund flow and the like among the accounts in the group in a knowledge graph constructing mode. Members of the same arbitrage partnership must be related by some relationship. Therefore, the constructed map is divided into groups by adopting a connected subgraph algorithm, so that arbitrage groups can be found more easily;

meanwhile, the invention classifies the whole graph by adopting a deep graph convolution neural network, directly predicts the risk of the whole group, and a plurality of arbitrage parties have typical structural characteristics, such as snowflake structure, chain structure and the like. The traditional graph neural network predicts the risk of a single entity in a group, and cannot effectively identify arbitrage partners without fully considering the information of a graph structure;

finally, the invention not only judges the structural risk of the group through the neural network of the depth map, but also combines the service index, analyzes the service index information of the group by using a frequent set mining algorithm, calculates the service risk score, and combines the service risk score with the service data, thereby increasing the interpretability of the model result.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a general schematic of the system of the present invention;

FIG. 2 is a diagram of a graph sorting network architecture of the present invention;

FIG. 3 is a schematic diagram of the graph convolution process of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation. Wherein like reference numerals refer to like parts throughout.

Example 1

Referring to fig. 1-3, the arbitrage partnership identification method based on graph classification according to the embodiment of the present invention first obtains transaction and operation data of a user for preprocessing, extracts entities and relationships of a composition, constructs a knowledge graph, and performs group division on the constructed graph by using a connected subgraph algorithm. And then calculating the risk index information of each group to form a business risk score. And then building and training a depth map convolutional neural network to predict the structural risk score of the group. And finally, calculating the comprehensive risk score of the group by the comprehensive business risk score and the structure risk score, and screening the risk group.

Fig. 1 is a flow chart illustrating a graph-based classification of arbitrage group identification method according to an example implementation, and as shown with reference to fig. 1, the method comprises the steps of:

specifically, the consumption information of the user is obtained from the transaction data table, and the login information of the user is obtained from the operation data table, in this embodiment, the entities required for composition include: the system comprises an account, a merchant, a bank card, equipment and an IP, wherein each entity comprises attribute information of the entity. The relationship types include: the account management method comprises the following steps of calculating a transfer relation between an account and an account, a consumption relation between the account and a merchant, a login relation between the account and equipment, a login relation between the account and an IP (Internet protocol), and a cash withdrawal relation between the account and a bank card, wherein 5 entities and 5 relations are counted.

S1.2: according to the entities and the relations extracted in the step S1.1, data are imported into a graph database Neo4j to construct a graph, or a composition tool such as Networkx is used for composition, and when the data volume is large, the performance of Neo4j is obviously superior to that of Networkx.

Specifically, a map is constructed based on the entities and relationships extracted in step S1.1. In the embodiment, as the number of entities is large and reaches more than 2000 tens of thousands, composition and subsequent connected subgraph division are performed by using Neo4j diagram data. At this data level, Neo4j performed significantly better than Networkx. If the data volume is small, the network composition can be adopted, and the method is more convenient and flexible.

s2.1: and (3) carrying out subgraph division on the whole graph by adopting a connected subgraph algorithm based on the graph constructed in the step (S1.2) to form groups which are mutually split and closely related to each other.

Specifically, the full graph is divided into groups by a connected subgraph algorithm in Neo4 j. Each user has a group number, and the user group numbers in the same group are the same.

specifically, the group information of all users is obtained through step S2, each user only belongs to one group, and one group may include multiple users. Arbitrage teams generally want to have obvious aggregations, so the aggregative index of each group needs to be calculated. In this embodiment, an FP-growth frequent item set mining algorithm is adopted to calculate the aggregation index of users in a group, and finally an aggregation score is formed.

The specific implementation steps of the FP-growth algorithm for each group are as follows:

1) scanning the user characteristic information in the primary group, finding out a frequent 1 item set, recording the frequent 1 item set as L, and arranging the frequent 1 item set and the frequent 1 item set in a descending order according to the support degree count, wherein the minimum support degree count is 3 in the embodiment;

2) based on L in the step 1), scanning the group user information again, and constructing an FP tree representing the association of the group information item set;

3) recursively find all frequent item sets on the FP;

4) and finally, generating strong association rules, namely association information of user characteristics, in all frequent item sets.

And obtaining the frequent item set and the corresponding support degree of each group by using a frequent set mining algorithm. Calculating an aggregation Score Score1_1 of each group by using the maximum frequent item set L and the support degree S of the maximum frequent item set of each group, wherein the calculation formula is as follows:

wherein size (G) represents the size of the group G, and size (features) represents the number of features. α is used to control the influence of the support count S and the frequent item set size L on the aggregation score, and in this embodiment, α takes a value of 0.5;

specifically, the features meeting the threshold of the aggregation degree in the group are compared with the existing blacklist of the service, and the ratio of the features hitting the blacklist is calculated. The black list used in this embodiment is: account blacklists, IP blacklists, device blacklists. The risk Score1_2 is calculated as follows:

wherein, PhoneNum, devicenum, and ModelNum respectively represent the number of mobile phone numbers, the number of devices, the number of device models in the group, and BlackPhoneNum, BlackDeviceNum, and BlackModelNum respectively represent the number of mobile phone numbers, the number of devices, and the number of device models hitting the blacklist.

S3.3: and calculating the total business risk score of the group based on the aggregation degree score and the risk degree score obtained in the step S3.1 and the step S3.2.

Specifically, the total business risk Score1 of the group is calculated according to the aggregation Score1_1 obtained in step S3.1 and the risk Score1_2 obtained in step S3.2, and the calculation formula is as follows:

Score1＝(Score1+Score2)*50

where Score1_1 and Score1_2 are the aggregative Score and risk Score, respectively, for the cohort, Score1 ranged from 0 to 100.

s4.1: building a depth map convolution neural network, wherein the model can accept any map as input without limiting the structure and the node number of the map, and training the neural network based on part of label groups;

specifically, a depth map neural network for map classification is built, and the network structure is shown in fig. 2. In this embodiment the network is mainly composed of picture convolutional layers, pooling layers, convolutional layers and fully-connected layer 4 parts. The convolutional layer is mainly used for fusing information of surrounding nodes and information of a network structure, 3 graph convolutional layers are used, the characteristic dimension of each node output by the last graph convolutional layer is 1, all nodes in the graph are sorted according to the value, and the graph convolutional process is shown in fig. 3. The sequenced node sequence enters a pooling layer, the pooling layer is used for carrying out standardized processing on the output of the convolutional layer, the pooling layer is preset with a value K for limiting the number of nodes entering the pooling layer, if the number of the nodes in the graph is larger than K, TOPK nodes are selected from large to small according to the sequencing result of the last layer of the convolutional layer, if the number of the nodes in the graph is smaller than K, 0 is supplemented, and due to the existence of the pooling layer, the network can process graph input with any structure and the number of the nodes, wherein the value of K in the embodiment is 20. Changing the sequenced K node characteristics into 1-dimensional long vectors, using the 1-bit convolution layer to extract the characteristic information of the node sequence, and finally connecting the full connection layer to complete the classification task. And training the network by taking the prepared labeled group as input, thereby ensuring the stable performance of the network in a test set.

S4.2: based on the trained neural network of step S4.1, the user groups obtained in step S1.2 are taken as input to obtain the probability of whether each group is a arbitrage group, and the probability is processed to be taken as the structure risk score of each group.

Specifically, the trained depth map neural network is used for predicting each user group in step S1.2, the prediction probability of each group is multiplied by 100 to serve as the structural risk Score2 of the group, and the calculation formula is as follows:

Score2＝(prob)*100

where prob is the probability that the network predicts the output of the cluster, Score2 ranges from 0 to 100.

S5: and calculating the comprehensive risk score of the group by the comprehensive business risk score and the structure risk score, and screening the risk group.

Specifically, the group business risk Score1 and the group structure risk Score2 obtained in step S3 and step S4 are weighted and averaged to be the total risk Score of the group, and the calculation formula is as follows:

Score＝α*Score1*(1-α)*Score2

in this embodiment, the value of α is 0.5, and may be adjusted according to the service requirement in specific use.

And finally, selecting the group with the group risk score larger than a certain threshold value or taking the group with the top TOPN as a risk group for output. In this embodiment, the groups with the arbitrage risk score greater than 80 are selected as high risk groups, users in these groups are added to the marketing blacklist, and the rest groups with risks are reviewed manually.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A arbitrage group identification method based on graph classification is characterized by comprising the following steps:

the step S1 includes:

the step S2 includes:

the step S3 includes:

the step S4 includes:

the step S5 includes: