CN114722920A

CN114722920A - Deep map convolution model phishing account identification method based on map classification

Info

Publication number: CN114722920A
Application number: CN202210276108.8A
Authority: CN
Inventors: 宣琦; 徐欣瑶; 李盼盼; 王金焕
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2022-03-21
Filing date: 2022-03-21
Publication date: 2022-07-08

Abstract

The application discloses a depth map convolution model phishing account identification method based on map classification, which comprises the following steps: step S1: constructing a lightweight data set from the published Etherhouse transaction records; s2: comprehensively considering a network topological structure, and sampling the transaction subgraph to obtain a small-scale subgraph; s3: and (3) learning potential transaction behavior patterns of the account through a graph volume deep neural network of Chebyshev, and realizing classification and detection of the phishing account for the Etheng account. The invention reasonably reduces the calculation data scale, improves the calculation efficiency, can accurately distinguish the phishing account from the non-phishing account, and helps the digital currency platform and the user to avoid fraud risks.

Description

Deep map convolution model phishing account identification method based on map classification

Technical Field

The invention relates to the technical field of block chains, in particular to a detection method for an Ether house phishing account.

Background

With the development of computer technology and the popularization of internet applications, electronic money is beginning to rise and is becoming a large component of the electronic financial field. The development of electronic money based on blockchain technology is started by a decentralized encrypted electronic money system based on P2P network, which is proposed by the inventor, and this also marks the formal start of the operation of bitcoin system. The blockchain is a distributed account book technology, and can guarantee the credible intermediary transaction between real-time nodes in a non-mutual trust environment. The blockchain technology is widely applied in various fields, wherein the cryptocurrency technology is one of the most widely applied fields of the blockchain. The block chain technology has outstanding advantages in the aspects of decentralization, openness and the like. Through the cryptocurrency technology, the account can freely transact currency and information without depending on the traditional third party; the transaction between the two addresses is permanently recorded in a public block and broadcasted to the whole network, and public and transparent security is guaranteed. However, in recent years, the cryptocurrency market inevitably has proliferated many cyber crime events due to the anonymity of the blockchain and the characteristics of unsupervised organizations.

EtherFang is the second largest cryptocurrency platform next to Bingjin and is also the largest intelligent contract support platform based on blockchain. A smart contract is a piece of code that is not tamperproof, process transparent, and uninterrupted in execution. The Ethengfang supports users to carry out picture-based complete language programming in the form of intelligent contracts, greatly enriches the levels and scenes of encryption currency trade, and further derives multiple applications of the block chain technology in the economic and financial field. While the hashing mechanism provided in the blockchain can prevent transactions from being tampered with, no internal tool has been available to date to detect illegal accounts and suspicious transactions on the network. Therefore, phishing fraud has become a key issue for etherhouses, deserving long-term attention and research and taking effective countermeasures.

Common approaches based on email detection and website detection are not suitable in this context due to the differences of phishing fraud approaches at etherhouses from traditional phishing accounts. Therefore, the relevant algorithm based on the network data mining field is considered to be used for extracting and learning effective information from the transaction network topological structure, the difference of the phishing account and the normal account in the transaction behavior is distinguished, and the phishing behavior is detected.

There are currently some methods for identifying phishing accounts based on network data mining techniques. Chinese patent application publication No. CN 112600810 a provides a graph-classification-based method for detecting phishing fraud in an ethernet workshop, which extracts a target node and preset first-order and second-order transaction neighbor nodes from the ethernet workshop network, learns a graph representation vector by using a graph-embedded algorithm, and learns and classifies through a classifier. Chinese patent application publication No. CN 112734425 a proposes a method for extracting transaction characteristics by using a transaction topology network and an intelligent contract, and then inputting the transaction characteristics into a classifier for identification. After the features are extracted by the two methods, the classifier is required to be trained again to detect the phishing account, and the end-to-end rapidity cannot be realized. The method of the Chinese patent application with the publication number of CN 113111930A is that from the perspective of a transaction subgraph, 20 neighbor information with the largest transaction amount of a target node is screened out, a second-order transaction network is constructed, a graph neural network is trained, and whether the target node is a phishing account or not is predicted.

Disclosure of Invention

The invention provides a deep map convolution model phishing account identification method based on map classification, which aims to overcome the defects in the technology, utilizes the map convolution neural network technology to dig out potential information of a transaction network to identify the phishing account, improves the calculation efficiency of network analysis, and ensures the end-to-end rapidity.

The invention provides a deep model fishing account identification method based on graph classification, which comprises the following steps:

s1: a lightweight data set is constructed. Sampling is carried out from the open Ether house transaction records, after the large-scale data are lightened, a second-order transaction sub-graph network is constructed, and the characteristics of the account in the network are extracted. Wherein the target account contains fishing nodes and non-fishing nodes that have been marked; the transaction object comprises a first-order neighbor node and a second-order neighbor node of the target node; the characteristics comprise designated characteristics of a fishing account and a non-fishing account in the lightweight data set;

s2: sampling the transaction subgraph, comprehensively considering the topological structure of the network, constructing a calculation formula of the number of the neighbors of the target node according to the attributes of the network average value, the network density, the number of the nodes and the number of the connecting edges, and obtaining the subgraph scale k with uniform and reasonable size. When the number of neighbors is less than k, all the neighbor nodes are reserved; if the number of the neighbors is larger than k, the attributes of the transaction amount and the transaction times of the neighbor nodes of the target node are sorted, and then k neighbors are reserved to obtain a small-scale sub-graph after sampling;

s3: through a graph volume deep neural network of Chebyshev, potential transaction behavior patterns of the account are learned, and end-to-end identification of the phishing account is realized.

Further, the step S1 specifically includes:

s1.1: extracting small-scale transaction data by a second-order breadth-first search algorithm (BFS) by taking a target account address as a starting point;

s1.2: based on the lightweight data of step S1.1, the dataset is again lightweight using a random walk sampling algorithm. The walking algorithm firstly randomly selects an account as a starting node, and samples forward by taking the account as a starting point to obtain a walking sequence with a fixed length. If the sequence does not reach the preset length in the sampling process, and a certain account does not have a transaction account, an account accessed in the sequence needs to be randomly selected to restart the wandering process.

S1.3: the accounts in the second-order trading networks of phishing and non-phishing are characterized separately.

Further, the step S2 specifically includes:

s2.1: when a trading network is constructed, an excessively large trading sample size causes large time complexity to affect the calculation efficiency, so that certain constraints are required on the number of neighbors and the neighborhood order. The patent provides a formula for calculating the number of neighbors, which is used for sequencing the neighbors of h order and obtaining k neighbor nodes, wherein the formula for calculating the number k of the neighbor nodes is as follows:

wherein,

represents the average value of the network, Density represents the network Density,

indicating rounding up on ·, | V | and | E | indicate the number of nodes and edges of the network, respectively.

Further, the step S3 specifically includes:

s3.1: the second order transaction network for each account is represented in the form of a set of vectors. The second order transaction network for each target account may be denoted by G ═ V, E, a, X, y. Where V is the set of all nodes that the trading network contains. E is a set of directed edges in the transaction network, defined as

A is an adjacency matrix of a transaction network and is expressed as A epsilon R^n×n. X is a node characteristic, and can be used as X belonging to R^n×dWhere d represents the dimension of the feature and n represents the total number of nodes. y represents whether the target node is a phishing account, y-1 represents that the target node is a phishing account, and y-0 represents that the target node is not a phishing account.

S3.2: by using the graph convolutional layer automatic aggregation node field information of the Chebyshev GCN, the convolutional layer form of the Chebyshev GCN is defined as:

wherein, beta_kAre coefficients corresponding to the Chebyshev polynomial, these parameters will be updated iteratively in the training, and X is the node feature vector of the second order trading network.

Is a Chebyshev polynomial of order k due to T_k(x) Cos (k. arccos (x)), hence the diagonal matrix of eigenvalues

Needs to be fixed at [ -1,1 [)]In between, expressed as:

where lambda is_maxIs obtained by a power iteration method, and L is a Laplace matrix

The advantage of such a transformation is that the computation process does not need to perform the eigenvector decomposition anymore. Since the extracted second-order transaction subgraph is a directed network, the laplacian matrix is transformed to:

where a is the adjacency matrix of the transaction subgraph,

is the sum of the adjacency matrix and its transpose,

is a deformed adjacency matrix

The degree matrix of (2) is a diagonal matrix. σ () is the activation function, and ReLu () max (0,) is chosen as the activation function.

In the actual operation process, the property of the Chebyshev polynomial can be utilized to obtain the recursion:

the scheme adopts two layers of Chebyshev GCN to aggregate neighborhood information of the target node, and the transaction subgraph feature extracted by taking the target account u as the center is represented as o_u＝gs。

S3.3: and extracting feature information after convolution of two layers of Chebyshev GCN in the step S3.2 by using a pooling function. The pooling function here is an average pooling function, and the node features are pooled into graph features by an average pooling layer, defined as:

y_pooling＝AvgPooling(o_u) (6)

s3.4: further training a full connection layer to distinguish phishing accounts from non-phishing accounts by using features:

where W and b are the trainable weight matrix and bias matrix respectively,

is the probability matrix of the final prediction result.

All the above trainable parameters are updated optimally by minimizing the following cross entropy loss function and using a gradient descent method:

the invention has the advantages that:

1. the method of extracting the second-order transaction network of the account and dynamically selecting the number of neighbors effectively avoids huge storage loss and operation loss required by using complete network data;

2. the high requirement on professional knowledge is relieved through the depth map neural network;

3. the phishing account is distinguished through the graph neural network, and phishing behaviors in the virtual currency field are effectively predicted.

4. The precision of the phishing account detection method provided by the invention is superior to that of the existing detection methods such as walking, graph embedding and the like.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments will be briefly described below.

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a sampling method of the present invention.

Fig. 3 is a schematic diagram of the second order transaction sub-graph sampling process of the present invention.

Detailed Description

In order that those skilled in the art will better understand the disclosure, the following detailed description will be given of the embodiments of the present invention, which are described as only a part of the embodiments of the present invention, but not as all embodiments. This description is not to be taken in a limiting sense, but is intended to be a more detailed description of certain aspects and embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

Example 1

The technical scheme provides a method for identifying a phishing account in a transaction network based on a deep network model aiming at the transaction information of an Ether house, and specifically comprises the following steps:

s1: a lightweight data set is constructed. Sampling is carried out from the open Ether house transaction records, the large-scale data is lightened, a second-order transaction subgraph network is constructed, and the characteristics of accounts in the network are extracted.

S1.1: 1165 phishing accounts are collected from the tag cloud of the etherhouse blockchain browser etherscan. The lightweight data set has 1686003 accounts and 4380616 transaction records. The extracted transaction data set has 167 weak connected components, and only the maximum weak connected component is used, so that 1684164 accounts and 4378716 transaction records are shared;

s1.2: on the basis of the light weight data, the data set is subjected to light weight operation again by using a random walk sampling algorithm. Starting from one node, random walks were performed to obtain five networks of different sizes, as shown in table 1:

data set	Number of nodes	Number of connecting edges	Number of fishing nodes
				G1	20000	131189	242
G2	30000	172011	363
				G3	40000	202595	462
G4	50000	227854	556
				G5	60000	250402	604

Table 1 five data set information

S1.3: the accounts in the trading networks of phishing and non-phishing are characterized separately.

S2: sampling the transaction subgraph, comprehensively considering the topological structure of the network, constructing a calculation formula of the number of neighbors of the target node according to the attributes of the network average value, the network density, the number of nodes and the number of connecting edges, and obtaining a subgraph scale k with uniform and reasonable size. When the number of neighbors is less than k, all the neighbor nodes are reserved; if the number of neighbors is larger than k, sorting neighbor nodes of the target node according to attributes of transaction amount and transaction times, and then reserving k neighbors to obtain a sampled small-scale sub-graph;

s2.1: when a trading network is constructed, an excessively large trading sample size causes large time complexity to affect the calculation efficiency, so that certain constraints are required on the number of neighbors and the neighborhood order. Sequencing the neighbors of the h order and obtaining k neighbor nodes, wherein the calculation formula of the number k of the neighbor nodes is as follows:

wherein,

representing the rounding of the pairs, upwards, | V | and | E | represent the number of nodes and connecting edges of the network, respectivelyAnd (4) counting.

S3: through a graph volume deep neural network of Chebyshev, potential transaction behavior patterns of the account are learned, and end-to-end identification of the phishing account is achieved.

Needs to be fixed at [ -1,1 [)]In between, expressed as:

The benefit of such a transformation is that the computation process does not need to perform the feature vector decomposition any more. Since the extracted second-order transaction subgraph is a directed network, the laplacian matrix is transformed to:

where a is the adjacency matrix of the transaction subgraph,

is the sum of the adjacency matrix and its transpose,

is a deformed adjacency matrix

The degree matrix of (2) is a diagonal matrix. σ () is an activation function, and ReLu () (max (0,) is selected as the activation function.

y_pooling＝AvgPooling(o_u) (6)

where W and b are the trainable weight matrix and bias matrix respectively,

is the probability matrix of the final prediction result.

mixing the algorithm model PDGNN (differentiating Scans Detection in Ethereum using Graph neural network) with Features (FE), LINE, Deepwalk (DW), Node2Vec (N2V), T-EDGE, Graph2Vec (G2V), I²Seven comparison algorithms of BGNN are used for comparison tests. The division ratio of the training set and the test set is 8:2, the fishing account detection experiment of each algorithm is repeated five times and averaged, F1-score is used as an evaluation index to measure the prediction result, and the experiment result is shown in Table 2.

TABLE 2 fishing account detection contrast experiment results

According to the analysis of experimental results, the FE effect of the simple feature extraction method is the worst, the effect of the walk algorithm N2V is better than that of the DW due to the addition of the network structure information, and the LINE is relatively better due to the intelligent aggregation of the second-order neighbor informationThe effect is not good. G2V, I²The BGNN and the PDGNN are both graph classification algorithms, and have better performance than a node classification algorithm. While our method pdbnn performs best in graph classification algorithms.

Claims

1. A deep model fishing account identification method based on graph classification is characterized by comprising the following steps: the method comprises the following steps:

s1: constructing a lightweight data set; sampling from the open Ether house transaction records, constructing a second-order transaction sub-graph network after carrying out light weight on large-scale data, and extracting the characteristics of accounts in the network; wherein the target account contains fishing nodes and non-fishing nodes that have been marked; the transaction object comprises a first-order neighbor node and a second-order neighbor node of the target node; the characteristics comprise designated characteristics of a fishing account and a non-fishing account in the lightweight data set;

s2: sampling the transaction subgraph, comprehensively considering the topological structure of the network, constructing a calculation formula of the number of neighbors of the target node according to the attributes of the network average value, the network density, the number of nodes and the number of connecting edges, and obtaining the subgraph scale with uniform and reasonable size; when the number of the neighbors is less than the number of the neighbors, all the neighbor nodes are reserved; if the number of the neighbors is larger than the number of the neighbors, the attributes of the transaction amount and the transaction times of the neighbor nodes of the target node are sorted, and then the neighbors are reserved to obtain a small-scale sub-graph after sampling;

2. The deep model fishing account recognition method based on graph classification as claimed in claim 1, wherein step S1 specifically comprises:

s1.1: extracting small-scale transaction data by using a second-order breadth-first search algorithm BFS by taking a target account address as a starting point;

s1.2: on the basis of the lightweight data of the step S1.1, carrying out light weight operation on the data set again by using a random walk sampling algorithm; firstly, randomly selecting an account as an initial node by a walking algorithm, and sampling forwards by taking the account as a starting point to obtain a walking sequence with a fixed length; if the sequence does not reach the preset length in the sampling process, a certain account does not have a transaction account, an account accessed in the sequence needs to be randomly selected to start the wandering again;

3. The deep model fishing account recognition method based on graph classification as claimed in claim 2, wherein step S2 specifically comprises:

s2.1: in order to constrain the number of neighbors and the neighborhood order, a calculation formula of the number of neighbors is provided, the neighbors of the h order are sequenced, k neighbor nodes are obtained, and the calculation formula of the number k of the neighbor nodes is as follows:

wherein,

4. The deep model fishing account recognition method based on graph classification as claimed in claim 3, wherein step S3 specifically comprises:

s3.1: representing a second-order transaction network for each account in the form of a set of vectors; the second-order transaction network for each target account may be denoted by G ═ V, E, a, X, y; wherein V is the set of all nodes contained in the trading network; e is a set of directed edges in the transaction network, defined as

A is an adjacency matrix of a transaction network and is expressed as A epsilon R^n×n(ii) a X is a node characteristic, and available X belongs to R^n×dRepresenting, wherein d represents the dimension of the feature and n represents the total number of nodes; y represents whether the target node is a phishing account, y is 1 represents that the target node is a phishing account, and y is 0 represents that the target node is not a phishing account;

wherein, beta_kCoefficients corresponding to the Chebyshev polynomial, which parameters are to be iteratively updated in the training, X is a node feature vector of the second-order trading network;

Needs to be fixed at [ -1,1 [)]In between, expressed as:

The advantage of such a transformation is that the computation process does not need to perform feature vector decomposition; since the extracted second-order transaction subgraph is a directed network, the laplacian matrix is transformed to:

where a is the adjacency matrix of the transaction subgraph,

is the sum of the adjacency matrix and its transpose,

is a deformed adjacency matrix

The degree matrix of (1) is a diagonal matrix; σ (-) is the activation function, where ReLu (-) max (0,) is chosen as the activation function;

two layers of Chebyshev GCN are adopted to aggregate neighborhood information of the target node, and the transaction subgraph feature extracted by taking the target account u as the center is represented as o_u＝gs；

S3.3: extracting feature information after convolution of two layers of Chebyshev GCN in the step S3.2 by using a pooling function; the pooling function here is an average pooling function, and the node features are pooled into graph features by an average pooling layer, defined as:

y_pooling＝AvgPooling(o_u) (6)

where W and b are the trainable weight matrix and bias matrix respectively,

a probability matrix which is the final prediction result;