CN116032670A

CN116032670A - Ethernet phishing fraud detection method based on self-supervision depth map learning

Info

Publication number: CN116032670A
Application number: CN202310328325.1A
Authority: CN
Inventors: 许封元; 吴昊; 李书城; 王润川
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2023-03-30
Filing date: 2023-03-30
Publication date: 2023-04-28
Anticipated expiration: 2043-03-30
Also published as: CN116032670B

Abstract

The invention relates to a method for detecting fraudulent behaviors of Ethereum phishing based on self-supervised deep graph learning, and belongs to the technical field of self-supervised deep graph learning. Data mapping: Based on the obtained Ethereum data, automatic information extraction is performed, and merged into nodes that do not have available attributes to obtain a transaction graph with node characteristics. Model preparation: set up spatial self-supervised pre-tasks, build models and training tasks, which are used to mine and represent node attribute information and topology information in the graph. Model training: Set the training scale and convergence conditions to obtain an optimized training model for detection. The model can detect new transaction graphs on Ethereum, and deal with problems such as constant changes in the scale of Ethereum, continuous evolution of transaction graphs, and insufficient number of node labels.

Description

Ethereum phishing fraud detection method based on self-supervised deep graph learning

技术领域Technical Field

本发明涉及一种基于自监督深度图学习的以太坊钓鱼欺诈行为检测方法，属于自监督深度图学习技术领域。The present invention relates to an Ethereum phishing fraud behavior detection method based on self-supervised deep graph learning, and belongs to the technical field of self-supervised deep graph learning.

背景技术Background Art

以太坊作为当今世界最受欢迎的可扩展区块链之一，充分挖掘了智能合约的潜力，并基于智能合约创建了大量的去中心化金融应用（DeFi），吸引了广泛的关注和资金。图1展示了智能合约的创建流程，智能合约可以在没有任何中心化实体的情况下自动管理和批准交易过程，保证交易信任和透明度的同时，消除了原本交易过程中的延迟和费用。据估计，以太坊目前总市值已达两千亿美元，但是其蓬勃发展也使得数以百万的用户面临恶意攻击的风险，诸如网络钓鱼诈骗、勒索用户资金等。这些攻击行为仅靠软件安全或者智能合约分析等传统方法也很难防御，因此，从以太坊交易行为的层面进行智能分析显得格外重要。As one of the most popular scalable blockchains in the world today, Ethereum has fully tapped the potential of smart contracts and created a large number of decentralized financial applications (DeFi) based on smart contracts, attracting widespread attention and funding. Figure 1 shows the creation process of smart contracts. Smart contracts can automatically manage and approve the transaction process without any centralized entity, ensuring transaction trust and transparency while eliminating delays and costs in the original transaction process. It is estimated that the current total market value of Ethereum has reached 200 billion US dollars, but its booming development has also put millions of users at risk of malicious attacks, such as phishing scams and extortion of user funds. These attacks are difficult to defend against with traditional methods such as software security or smart contract analysis alone. Therefore, it is particularly important to conduct intelligent analysis at the level of Ethereum transaction behavior.

现有的基于智能分析的钓鱼欺诈行为检测方法主要有两类。第一类主要采用浅层学习模型，例如依靠特征工程的传统机器学习方法、或者像DeepWalk，Node2Vec一样基于随机游走的网络嵌入方法等。第二类主要是基于图的深度学习方法，例如图卷积神经网络等。深度学习由于其强大的表示学习能力已在计算机视觉、语音识别和自然语言处理等方面取得了巨大的成功。近年来，如何将深度学习应用于图等非欧几里得数据越来越受到关注，如社交网络、蛋白质界面预测、知识图嵌入等，这也有助于计算机视觉或自然语言处理中的很多任务，如对象检测、动作识别、机器翻译、语义解析等。鉴于其它领域的经验，结合以太坊本身交易活动的特征——以太坊所有的交易活动可以看作一个大规模的交易图，如果可以充分利用以太坊上深度图学习方法的优势，便可以大大提高安全分析的有效性。There are two main types of existing phishing fraud detection methods based on intelligent analysis. The first type mainly uses shallow learning models, such as traditional machine learning methods that rely on feature engineering, or network embedding methods based on random walks like DeepWalk and Node2Vec. The second type is mainly based on graph-based deep learning methods, such as graph convolutional neural networks. Deep learning has achieved great success in computer vision, speech recognition, and natural language processing due to its powerful representation learning ability. In recent years, how to apply deep learning to non-Euclidean data such as graphs has attracted more and more attention, such as social networks, protein interface prediction, knowledge graph embedding, etc., which also helps many tasks in computer vision or natural language processing, such as object detection, action recognition, machine translation, semantic parsing, etc. In view of the experience in other fields, combined with the characteristics of Ethereum's own transaction activities-all Ethereum's transaction activities can be regarded as a large-scale transaction graph. If the advantages of deep graph learning methods on Ethereum can be fully utilized, the effectiveness of security analysis can be greatly improved.

深度图学习方法的基本工作流程如图2所示，步骤包括：1）数据收集——获取用于深度图学习的数据集；2）数据建图——基于数据集构建用于深度学习方法的图；3）模型准备——设计训练模型，创建训练方法；4）模型训练——将用于训练的数据子图作为输入，传递给训练模型，直到训练结果收敛；5）模型应用——训练完成后将已训练好的模型部署，将用于评估的数据子图作为输入，获得应用结果。The basic workflow of the deep graph learning method is shown in Figure 2. The steps include: 1) Data collection - obtaining a data set for deep graph learning; 2) Data mapping - building a graph for the deep learning method based on the data set; 3) Model preparation - designing a training model and creating a training method; 4) Model training - passing the data subgraph used for training as input to the training model until the training results converge; 5) Model application - after training is completed, the trained model is deployed, and the data subgraph used for evaluation is used as input to obtain the application results.

目前，在以太坊钓鱼欺诈行为检测中应用深度图学习方法主要存在两个问题。一方面，构建的以太坊交易数据规模很大。截至目前链上已有超过18亿的交易量，而每秒有数以千计的新交易源源不断地添加到链上，即构建的以太坊交易图是一个保持动态变化的演化图；另一方面，构建的以太坊交易图节点标签数目很少，尤其是缺少钓鱼节点标签用于钓鱼欺诈行为的检测训练，此外由新交易数据生成的交易图中标签数据更少，进一步导致了标签不平衡的问题。而现有的深度图学习方法并不能很好的解决以上问题，它们大都是直推式的训练方法，只能在单个固定图中进行检测，每次出现新的节点和子图时便需要重新训练，并不适用于以太坊不断有新子图进入的场景。同时这些方法为了应对标签数目过少的情况，生成交易图时采用的是带偏置的采样方法，即挑选出带有标签的节点，再以这些节点为中心，以随机游走等算法进行扩展采样，最终选取了带有标签的节点与这些节点的邻居，而这在一定程度上改变了以太坊原始图的节点分布情况，不适用于实际场景。At present, there are two main problems in applying deep graph learning methods to Ethereum phishing fraud detection. On the one hand, the scale of Ethereum transaction data constructed is very large. As of now, there have been more than 1.8 billion transactions on the chain, and thousands of new transactions are continuously added to the chain per second, that is, the constructed Ethereum transaction graph is an evolving graph that keeps changing dynamically; on the other hand, the number of node labels in the constructed Ethereum transaction graph is very small, especially the lack of fishing node labels for phishing fraud detection training. In addition, there is even less label data in the transaction graph generated by new transaction data, which further leads to the problem of label imbalance. The existing deep graph learning methods cannot solve the above problems well. Most of them are direct training methods, which can only be detected in a single fixed graph. Every time a new node and subgraph appear, they need to be retrained, which is not suitable for the scenario where new subgraphs are constantly entering Ethereum. At the same time, in order to deal with the situation where the number of labels is too small, these methods use a biased sampling method when generating the transaction graph, that is, selecting nodes with labels, and then using these nodes as the center to perform extended sampling using algorithms such as random walks, and finally selecting the nodes with labels and their neighbors. This changes the node distribution of the original Ethereum graph to a certain extent and is not suitable for actual scenarios.

对上述问题进一步分析和总结有，现有的深度图学习只是通过直推式训练、带偏置采样等方法，限定以太坊交易图的研究规模，改变图的原始结构，但并不能完全解决以太坊数据规模不断变化、交易图不断演化、节点标签数目不足等问题。Further analysis and summary of the above problems show that the existing deep graph learning only limits the research scale of the Ethereum transaction graph and changes the original structure of the graph through methods such as direct training and biased sampling, but it cannot completely solve the problems of the ever-changing Ethereum data scale, the ever-evolving transaction graph, and the insufficient number of node labels.

发明内容Summary of the invention

发明目的：针对上述现有存在的问题和不足，本发明的目的是提供一种基于自监督深度图学习的以太坊钓鱼欺诈行为检测方法，提出一种有效的自监督学习前置任务，对既有的以太坊交易数据进行自监督模型训练，并将训练收敛的模型运用于需要检测的新子图，找到子图中存在钓鱼欺诈行为的节点。Purpose of the invention: In view of the above-mentioned existing problems and shortcomings, the purpose of the present invention is to provide an Ethereum phishing fraud detection method based on self-supervised deep graph learning, propose an effective self-supervised learning pre-task, train a self-supervised model on existing Ethereum transaction data, and apply the trained converged model to new subgraphs that need to be detected to find nodes in the subgraphs where phishing fraud exists.

从流程上来说，使用本方法的用户只需要基于既有的以太坊交易数据训练出收敛的钓鱼欺诈行为检测模型，将最新需要检测的交易数据构建成交易图输入，从而返回相应的正常节点与可能存在钓鱼欺诈行为的节点，实现以太坊钓鱼钓鱼欺诈行为的检测。In terms of process, users of this method only need to train a convergent phishing fraud detection model based on existing Ethereum transaction data, construct the latest transaction data that needs to be detected into a transaction graph input, and then return the corresponding normal nodes and nodes that may have phishing fraud behavior, thereby realizing the detection of Ethereum phishing fraud behavior.

从特点上来说，本方法可以基于既有的以太坊数据训练模型对新数据子图进行检测，应对以太坊规模不断变化、交易图不断演化和节点标签数目不足等问题。In terms of characteristics, this method can detect new data subgraphs based on the existing Ethereum data training model to cope with problems such as the ever-changing scale of Ethereum, the evolving transaction graph, and the insufficient number of node labels.

技术方案：为实现上述发明目的，本发明采用以下技术方案：Technical solution: To achieve the above-mentioned invention object, the present invention adopts the following technical solution:

一种基于自监督深度图学习的以太坊钓鱼欺诈行为检测方法，包括如下步骤：A method for detecting Ethereum phishing fraud based on self-supervised deep graph learning, comprising the following steps:

步骤1：数据建图：基于获取的以太坊数据，进行自动信息提取，合并到原本不具有可用属性的节点上，得到具有节点特征的交易图；Step 1: Data graph construction: Based on the acquired Ethereum data, automatic information extraction is performed and merged into nodes that originally did not have available attributes to obtain a transaction graph with node characteristics;

步骤2：模型准备：设置空间性自监督前置任务，构建模型和训练任务，用于挖掘和表示图中节点属性信息和拓扑结构信息；Step 2: Model preparation: Set up spatial self-supervised pre-tasks, build models and training tasks to mine and represent node attribute information and topological structure information in the graph;

步骤3：模型训练：设置训练规模和收敛条件，得到优化后的用于检测的训练模型。Step 3: Model training: Set the training scale and convergence conditions to obtain the optimized training model for detection.

进一步的，所述步骤1的具体步骤为：Furthermore, the specific steps of step 1 are:

步骤1.1：收集用于训练的以太坊交易数据，并过滤掉其中失败和交易值为0的交易数据；Step 1.1: Collect Ethereum transaction data for training and filter out failed transactions and transactions with a transaction value of 0;

步骤1.2：将步骤1.1得到的交易数据按照区块号或者时间戳分为S份，每一份交易数据进一步划分为交易数目相近的S片；Step 1.2: Divide the transaction data obtained in step 1.1 into S parts according to the block number or timestamp, and further divide each transaction data into S pieces with similar transaction numbers;

步骤1.3：根据交易图节点属性，计算每片交易数据中的节点特征向量，构建一对一的以太坊交易图。Step 1.3: Based on the transaction graph node attributes, calculate the node feature vector in each piece of transaction data and construct a one-to-one Ethereum transaction graph.

进一步的，所述步骤2的具体步骤为：Furthermore, the specific steps of step 2 are:

步骤2.1：设置模型参数，选择直推式或归纳式学习；Step 2.1: Set model parameters and choose direct or inductive learning;

步骤2.2：如选择直推式学习则依次输入每份交易数据前S-2片形成的交易图用于模型训练，并在第S-1片交易图上评价模型训练成果；如选择归纳式学习则依次输入前S-1片交易图用于模型训练，并在第S片交易图上评价模型训练成果。Step 2.2: If direct learning is selected, the transaction graph formed by the first S-2 pieces of each transaction data is input in sequence for model training, and the model training results are evaluated on the transaction graph of the S-1th piece; if inductive learning is selected, the transaction graph of the first S-1 pieces is input in sequence for model training, and the model training results are evaluated on the transaction graph of the Sth piece.

进一步的，所述步骤3的具体步骤为：分别在S份交易数据上重复步骤2.2，得到最终训练收敛的模型。Furthermore, the specific steps of step 3 are: repeating step 2.2 on S pieces of transaction data respectively to obtain the final trained converged model.

进一步的，所述S为5。Furthermore, the S is 5.

进一步的，所述步骤2具体步骤为：所述步骤1得到的交易图设为

，按照区块号或者时间戳的顺序划分为

共

个子图，其中每个子图

，

，即

是子图中

个节点的集合，

是子图中边的集合，

是根据子图节点的17条特征构建的特征矩阵。以

表示子图的邻接矩阵，即如果子图中任意两个节点m, n间存在边，则邻接矩阵中

，否则

，再对该无标签且有节点特征的交易子图运用自监督学习的训练方法，通过如下公式计算损失值

：Furthermore, the specific steps of step 2 are as follows: the transaction graph obtained in step 1 is set as

, divided into

common

subgraphs, where each subgraph

,

,Right now

In the subgraph

A collection of nodes,

is the set of edges in the subgraph,

is a feature matrix constructed based on the 17 features of the subgraph nodes.

Represents the adjacency matrix of the subgraph, that is, if there is an edge between any two nodes m and n in the subgraph, then the adjacency matrix

,otherwise

, and then apply the self-supervised learning training method to the unlabeled transaction subgraph with node features, and calculate the loss value by the following formula

:

：

:

其中，g是图神经网络的编码器，即特征提取器，

是前置任务中交易图节点集合，

是前置任务训练的交易图节点

的真实特征值，

是用于衡量节点

的特征嵌入向量和真实特征值

误差的判定模型，在经过本轮前置任务的训练后，图神经网络的编码器g将会被保留到下一轮训练，同时生成子图节点的表示

，该节点表示将会被送入最终的分类器进行下游的钓鱼节点检测任务，通过图神经网络的特征提取器计算给定子图

中所有节点的特征嵌入向量表示

，其次对于子图

中的节点

，如果

位于从

开始不超过k跳可达（即经过k跳或更短的路径）的所有点集合中（不考虑边上的方向），即

是

的k跳邻居，经过模型训练后两个节点的特征嵌入向量表示相似度应该尽可能高，反之亦然，该任务的一般性表达如下公式所示：Among them, g is the encoder of the graph neural network, that is, the feature extractor,

is the set of transaction graph nodes in the previous task,

It is the transaction graph node of the previous task training

The true eigenvalue of

It is used to measure the node

The feature embedding vector and true eigenvalue of

Error judgment model. After training the previous task in this round, the encoder g of the graph neural network will be retained for the next round of training, and the representation of the subgraph nodes will be generated at the same time.

, which will be sent to the final classifier for downstream fishing node detection tasks. The feature extractor of the graph neural network is used to calculate the given subgraph

The feature embedding vector representation of all nodes in

, and secondly for the subgraph

Nodes in

,if

Located from

Start with the set of all points that are reachable within k hops (i.e., a path that is k hops or shorter) (regardless of the direction of the edge), that is

yes

k-hop neighbors, after model training, the feature embedding vector representation similarity of two nodes should be as high as possible, and vice versa. The general expression of this task is shown in the following formula:

，

,

其中，

和

分别表示子图

所有节点

中互为k跳邻居的节点组集合和非k跳邻居的节点组集合，

是相似度判断函数，本方法设置为

，

表示的是线性变换层，

是

中节点

的特征嵌入向量表示。in,

and

Represent subgraphs

All nodes

The set of node groups that are k-hop neighbors and the set of node groups that are not k-hop neighbors,

is the similarity judgment function, and this method is set to

,

represents the linear transformation layer,

yes

Midpoint

The feature embedding vector representation of .

有益效果：与现有技术相比，本发明具有以下优点：主要解决的问题来自三个方面：Beneficial effects: Compared with the prior art, the present invention has the following advantages: The main problems solved come from three aspects:

以太坊的原始交易数据中找不到节点的可用属性，同时两个节点之间可能存在多笔交易，导致生成多边交易图，这便需要设计合理的以太坊交易图结构、设置具有代表性的节点和边的属性；The available properties of nodes cannot be found in the original transaction data of Ethereum. At the same time, there may be multiple transactions between two nodes, resulting in the generation of a multilateral transaction graph. This requires the design of a reasonable Ethereum transaction graph structure and the setting of representative node and edge properties.

以太坊交易图的标签数据很少，而由新交易生成的交易图中标签数据量会更少，进一步加剧了标签不平衡的问题；The Ethereum transaction graph has very little labeled data, and the transaction graph generated by new transactions has even less labeled data, further exacerbating the problem of label imbalance;

以太坊的数据规模很大，如果对整个生成的交易图进行训练，时间资源成本都很大。The data scale of Ethereum is very large. If the entire generated transaction graph is trained, the time and resource costs will be very high.

针对上述问题，本专利提出了一种基于自监督深度图学习的以太坊钓鱼欺诈行为检测方法，可以将需要解决的三个方面问题划分入方法的其中三个模块——以太坊数据建图、钓鱼欺诈行为检测模型的准备和钓鱼欺诈行为检测模型的训练。对于以太坊数据建图。不同于比特币交易可以存在多个输入和输出，以太坊的交易是一对一的，但原始交易图存在节点没有可用属性、两个节点间可能出现多边等问题。本方法并不是直接将交易图的多边合并成一条边，而是从原始交易数据中提取相对有用的信息，并将这些信息人为合并到节点的属性中，得到具有节点属性的单边有向图，然后选择图中的最大弱连通分量（WCC）作为最终的训练模型输入图。In response to the above problems, this patent proposes an Ethereum phishing fraud detection method based on self-supervised deep graph learning, which can divide the three aspects of the problem to be solved into three modules of the method - Ethereum data mapping, preparation of phishing fraud detection model and training of phishing fraud detection model. For Ethereum data mapping. Unlike Bitcoin transactions, which can have multiple inputs and outputs, Ethereum transactions are one-to-one, but the original transaction graph has problems such as nodes having no available attributes and multiple edges between two nodes. This method does not directly merge the multiple edges of the transaction graph into one edge, but extracts relatively useful information from the original transaction data, and artificially merges this information into the attributes of the node to obtain a unilateral directed graph with node attributes, and then selects the largest weakly connected component (WCC) in the graph as the final training model input graph.

对于钓鱼欺诈行为检测模型的准备。为了在节点标签数目很少的情况下充分利用大规模的以太坊交易数据，本专利采用自监督学习的方法，自监督学习方法可以不用担心标签和注释的问题，直接从数据本身通过监督信号挖掘出更多的信息。目前自监督学习方法的有效性已经被自然语言处理、计算机视觉等许多成功案例所证明。但以太坊交易图的拓扑结构表明图上节点不是独立，这使得直接在图上使用现有的框架几乎不可能。为此本专利所述技术将着力于前置任务的设计，提出一种有效的空间性自监督前置任务。Preparation of a phishing fraud detection model. In order to make full use of large-scale Ethereum transaction data when the number of node labels is small, this patent adopts a self-supervised learning method. The self-supervised learning method does not need to worry about labels and annotations, and directly mines more information from the data itself through supervisory signals. At present, the effectiveness of self-supervised learning methods has been proven by many successful cases such as natural language processing and computer vision. However, the topological structure of the Ethereum transaction graph shows that the nodes on the graph are not independent, which makes it almost impossible to use the existing framework directly on the graph. For this reason, the technology described in this patent will focus on the design of pre-tasks and propose an effective spatial self-supervised pre-task.

对于钓鱼欺诈行为检测模型的训练。庞大的以太坊数据给模型训练带来巨大的资源压力，本方法根据以太坊的区块号（或者时间戳）将交易数据划分为五份，通过多次实验减少可变性，保证结果更加稳定。同时对于每一份交易数据进一步划分为五片，并在有限的资源条件下，将检测模型依次经过五片交易数据进行训练，从而解决以太坊数据扩展性强，数据规模大的难点。For the training of phishing fraud detection models. The huge amount of Ethereum data brings huge resource pressure to model training. This method divides the transaction data into five parts according to the Ethereum block number (or timestamp), reduces variability through multiple experiments, and ensures more stable results. At the same time, each transaction data is further divided into five pieces, and under limited resource conditions, the detection model is trained on five pieces of transaction data in turn, thus solving the difficulties of Ethereum data scalability and large data scale.

基于以上分析，本专利分三个模块设计了一种基于自监督深度图学习的以太坊钓鱼欺诈行为检测方法。其中数据建图模块通过自动提取以太坊原始交易数据中相对有用的信息并合并到节点属性上，得到具有节点属性的单边有向图；模型准备模块基于自监督学习的方法设计了一种有效的空间性前置任务，用于挖掘和表示图中丰富的节点属性信息和拓扑结构信息；模型训练模块根据以太坊的区块号（或者时间戳）对交易数据/交易图进行划分，设置收敛条件不断优化特征提取器，同时多次实验保证结果的稳定性。Based on the above analysis, this patent designs a method for detecting Ethereum phishing fraud based on self-supervised deep graph learning in three modules. The data mapping module automatically extracts relatively useful information from the original Ethereum transaction data and merges it into the node attributes to obtain a unilateral directed graph with node attributes; the model preparation module designs an effective spatial pre-task based on the self-supervised learning method to mine and represent the rich node attribute information and topological structure information in the graph; the model training module divides the transaction data/transaction graph according to the Ethereum block number (or timestamp), sets the convergence conditions to continuously optimize the feature extractor, and conducts multiple experiments to ensure the stability of the results.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明的智能合约的创建流程图；FIG1 is a flow chart of creating a smart contract of the present invention;

图2是本发明的深度图学习的基本工作流程图；FIG2 is a basic workflow diagram of deep graph learning of the present invention;

图3是本发明的空间性前置任务；FIG3 is a spatial pre-task of the present invention;

图4是本发明的实施例使用本专利所述方法相比其它方法检测以太坊钓鱼欺诈行为的原始特征的可视化结果；FIG4 is a visualization result of the original features of detecting Ethereum phishing fraud using the method described in this patent compared with other methods in an embodiment of the present invention;

图5是本发明的实施例使用本专利所述方法相比其它方法检测以太坊钓鱼欺诈行为的未经训练的图卷积神经网络的可视化结果；FIG5 is a visualization result of an untrained graph convolutional neural network for detecting Ethereum phishing fraud using the method described in this patent in accordance with an embodiment of the present invention compared to other methods;

图6是本发明的实施例使用本专利所述方法相比其它方法检测以太坊钓鱼欺诈行为的本专利所述方法的可视化结果；FIG6 is a visualization result of the method of the present invention for detecting Ethereum phishing fraud using the method of the present invention compared with other methods;

图7是本发明的实施例使用本专利所述方法不同训练迭代1轮下检测以太坊钓鱼欺诈行为的可视化结果；FIG. 7 is a visualization result of detecting Ethereum phishing fraud behavior using the method described in this patent under different training iterations of 1 round in an embodiment of the present invention;

图8是本发明的实施例使用本专利所述方法不同训练迭代10轮下检测以太坊钓鱼欺诈行为的可视化结果；FIG8 is a visualization result of detecting Ethereum phishing fraud behavior using the method described in this patent under different training iterations of 10 rounds in an embodiment of the present invention;

图9是本发明的实施例使用本专利所述方法不同训练迭代50轮下检测以太坊钓鱼欺诈行为的可视化结果。FIG. 9 is a visualization result of detecting Ethereum phishing fraud using the method described in this patent under 50 different training iterations according to an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合附图和具体实施例，进一步阐明本发明，应理解这些实施例仅用于说明本发明而不用于限制本发明的范围，在阅读了本发明之后，本领域技术人员对本发明的各种等价形式的修改均落于本申请所附权利要求所限定的范围。The present invention is further explained below in conjunction with the accompanying drawings and specific embodiments. It should be understood that these embodiments are only used to illustrate the present invention and are not used to limit the scope of the present invention. After reading the present invention, various equivalent forms of modifications to the present invention by those skilled in the art all fall within the scope defined by the claims attached to this application.

一种基于自监督深度图学习的以太坊钓鱼欺诈行为检测方法相比其它方法，其可视化结果如图4~9所示。A method for detecting Ethereum phishing fraud based on self-supervised deep graph learning is shown in Figures 4 to 9 compared with other methods.

在数据建图部分，本方法对以太坊原始交易数据进行自动信息提取并合并到原本不具有可用属性的节点上，得到具有节点特征的交易图。In the data mapping part, this method automatically extracts information from the original Ethereum transaction data and merges it into nodes that originally did not have available attributes, obtaining a transaction graph with node characteristics.

在模型准备部分，本方法提出了一种有效的空间性自监督前置任务，解决标签稀缺的问题，以最小化累计误差的方式得到可用于检测的收敛模型。In the model preparation part, this method proposes an effective spatial self-supervised pre-task to solve the problem of scarce labels and obtain a converged model that can be used for detection by minimizing the cumulative error.

在模型训练部分，本方法通过对交易图按区块号（或者时间戳）分片训练，在每片交易图的训练中不断优化，控制使用资源的同时适应最新的交易图变化。In the model training part, this method trains the transaction graph by sharding according to block number (or timestamp), continuously optimizes the training of each transaction graph, controls the use of resources and adapts to the latest changes in the transaction graph.

实施例Example

以太坊的创世块源于2015年7月。不失一般性地，本方法收集了2018年1月到2020年5月的全部交易数据（包括从用户地址发起的外部交易数据和从智能合约地址发起的内部交易数据）用于模型训练。原始数据结构复杂，同时有很多对于钓鱼欺诈行为检测无意义的噪音数据，如失败的交易数据和交易值为0的交易数据，经本方法过滤后最终总共获得了75382756条交易数据。The genesis block of Ethereum originated in July 2015. Without loss of generality, this method collects all transaction data from January 2018 to May 2020 (including external transaction data initiated from user addresses and internal transaction data initiated from smart contract addresses) for model training. The original data structure is complex, and there are a lot of meaningless noise data for phishing fraud detection, such as failed transaction data and transaction data with a transaction value of 0. After filtering by this method, a total of 75,382,756 transaction data were finally obtained.

数据建图部分，自动提取了17条以太坊原始交易数据的一般特征作为交易图节点属性，如下表1，17条作为交易图节点属性的基本特征所示，当前特征提取效果较好。In the data mapping part, 17 general features of Ethereum original transaction data are automatically extracted as transaction graph node attributes, as shown in Table 1 below. The 17 basic features of transaction graph node attributes show that the current feature extraction effect is good.

表1Table 1

模型准备部分，在进行自监督学习训练之前，本方法假定以太坊原始交易图

已经被按照区块号（或者时间戳）的顺序划分为{

}共N个子图，其中每个子图

，

，即

是子图中

个节点的集合，

是子图中边的集合，

是根据子图节点的17条特征构建的特征矩阵。以

，否则

。之后对该无标签且有节点特征的交易子图运用自监督学习的训练方法，自监督学习的训练目标可以被概括为最小化前置任务的损失值

，如公式一所示：In the model preparation part, before self-supervised learning training, this method assumes that the original transaction graph of Ethereum

It has been divided into {

}There are N subgraphs in total, each of which

,

,Right now

In the subgraph

A collection of nodes,

is the set of edges in the subgraph,

is a feature matrix constructed based on the 17 features of the subgraph nodes.

,otherwise

Then, the self-supervised learning training method is applied to the unlabeled transaction subgraph with node features. The training objective of self-supervised learning can be summarized as minimizing the loss value of the previous task.

, as shown in Formula 1:

公式一：

Formula 1:

其中，g是图神经网络的编码器，即特征提取器，

是前置任务训练的交易图节点

的真实特征值，

是原始交易图已经被按照区块号（或者时间戳）划分好的某一个子图，

是前置任务中交易图节点集合，

是用于衡量节点

的特征嵌入向量和真实特征值

误差的判定模型。在经过本轮前置任务的训练后，图神经网络的编码器g将会被保留到下一轮训练，同时生成子图节点的表示

，该节点表示将会被送入最终的分类器进行下游的钓鱼节点检测任务。Among them, g is the encoder of the graph neural network, that is, the feature extractor,

It is the transaction graph node of the previous task training

The true eigenvalue of

It is a subgraph of the original transaction graph that has been divided according to the block number (or timestamp).

is the set of transaction graph nodes in the previous task,

It is used to measure the node

The feature embedding vector and true eigenvalue of

, this node indicates that it will be sent to the final classifier for downstream fishing node detection tasks.

直观的来说，对于钓鱼欺诈行为的检测问题，一个合格前置任务地设计应该满足以下先决条件：1)提取的数据标签应能反映数据本身的特征；2)由于以太坊交易数据规模较大，应该能以较低的时间复杂度获得提取的数据标签，否则训练的时间成本是不可接受的；3)前置任务可以包含一些领域知识，但是不能过于详细，否则前置任务的适用性将受到限制，容易收到对抗性攻击。对于以太坊交易图上地节点，本方法认为具有频繁交易的邻居节点应该具有相似的节点表示。为了能够有效捕捉以太坊交易图中邻居节点之间的空间关系，本专利所述的方法设计了如图3所示的空间性前置任务。Intuitively speaking, for the problem of detecting phishing fraud, the design of a qualified pre-task should meet the following prerequisites: 1) The extracted data labels should reflect the characteristics of the data itself; 2) Due to the large scale of Ethereum transaction data, the extracted data labels should be obtained with a lower time complexity, otherwise the time cost of training is unacceptable; 3) The pre-task can contain some domain knowledge, but it cannot be too detailed, otherwise the applicability of the pre-task will be limited and it will be vulnerable to adversarial attacks. For nodes on the Ethereum transaction graph, this method believes that neighbor nodes with frequent transactions should have similar node representations. In order to effectively capture the spatial relationship between neighbor nodes in the Ethereum transaction graph, the method described in this patent designs a spatial pre-task as shown in Figure 3.

首先通过图神经网络的特征提取器计算给定子图

中所有节点的特征嵌入向量表示

。其次对于子图

中的节点

，如果

位于从

是

的k跳邻居，经过模型训练后两个节点的特征嵌入向量表示相似度应该尽可能高，反之亦然。该任务的一般性表达如公式二所示：First, the feature extractor of the graph neural network is used to calculate the given subgraph

The feature embedding vector representation of all nodes in

. Secondly, for the subgraph

Nodes in

,if

Located from

yes

After model training, the feature embedding vectors of two nodes should have as high similarity as possible, and vice versa. The general expression of this task is shown in Formula 2:

公式二：

Formula 2:

其中，

和

分别表示子图

所有节点

中互为k跳邻居的节点组集合和非k跳邻居的节点组集合，

是相似度判断函数，本方法设置为

，

表示的是线性变换层，

是

中节点

的特征嵌入向量表示。in,

and

Represent subgraphs

All nodes

is the similarity judgment function, and this method is set to

,

represents the linear transformation layer,

yes

Midpoint

The feature embedding vector representation of .

以太坊市场的变化远比传统交易平台更加剧烈，其交易数据的规模庞大且处在不断扩展的过程中，交易图也在不断演化，一次性将所有获取的数据都投入训练是不切实际的。所以正如上文所述，本方法将交易数据按照区块号（或者时间戳）进行划分，每次训练只输入部分交易子图，并将训练好的模型用于训练新的子图。这样一方面缓解了以太坊数据不断扩展的问题，另一方面帮助模型适应新的以太坊交易数据分布。The changes in the Ethereum market are far more dramatic than those of traditional trading platforms. The scale of its transaction data is huge and is in the process of continuous expansion. The transaction graph is also constantly evolving. It is impractical to put all the acquired data into training at one time. So as mentioned above, this method divides the transaction data according to the block number (or timestamp), inputs only part of the transaction subgraph each time for training, and uses the trained model to train the new subgraph. This not only alleviates the problem of the continuous expansion of Ethereum data, but also helps the model adapt to the new distribution of Ethereum transaction data.

模型的最终算法如下所示：The final algorithm of the model is as follows:

（1）以太坊原始交易图

已经被按照区块号（或者时间戳）的顺序划分为{

}共N个子图，其中

= (Ai, Xi)。(1) Ethereum original transaction graph

It has been divided into {

}There are N subgraphs in total, among which

= (Ai, Xi).

（2）通过当前图神经网络编码器gi获得子图

的节点特征嵌入向量表示为Zi =gi(Ai, Xi)。(2) Obtain the subgraph through the current graph neural network encoder gi

The node feature embedding vector is represented as Zi =gi(Ai, Xi).

（3）将当前图神经网络编码器gi和子图

的节点特征嵌入向量表示Zi输入空间性前置任务，基于公式一使得任务损失

最小化，将模型训练收敛得到更新后的图神经网络编码器gi’。(3) The current graph neural network encoder gi and the subgraph

The node feature embedding vector represents the spatial pre-task of Zi input, and the task loss is made based on formula 1.

Minimize and converge the model training to obtain the updated graph neural network encoder gi'.

（4）将更新后的图神经网络编码器gi’作为下一个子图

+1的编码器，重复步骤（2）~（4），直到训练完所有的训练集子图。(4) Take the updated graph neural network encoder gi' as the next subgraph

+1 encoder, repeat steps (2) to (4) until all training set subgraphs are trained.

模型训练部分，相比现有的大部分深度图学习方法是直推式的，本专利所述方法可以实现直推式和归纳式两种训练方法。直推式学习方法指在训练模型时已经同时用到了训练和测试数据，而归纳式学习方法在训练时只使用了训练数据并没有使用测试数据。对于直推式学习训练模型时，本方法将根据以太坊的区块号（或者时间戳）划分的每份交易数据（总共五份）前三片（总共五片）用于模型训练，并在第四片交易数据中进行模型评估。对于归纳式学习训练模型时，本方法将每份交易数据前四片用于模型训练，并在第五片交易数据中进行模型评估。以其中一份交易数据的前三片为例，是本发明的用于训练的某份交易数据前三片的数据统计信息如下表2所示：In terms of model training, compared to most existing deep graph learning methods which are direct-push, the method described in this patent can implement both direct-push and inductive training methods. The direct-push learning method refers to the use of both training and test data when training the model, while the inductive learning method only uses training data and does not use test data during training. When training the model using direct learning, this method uses the first three pieces (a total of five pieces) of each transaction data (a total of five pieces) divided according to the block number (or timestamp) of Ethereum for model training, and performs model evaluation in the fourth piece of transaction data. When training the model using inductive learning, this method uses the first four pieces of each transaction data for model training, and performs model evaluation in the fifth piece of transaction data. Taking the first three pieces of one of the transaction data as an example, the data statistical information of the first three pieces of a certain transaction data used for training of the present invention is shown in Table 2 below:

表2Table 2

本专利所述的方法选择了具有均值池聚合器的2层GraphSage作为节点分类模型的主体，使用逻辑回归模型作为二元分类器。用于节点分类的以太坊网络钓鱼节点标签主要来源于Etherscan.io和一些公司发布的黑名单，总计6588个；而以太坊正常节点标签则是从每片交易图中随机抽取非钓鱼节点，其数目约为该图中钓鱼节点数目的3倍。这些标签节点被按照5：2：3的比例分配给训练集、评估集和测试集。本方法选用Adam作为优化器，将学习率遍历{0.01, 0.001, 0.0001}，dropout设置为0.5，隐藏层节点数遍历{32, 64,128, 256}，批大小设置为512。同时模型训练部分的空间性前置任务中k设置为2。The method described in this patent selects a 2-layer GraphSage with a mean pool aggregator as the main body of the node classification model, and uses a logistic regression model as a binary classifier. The Ethereum phishing node labels used for node classification are mainly derived from the blacklists released by Etherscan.io and some companies, totaling 6588; while the Ethereum normal node labels are randomly selected from each transaction graph. Non-phishing nodes, the number of which is about 3 times the number of phishing nodes in the graph. These labeled nodes are allocated to the training set, evaluation set, and test set in a ratio of 5:2:3. This method uses Adam as the optimizer, traverses the learning rate through {0.01, 0.001, 0.0001}, drops out is set to 0.5, the number of hidden layer nodes traverses {32, 64,128, 256}, and the batch size is set to 512. At the same time, k is set to 2 in the spatial pre-task of the model training part.

至此，介绍了本方法主体的三个部分——数据建图、模型准备和模型训练，其主要创新点在于保证以太坊交易图原始结构的同时，基于自监督深度图学习的方法解决了以太坊数据规模不断变化和数据标签少的问题。接下来将介绍该方法取得的有益成果。So far, the three main parts of this method have been introduced: data mapping, model preparation, and model training. Its main innovation is that while maintaining the original structure of the Ethereum transaction graph, the method based on self-supervised deep graph learning solves the problems of the ever-changing Ethereum data scale and the lack of data labels. Next, we will introduce the beneficial results achieved by this method.

简要来说：In short:

对本专利所述方法进行了实现和全面评估，通过在大规模以太坊交易图上的大量实现表明，该方法大大优于基线模型，其F-1 score表现优于基线4%~16%左右。The method described in this patent has been implemented and fully evaluated. A large number of implementations on large-scale Ethereum transaction graphs have shown that this method significantly outperforms the baseline model, with its F-1 score outperforming the baseline by about 4% to 16%.

具体而言：Specifically:

将本专利所述方法的直推式学习与六个基线方法进行对比，包括：1)原始特征；2)DeepWalk算法；3)DeepWalk节点特征嵌入和原始特征的结合方法；4)GraphSage算法，一种灵活的归纳图神经网络算法；5)DGI，一种通过最大化局部批表示和高维图表示交互信息的无监督归纳图神经网络算法；6)未经训练的图卷积神经网络算法。对于归纳式学习，DeepWalk算法生成的节点特征嵌入向量会相对于原始嵌入空间旋转，所以不与该基线比较。基于根据以太坊的区块号（或者时间戳）划分的五份交易数据，直推式和归纳式的预测结果分别如表3和表4所示，包括正确率、预测率、回归率和F1-score四个指标。The direct learning of the method described in this patent is compared with six baseline methods, including: 1) original features; 2) DeepWalk algorithm; 3) DeepWalk node feature embedding and original features; 4) GraphSage algorithm, a flexible inductive graph neural network algorithm; 5) DGI, an unsupervised inductive graph neural network algorithm that maximizes the interaction information between local batch representation and high-dimensional graph representation; 6) untrained graph convolutional neural network algorithm. For inductive learning, the node feature embedding vector generated by the DeepWalk algorithm will be rotated relative to the original embedding space, so it is not compared with this baseline. Based on five transaction data divided according to the block number (or timestamp) of Ethereum, the prediction results of direct and inductive methods are shown in Tables 3 and 4, respectively, including four indicators: accuracy, prediction rate, regression rate and F1-score.

表3：直推式学习的预测结果与六个基线方法对比Table 3: Comparison of prediction results of transductive learning with six baseline methods

表4：归纳式学习的预测结果与六个基线方法对比Table 4: Comparison of prediction results of inductive learning with six baseline methods

由表3和表4的结果观察可知，本专利所述方法大大优于基线模型。From the results in Tables 3 and 4, it can be observed that the method described in this patent is much better than the baseline model.

用户使用本专利所述方法的流程如下：The process for users to use the method described in this patent is as follows:

a.收集用于训练的以太坊交易数据，并过滤其中失败、交易值为0的交易数据；a. Collect Ethereum transaction data for training and filter out failed transaction data with transaction value of 0;

b.将收集到的交易数据按照区块号（或者时间戳）分为五份，每一份交易数据进一步划分为交易数目相近的五片；b. Divide the collected transaction data into five parts according to the block number (or timestamp), and further divide each transaction data into five parts with similar transaction numbers;

c.根据表1设置的17条交易图节点属性，手工计算每片交易数据中的节点特征向量，构建一对一的以太坊交易图；c. According to the 17 transaction graph node attributes set in Table 1, manually calculate the node feature vector in each piece of transaction data and construct a one-to-one Ethereum transaction graph;

d.设置该方法的模型参数，同时选择直推式学习或归纳式学习；d. Set the model parameters of the method and choose direct learning or inductive learning;

e.以第一份交易数据为例，如果选择直推式学习则依次输入前三片交易图用于模型训练，并在第四片交易图上评价模型训练成果；如果选择归纳式学习则依次输入前四片交易图用于模型训练，并在第五片交易图上评价模型训练成果；e. Taking the first transaction data as an example, if direct learning is selected, the first three transaction graphs are input in sequence for model training, and the model training results are evaluated on the fourth transaction graph; if inductive learning is selected, the first four transaction graphs are input in sequence for model training, and the model training results are evaluated on the fifth transaction graph;

f.分别在五份交易数据上重复步骤e，得到最终训练收敛的模型。f. Repeat step e on five sets of transaction data to obtain the final training convergence model.

技术人员按照上述方法a~f返回的模型进行部署，并按照步骤a收集需要检测的以太坊交易数据，过滤清洗其中的噪音数据，按照步骤c构建对应的以太坊交易图后输入部署好的模型，生成检测结果。The technician deploys the model returned by methods a to f above, collects the Ethereum transaction data to be tested according to step a, filters and cleans the noise data, builds the corresponding Ethereum transaction graph according to step c, inputs the deployed model, and generates the test results.

对于以太坊钓鱼欺诈行为检测结果进行了可视化，其可视化结果如图4~图9所示。为了保证可视图足够清晰，将可视图的节点数控制在500以内。随机取样的正常节点数目和存在钓鱼欺诈行为的节点数目比约为4：1，图4~图9中圆形节点代表正常节点，五角星节点代表存在钓鱼欺诈行为的节点。The results of Ethereum phishing fraud detection were visualized, and the visualization results are shown in Figures 4 to 9. In order to ensure that the visualization graph is clear enough, the number of nodes in the visualization graph is controlled within 500. The ratio of the number of randomly sampled normal nodes to the number of nodes with phishing fraud is about 4:1. The circular nodes in Figures 4 to 9 represent normal nodes, and the five-pointed star nodes represent nodes with phishing fraud.

图4~图6同时展示了三种方法的节点特征可视化结果，可以发现本专利所述方法相比原始特征和未经训练的图卷积神经网络，点分离得更好更集中，进一步证明本专利所述方法的有效性。由此可见，本方法可以达到现有技术所不能达到的效果。Figures 4 to 6 show the node feature visualization results of the three methods at the same time. It can be found that the method described in this patent has better and more concentrated point separation than the original features and the untrained graph convolutional neural network, which further proves the effectiveness of the method described in this patent. It can be seen that this method can achieve the effect that the existing technology cannot achieve.

图7~图9同时展示了本专利所述方法在不同数据训练迭代轮次下的可视化结果，可以发现在一定范围内，随着数据训练迭代轮次的增加，本专利所述方法能使得同类型节点特征更加集中，同时正常节点与钓鱼欺诈节点更易分离。Figures 7 to 9 also show the visualization results of the method described in this patent under different data training iteration rounds. It can be found that within a certain range, as the data training iteration rounds increase, the method described in this patent can make the features of nodes of the same type more concentrated, and normal nodes and phishing fraud nodes are easier to separate.

Claims

1. The method for detecting the phishing fraud of the Ethernet based on the self-supervision depth map learning is characterized by comprising the following steps of: the method comprises the following steps:

step 1: and (3) data mapping: based on the acquired Ethernet data, automatic information extraction is carried out, and the information is combined to nodes which do not have available attributes originally, so that a transaction diagram with node characteristics is obtained;

step 2: preparing a model: setting a spatial self-supervision pre-task, constructing a model and a training task, and mining and representing node attribute information and topology structure information in the diagram;

step 3: model training: setting training scale and convergence condition to obtain optimized training model for detection.

2. The method for detecting the phishing fraud of the ethernet based on self-supervision depth map learning as claimed in claim 1, wherein the method comprises the following steps: the specific steps of the step 1 are as follows:

step 1.1: collecting the Ethernet transaction data for training, and filtering out the transaction data with failure and transaction value of 0;

step 1.2: dividing the transaction data obtained in the step 1.1 into S parts according to block numbers or time stamps, and further dividing each part of transaction data into S pieces with similar transaction numbers;

step 1.3: and calculating node feature vectors in each piece of transaction data according to the node attributes of the transaction graph, and constructing a one-to-one Ethernet transaction graph.

3. The method for detecting the phishing fraud of the ethernet based on the self-supervision depth map learning as claimed in claim 2, wherein the method comprises the following steps: the specific steps of the step 2 are as follows:

step 2.1: setting model parameters, and selecting direct pushing type or inductive type learning;

step 2.2: if direct push learning is selected, sequentially inputting a transaction graph formed by S-2 pieces of transaction data before each piece of transaction data for model training, and evaluating model training results on the S-1 th transaction graph; if inductive learning is selected, the former S-1 transaction graphs are sequentially input for model training, and model training results are evaluated on the S-th transaction graph.

4. The method for detecting the phishing fraud of the ethernet based on self-supervision depth map learning as claimed in claim 3, wherein: the specific steps of the step 3 are as follows: and (2) repeating the step 2.2 on the S transaction data respectively to obtain a final training convergence model.

5. The method for detecting the phishing fraud of the ethernet based on the self-supervision depth map learning as claimed in claim 2, wherein the method comprises the following steps: and S is 5.

6. The method for detecting the phishing fraud of the ethernet house based on the self-supervision depth map learning as claimed in claim 5, wherein: the specific steps of the step 2 are as follows: the transaction diagram obtained in the step 1 is set as

Dividing into +/according to the sequence of block number or time stamp>

Co (all ]>

A plurality of sub-pictures, wherein each sub-picture +.>

，

I.e. +.>

Is +.>

A set of individual nodes->

Is the collection of edges in the sub-graph,

is a feature matrix constructed according to 17 features of the sub-graph nodes to +.>

Adjacency matrix representing sub-graphs, i.e. if any two of the sub-graphsThere is an edge between the nodes m, n, then +.>

Otherwise->

Then, the transaction subgraph is applied with a training method of self-supervision learning, and the loss value +.>

：/>

，

Where g is the encoder of the neural network, i.e. the feature extractor,

is a collection of transaction graph nodes in the pre-task,

transaction graph node for pre-task training>

Is>

Is used for measuring node->

Feature embedding vector and true feature value +.>

The error judging model, after the training of the front-end task of the round, the encoder g of the graphic neural network is reserved to the next round of training, and simultaneously the representation of the sub-graph node is generated>

The node represents the phishing node detection task that will be fed into the final classifier for downstream calculation of the given sub-graph +_ by the feature extractor of the graph neural network>

Feature embedding vector representation of all nodes in +.>

Second, for subgraph->

Node->

If->

Is located at the sub->

Of all sets of points that start not more than k hops reachable (i.e. going through paths of k hops or less), irrespective of the direction on the edge, i.e +.>