CN116628699A

CN116628699A - DApp risk detection method, system and device based on multi-classification model

Info

Publication number: CN116628699A
Application number: CN202310530700.0A
Authority: CN
Inventors: 彭滔; 陈厚积; 王国军; 李旭彬; 张雨恒; 李培强; 朱津瑶; 刘雪蕾
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2023-05-10
Filing date: 2023-05-10
Publication date: 2023-08-22

Abstract

The invention discloses a DApp risk detection method, system and device based on a multi-classification model, comprising the following steps: s1, acquiring sequence data to be detected and training set data; s2, carrying out data preprocessing on the sequence data to be detected and the training set data to obtain sequence preprocessing data to be detected and training set preprocessing data; s3, reducing the dimension of the training set pretreatment data to obtain dimension reduction training set pretreatment data; s4, building a classification model, and training the classification model based on the dimension reduction training set preprocessing data to obtain a trained classification model; s5, inputting the pretreatment data of the sequence to be detected into a trained classification model to obtain judgment probabilities of various categories, summing the judgment probabilities of risk categories except normal categories to be used as the judgment probability of unknown risks, if the probability is larger than a set threshold, detecting the DApp risk, and otherwise, not detecting the DApp risk. The invention can realize DApp risk detection.

Description

DApp risk detection method, system and device based on multi-classification model

Technical Field

The invention relates to the field of risk detection, in particular to a DApp risk detection method, system and device based on a multi-classification model.

Background

In recent years, with the rapid development of deep learning technology, more and more researchers have begun to explore the application of deep learning technology to DApp vulnerability detection. The technology can greatly improve the efficiency and accuracy of DApp risk detection, and becomes a popular research direction in the field of DApp safety. Several schemes for implementing DApp risk detection in combination with deep learning and intelligent contract vulnerability detection will be described below.

L Su et al [1] propose a combined machine learning framework. The framework can analyze the dependency relationship of the call in the API sequence on the functional level and extract the characteristics. Random forests are then used to detect and analyze various possible risks. In addition, they have also proposed an automated inspection tool named "Sun instror" that can help developers to discover and repair potential risks in time. The tool uses symbolic execution techniques, belonging to static analysis, can simulate intelligent contracts to execute DApp without running actual code, and find potential risks.

BNB chain [2] introduced a new platform DAppBay for discovering new Web3 projects. DAppBay is equipped with a new function named Red Alarm that can evaluate item risk levels in real time and alert the user to the potential risk of DApp. Red Alarm is a contractual risk scanning tool provided by DAppBay that can help users identify high risk items to protect them from fraud. The user may check the contract address for a logical defect or fraud risk by entering the contract address into a red alert function. The risky DApp list is updated every friday.

GoPlus [3] provides an open, non-privileged, user-driven security service as a security infrastructure for web 3. The security engine of Goplus covers multi-chain multi-dimensional risk detection, providing a safer chained ecosystem for cryptocurrency items and general users. Currently, the go+ API proposed by gollus is a complete, dynamic, automatic security detection platform, and is a DApp risk analysis tool based on convolutional neural networks. Go+ APIs include token detection, real-time risk early warning, DApp contract security, and interactive security.

[1]Su L,Shen X,Du X,et al.Evil Under the Sun:Understanding and Discovering Attacks on Ethereum Decentralized Applications[C]//USENIX Security Symposium.2021:1307-1324.

[2]https://www.bnbchain.org/en/blog/DAppbay-red-alarm-DApp-risk-list-feb-5th-feb-12th/

[3]https://gopluslabs.io/.

Disclosure of Invention

The invention aims to provide a DApp risk detection method, system and device based on a multi-classification model, and aims to solve the problem of DApp risk detection.

The invention provides a DApp risk detection method based on a multi-classification model, which comprises the following steps:

a method of DApp risk detection based on a multi-classification model, comprising:

s1, acquiring sequence data to be detected and training set data;

s2, carrying out data preprocessing on the sequence data to be detected and the training set data to obtain sequence preprocessing data to be detected and training set preprocessing data;

s3, reducing the dimension of the training set pretreatment data to obtain dimension reduction training set pretreatment data;

s4, building a classification model, and training the classification model based on the dimension reduction training set preprocessing data to obtain a trained classification model;

s5, inputting the pretreatment data of the sequence to be detected into a trained classification model to obtain judgment probabilities of various categories, summing the judgment probabilities of risk categories except normal categories to be used as the judgment probability of unknown risks, if the probability is larger than a set threshold, detecting the DApp risk, and otherwise, not detecting the DApp risk.

The invention also provides a DApp risk detection system based on the multi-classification model, which comprises:

a system for DApp risk detection based on a multi-classification model, comprising:

the acquisition module is used for: the method comprises the steps of acquiring sequence data to be detected and training set data;

the preprocessing module is used for carrying out data preprocessing on the sequence data to be detected and the training set data to obtain sequence preprocessing data to be detected and training set preprocessing data;

and the dimension reduction module is used for: the method comprises the steps of performing dimension reduction on training set pretreatment data to obtain dimension reduction training set pretreatment data;

training module: the method comprises the steps of establishing a classification model, and training the classification model based on preprocessing data of a dimension reduction training set to obtain a trained classification model;

and a detection module: the method comprises the steps of inputting preprocessing data of a sequence to be detected into a trained classification model to obtain judging probabilities of various categories, summing judging probabilities of risk categories except normal categories to be used as judging probabilities of unknown risks, if the probabilities are larger than a set threshold, discovering DApp risks, and otherwise, not discovering the DApp risks.

The embodiment of the invention also provides a device for detecting the DApp risk based on the multi-classification model, which comprises the following components: a memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor, performs the steps of the method described above.

The embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores an information transmission implementation program, and the program realizes the steps of the method when being executed by a processor.

By adopting the embodiment of the invention, DApp risk detection can be realized.

The foregoing description is only an overview of the present invention, and is intended to provide a more clear understanding of the technical means of the present invention, as it is embodied in accordance with the present invention, and to make the above and other objects, features and advantages of the present invention more apparent, as it is embodied in the following detailed description of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method of DApp risk detection based on a multi-classification model in accordance with an embodiment of the present invention;

FIG. 2 is a system model schematic diagram of a method of DApp risk detection based on a multi-classification model in accordance with an embodiment of the present invention;

FIG. 3 is a system schematic diagram of DApp risk detection based on a multi-classification model of an embodiment of the present invention;

FIG. 4 is a schematic diagram of an apparatus for DApp risk detection based on a multi-classification model in accordance with an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be clearly and completely described in connection with the embodiments, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Method embodiment

According to an embodiment of the present invention, there is provided a method for DApp risk detection based on a multi-classification model, and fig. 1 is a flowchart of a method for DApp risk detection based on a multi-classification model according to an embodiment of the present invention, as shown in fig. 1, specifically including:

s1, acquiring sequence data to be detected and training set data;

s1 specifically comprises: acquiring sequence data to be detected and training set data, wherein the training set data comprises: reentry attack risk sequence, malicious authorization risk sequence, exploit risk sequence, data tampering attack sequence and information leakage risk sequence.

s2 specifically comprises: and carrying out data pre-training on the sequence data to be detected and the training set data by using a Word2vec neural network to obtain a sequence to be detected and a training set to be detected, filling a feature vector matrix with the sequence to be detected and the training set to be detected, and unifying the length.

s4 specifically comprises the following steps: establishing a classification model, wherein the classification model comprises: the CNN convolutional neural network is used for dimension reduction, LSTM is used for multi-batch training and optimizing parameters through back propagation, the LSTM is used for classifying and outputting a full-connection layer and an output layer of model loss and accuracy, and the classification model is trained based on dimension reduction training set preprocessing data to obtain a trained classification model.

in this scheme, the detection system scheme is divided into 3 stages: a data preprocessing stage, a model training stage and a risk detection stage.

The MongoDB database stores the sequence to be detected, the normal sequence and the function call sequence marked with the risk of the DApp, and the function call sequence can be divided into 6 function call sequences: d0 The data of (to-be-detected sequence), R0 (normal sequence), R1 (re-entry attack risk sequence), R2 (malicious authorization risk sequence), R3 (exploit risk sequence), R4 (data tampering attack sequence) and R5 (information leakage risk sequence) can be used as training sets of the model. The following is a detailed description:

r1 reentry attack risk-an attacker repeatedly executes functions within the attacked contract using the reentrandom feature. For example, an attacker may pass his own contract address as a parameter to the contract function before transferring to another contract, and during the execution of the function, the function of the attacked contract is called again before the balance of the attacked contract has not been updated, so that the attacked contract performs an invalid transfer to acquire additional tokens.

R2 malicious authorization risk that an attacker can issue a malicious contract disguised as other contracts, disguise as regular contracts such as crowd funding contracts and the like, request the user to authorize and obtain relevant rights, and then access other contracts of the user through the obtained authorization to steal assets, thereby causing property loss of the user.

And R3, taking an integer overflow vulnerability as an example, an attacker can intentionally make the arithmetic operation result exceed the maximum value specified by the contract when carrying out integer operation by constructing data in the contract, and further obtain the overflow balance in the contract executing process so as to attack.

R4 data tampering risk-an attacker can tamper with a certain state variable or calling sequence during the calling process of the contract to change the predetermined behavior of the contract or bypass security checks. For example, an attacker may inject malicious code into a contract, modifying all state variables of the contract in a function of the attacked contract, thereby allowing the attacker to gain more benefit.

R5 information leakage risk that an attacker can intercept the data packet to obtain sensitive account addresses, keys or rules and other important information in the function call process of the DApp. For example, an attacker may access the internal state information of the contract by intercepting contract message calls, obtain signature information or encoding parameters, and further invade the user account to obtain sensitive information.

The data preprocessing stage focuses on word vector training and feature vector length unification. We obtain 5 marked function call sequences from MongoDB for training, and use the unmarked function call sequences obtained by replaying the ethernet transaction as sequences to be detected for detecting the DApp risk that has not been found. The Word2vec neural network is utilized to respectively pretrain the marked function call sequence and the sequence to be detected, and respective emmbedding Word vectors are respectively obtained, then the feature vector matrix is filled and the length is unified, so that semantic information of the called function and the logic relation before and after the function call sequence are effectively reserved.

In the model training stage, the complete model built by us is an input layer, a CNN (convolutional neural network), an LSTM (long short term memory model), a full-connection layer and an output layer. Training phase: the marked function call sequence is input into a model, a CNN (convolutional neural network) reduces the dimension of a matrix of 128 x 5000 into a matrix of 64 x 1249, an LSTM (long short term memory model) is used for multi-batch training and optimizing parameters through back propagation, and a full connection layer and an output layer are used for classifying and outputting model loss and accuracy.

In the risk detection stage, the feature vector matrix of the processed function call sequence to be detected is input into a trained model, and the probability sum of all risk decisions is compared with a threshold value, so that a detection result is finally obtained. The complete model has expansibility, the introduction of CNN (convolutional neural network) ensures that long sequences can be processed, and flexible adjustment of the threshold value enables more credible risk judgment.

In this scheme, mainly include 5 workflows: training word vectors, data preprocessing, feature vector dimension reduction, classification model training and DApp risk detection.

Training word vector

The training data are marked function call sequences, the dimension of the Word vector in parameter setting is 128, the iteration number is 8 (n_epoch=8), the number of models which are input every time is 100 (batch_size=100), and the skip-gram algorithm is adopted and negative sampling optimization is used. And finally generating a word vector index dictionary.

Data preprocessing

We store the obtained 5 marked function call sequences in order as a list and as a training set, with the sequences and labels, respectively. And then respectively converting each function name in the data set into a corresponding word vector according to the word vector index dictionary, and meanwhile, using the pad_sequences interface of the sequence library to unify the sequence length to be 5000, complementing the sequence which is insufficient to 5000 with 0 vector, deleting the function names which exceed 5000, and finally generating two three-dimensional feature vector matrixes of 128 x 5000 x num, wherein num represents the number of function call sequences.

Feature vector dimension reduction

Because the feature vector matrix generated after the data preprocessing is too huge, a CNN neural network is built for dimension reduction before training, the dimension reduction is finally reduced to a three-dimensional feature vector matrix of 64 x 1249 x num through convolution and pooling operation, num represents the number of function call sequences, and the training progress can be accelerated under the premise of not affecting the accuracy of a model.

Classification model training

We selected LSTM neural network as classification model, iteration number 20 (n_epoch=20), excitation function selection sigmoid, loss function selection spark_classification_cross sentropy, number of models per afferent is 100 (batch_size=100), and finally output model.

DApp risk detection

And (3) reading the function call sequence to be detected and storing the function call sequence as a list, wherein the length of the unified sequence of the pad_sequences interface of the sequence library is 5000, the sequence which is insufficient to 5000 is complemented by 0 vector, the function names which exceed 5000 are deleted, and finally two three-dimensional vector matrixes of 128 x 5000 x num are generated, and num represents the number of the function call sequences. And then inputting a classification model, respectively obtaining the judging probabilities of 6 categories, summing the judging probabilities of all risk categories except the normal category to be used as the judging probability of the unknown risk, and if the probability is larger than a set threshold (threshold=0.5), discovering a new DApp risk, otherwise, not discovering.

Compared with the prior art, the proposed scheme has the advantage that our scheme supports dynamic detection of DApp risk. Based on the premise that a certain similarity exists between the function call sequences which are at risk but are not found and some marked function call sequences, combining the advantages of the dynamic detection and the deep learning technology, constructing a deep learning multi-classification model of Word2vec+CNN+LSTM, taking the trained model as a DApp risk detection tool, and improving the accuracy of risk discrimination according to a threshold value set by the user.

Meanwhile, an enabling Word2vec pre-training model is utilized to construct an enabling Word vector dictionary for the function call sequence, semantic information of function names and context connection of the function call sequence can be reserved, so that the extracted feature vector matrix is more reasonable, and a DApp risk detection result is more evidence.

System embodiment one

According to an embodiment of the present invention, a system for DApp risk detection based on a multi-classification model is provided, and fig. 3 is a schematic diagram of a system for DApp risk detection based on a multi-classification model according to an embodiment of the present invention, as shown in fig. 3, specifically including:

the acquisition module is specifically used for: acquiring sequence data to be detected and training set data, wherein the training set data comprises: reentry attack risk sequence, malicious authorization risk sequence, exploit risk sequence, data tampering attack sequence and information leakage risk sequence.

the preprocessing module is specifically used for: and carrying out data pre-training on the sequence data to be detected and the training set data by using a Word2vec neural network to obtain a sequence to be detected and a training set to be detected, filling a feature vector matrix with the sequence to be detected and the training set to be detected, and unifying the length.

the training module is specifically used for: establishing a classification model, wherein the classification model comprises: the CNN convolutional neural network is used for dimension reduction, LSTM is used for multi-batch training and optimizing parameters through back propagation, the LSTM is used for classifying and outputting a full-connection layer and an output layer of model loss and accuracy, and the classification model is trained based on dimension reduction training set preprocessing data to obtain a trained classification model.

The embodiment of the present invention is a system embodiment corresponding to the above method embodiment, and specific operations of each module may be understood by referring to the description of the method embodiment, which is not repeated herein.

Device embodiment 1

The embodiment of the invention provides a device for detecting DApp risk based on a multi-classification model, as shown in fig. 4, which comprises: memory 40, processor 42, and a computer program stored on memory 40 and executable on processor 42, which when executed by the processor, performs the steps of the method embodiments described above.

Device example two

The embodiment of the present invention provides a computer readable storage medium, on which a program for implementing information transmission is stored, which when executed by the processor 42 implements the steps in the above-described method embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; and these modifications or substitutions may be made to the technical solutions of the embodiments of the present invention without departing from the spirit of the corresponding technical solutions.

Claims

1. A method of DApp risk detection based on a multi-classification model, comprising:

s1, acquiring sequence data to be detected and training set data;

2. The method according to claim 1, wherein S1 specifically comprises: acquiring sequence data to be detected and training set data, wherein the training set data comprises: reentry attack risk sequence, malicious authorization risk sequence, exploit risk sequence, data tampering attack sequence and information leakage risk sequence.

3. The method according to claim 2, wherein S2 specifically comprises: and carrying out data pre-training on the sequence data to be detected and the training set data by using a Word2vec neural network to obtain a sequence to be detected and a training set to be detected, filling a feature vector matrix with the sequence to be detected and the training set to be detected, and unifying the length.

4. A method according to claim 3, wherein S4 comprises: establishing a classification model, wherein the classification model comprises: the CNN convolutional neural network is used for dimension reduction, LSTM is used for multi-batch training and optimizing parameters through back propagation, the LSTM is used for classifying and outputting a full-connection layer and an output layer of model loss and accuracy, and the classification model is trained based on dimension reduction training set preprocessing data to obtain a trained classification model.

5. A system for DApp risk detection based on a multi-classification model, comprising:

6. The system of claim 5, wherein the acquisition module is specifically configured to: acquiring sequence data to be detected and training set data, wherein the training set data comprises: reentry attack risk sequence, malicious authorization risk sequence, exploit risk sequence, data tampering attack sequence and information leakage risk sequence.

7. The system according to claim 6, wherein the preprocessing module is specifically configured to: and carrying out data pre-training on the sequence data to be detected and the training set data by using a Word2vec neural network to obtain a sequence to be detected and a training set to be detected, filling a feature vector matrix with the sequence to be detected and the training set to be detected, and unifying the length.

8. The system of claim 7, wherein the training module is specifically configured to: establishing a classification model, wherein the classification model comprises: the CNN convolutional neural network is used for dimension reduction, LSTM is used for multi-batch training and optimizing parameters through back propagation, the LSTM is used for classifying and outputting a full-connection layer and an output layer of model loss and accuracy, and the classification model is trained based on dimension reduction training set preprocessing data to obtain a trained classification model.

9. An apparatus for DApp risk detection based on a multi-classification model, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor, performs the steps of the method of DApp risk detection based on a multi-classification model as claimed in any of claims 1 to 4.

10. A computer-readable storage medium, wherein a program for implementing information transfer is stored on the computer-readable storage medium, and the program, when executed by a processor, implements the steps of the method for DApp risk detection based on the multi-classification model as set forth in any one of claims 1 to 4.