CN116541845A

CN116541845A - Intelligent contract multi-label vulnerability detection method and system based on AdaBoost

Info

Publication number: CN116541845A
Application number: CN202310419740.8A
Authority: CN
Inventors: 张明武; 黄梦
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2023-04-13
Filing date: 2023-04-13
Publication date: 2023-08-04

Abstract

The invention discloses an intelligent contract multi-label vulnerability detection method and system based on AdaBoost, which comprises the steps of firstly extracting byte codes of an intelligent contract to be detected; then decompiling the byte code into an operation code; then extracting slice characteristics of the operation code; finally, aiming at slice characteristics, performing joint vulnerability detection by using an One-Vs-Rest model and an AdaBoost model; the method effectively improves the detection performance of the security holes of the intelligent contracts of the Ethernet, and has higher accuracy.

Description

Intelligent contract multi-label vulnerability detection method and system based on AdaBoost

Technical Field

The invention belongs to the technical field of information security, relates to a network space intelligent contract vulnerability detection method and system, and particularly relates to an intelligent contract multi-label vulnerability detection method and system based on AdaBoost.

Background

Intelligent contracts were first proposed by scientists, nickel Szabo, which is a computer protocol that propagates, validates, or executes contracts in an informative manner. At that time, smart contracts have not been used in practice due to lack of trusted execution environments. The above-described problems are not solved until the advent of bitcoin and blockchain technology. However, it cannot provide a complicated service due to the incompleteness of the bitcoin-in.

As one of the most popular blockchain platforms, ethernet has deployed tens of thousands of intelligent contracts. Ethernet has introduced the complete programming language (solution) of the calibrant for smart contracts, enabling developers to deploy smart contract-based applications (dapp) on blockchains. The intelligent contract has the characteristics of certainty, real-time performance, verifiability and decentralization, and can be used for various scenes including digital identity, financial transaction, securities, digital records, internet of things, blockchain, supply chain and the like, distributed computation and insurance.

With the richness of smart contract application scenarios, smart contracts mostly involve cryptocurrency, which may cost millions of dollars. Thus, security of the smart contract may affect security of the cryptocurrency. Security holes in smart contracts not only cause huge financial losses, but also destroy everyone's trust in smart contracts and blockchains. As can be seen, the development of intelligent contracts has a very high security requirement, and vulnerability detection has become an urgent problem in the field of blockchain security.

Currently, the main methods for detecting intelligent contract vulnerabilities are feature matching, formal verification, symbolic execution, static analysis, fuzzy testing and deep learning. The notation Oyente, mythril and security methods are performed. These tools need to find all executable paths in the contract or analyze the control flow graph of the contract to detect the vulnerability, but have the problems of low detection efficiency, long time consumption and the like.

Disclosure of Invention

In order to solve the technical problems, the invention provides an intelligent contract multi-label vulnerability detection method, system and electronic equipment based on AdaBoost, which are used for detecting vulnerabilities in intelligent contract source codes (solubility source codes) of Ethernet.

The technical scheme adopted by the method is as follows: an intelligent contract multi-label vulnerability detection method based on AdaBoost comprises the following steps:

step 1: extracting byte codes of intelligent contracts to be detected;

step 2: decompiling the bytecode into an operation code;

step 3: extracting slice characteristics of the operation code;

step 4: aiming at slice characteristics, performing joint vulnerability detection by using an One-Vs-Rest model and an AdaBoost model; the loopholes comprise reentrant loopholes, integer overflow loopholes, exception handling loopholes, call stack overflow loopholes, tx.origin loopholes, timestamp dependence loopholes and transaction sequence dependence loopholes;

the joint vulnerability detection is carried out by using an One-Vs-Rest model and an AdaBoost model, the One-Vs-Rest model is adopted to convert the multi-classification problem into a plurality of binary classification problems, the conversion idea is that One of the classes is selected to be positive, all other classes are made to be negative, then each binary classification task is classified by One AdaBoost classifier, the total of six AdaBoost classifiers are used for classifying, and the classification results are combined to provide the final multi-label classification result.

The system of the invention adopts the technical proposal that: an intelligent contract multi-label vulnerability detection system based on AdaBoost comprises the following modules:

the byte code extraction module is used for extracting byte codes of intelligent contracts to be detected;

the operation code extraction module is used for decompiling the byte code into an operation code;

the slice feature extraction module is used for extracting slice features of the operation code;

the vulnerability detection module is used for carrying out joint vulnerability detection by using an One-Vs-Rest model and an AdaBoost model aiming at slice characteristics; the loopholes comprise reentrant loopholes, integer overflow loopholes, exception handling loopholes, call stack overflow loopholes, tx.origin loopholes, timestamp dependence loopholes and transaction sequence dependence loopholes;

The invention realizes intelligent contract multi-label vulnerability detection based on AdaBoost based on word2vec and AdaBoost, and the scheme mainly researches operation codes of intelligent contracts, wherein the operation codes consist of a plurality of operation code fragments. Because the operation code comprises logic executed by the contract, slice characteristics of the operation code are extracted through word2vec and PCA, and the slice characteristics are used as input of an AdaBoost model for learning and training, the detection performance of the intelligent contract security hole can be effectively improved.

Drawings

Fig. 1 is a schematic diagram of a detection method according to an embodiment of the present invention.

FIG. 2 is a training flow chart of One-Vs-Rest model and AdaBoost model according to an embodiment of the present invention;

Detailed Description

In order to facilitate the understanding and practice of the invention, those of ordinary skill in the art will now make further details with reference to the drawings and examples, it being understood that the examples described herein are for the purpose of illustration and explanation only and are not intended to limit the invention thereto.

Referring to fig. 1, the intelligent contract multi-label vulnerability detection method based on AdaBoost provided by the invention comprises the following steps:

step 1: for an intelligent contract to be detected, extracting byte codes through a solc command;

step 2: decompiling the byte code into an operation code by a vandal tool, wherein the operation code consists of a plurality of operation code fragments;

step 3: extracting slice characteristics of the operation code;

in this embodiment, extracting each section of operation code feature through a word2vec model, generating a word vector matrix, summing the word vector matrix to obtain a feature, and forming the feature of each operation code segment into a slice feature, where the slice feature represents the feature of the contract;

in this embodiment, the word2vec model is composed of a CBOW model and a skip-gram model, and includes an input layer, a hidden layer and an output layer; CBOW model for predicting intermediate target word through context content, inputting one-hot code of context word of current word into input layer, multiplying by eachThe same central word matrix W _V×N Obtaining respective 1*N vectors, wherein V is the number of words in the vocabulary, and N is the dimension of the word vector; these 1*N vectors are then averaged to a 1*N vector; multiplying 1*N vector by context matrix U _V×N Obtaining a 1*V vector, normalizing 1*V vectors softmax, outputting probability vectors 1*V of each word, and taking the word corresponding to the number with the maximum probability value as a predicted word; if the predicted value is inconsistent with the word of the context, correcting the central word matrix W by using a back propagation algorithm _V×N And context matrix U _V×N The method comprises the steps of carrying out a first treatment on the surface of the The skip-gram model is used for predicting the context of the context through the center word, and takes the one-hot code of the center word as input to be 1*V vector; multiplying 1*V vector by the center word matrix W _V×N Obtaining a 1*N vector; multiplying 1*N vector by context matrix U _V×N Obtaining a V-dimensional vector, carrying out normalization processing on the V-dimensional vector softmax, using a word corresponding to the number with the maximum probability as a model predictive word, and if the predictive value of the model is inconsistent with the word of the context, correcting the weight vector center word matrix W by using a back propagation algorithm _V×N And context matrix U _V×N ；

In the CBOW model, in a known context W _(t-2) ，W _(t-1) ，W _(t+1) ，W _(t+2) On the premise of predicting the current word W _(t) Learned objective function P _CBOW To maximize the log likelihood formula;

P _CBOW ＝∑logp(W _(t) |W _(t-2) ，W _(t-1) ，W _(t+1) ，W _(t+2) )；

in the Skip-Gram model, the current word W is known _(t) Predicting its context W _(t-2) ，W _(t-1) ，W _(t+1) ，W _(t+2) Objective function P _Skip-Gram The method comprises the following steps:

P _Skip-Gram ＝∑logp(W _(t-2) ，W _(t-1) ，W _(t+1) ，W _(t+2) |W _(t) )。

in this embodiment, the joint vulnerability detection is performed by using One-Vs-Rest model and AdaBoost model, which converts the multi-classification problem into multiple binary classification problems by using One-Vs-Rest model, wherein the conversion idea is to select One of the classes as positive and make all other classes as negative, and then classify each binary classification task by using One AdaBoost classifier, and the classification results are combined to provide the final multi-label classification result.

Please refer to fig. 2, the One-Vs-Rest model and AdaBoost model of the present embodiment are trained models; the specific training process comprises the following steps:

(1) Collecting an original data set and preprocessing;

collecting intelligent contracts of the Ethernet to form an original data set; preprocessing the original data, marking the data by an intelligent contract vulnerability detection tool oynte, and marking whether the contract contains the vulnerability and the vulnerability type; compiling the contract source code into byte codes through a solc command;

(2) Decompiling the byte code of the source code into an operation code by a vandal tool, wherein the operation code consists of a plurality of operation code fragments;

because the lengths of the operation code fragments are inconsistent, the characteristic lengths of all contracts are different, the embodiment adopts the PCA algorithm dimension reduction characteristic and the fixed characteristic, and adopts the 0 supplementing principle (according to the longest characteristic length, the other data characteristics are supplemented with 0) to perform the normalization processing of the operation code fragments;

(3) Extracting the characteristics of each operation code segment through word2vec, generating a word vector matrix, summing the word vector matrix to obtain a characteristic, forming the characteristic of each operation code segment into a slice characteristic, wherein the slice characteristic represents the characteristic of the contract, and forming a characteristic data set;

(4) The feature data set is subjected to an oversampling SMOTE algorithm, an undersampled smoetemek algorithm and a SMOTENN algorithm to obtain 3 training data sets;

in this embodiment, an oversampling SMOTE algorithm is used for the feature data set to obtain an SMOTE data set; the specific implementation method comprises the following steps:

1) For each sample a in the samples with the quantity less than the threshold value, calculating the distance from the sample a to all samples in a minority class sample set by taking Euclidean distance as a standard to obtain k nearest neighbor;

2) Setting a sampling proportion according to the sample unbalance proportion to determine a sampling multiplying power N, randomly selecting a plurality of samples from k neighbors of each minority sample a, and assuming that the selected neighbors are b;

3) For each randomly selected neighbor b, a new sample c=a+rand (0, 1) |a-b| is constructed with the original sample a according to the following formula, respectively.

In this embodiment, an undersampled smoetemek algorithm is used on the feature dataset to obtain an smoetemek dataset; the specific implementation method comprises the following steps:

1) Generating a new minority sample by using an SMOTE method to obtain an expanded data set T;

2) Removing TomekLinks pairs in the T, and cleaning data;

where Tomek Links are defined as a pair of Links between nearest neighbor samples of opposite classes, given a pair of samples (x _i ，x _j ) Wherein x is _i ∈S _maj， x _j ∈S _min Record d (x) _i ，x _j ) Is sample x _i And x _j Distance between, if there is no sample x _k So that d (x _i ，x _k )＜d(x _i ，x _j ) Then the sample pair (x _i ，x _j ) Known as Tomek Links.

In the embodiment, an undersampled SMOTENN algorithm is used for the feature data set to obtain an SMOTENN data set; the specific implementation method comprises the following steps:

2) Predicting each sample in T by using a KNN (general K is taken as 3) method, if the prediction result is not consistent with the actual category label, rejecting the sample, and cleaning data;

wherein the KNN method comprises 4 steps: (1) preparing data and preprocessing the data; (2) calculating the distance from the test sample point to each other sample point; (3) sorting each distance, and then selecting K points with the smallest distance; (4) and comparing the categories of the K points, wherein the data minority obeys the principle of majority, and classifying the test sample points into the category with the highest proportion among the K points.

(5) Training an One-Vs-Rest model and an AdaBoost model by using a training data set;

in this embodiment, the training of the One-Vs-Rest model is realized by the following steps:

1) Selecting one of the classes as a positive class and making all other classes as negative classes;

2) Training an AdaBoost classifier for each classification task, and finally training six AdaBoost classifiers;

3) The results of the six classifiers are combined to provide the final result of the multi-label classification.

In this embodiment, the AdaBoost model is trained, and the specific implementation includes the following steps:

1) Initializing weight distribution of training data, wherein each training sample is given the same weight at the beginning:

D ₁ ＝(w ₁₁ ，w ₁₂ ，w ₁₃ ...，w _1i ，w _1N )；

wherein w is _1j Representing the weight at the beginning of the jth training sample,n represents the total number of samples, and j is more than or equal to 1 and less than or equal to N;

2) Performing M iterations, wherein each iteration comprises the following steps:

a. using D with weight distribution _m Is learned by the training data set of the base classifier G _m (x)：

G _m (x)：x→{-1，+1}；

b. Calculation G _m (x) Classification error rate e on training data set _m ：

Wherein G is _m (x _i ) Representation base classifier G _m (x) X on training set _i Classification result, y _i Representing training data x _i True classification, w _m，i Representing sample x at the mth iteration _i P () represents the probability of an event, I () represents the result of 1 when the event in brackets is true, otherwise the result is 0;

c. calculation G _m (x) Obtaining the weight alpha of the basic classifier in the final classifier _m ：

Wherein e _m Represents G _m (x) A classification error rate on the training dataset;

d. updating weight distribution of the training data set:

D _m+1 ＝(w _m+1，1 ，w _m+1，2 ，w _m+1，3 ，...，w _m+1，i ，w _m+1，N )；

wherein w is _m+1，i Representing the updated weight of the ith training sample after m iterations; z is Z _m Representing normalization factors, exp () represents an exponential function based on a natural constant e;

(3) Combining all the base classifiers, the final classification result being represented by all the base classifiers:

wherein G is _m (x) Is a base classifier, alpha _m Is a basic classifier G _m (x) And in the weight of the final classifier, M is the iteration number.

The invention realizes intelligent contract multi-label vulnerability detection based on AdaBoost based on Word2vec and AdaBoost, and the scheme mainly researches operation codes of intelligent contracts, wherein the operation codes consist of a plurality of operation code fragments. Because the operation code comprises logic executed by the contract, the sequence slice characteristics of the operation code are extracted through Word2vec and PCA, and the slice characteristics are used as the input of the AdaBoost model for learning and training, the detection performance of the security vulnerability of the intelligent contract can be effectively improved.

The foregoing description of the preferred embodiments is not to be construed as limiting the scope of the invention, and persons of ordinary skill in the art may make substitutions or alterations without departing from the scope of the invention as set forth in the appended claims.

Claims

1. An intelligent contract multi-label vulnerability detection method based on AdaBoost is characterized by comprising the following steps:

step 1: extracting byte codes of intelligent contracts to be detected;

step 2: decompiling the bytecode into an operation code;

step 3: extracting slice characteristics of the operation code;

2. The AdaBoost-based intelligent contract multi-label vulnerability detection method of claim 1, wherein the method comprises the steps of: in step 1, the bytecode is extracted by a solc command.

3. The AdaBoost-based intelligent contract multi-label vulnerability detection method of claim 1, wherein the method comprises the steps of: in step 2, the bytecode is decompiled into an opcode by a vandal tool, which consists of several opcode fragments.

4. The AdaBoost-based intelligent contract multi-label vulnerability detection method of claim 1, wherein the method comprises the steps of: in step 3, extracting each section of operation code feature through a word2vec model, generating a word vector matrix, summing the word vector matrix to obtain a feature, and forming the feature of each operation code segment into a slice feature, wherein the slice feature represents the feature of the contract;

the word2vec model consists of a CBOW model and a skip-gram model and comprises an input layer, a hidden layer and an output layer; the CBOW model is used for predicting the intermediate target word through the context content, inputting the one-hot code of the context word of the current word into the input layer, and multiplying the one-hot code by the same central word matrix W respectively _V×N Obtaining respective 1*N vectors, wherein V is the number of words in the vocabulary, and N is the dimension of the word vector; these 1*N vectors are then averaged to a 1*N vector; multiplying 1*N vector by context matrix U _V×N Obtaining a 1*V vector, normalizing 1*V vectors softmax, outputting probability vectors 1*V of each word, and taking the word corresponding to the number with the maximum probability value as a predicted word; if the predicted value does not match the word of the context, correcting the central word by using a back propagation algorithmMatrix W _V×N And context matrix U _V×N The method comprises the steps of carrying out a first treatment on the surface of the The skip-gram model is used for predicting the context of the context through the center word, and takes the one-hot code of the center word as input to be 1*V vector; multiplying 1*V vector by the center word matrix W _V×N Obtaining a 1*N vector; multiplying 1*N vector by context matrix U _V×N Obtaining a V-dimensional vector, carrying out normalization processing on the V-dimensional vector softmax, using a word corresponding to the number with the maximum probability as a model predictive word, and if the predictive value of the model is inconsistent with the word of the context, correcting the weight vector center word matrix W by using a back propagation algorithm _V×N And context matrix U _V×N ；

P _CBOW ＝∑logp(W _(t) |W _(t-2) ，W _(t-1) ，W _(t+1) ，W _(t+2) )；

5. the AdaBoost-based intelligent contract multi-label vulnerability detection method according to any one of claims 1-4, wherein the method is characterized by comprising the following steps: the One-Vs-Rest model and the AdaBoost model are trained models; the specific training process comprises the following steps:

(1) Collecting an original data set and preprocessing;

adopting a dimension reduction feature and a fixed feature of a PCA algorithm, and adopting a 0 supplementing principle to perform normalization processing on the operation code fragments;

the specific implementation of the training One-Vs-Rest model comprises the following steps:

The training AdaBoost model specifically comprises the following steps:

D ₁ ＝(w ₁₁ ，w ₁₂ ，w ₁₃ ...，w _1j ，w _1N )；

G _m (x)：x→{-1，+1}；

b. Calculation G _m (x) Classification error rate e on training data set _m ：

d. updating weight distribution of the training data set:

wherein w is _m+1，i Representing the ith training sample iteration mUpdating the weight after the second time; z is Z _m Representing normalization factors, exp () represents an exponential function based on a natural constant e;

6. The AdaBoost-based intelligent contract multi-label vulnerability detection method of claim 5, wherein the method comprises the following steps: the feature data set is subjected to an oversampling SMOTE algorithm to obtain an SMOTE data set; the specific implementation method comprises the following steps:

(1) For each sample a in the samples with the quantity less than the threshold value, calculating the distance from the sample a to all samples in a minority class sample set by taking Euclidean distance as a standard to obtain k nearest neighbor;

(2) Setting a sampling proportion according to the sample unbalance proportion to determine a sampling multiplying power N, randomly selecting a plurality of samples from k neighbors of each minority sample a, and assuming that the selected neighbors are b;

(3) For each randomly selected neighbor b, a new sample c=a+rand (0, 1) |a-b| is constructed with the original sample a according to the following formula, respectively.

7. The AdaBoost-based intelligent contract multi-label vulnerability detection method of claim 6, wherein the method comprises the steps of: the undersampling smoetemek algorithm is used for the characteristic data set, so that an smoetemek data set is obtained; the specific implementation method comprises the following steps:

(1) Generating a new minority sample by using an SMOTE method to obtain an expanded data set T;

(2) Removing Tomek Links pairs in the T, and cleaning data;

wherein Tomek Links are defined as opposite classesA pair of connections between nearest neighbor samples, given a pair of samples (x _i ，x _j ) Wherein x is _i ∈S _maj ，x _j ∈S _min Record d (x) _i ，x _j ) Is sample x _i And x _j Distance between, if there is no sample x _k So that d (x _i ，x _k )＜d(x _i ，x _j ) Then the sample pair (x _i ，x _j ) Known as Tomek Links.

8. The AdaBoost-based intelligent contract multi-label vulnerability detection method of claim 6, wherein the method comprises the steps of: the undersampled SMOTENN algorithm is used for the feature data set to obtain an SMOTENN data set; the specific implementation method comprises the following steps:

(2) Predicting each sample in the T by using a KNN method, and if the prediction result does not accord with the actual category label, rejecting the sample and cleaning data;

9. An intelligent contract multi-label vulnerability detection system based on AdaBoost is characterized by comprising the following modules: