CN116150757A

CN116150757A - Intelligent contract unknown vulnerability detection method based on CNN-LSTM multi-classification model

Info

Publication number: CN116150757A
Application number: CN202211228581.5A
Authority: CN
Inventors: 彭滔; 李旭彬; 王国军; 李培强; 顾婉仪; 翟广鑫; 黎相彬
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2022-10-08
Filing date: 2022-10-08
Publication date: 2023-05-23

Abstract

The invention discloses an intelligent contract unknown vulnerability detection method based on a CNN-LSTM multi-classification model, which comprises the following steps: s1, pile inserting is carried out at an Ethernet client Geth; s2, obtaining a normal operation code sequence set and a vulnerability operation code sequence set through the inserted Ethernet client Geth; s3, training the enabling word vectors by the pre-training model to obtain a word vector index dictionary; s4, converting the replay operation code sequence into a feature vector matrix according to the word vector index dictionary, and reducing the dimension by using a CNN neural network; s5, training the LSTM classification model by using the feature vector matrix after dimension reduction; s6, collecting an operation code sequence to be tested generated by transaction in real time by using the inserted Ethernet client Geth; s7, in the detection stage, for each operation code sequence to be detected, the probability values of all vulnerability categories given by the classification model are summed and compared with a threshold value, and the judgment of unknown vulnerabilities is completed.

Description

Intelligent contract unknown vulnerability detection method based on CNN-LSTM multi-classification model

Technical Field

The invention relates to the technical field of vulnerability detection, in particular to an intelligent contract unknown vulnerability detection method based on a CNN-LSTM multi-classification model.

Background

In recent years, the development speed of deep learning is very rapid, so researchers start to test intelligent contract holes by using a deep learning method. Several intelligent contract vulnerability detection schemes implemented in conjunction with deep learning are described below.

1) N-gram and deep learning based intelligent contract vulnerability detection and analysis method research is proposed by Zhang et al, wherein an operation code sequence refers to an instruction sequence in which byte codes generated after intelligent contract compiling are mapped according to an Ethernet official instruction set, and the instructions are assembly operation codes. And compiling the constraint source codes and analyzing the byte codes to obtain a static operation code data stream, constructing a dictionary according to the corresponding relation between the operation codes and the 16-system numbers in the Ethernet yellow book, and executing different flows according to different schemes. N-gram based scheme: simplifying operation codes, slicing by using n-gram, constructing a feature matrix, and further training by using different machine learning classification algorithms. Deep learning-based scheme: the operation code data stream is converted into an operation code sequence expressed by 16-system numbers, and 4 different deep learning network structure training is input. Comparing the training results of the two schemes, the vulnerability detection effect of the deep learning network structure of CNN+LSTM is found to be best.

2) He J et al train a symbol execution expert by mimicking conventional symbol execution based on a graph rolling network (GCN, graph Convolutional Network) in deep learning. In the model learning stage, the team runs the symbol execution expert on tens of thousands of intelligent contracts, so that thousands of high-quality operation code sequences are generated and input into the next layer of training, and finally, a fuzzy strategy with higher program coverage rate is obtained. And in the using stage, generating a tested operation code sequence by using a fuzzy strategy learned by the model, and finally analyzing the result of the fuzzy test to obtain the result of the vulnerability detection. The operation code sequence in the scheme is a transaction execution process which is simulated by symbol execution, and the feasibility is not necessarily provided, so that the feasibility of the operation code sequence needs to be checked by means of fuzzy test.

3) Huang H D et al translate the 16-ary bytecodes compiled from smart contracts into RGB color codes, thereby converting each smart contract into a fixed-size image code, and training and detecting it as input to a CNN (convolutional neural network) deep learning model, but since this approach converts source codes directly into image codes, the processing of layers in the model may cause otherwise uncorrelated bytecodes to become correlated, or destroy the logical relationships between contexts, the scheme must choose the appropriate CNN (convolutional neural network) model structure to reduce the impact of this possibility,

in the scheme, original data is generally static intelligent contract source codes or compiled byte codes, and the loopholes cannot be dynamically detected according to real-time execution conditions of transactions on a chain; meanwhile, the semantic information of the operation code and the context relation of the sequence are lacked according to the feature vector acquired by the original data; moreover, the classification model can only detect a few known vulnerabilities, and cannot detect unknown vulnerabilities which are not found yet. According to the invention, currently, researches are freshly carried out on unknown vulnerability detection. Therefore, the invention provides an intelligent contract unknown vulnerability detection method based on a CNN-LSTM multi-classification model.

Disclosure of Invention

(one) solving the technical problems

Aiming at the defects of the prior art, the invention provides an intelligent contract unknown vulnerability detection method based on a CNN-LSTM multi-classification model, so as to solve the problems.

(II) technical scheme

In order to achieve the above purpose, the present invention provides the following technical solutions:

an intelligent contract unknown vulnerability detection method based on a CNN-LSTM multi-classification model comprises the following steps:

s1, pile inserting is carried out at an Ethernet client Geth;

s2, replaying the block transaction of the Ethernet, and obtaining a normal operation code sequence set and a vulnerability operation code sequence set through the inserted Ethernet client Geth;

s3, training the enabling Word vectors by using a Word2vec pre-training model to obtain a Word vector index dictionary;

s4, converting the replay operation code sequence into a feature vector matrix according to the word vector index dictionary, and reducing the dimension by using a CNN neural network;

s5, training the LSTM classification model by using the feature vector matrix after dimension reduction in a training stage;

s6, collecting an operation code sequence to be tested generated by transaction in real time by using the inserted Ethernet client Geth;

s7, in the detection stage, for each operation code sequence to be detected, the probability values of all vulnerability categories given by the classification model are summed and compared with a threshold value, and the judgment of unknown vulnerabilities is completed.

Further, in the step S1, the instrumentation refers to inserting a code segment capable of outputting transaction data into the Geth source code of the ethernet client, where the code is written by Golang, and the transaction data collected here includes:

block parameters such as block number, timestamp, nonce value, root hash, gas value, etc.; transaction information such as account addresses and transfer amounts involved in transactions; an executed smart contract address; balance account balance; an operation code sequence formed by assembly operation codes such as PUSH1, MSTORE, CALLDATASIZE, ISZERO and the like and operands; memory, storage, stack, etc. of the ethernet virtual machine.

Further, in the step S2, the replay refers to re-executing the transaction executed by the ethernet on the local private chain, and the normal operation code sequence set and the vulnerability operation code sequence set can be obtained through the instrumented ethernet client Geth, where the two sequence sets form a data set input into the subsequent model, and the specific flow is as follows:

s201, replaying the block transaction of the Ethernet;

s202, a normal operation code sequence set and a vulnerability operation code sequence set are obtained through the instrumented Ethernet client Geth.

Further, in the step S3, the Word2vec pre-training model is a neural network for generating an enabling Word vector in the NLP field, and one operation code in the trained Word vector index dictionary corresponds to one multidimensional Word vector, and the specific process is as follows:

s301, building a Word2vec pre-training model, setting a Word vector dimension to be 128 in parameter setting, setting the iteration number to be 8 (n __ epoch=8), setting the sample number of each time of the model to be 100 (batch_size=100), adopting a skip-gram algorithm and using negative sampling optimization;

s302, a normal operation code sequence set and a vulnerability operation code sequence set are input into a Word2vec pre-training model, a 129 x 128 Word vector index dictionary is output, 129 represents operation code types appearing in all operation code sequences, and 128 represents Word vector dimensions.

Further, the specific flow of the step S4 is as follows:

s401, converting each operation code in the operation code sequence into a corresponding word vector according to the word vector index dictionary, and finally converting each operation code sequence into a feature vector matrix of 128 x 5000, wherein 128 represents word vector dimension, and 5000 represents the length of the operation code sequence;

s402, constructing a single-layer CNN neural network;

and S403, inputting all the multidimensional feature vector matrixes into a CNN neural network to reduce the dimension, and finally reducing the multidimensional feature vector matrixes into a plurality of 64-x 1249 feature vector matrixes through convolution and pooling operation.

Further, in the step S5, the normal operation code sequence set and the vulnerability operation code sequence set form a data set input to the LSTM neural network, and the training process of the LSTM classification model is as follows:

s501, constructing a single-layer LSTM neural network, wherein the iteration number in parameter setting is 5 (n __ epoch=5), the excitation function is sigmoid, the loss function is sparse_category_cross-cosen, and the number of models transmitted each time is 100 (batch_size=100);

s502, inputting the feature vector matrix subjected to the CNN neural network dimension reduction into the LSTM neural network to obtain a classification model capable of distinguishing various vulnerability categories.

Further, the operation code sequence to be tested collected in the step S6 is an operation code sequence generated by a transaction newly executed on the ethernet.

Further, the specific flow of step S7 is as follows:

s701, inputting the collected operation code sequence to be tested into a trained CNN-LSTM multi-classification model, obtaining probability values of each vulnerability category and summing;

s702, comparing the probability sum with a threshold, if the probability sum is larger than the threshold, discovering a new unknown vulnerability, otherwise, not discovering.

(III) beneficial effects

Compared with the prior art, the intelligent contract unknown vulnerability detection method based on the CNN-LSTM multi-classification model has the following beneficial effects:

1. the intelligent contract unknown vulnerability detection method based on the CNN-LSTM multi-classification model supports dynamic detection of intelligent contract unknown vulnerabilities. Based on the premise that the unknown vulnerability operation code sequence has certain similarity with certain known vulnerability operation code sequences, the method combines the advantages of the dynamic detection and the deep learning technology, builds a deep learning multi-classification model of CNN-LSTM, takes the trained model as a tool for unknown vulnerability detection, and improves the accuracy of unknown vulnerability discrimination according to a threshold value set by the user.

2. According to the intelligent contract unknown vulnerability detection method based on the CNN-LSTM multi-classification model, the Word2vec pre-training model is utilized to construct the embedding Word vector dictionary for the operation code, semantic information of the operation code and context relation of an operation code sequence can be reserved, so that the extracted feature vector matrix is more reasonable, and the unknown vulnerability detection result is more evidence.

Drawings

FIG. 1 is a flow diagram of a method for intelligent contract unknown vulnerability detection;

FIG. 2 is a schematic diagram of a system model structure of a method for detecting unknown vulnerabilities of an intelligent contract;

FIG. 3 is a system flow diagram of a method for intelligent contract unknown vulnerability detection.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples

The embodiment of the invention discloses an intelligent contract unknown vulnerability detection method based on a CNN-LSTM multi-classification model, which relates to a system model as follows:

the detection system scheme is divided into 2 stages: a data preprocessing stage and a model training and testing stage.

The data preprocessing stage focuses on word vector training and feature vector length unification. Because of the lack of an unknown vulnerability operating code sequence in a real environment, the invention adopts part of the known vulnerability operating code sequence as the unknown vulnerability operating code sequence for detection in an experimental stage. The invention obtains 8 operation code sequences from a database: s0 (normal sequence), S1 (reentrant vulnerability sequence), S2 (unexpected function call vulnerability sequence), S3 (false authority check vulnerability sequence), S4 (false processed exception vulnerability sequence), S5 (lack of standard event vulnerability sequence), S6 (strict check balance vulnerability sequence), S7 (timestamp/block number dependency vulnerability sequence), and divide it into 2 parts: known vulnerability operation code sequences and partial normal sequences, unknown vulnerability operation code sequences and partial normal sequences. The first part is used to train the model and the second part is used to detect unknown vulnerabilities. S1, because the number of samples is small, the accuracy of the classification model is worried about being influenced, the unknown vulnerability is taken as a default, and one of the other 6 vulnerabilities is extracted as the unknown vulnerability. So far, the number of known vulnerabilities is 5, and the number of unknown vulnerabilities is 2. And pre-training all operation code sequences by using a Word2vec neural network to obtain an emmbedding Word vector, filling a feature vector matrix and unifying the length, and effectively reserving semantic information of the operation code and a logic relation before and after the operation code sequences.

The model training and testing stage comprises two parts of training and unknown vulnerability determination, and the complete model built by the method is an input layer, a CNN (convolutional neural network), an LSTM (long-short-term memory model), a full-connection layer and an output layer. Training phase: the known vulnerability sequence and the normal sequence are input into a model, a CNN (convolutional neural network) reduces the dimension of a matrix of 128 x 5000 into a matrix of 64 x 1249, an LSTM (long short term memory model) is used for multi-batch training and optimizing parameters through back propagation, and a full connection layer and an output layer are used for classifying and outputting model loss and accuracy. Unknown vulnerability determination stage: and inputting the feature vector matrix of the processed unknown vulnerability sequence into a trained model, and comparing the probability of all vulnerability judgment with a threshold value to obtain a detection result. The complete model has expansibility, the introduction of CNN (convolutional neural network) ensures that a long operation code sequence can be processed, and the flexible adjustment of the threshold value ensures that the judgment of the unknown vulnerability is more reliable, and the high threshold value and the high accuracy prove that the unknown vulnerability has stronger similarity with the known vulnerability.

Referring to fig. 1-2, the intelligent contract unknown vulnerability detection method based on the cnn+lstm multi-classification model provided in this embodiment includes the following steps:

mainly comprises 6 working procedures: geth pile insertion, training of the emmbedding word vector, data preprocessing, feature vector dimension reduction, classification model training and unknown vulnerability determination.

(1) Geth pile

The invention inserts the code segment into the Geth source code of the Ethernet client, so that the Ethernet client can output the corresponding operation code sequence when executing the transaction, thereby collecting the operation code sequence generated by replaying the transaction and the operation code sequence to be tested generated by newly executing the transaction. The replay operation code sequence is used for training the emmbedding word vector and inputting the vector into the CNN-LSTM multi-classification model for training, and the operation code sequence to be tested is used for detecting unknown vulnerabilities.

(2) Training of emmbedding word vectors

According to the invention, the Word2vec interface of the genesim library is used for training the enabling Word vector, training data are a normal operation code sequence and all vulnerability operation code sequences, the dimension of the Word vector in parameter setting is 128, the iteration number is 8 (n_epoch=8), the number of models which are transmitted each time is 100 (batch_size=100), and a skip-gram algorithm is adopted and negative sampling optimization is used. Because the types of the operation codes supported by the Ethernet virtual machine are only 130, the accuracy of the model is only affected by deleting the low-frequency operation codes, the invention reserves the low-frequency operation codes, and finally generates a 129 x 128 word vector index dictionary, 129 represents the types of the operation codes appearing in all operation code sequences, and 128 represents the word vector dimension.

(3) Data preprocessing

The obtained 5 known vulnerability operation code sequences and the obtained normal operation code sequences are divided into 6 files according to different categories, the files are read in sequence, and the sequences and the labels are respectively stored as a list. Then dividing the two into a training set and a testing set through the train_test_split interface of the sklearn library, wherein the testing set is used for testing the accuracy of the model to detect the known vulnerabilities, and the training set is used for: test set = 4:1, the ratio of each category in the two data sets is consistent with the ratio in the initial data set, ensuring that there is no impact on the model results. And dividing words of the data set according to the operation codes, respectively converting each operation code in the data set into a corresponding word vector according to a word vector index dictionary, complementing sequences which are less than 5000 by 0 vectors by using a pad_sequences interface of a sequence library to unify the sequence length to 5000, deleting operation codes which are more than 5000, and finally generating two 128 x 5000 x num three-dimensional feature vector matrixes, wherein num represents the number of operation code sequences.

(4) Feature vector dimension reduction

Because the feature vector matrix generated after the data preprocessing is too huge, the CNN neural network is built for dimension reduction before training, the dimension reduction is finally reduced to a three-dimensional feature vector matrix of 64 x 1249 x num through convolution and pooling operation, num represents the number of operation code sequences, and the training progress can be accelerated under the premise of not affecting the accuracy of a model.

(5) Classification model training

According to the invention, an LSTM neural network is selected as a classification model, the iteration number is 5 (n_epoch=5), a sigmoid is selected as an excitation function, a spark_category_cross sentropy is selected as a loss function, the number of models transmitted each time is 100 (batch_size=100), and finally the model loss and accuracy are output. The classification model can distinguish 6 categories, namely a normal category from 5 known vulnerability categories.

(6) Unknown vulnerability determination

The invention extracts two vulnerability sequences as unknown vulnerability sequences for testing. And reading unknown vulnerability sequences and storing the unknown vulnerability sequences as a list, wherein the unified sequence length of the pad_sequences interface of the sequence library is 5000, sequences which are insufficient to 5000 are complemented by 0 vectors, operation codes which exceed 5000 are deleted, and finally two 128 x 5000 x num three-dimensional vector matrixes are generated, wherein num represents the number of the operation code sequences. And then inputting a classification model, respectively obtaining the judging probabilities of 6 categories, summing the judging probabilities of all the vulnerability categories except the normal category to be used as the judging probabilities of the unknown vulnerability, and if the probabilities are larger than a set threshold (threshold=0.5), discovering a new unknown vulnerability, otherwise, not discovering.

The intelligent contract unknown vulnerability detection method based on the CNN-LSTM multi-classification model provided by the embodiment of the invention supports the dynamic detection of the intelligent contract unknown vulnerability. Based on the premise that the unknown vulnerability operation code sequence has certain similarity with certain known vulnerability operation code sequences, the method combines the advantages of the dynamic detection and the deep learning technology, builds a deep learning multi-classification model of CNN-LSTM, takes the trained model as a tool for unknown vulnerability detection, and improves the accuracy of unknown vulnerability discrimination according to a threshold value set by the user.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. An intelligent contract unknown vulnerability detection method based on a CNN-LSTM multi-classification model is characterized by comprising the following steps:

s1, pile inserting is carried out at an Ethernet client Geth;

2. The intelligent contract unknown vulnerability detection method based on the CNN-LSTM multi-classification model as claimed in claim 1, wherein the method comprises the following steps: in the step S1, the instrumentation refers to inserting a code segment capable of outputting transaction data into the Geth source code of the ethernet client, the code is written by Golang, and the collected transaction data includes:

block parameters such as block number, timestamp, nonce value, root hash, gas value, etc.;

transaction information such as account addresses and transfer amounts involved in transactions;

an executed smart contract address;

balance account balance;

PUSH1, MSTORE, CALLDATASIZE, ISZERO assemble the operation code sequence that operation code and operand make up;

and storing, storing and stacking the bottom layer information related to the Ethernet virtual machine.

3. The intelligent contract unknown vulnerability detection method based on the CNN-LSTM multi-classification model as claimed in claim 1, wherein the method comprises the following steps: in the step S2, the replay refers to re-executing the transaction executed by the ethernet in the present on the local private chain, and the normal operation code sequence set and the vulnerability operation code sequence set can be obtained through the instrumented ethernet client Geth, where the two sequence sets form a data set input into the subsequent model, and the specific flow is as follows:

s201, replaying the block transaction of the Ethernet;

4. The intelligent contract unknown vulnerability detection method based on the CNN-LSTM multi-classification model as claimed in claim 1, wherein the method comprises the following steps: in the step S3, the Word2vec pre-training model is a neural network for generating an emmbedding Word vector in the NLP field, and an operation code in the trained Word vector index dictionary corresponds to a multidimensional Word vector, and the specific process is as follows:

s301, a Word2vec pre-training model is built, a skip-gram algorithm is adopted, and negative sampling optimization is used;

s302, inputting a normal operation code sequence set and a vulnerability operation code sequence set into a Word2vec pre-training model, and outputting a Word vector index dictionary.

5. The intelligent contract unknown vulnerability detection method based on the CNN-LSTM multi-classification model as claimed in claim 1, wherein the method comprises the following steps: the specific flow of the step S4 is as follows:

s401, according to the word vector index dictionary, each operation code in the operation code sequence is converted into a corresponding word vector, and finally each operation code sequence is converted into a multidimensional feature vector matrix;

s402, constructing a single-layer CNN neural network;

s403, inputting all the multidimensional feature vector matrixes into the CNN neural network to reduce the dimension.

6. The intelligent contract unknown vulnerability detection method based on the CNN-LSTM multi-classification model as claimed in claim 1, wherein the method comprises the following steps: in step S5, the normal operation code sequence set and the vulnerability operation code sequence set form a data set input to the LSTM neural network, and the training process of the LSTM classification model is as follows:

s501, constructing a single-layer LSTM neural network;

7. The intelligent contract unknown vulnerability detection method based on the CNN-LSTM multi-classification model as claimed in claim 1, wherein the method comprises the following steps: the operation code sequence to be tested collected in real time in the step S6 is an operation code sequence generated by a newly executed transaction on the ethernet.

8. The intelligent contract unknown vulnerability detection method based on the CNN-LSTM multi-classification model as claimed in claim 1, wherein the method comprises the following steps: the specific flow of the step S7 is as follows: