CN114048464A

CN114048464A - Ether house intelligent contract security vulnerability detection method and system based on deep learning

Info

Publication number: CN114048464A
Application number: CN202210029518.2A
Authority: CN
Inventors: 陈钟; 关志; 李青山; 杨可静; 崔冬琪; 李悦; 董宇; 陈子明
Original assignee: Boya Chain Beijing Technology Co ltd; Peking University
Current assignee: Boya Chain Beijing Technology Co ltd; Peking University
Priority date: 2022-01-12
Filing date: 2022-01-12
Publication date: 2022-02-15
Anticipated expiration: 2042-01-12
Also published as: CN114048464B

Abstract

The invention discloses an Ethernet intelligent contract security vulnerability detection method and system based on deep learning, wherein the Ethernet intelligent contract vulnerability detection problem is modeled into an end-to-end classification detection model, and whether a vulnerability is included is judged aiming at an intelligent contract source code, so that the detection of the intelligent contract security vulnerability is realized; the method comprises the following steps: preprocessing the source code data of the intelligent contracts of the Etheng; constructing an intelligent contract source code semantic representation learning module which comprises a coding layer/coder, a detection layer/classifier and a model fusion output module; training a model; in the testing stage, the trained intelligent contract source code semantic representation learning module is used for realizing the block chain intelligent contract security vulnerability detection based on machine learning, and the detection performance of the Ethernet intelligent contract security vulnerability is effectively improved.

Description

Ether house intelligent contract security vulnerability detection method and system based on deep learning

Technical Field

The invention belongs to the technical field of information security, relates to a network space information security technology, and particularly relates to a method and a system for detecting the security vulnerability of an Etherhouse intelligent contract based on machine learning/deep learning.

Background

While intelligent convergence provides flexibility and scalability for various fields and services in a block chain, its security problem has also emerged in the past decade. To ensure the security of smart contracts, many security analysis tools have been developed, most of which are directed to certain, known types of attacks. However, as the complexity of intelligent contract source code increases, existing tools that perform contract security analysis only through shallow features and metrics such as LOC, n-gram have been unable to meet practical requirements. Meanwhile, a large amount of open source contract source codes provide a new idea for researchers, namely, the method of machine learning and deep learning is used for mining various patterns existing in the codes. Learning-based methods have achieved good results in the fields of computer vision and natural language processing, and researchers of network spatial information security have begun to apply data-driven methods to solve problems of vulnerability detection, vulnerability discovery, and the like. Unlike manual design rules, hard coded feature detection methods, machine learning, and in particular deep learning methods, which provide the ability to automatically mine pattern features, extract context features directly from input source code data without the need to manually predefine strategies and features, and thus can be used to detect unknown potential threats in the code. However, the existing detection technology field based on machine learning is still lack of an intelligent contract security vulnerability detection technology with high effectiveness and high efficiency.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides an intelligent contract security vulnerability detection method and system (device) based on machine learning/deep learning, which are used for detecting potential threats in an intelligent contract source code (security source code) of an Ethernet.

The invention provides an Ethernet intelligent contract security vulnerability detection model (algorithm) based on deep learning, which models an intelligent contract vulnerability detection problem into a classification problem and realizes an end-to-end detection method. The method provided by the invention firstly extracts related contract segments from an abstract syntax tree constructed by a source code, then extracts a mode through a deep learning encoder, and generates a semantic embedded vector for expressing vulnerability semantic information. The semantic embedded vector is used as the input of a subsequent full-connection detection network model, and the probability that a sample belongs to the category containing the holes is output by a last softmax layer of the detection network model.

The technical scheme provided by the invention is as follows:

an intelligent contract security vulnerability detection method based on machine learning models an intelligent contract vulnerability detection problem into an end-to-end classification detection model, and judges whether a vulnerability is contained or not aiming at an intelligent contract source code, thereby realizing the detection of the intelligent contract security vulnerability; the method comprises the following steps:

1) constructing a data preprocessing module for preprocessing the source code data of the intelligent contracts of the Etheng; the method comprises the following steps:

11) constructing an abstract syntax tree corresponding to the source code of the intelligent junction of the Etheng;

processing input intelligent contract source codes and converting the input intelligent contract source codes into abstract syntax trees;

12) extracting related intelligent contract fragments from the abstract syntax tree, and converting to obtain a word (token) sequence;

a sequence generation algorithm is designed for extracting semantic modules associated with vulnerability detection from an abstract syntax tree and finally converting the semantic modules into word (token) sequences. Specifically, the method comprises the steps of reserving grammatical information related to program control flow by extracting function level code segments of an abstract syntax tree and converting the grammatical information into token sequences.

And taking the word sequence obtained by each contract sample as the input of the intelligent contract source code semantic representation learning module.

In specific implementation, a function level code segment of the abstract syntax tree is extracted through a sequence generation algorithm to reserve syntax information related to a program control flow and convert the syntax information into a token sequence. The method comprises the following steps:

121) firstly, performing depth-first traversal on an abstract syntax tree, and taking the abstract syntax tree as a code segment to be analyzed when traversing to a subtree with the type defined by a function;

122) and then performing depth-first traversal on the subtree to be analyzed, if a key value pair is a minimum nested unit of a character string type, returning the key character string and the value character string as words respectively, and continuing recursion on an object of which the value is a dictionary or a list type.

Through the above process, each function definition code segment is converted into a segment sequence. And for an intelligent contract sample containing a plurality of function definition subtrees, splicing sequences generated by all code fragment subtrees to obtain a final token sequence representation.

13) Designing a sampler in a training stage, and adopting a sampling method based on weight to generate a training data set in the training stage;

the data processing mode aiming at the data of the training phase and the testing phase is as follows: the data of the training phase is divided into a training set and a verification set, and the data of the testing phase is used as a testing set. The data set of the invention has two categories, a minority category and a majority category, which are binary unbalanced classification problems. In the binary unbalanced classification problem, the minority class is unsafe contract codes with vulnerability threats, and the majority class is safe and standard contract codes. And the data imbalance can bring adverse effect to the convergence of the model in the training process, so a sampling module is also designed in the training stage.

The sampling method based on the weight specifically comprises the following steps: calculating the sampled weight of each intelligent contract source code sample

The weight value is the reciprocal of the total number of samples of the category to which the samples belong. This weight-based sampling algorithm only works on the training data set generation in the training phase. This is because the training set is used to learn positive and negative sample patterns, and data imbalance interferes with the convergence of the learning model, so that sampling is used when iterating the training set. The verification set and the test set are used for verifying the performance and generalization capability of the model on the evaluation index, so that the original true distribution of the data needs to be maintained.

2) The intelligent contract source code semantic representation learning module comprises an encoding layer (encoder), a detection layer (classifier) and a model fusion output module.

The coding layer converts an input token sequence into a fixed-dimension vector, and the vector is a semantic vector of a sample (intelligent contract source code); and the detection layer learns a classifier to obtain the probability that each sample belongs to the white sample and the black sample. The invention designs two main semantic coding models including a coding model based on bidirectional Short-Term Memory (LSTM) and a coding model based on a pre-training model. In the invention, a coding model based on a bidirectional LSTM neural network is adopted to directly mine semantic modes related to vulnerabilities from intelligent contract training data; and fine-tuning downstream tasks on the intelligent contract data set by adopting a coding model based on a pre-training model to obtain a classification model aiming at the intelligent contract vulnerability detection classification task. Each classification network outputs a probability score that a sample belongs to a white sample or a black sample.

And the model fusion output module combines the output probability score results of different types of classification networks to form a final probability output score.

21) Bidirectional LSTM-based coding model

Based on the bidirectional LSTM network, the coding layer mainly comprises a word embedding layer and a bidirectional LSTM layer. The word embedding layer converts the input tokens into a low-dimensional vector, also commonly referred to as a distributed representation of words, and then each token is sequentially used as the input of the LSTM corresponding to the time step according to the order of their occurrence in the sequence. To obtain contextual semantics, the bi-directional LSTM layer learns semantic information from front to back and back to front simultaneously by constructing bi-directional LSTM. In addition, to obtain the abstract semantics of the sentence, a stacked bidirectional LSTM structure is used to learn the high level semantics.

To form each token with its corresponding embedded vector, a one-to-one mapping of words to integers is first constructed as a dictionary. For the bidirectional LSTM model, all tokens contained in the collected solidity intelligent contract source code corpus are used as dictionaries; for the pre-training model, a dictionary of the pre-training model itself is used for mapping. Then mapping each word according to the dictionary and mapping the word to a vector with fixed dimension through a matrix.

The stacked bidirectional LSTM layer accepts the word vector sequence as input, and the entire module may be divided into a forward encoding layer and a backward encoding layer, which model the input token sequence from front to back and back to front, respectively. The forward layer being a stack

A network of LSTMs, the activation function being selected from the tanh function, the input to the first layer being a word vector corresponding to the time step, the first layer being a network of LSTMs

The input of the layer is the hidden state of the previous layer corresponding to the time step. In addition, for stacked LSTM structures, dropout, which can be understood as a regularization method that "deactivates" (output 0) neurons with a certain probability, is also added to avoid overfitting. The dropout deactivation probability between LSTM layers in the model structure of the invention is 0.5. The backward layer is of similar structure except that the word vector order is accepted in the backward layerThe direction of the columns is opposite to the forward layer. In order to obtain semantic vectors containing context semantic information, the invention arranges the last layer (namely the first layer) of a forward layer and a backward layer

Layer), and selects the last hidden state as the last semantic vector.

The classifier based on the bidirectional LSTM network adopts a plurality of full connection layers, semantic vectors obtained by a front coding module are input, because two classification tasks are carried out, the number of neurons of an output layer is 2, and the final classification probability can be obtained by carrying out softmax function calculation on the output layer result.

Through the learning of the stacked bidirectional LSTM, modeling is carried out on the extracted abstract syntax tree on the word granularity and the sentence granularity, and finally an embedded vector containing semantic information is obtained. The depth model strengthens the representation capability in code representation learning, and the LSTM sequence modeling enables the semantics of the whole sentence to be abstracted into a compact vector, so that the robustness is higher compared with a simple bag-of-words model; the bidirectional model learns the information of the upper and lower parts at the same time, and the problem of forgetting the information of the extracted subtree caused by the length is solved to a certain extent; while a stacked model may be understood as a model that is regularized such that the representative vector of the output of the higher-level network contains less noise. The final semantic vector obtained by the stacked bidirectional LSTM encoder contains rich and abstract semantic information, so that the classifier can output the probability that the sample contains the threat through full connection of several layers.

22) Coding model based on pre-training model

The sequence model based on the RNN and the variant thereof is used for the code learning field, in particular to vulnerability detection and defect positioning tasks based on source codes. The invention applies a pre-training model as a coding model on the problem of intelligent contract source code vulnerability detection. In consideration of the difference between the programming language and the natural language, the invention carries out fine adjustment on specific downstream tasks on the basis of the pre-training model.

When the pre-training model is fine-tuned, a token sequence is input, and the input token sequence is processed into a form required by a BERT (Bidirectional Encoder Representation from Transformers). Since the present invention models intelligent contract source code vulnerability mining as a classification problem, predefined special characters are added before and after the sequence ([ CLS ] characters for fusion and [ SEP ] characters for segmentation).

For an input sequence, the tokenizer generates an attention mask to distinguish between filled and non-filled words by mapping each word to a corresponding ID in the pre-trained model dictionary by the tokenizer of the pre-trained model and then filling (or truncating) the sequence to a fixed length. Then inputting the data into a BERT model formed by a multi-layer transform coder.

The vector sequence with the same length as the input sequence is obtained through the whole BERT coding model, each token in the sequence is a vector with a certain dimension, and the dimension of a pre-training model used in the method is 768 in specific implementation.

In order to increase the scalability of the whole system, a selector module is designed behind the BERT output. The selector module is used for generating and outputting different semantic vectors after the output layer, wherein the semantic vectors comprise CLS or SEP or mean values; the selector module implements three semantic vector generation methods: outputting the output position corresponding to the [ CLS ] character in the input; outputting an output position corresponding to the [ SEP ] character in the input; the average value for each location in the output is used in implementations.

And inputting the output of the selector module into a classifier, and performing supervised fine adjustment by using a full detection network formed by two layers of full connection and a softmax function, wherein the activation function of the full connection layer is Relu.

Because only a single downstream task of vulnerability detection is performed based on the pre-training model, the method does not distinguish whether each layer of network in the BERT model is shared or exclusive, but performs gradient back propagation to each layer in the fine adjustment process without freezing treatment.

23) Model fusion

Designing a model fusion module; the input of the probability score is the probability score output by the previous classification model, and the probability score is used as the characteristic to enable the model fusion module to learn. And the two classification models are fused to output the classification result, so that the intelligent contract security vulnerability detection based on machine learning can be realized.

3) In the training stage, inputting a training data set to a constructed intelligent contract source code semantic representation learning module, and defining the forward process of a neural network; after a network structure is defined, training a model through gradient back propagation, and performing iterative training on batch data to optimize a defined target function by adopting a small batch gradient descending training mode; and stopping the training of the model after a certain condition is met, and storing the model parameters to obtain the trained intelligent contract source code semantic representation learning module.

4) In the testing stage, only model parameters are loaded and then forward calculation is carried out, and the network classification result of the testing data is output.

The invention also provides a device for realizing the intelligent contract security vulnerability detection method based on machine learning, which comprises the following steps: the system comprises a data preprocessing module and a semantic representation learning module. Wherein:

the data preprocessing module comprises: the device comprises an abstract syntax tree generating module, a sequence generating module and a sampler; the abstract syntax tree generating module is used for constructing an abstract syntax tree corresponding to the Etheng intelligent contract source code; the sequence generation module is used for extracting a function level code segment of the abstract syntax tree to reserve syntax information related to a program control flow and converting the syntax information into a token sequence; the sampler is used for generating a training data set in a training phase by a weight-based sampling method;

the intelligent contract source code semantic representation learning module comprises: the device comprises an encoding layer, a detection layer and a model fusion output module. The encoding layer of the semantic representation module encodes an input word sequence into a semantic vector with fixed length, and then inputs the semantic vector into a detection layer (classifier) consisting of a plurality of fully-connected layers; in the detection layer (classifier), each classification network outputs a probability score that a sample belongs to a white sample/a black sample. And the model fusion output module combines the output results of different types of classification networks to form a final probability output score.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides an intelligent contract security vulnerability detection method and device based on machine learning, which are used for detecting security vulnerabilities in an Ethernet intelligent contract source code (security source code). The method and the device directly extract the context characteristics from the input source code data without manually predefining strategies and characteristics, so that the method and the device can be used for detecting unknown potential threats in codes and effectively improve the detection performance of the security vulnerabilities of intelligent contracts.

Drawings

Fig. 1 is a block diagram of a structure of an intelligent contract security vulnerability detection apparatus based on machine learning according to the present invention.

FIG. 2 is a block diagram of a method for preprocessing data according to the present invention.

Fig. 3 is a schematic diagram of a network structure of a stacked bidirectional LSTM encoder established in an embodiment of the present invention.

Fig. 4 is a structure of an encoder based on a pre-training model according to an embodiment of the present invention.

Fig. 5 is a block flow diagram of model fusion in an embodiment of the invention.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

The invention provides an intelligent contract security vulnerability detection method and device based on machine learning, which comprises data preprocessing and semantic representation learning, wherein the intelligent contract vulnerability detection problem is modeled into a classification problem and end-to-end detection is realized, and whether a given intelligent contract source code contains a vulnerability or not is judged. The method comprises the steps of firstly extracting relevant contract segments from an abstract syntax tree constructed by a source code, then extracting a mode through a deep learning encoder, and generating a semantic embedded vector for expressing vulnerability semantic information. The semantic embedded vector is used as the input of a subsequent full-connection detection network model, and the probability that a sample belongs to the category containing the holes is output by a last softmax layer of the detection network model.

Since machine learning, and in particular deep learning methods, provide the ability to automatically mine pattern features, deep learning methods extract context features directly from input source code data without the need to manually predefine policies and features, and thus have the ability to detect unknown potential threats in the code. In addition, the deep learning architecture enables intelligent contract analysis to be processed in a batch mode, and therefore the effectiveness and efficiency of intelligent contract security vulnerability detection can be improved at the same time.

The method comprises the following steps of firstly, constructing an intelligent contract security vulnerability detection device based on machine learning;

fig. 1 shows a structure of an intelligent contract security vulnerability detection apparatus based on machine learning according to the present invention. The input of the whole system is an Etheng intelligent contract source code (solid source code), the data preprocessing module firstly constructs an abstract syntax tree corresponding to the source code, then a sequence generating algorithm is designed to extract a semantic module associated with vulnerability detection from the syntax tree and finally the semantic module is converted into a word (token) sequence. The word sequence obtained by each contract sample is used as the input of an intelligent contract source code semantic representation learning module, and the module comprises a coding layer, a detection layer and a model fusion output module. The encoding layer of the semantic representation module encodes an input sequence into a semantic vector with fixed length, and then the semantic vector is input into a detection layer (classifier) formed by a plurality of fully-connected layers, and each classification network outputs a probability score of a sample belonging to a white sample/a black sample. And the model fusion module combines the output results of different types of classification networks to form a final probability output score.

Second, pretreatment module

The whole preprocessing module converts an input intelligent contract source code text into a sequence required by deep learning model input, and the data preprocessing module mainly comprises three sub-modules: the abstract syntax tree generating module, the sequence generating module and the sampler, and the overall method flow of data preprocessing are shown in fig. 2.

The abstract syntax tree generating module processes the input intelligent contract source codes and converts the intelligent contract source codes into the abstract syntax tree.

The sequence generation module extracts the function level code segments of the abstract syntax tree to reserve syntax information related to the program control flow and converts the syntax information into token sequences. Firstly, performing depth-first traversal on an abstract syntax tree, and taking the abstract syntax tree as a code segment to be analyzed when traversing to a subtree with the type defined by a function; and then performing depth-first traversal on the subtree to be analyzed, if a key value pair is a minimum nested unit of a character string type, returning the key character string and the value character string as words respectively, and continuing recursion on an object of which the value is a dictionary or a list type. In this way, each function definition code fragment is converted into a sequence of fragments. And for the intelligent contract sample containing a plurality of function definition subtrees, the final token sequence representation is obtained by splicing the sequences generated by all the code fragment subtrees.

For the data in the training phase and the testing phase, there are different data processing modes: the data of the training phase is divided into a training set and a verification set, and the data of the testing phase is used as a testing set.

Because the problem of unbalanced data categories exists in the research problem of the invention, namely two categories, namely a few category, namely a category with a small number of samples, and a majority category, namely a category with a relatively large number of samples exist in the data set, and the unbalanced data brings adverse effects to the convergence of the model in the training process, a sampling module is also designed in the training stage. In the binary unbalanced classification problem analyzed by the invention, the minority classes are unsafe contract codes with vulnerability threats, and the majority classes are safe and standard contract codes. The invention introduces a sampling method based on weight in the training phase, and the weight of each sample sampled in a sampler is calculated

The weight value is the reciprocal of the total number of samples of the category to which it belongs. This weight-based sampling algorithm only works on the training data set generation in the training phase. This is because the training set functions to let the model learn positive and negative samplesPatterns, and data imbalance can interfere with learning model convergence, so sampling needs to be used when iterating the training set. The verification set and the test set are used for verifying the performance and generalization capability of the model on the evaluation index, so that the original true distribution of the data needs to be maintained.

Three, semantic coding model

The network structure of the semantic coding model used by the invention can be divided into two parts: the device comprises a semantic representation module and a detection module. The code semantic representation module is actually an encoder module, and converts an input token sequence into a fixed-dimension vector, which is a semantic vector of a sample (intelligent contract source code). And the detection module learns a classifier to obtain the probability that each sample belongs to the white sample and the black sample. The invention designs two major semantic coding models, which both model the intelligent contract vulnerability detection as a classification task.

3.1 problem modeling

In order to apply a deep learning model to carry out vulnerability detection on intelligent contract source codes, the intelligent contract vulnerability detection is modeled into a classification problem. If the specific vulnerability type is not considered, only whether the vulnerability exists is considered, and then vulnerability detection is a binary classification task.

In the semantic representation learning module, training data are learned based on the supervised method.

3.2 semantic representation learning;

the semantic representation learning module obtains the high-level abstract semantics of the intelligent contract source code through the depth model, represents each input sample into a semantic vector, and the semantic vector contains the semantic information of sentences extracted from the intelligent contract abstract syntax tree and is used as the input of the detection module to obtain the result of the target classification task.

Two major semantic representation learning modules realized by the invention are respectively introduced below, wherein the first type is a coding model based on a bidirectional Short-Term Memory (LSTM) neural network, and semantic patterns related to vulnerabilities are directly mined from intelligent contract training data; the second type is a coding model based on a pre-training model, the model already comprises universal characteristics learned on other source code data sets, and then a classification model aiming at an intelligent contract vulnerability detection classification task is obtained through downstream task fine adjustment on the intelligent contract data set.

3.2.1 bidirectional LSTM-based coding model

The first coding model adopted by the invention is based on a bidirectional LSTM network, and the coding layer mainly comprises a word embedding layer and a bidirectional LSTM layer. The word embedding layer converts the input tokens into a low-dimensional vector, also commonly referred to as a distributed representation of words, and then each token is sequentially used as the input of the LSTM corresponding to the time step according to the order of their occurrence in the sequence. To obtain contextual semantics, the bi-directional LSTM layer learns semantic information from front to back and back to front simultaneously by constructing bi-directional LSTM. In addition, in order to obtain abstract semantics of sentences, a stacked bidirectional LSTM structure is also used to learn high-level semantics, and the stacked bidirectional LSTM encoder network structure is shown in fig. 3.

The input of the layer is the hidden state of the previous layer corresponding to the time step. In addition, for stacked LSTM structures, dropout, which can be understood as a regularization method that "deactivates" (output 0) neurons with a certain probability, is also added to avoid overfitting. The dropout deactivation probability between LSTM layers in the model structure of the invention is 0.5. The forward and backward layers are similar structures except that the direction in which the word vector sequence is accepted in the backward layer is opposite to the forward layer. In order to obtain semantic vectors containing context semantic information, the invention arranges the last layer (namely the first layer) of a forward layer and a backward layer

Layer), and selects the last hidden state as the last semantic vector.

The classifier adopts a plurality of full connection layers, the semantic vector obtained by the front coding module is input, because two classification tasks are carried out, the number of neurons of the output layer is 2, and the final classification probability can be obtained by carrying out softmax function calculation on the output layer result.

Through the learning of the stacked bidirectional LSTM, modeling is carried out on the extracted abstract syntax tree on the word granularity and the sentence granularity, and finally an embedded vector containing semantic information is obtained. The depth model strengthens the representation capability in code representation learning, and the LSTM sequence modeling enables the semantics of the whole sentence to be abstracted into a compact vector, so that the robustness is higher compared with a simple bag-of-words model; the bidirectional model learns the information of the upper and lower parts at the same time, and the problem of forgetting the information of the extracted subtree caused by the length is solved to a certain extent; while a stacked model may be understood as a model that is regularized such that the representative vector of the output of the higher-level network contains less noise. The final semantic vector obtained by the stacked bidirectional LSTM encoder contains rich and abstract semantic information, and the probability that a sample contains a threat can be output by a detection layer through full connection of a plurality of layers.

3.2.2 coding model based on Pre-training model

Sequence models based on RNN and its variants are mature and widely used models in the field of natural language processing, but two problems generally exist in the field of code learning, particularly in vulnerability detection and defect localization tasks based on source codes: firstly, the lack of the labeled data set makes the quality of supervised learning possibly affected; secondly, it is difficult to fully utilize the open source un-labeled code data set to learn the domain knowledge.

In recent years, a series of methods based on a pre-training language model in the NLP field refresh the results on many tasks by a method of pre-training on a large scale in a specific field in advance and then fine-tuning for specific downstream tasks. The method also provides a thought for learning the general representation by using the label-free data in the field of code representation learning, and introduces the design and implementation details of applying the pre-training model as the coding model on the problem of intelligent contract source code vulnerability detection. In consideration of the difference between the programming language and the natural language, the invention carries out fine adjustment on specific downstream tasks on the basis of the pre-training model.

When the pre-training model is fine-tuned, the input is similar to the LSTM model, and a token sequence needs to be input, but the input token sequence needs to be processed into a form required by the BERT model. Because the invention models intelligent contract source code vulnerability mining as a classification problem, predefined special characters [ CLS ] and [ SEP ] are added before and after the sequence.

For an input sequence, the tokenizer generates an attention mask to distinguish between filled and non-filled words by mapping each word to a corresponding ID in the pre-trained model dictionary by the tokenizer of the pre-trained model and then filling (or truncating) the sequence to a fixed length. Then, the data are input into a model formed by a multi-layer transform encoder, and the structure of the encoder based on the pre-training model is shown in FIG. 4.

The vector sequence with the same length as the input sequence is obtained through the whole coding model, each token in the sequence is a vector with a certain dimension, and the dimension of the pre-training model used in the invention is 768.

After experiments are carried out on the intelligent contract vulnerability detection problem, the invention finds that compared with the method of using the final hidden state of the [ CLS ] position as the final semantic vector, the method of using the mean value of each position vector in the BERT output can obtain better effect on the problem researched by the invention. In order to increase the scalability of the whole system, a selector module is designed behind the BERT output. The selector module implements three semantic vector generation methods: [ CLS ] position corresponds to the output; [ SEP ] position-corresponding output; each position corresponds to a mean value of the outputs, which is used in the implementation.

And finally, inputting the obtained semantic vector into a detection layer, and training the whole network through a defined loss function to perform fine adjustment. The parameters of the network in the BERT model are reversely propagated and updated in the fine tuning process, the invention uses a detection network formed by two layers of full connection and a softmax function, and the activation function of the full connection layer is Relu.

3.3 loss function

The vulnerability detection of the intelligent contract is the problem of unbalanced data set categories, and in order to improve the influence caused by unbalanced data, the loss function is defined by a cost sensitive learning algorithm, namely, focal loss.

Model fusion

In order to fuse the results of the two types of classification models realized by the invention into final output and make the overall model have higher robustness and stronger generalization capability in effect, a model fusion module is designed after the classification models. The input of the model fusion module is the probability score of the previous classification model output, and the model fusion process is shown in fig. 5.

In the selection of the fusion learner, a Bagging-based method and a Boosting-based method can be selected. The invention uses random forest and gradient lifting tree as a learner for model fusion.

And the two classification models are fused to output the classification result, so that the intelligent contract security vulnerability detection based on machine learning can be realized.

The following data set of an embodiment of intelligent contract security vulnerability detection based on machine learning uses a data set from SmartBugs open source project, which is a public data set containing about forty thousand solidity contract source code, and is composed of three parts: the cured data set comprises 69 intelligent contract codes with leaks, the total number of the contract codes comprises 10 leak types, and the positions and the leak types of the leaks are marked in the codes; the world data set comprises intelligent contract source codes on 4 ten thousand Ether house blockchain platforms; the bug data set is a black sample data set constructed by a code injection mode, and 9369 different bugs are injected into 50 contract source codes. In the implementation process of the method, the curved data set and the bug data set are regarded as black sample data sets, and for the world data set, the open source detection result of the existing open source tool for the data set is used as a label. For model training and effect verification, 80% of the whole data set is taken as training data, and 20% is taken as test data; the validation set in the training process is divided from the training data, and the test data is completely invisible in the training process and is only used for finally predicting and evaluating on the trained model.

The implementation process adopts a Pythrch deep learning framework with the version of 1.1.0. The experiment is divided into a training stage and a testing stage, wherein a training set comprises 30907 samples, wherein 9060 samples are black samples, and 21847 white samples; the test set contained 7726 samples.

After the data preprocessing module, token sequences of the data sets are obtained, and the length of text sequences input by the model is set to be 200 in consideration of balance between calculation cost and information retention. For samples with length over 200, truncating to 200; for samples less than 200 in length, the sample is filled to 200.

Inputting a data set in a training stage, and defining a forward process of a neural network; after the network structure is defined, the model is trained through gradient back propagation, iterative training is carried out on batch data to optimize a loss function, the model stops training after certain conditions are met, and model parameters are stored. In the testing stage, only model parameters need to be loaded and then forward calculation is carried out to obtain the network output of the testing data.

The training adopts a small batch gradient descending training mode, each iteration model calculates loss of data in a small batch and updates parameters through a BP algorithm, and the process that all training data are subjected to one training iteration is called one round (epoch). In order to observe the condition of model convergence in the training process and avoid problems of overfitting and the like, training data is divided into a training set and a verification set according to the proportion of 4: 1. Sampling the training set through a weight-based sampling algorithm in an iterative process, and finally generating batch data with balanced sample types; the verification set is used for checking the overfitting condition of the model in the training process, so that the original distribution of the data is adopted, and sampling is not carried out. In order to compare and observe the fitting degree of the model to the data in the training process, simultaneously fully utilize all the data, perform 5-fold cross validation on all the training data, divide the training data into five parts, wherein one part is adopted as a validation set each time, and the rest is adopted as a training set.

In order to control the training time and avoid overfitting of the model, an early stopping strategy is also adopted during training. And when each epoch in the training phase is finished, carrying out evaluation index calculation once by using the AUC value of the verification set as an index. Considering that the training process may have oscillation, the best verification set AUC so far is recorded during training, and when the continuous 10 epochs do not reach the index, the verification set performance is considered to be stable. If training is continued, the model is overfit, and the iteration is stopped. During the training process, if the AUC of the verification set obtained by the current epoch exceeds the AUC of the best verification set recorded by the historical records, the parameters of the currently trained model are stored.

An Adam optimization method is selected by carrying out gradient descent on the LSTM model, and the learning rate is set to be 1 e-3; AdamW was used for the BERT fine tuning task, with the learning rate set to 2 e-5. In the training phase, the maximum epoch value for all model training is 50, but the training iteration is stopped within substantially 30 epochs because of the early-stop strategy.

Comparative experiments were performed in order to perform comparisons of different methods and models. For comparison with the conventional machine learning method, static features extracted from the abstract syntax Tree are applied to a machine learning classification model, including logistic Regression (logistic Regression), Decision Tree (Decision Tree), Linear kernel support vector machine (Linear-SVM), and gaussian kernel support vector machine (RBF-SVM). In order to verify the effect of a modeling method of a sequence model on an intelligent contract abstract syntax tree sequence, an RNN (neural network), a bidirectional LSTM model and a bidirectional LSTM model with an attention mechanism are compared, wherein the bidirectional LSTM model is used for comparing cross entropy and a focal loss function, and the bidirectional LSTM with the attention mechanism realizes two attention score calculation methods, namely attention score calculation based on dot product and attention score calculation function using a neural network. Based on the method of pre-training model plus fine tuning, two experiments were designed to compare the differences between cross entropy and focal loss. In order to fuse the prediction results of the single models and make the robustness stronger, a model fusion method based on a Bagging mode (random forest) and a Boosting mode (gradient lifting tree) is also performed.

Results for all models above on the four indices accuracy (precision), recall (call), F1 score (F1-score) and AUC on the test set are shown in table 1.

TABLE 1 evaluation index of each model on test set

It can be seen that the model effect based on the deep learning method is superior to that of the traditional machine learning method, and the performances of other models in AUC are slightly superior to those of random classification except that the performances of the decision tree model and the sequence model in the traditional method are almost the same. The sequence model has great advantage in accuracy, the LSTM model basically achieves 0.7, but is poor in recall effect, but is obviously superior to the traditional machine learning model in the scores and AUC of two comprehensive indexes F1, which shows that the syntactic semantic information of the intelligent contract source code can be learned by performing serialization processing on the abstract syntax tree and modeling by applying a natural language processing model. The BERT pre-training model fine tuning achieves the optimal effect in all single models on F1 scores and AUC, and although the accuracy is slightly lower than that of a sequence model, the recall rate achieves good effect. Compared with the BERT model, the fusion model improves the accuracy by 10 percent under the condition of keeping the F1 score and the AUC not to change greatly, is higher than all single models, and reduces the recall.

In actual use, a proper model needs to be selected according to specific service requirements, and a pre-training model can be selected for fine adjustment of a service scene which is more concerned about recall; whereas for scenarios requiring higher accuracy, an LSTM model or fusion model may be used.

It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the invention and scope of the appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims

1. An Ethernet intelligent contract security vulnerability detection method based on deep learning models an intelligent contract vulnerability detection problem into an end-to-end classification detection model, and judges whether a vulnerability is included or not aiming at an intelligent contract source code, thereby realizing the detection of the intelligent contract security vulnerability; the method comprises the following steps:

1) preprocessing the source code data of the intelligent contracts of the Etheng; the method comprises the following steps:

12) extracting relevant Etheng intelligent contract fragments from the abstract syntax tree, and converting to obtain a word sequence/token sequence; each Ether house intelligent contract sample obtains a corresponding word sequence;

designing a sequence generation algorithm, extracting a semantic module associated with vulnerability detection from the abstract syntax tree constructed in the step 11), and converting the semantic module into a word sequence; specifically, function level code segments of an abstract syntax tree are extracted to reserve syntax information related to a program control flow and convert the syntax information into a token sequence;

13) generating a training data set in a training stage by adopting a weight-based sampling method; calculating the sampled weight of each intelligent contract source code sample, wherein the weight is the reciprocal of the total number of samples of the class to which the sample belongs;

dividing the generated data set of the training stage into a training set and a verification set;

2) constructing an intelligent contract source code semantic representation learning module which comprises a coding layer/coder, a detection layer/classifier and a model fusion output module;

2a) the coding layer converts an input token sequence into a vector with fixed dimensionality, wherein the vector is a semantic vector of an intelligent contract source code sample;

2b) the detection layer is used for learning the classifier to obtain the probability that each intelligent contract source code sample belongs to the white sample and the black sample; the semantic coding model of the classifier comprises a coding model based on bidirectional long-term short-term memory (LSTM) and a coding model based on a pre-training model, the intelligent contract vulnerability detection is modeled into a binary classification task, and the probability scores of intelligent contract source code samples belonging to white samples or black samples are output; specifically, a LSTM-based coding model is adopted to directly mine semantic modes related to vulnerabilities from intelligent contract training data; fine-tuning downstream tasks on the intelligent contract data set by adopting a coding model based on a pre-training model to obtain a classification model aiming at the intelligent contract vulnerability detection classification tasks;

2c) the model fusion output module takes the output probability score result of the classification network as the characteristic for learning, and generates a final probability output score by fusing the two classification models to obtain a classification result;

3) in the training stage, inputting a training data set to a constructed intelligent contract source code semantic representation learning module, defining the forward process of a neural network, training by gradient back propagation, performing iterative training on batch data by adopting a small batch gradient descending training mode, and storing model parameters to obtain the trained intelligent contract source code semantic representation learning module;

4) in the testing stage, a trained intelligent contract source code semantic representation learning module is used, model parameters are loaded, and then forward calculation is carried out, so that a network classification result of the testing data is output;

by the steps, intelligent contract security vulnerability detection based on machine learning can be realized.

2. The method for detecting the security vulnerabilities of the etherhouse intelligent contracts based on the deep learning as claimed in claim 1, wherein in step 12) relevant intelligent contract segments are extracted from an abstract syntax tree through a sequence generation algorithm and converted to obtain word sequences; the method comprises the following steps:

122) then depth-first traversal is carried out on the subtree to be analyzed;

when a key value pair is a minimum nested unit of a character string type, returning the key character string and the value character string as words respectively; and for the object whose value is dictionary or list type, continue recursion;

through the process, each function definition code segment is converted into a segment of sequence;

and for an intelligent contract sample containing a plurality of function definition subtrees, splicing sequences generated by all code fragment subtrees to obtain a final token sequence representation.

3. The method for detecting the security vulnerabilities of the etherhouse intelligent contracts based on the deep learning of claim 1, wherein in the step 2 b), the coding layer based on the bidirectional LSTM network comprises a word embedding layer and a bidirectional LSTM layer; wherein:

a) the word embedding layer converts the input token into a low-dimensional vector, namely the distribution representation of the word; sequentially taking the appearance sequence of each token in the sequence as the input of the corresponding time step of the LSTM; forming a word vector sequence/token sequence by the embedded vector corresponding to each token;

b) the bidirectional LSTM layer learns semantic information from front to back and from back to front simultaneously by constructing the bidirectional LSTM so as to obtain context semantics;

c) further using the structure of the stacked bidirectional LSTM layer to learn high-level semantics so as to obtain abstract semantics of the sentence;

the classifier based on the bidirectional LSTM network adopts a plurality of fully connected layers; the input is a semantic vector obtained by a coding layer; the output layer outputs the two classification results; and performing softmax function calculation on the output layer result to obtain the final classification probability.

4. The method for detecting the security vulnerabilities of the etherhouse intelligent contracts based on the deep learning of claim 3, wherein the stacked bidirectional LSTM layer comprises a forward coding layer and a backward coding layer, a word vector sequence is used as an input, and a token sequence is modeled from front to back and from back to front respectively; the forward coding layer is a network stacked with K LSTMs, the activating function is a tanh function, the input of the first layer is a word vector corresponding to a time step, and the input of the K layer is a hidden state corresponding to the previous time step; also increasing dropout to avoid overfitting by a regularization method; the direction of the backward coding layer for receiving the word vector sequence is opposite to the direction of the forward layer coding; and splicing the vectors of the last layer of the forward layer and the backward layer, and selecting the last hidden state as the last semantic vector, thereby obtaining the semantic vector containing context semantic information.

5. The Ethenhouse intelligent contract security vulnerability detection method based on deep learning of claim 1, wherein a one-to-one mapping of words to integers is constructed as a dictionary, so that each token forms a corresponding embedded vector; and mapping each word according to the dictionary and mapping the word to a vector with fixed dimensionality through a matrix.

6. The method for detecting the security vulnerabilities of the etherhouse intelligent contracts based on the deep learning of claim 5, wherein, for the bidirectional LSTM model, all tokens contained in the collected solid intelligent contract source code corpus are used as a dictionary; for the pre-trained model, a dictionary of the pre-trained model is used.

7. The method for detecting the security vulnerability of the Etherhouse intelligent contracts based on the deep learning as claimed in claim 1, wherein in the step 2 b), the fine tuning for the downstream classification task by adopting the coding model based on the pre-training model comprises:

a) inputting a token sequence, processing the input token sequence, adding predefined special characters comprising [ CLS ] characters and [ SEP ] characters before and after the token sequence, and converting the predefined special characters into a form required by a transform-based bidirectional encoder characterization model/BERT model;

b) mapping each word of an input sequence into a corresponding ID in a pre-training model dictionary through a word splitter of a pre-training model, filling or truncating the sequence to a fixed length, generating an attention mask by the word splitter to distinguish filled words from non-filled words, and inputting the words into a BERT coding model formed by a multi-layer transform encoder;

c) obtaining a vector sequence with the same length as the input sequence through a BERT coding model;

d) designing a selector module behind an output layer of the BERT coding model to increase system expandability;

the selector module is used for generating and outputting different semantic vectors after the output layer, and comprises the following steps: CLS or SEP or mean;

e) inputting the output of the selector module into a classifier, and performing supervised fine adjustment by using a full detection network formed by two full connection layers and a softmax function, wherein an activation function of the full connection layer adopts Relu; the gradient counter-propagated to each layer during the trim, without freezing treatment.

8. The method for detecting the security vulnerabilities of the etherhouse intelligent contracts based on the deep learning of claim 7, wherein in the step d), the selector module adopts three semantic vector generation methods: the method comprises the following steps: outputting the output position corresponding to the [ CLS ] character in the input; outputting an output position corresponding to the [ SEP ] character in the input; the mean value for each position in the output.

9. An Ethenhouse intelligent contract security vulnerability detection system for realizing the Ethenhouse intelligent contract security vulnerability detection method based on deep learning of claim 1, which is characterized by comprising: the system comprises a data preprocessing module and a semantic representation learning module; wherein:

the intelligent contract source code semantic representation learning module comprises: the device comprises a coding layer, a detection layer/classifier and a model fusion output module; the encoding layer of the semantic representation module encodes the input word sequence into a semantic vector with fixed length, and then inputs the semantic vector into a detection layer/classifier composed of a plurality of full connection layers; in the detection layer/classifier, each classification network outputs a probability score of a sample belonging to a white sample/a black sample; and the model fusion output module is used for combining the output results of different types of classification networks to form a final probability output score.