CN110737899B

CN110737899B - Intelligent contract security vulnerability detection method based on machine learning

Info

Publication number: CN110737899B
Application number: CN201910904539.2A
Authority: CN
Inventors: 翁健; 陈新凯; 李明; 袁浩宸; 张斌; 卢贺贤
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2019-09-24
Filing date: 2019-09-24
Publication date: 2022-09-06
Anticipated expiration: 2039-09-24
Also published as: CN110737899A

Abstract

The invention discloses an intelligent contract security vulnerability detection method based on machine learning, which comprises the steps of firstly collecting intelligent contract source code data, preprocessing the data and constructing a sample set for machine learning; and then determining vulnerability labels for sample set data by using a disclosed intelligent contract vulnerability detector, translating intelligent contract source codes into XML structured texts, extracting the characteristics of the intelligent contract source codes in the data set on the basis, and considering that the current Solidity intelligent contract sample data is limited according to different vulnerability types of the intelligent contract, so that the method adopts two different machine learning algorithms to analyze according to the quantity of the label samples. The method and the device can more efficiently and automatically obtain the detected identity intelligent contract vulnerability by adopting a random forest algorithm to construct a model for multiple data samples and utilizing transfer learning to construct a detection model for less data samples.

Description

Intelligent contract security vulnerability detection method based on machine learning

Technical Field

The invention relates to the technical field of network space security, in particular to an intelligent contract security vulnerability detection method based on machine learning.

Background

Ether Fang is the most mature public chain except Bizhou, and has become the first development platform of the bottom layer module chain in the industry with the continuous development and maturation in the global scope. The intelligent contract with complete pictures can be supported in the ether workshop, the limitation of the bitcoin on the application of the block chain is broken through, people can know the block chain without being limited to digital currency, and the application field is further expanded to various industries in an intelligent contract form, such as block chain distributed application DApp. Economic losses due to blockchain self-mechanics problems, ecological security, and user security reach billions of dollars, while losses due to smart contract security vulnerabilities account for the highest percentage, up to 41.8%. With the increasing economic value of the block chain, lawless persons are prompted to acquire more sensitive data by various attack means, such as 'theft', 'lasso', 'mine digging', and the like, and the block chain security situation becomes more complex by means of the block chain concept and technology. According to Besec survey data of network security companies, digital cryptocurrency, which has a value of about several billion dollars in total, is stolen in recent years, and the amount of money lost due to block chain security events is rising worldwide. Various theft 'repugnance' pushes the digital encryption currency market with the market value as high as 1 trillion dollars to the wave tip of the air opening.

And once the intelligent contracts in the ether house are deployed, once the vulnerabilities occur, the vulnerabilities cannot be solved by means of patching or updating due to the fact that the vulnerabilities cannot be tampered, and most of the cases can only adopt a contract forbidding means to prevent loss from being further expanded. Traditional analysis of security vulnerabilities in intelligent contracts is very valuable for analyzing predefined vulnerability attributes. However, most conventional analysis tools require complex analysis steps to be performed, such as a predetermined calling depth to search for an execution path, and the search time increases as the depth increases. Since 12 months 2015, the number of blockchain contracts like Ethereum increased 176 times. If these tools are unable to analyze an increasing number of contracts in time, then an increasing number of security breaches will irreparably harm the community of intelligent contracts.

Disclosure of Invention

The invention aims to solve the defects in the prior art and provides an intelligent contract security vulnerability detection method based on machine learning, which can detect security risk vulnerabilities existing in intelligent contract codes and problems caused by the properties of block chains by establishing an intelligent contract security vulnerability detection model, and display specific vulnerability information to enable participating users to clearly know the security vulnerabilities existing in contracts concerned by the users; and in the face of the current increasing number of intelligent contracts, the detection result can be obtained more quickly compared with the traditional analysis mode.

The purpose of the invention can be achieved by adopting the following technical scheme:

an intelligent contract security vulnerability detection method based on machine learning comprises the following steps:

s1, collecting massive Solidity intelligent contract codes and Java/C + + codes on the network to form a basic data set for machine learning, and selecting contracts of which the Solidity compilation version is higher than a specified version number and the code content repetition rate is lower than a repetition threshold value in the basic data set as a machine learning sample set;

s2, determining vulnerability labels for sample set data through an intelligent contract vulnerability detector, generating vulnerability label data based on a solid intelligent contract vulnerability detection tool, and counting the number of solid intelligent contract samples of each vulnerability label in a sample set;

s3, performing branch processing according to the number of samples of the Solidity intelligent contract label in the sample set, and constructing a detection model by adopting a random forest algorithm aiming at multiple data labels larger than or equal to a preset comparison threshold value threshold; for the data labels smaller than a preset comparison threshold value threshold, performing transfer learning through a java/C + + vulnerability model to construct a detection model;

and S4, carrying out intelligent contract security vulnerability detection on the intelligent contract to be detected through the constructed detection model to obtain security vulnerability information existing in the intelligent contract.

Further, the procedure of step S1 is as follows:

s11, collecting a Solidity intelligent contract code from the Ethern intelligent contract platform by using a crawler script, and simultaneously collecting a Java/C + + code from an open source community;

s12, converting the identity intelligent contract code into an XML text, directly obtaining a compiled version of the identity, then comparing internal code segments of the converted XML text, and calculating the same proportion of the code segments to obtain a content repetition rate;

and S13, selecting a contract with the consistency compiling version higher than the specified version number and the code content repetition rate lower than the repetition threshold value in the basic data set as a machine learning sample set.

Further, in step S13, a contract with a Solidity compiled version higher than 4.14 and a code content repetition rate lower than 30% is selected as the machine learning sample set

Further, the procedure of step S2 is as follows:

s21, inputting the Solidity intelligent contract codes in a sample set by using one or more Solidity intelligent contract vulnerability detection tools, and outputting a plurality of vulnerability labels;

s22, summarizing the detection results of different detection tools, and recording the same vulnerability label when the frequency of the vulnerability label appearing in different detection results is equal to or higher than 50% to generate the identity intelligent contract vulnerability label;

and S23, counting the quantity of the Solidity intelligent contract samples of all vulnerability labels in the sample set.

Further, the process of constructing the detection model by the random forest algorithm is as follows:

p1, converting the identity intelligent contract code into XML text, each node in the XML text represents the grammar element of the contract code and provides all the details about the source code character;

p2, based on XML text and according to the principle intelligent contract characteristics, respectively considering the principle grammar, the contract semantics and the function behavior, extracting the characteristics;

and P3, training a random forest model by taking the feature vector and the label data corresponding to the solid intelligent contract as input by adopting a random forest algorithm, and training by taking the representative execution path function call and the code flow characteristic as high-weight characteristics in consideration of the inherent characteristics of the solid intelligent contract.

Further, the procedure of step P2 is as follows:

and traversing the XML text by applying a dom4j package and an XPath Language, and further packaging the Solidiy source code information contained in the XML text into a SolFileBean entity, wherein dom4j is an open source XML parsing package for parsing the XML text, the XPath is an XML Path Language (XML Path Language) and is a computer Language for determining the position of a certain part in the XML document, and the SolFileBean is a programming entity for packaging the Solidiy source code information. The SolFileBean provides complete details about the characteristics of the solid source code, including source code information including contract sets, method sets, variable sets and modifier sets;

according to the Solidity intelligent contract characteristics, the Solidity grammar, the contract semantics and the function behavior are respectively considered, various characteristics are extracted on the SolFileBean, and the characteristics are divided into four types, namely 1) the basic information characteristics of the contract; 2) a binary operator characteristic; 3) a code complexity characteristic; 4) and (4) path characteristics.

Further, the process of building the detection model by migrating and learning the java/C + + vulnerability model is as follows:

q1, extracting vulnerability types similar to programming language Java or C + + in the identity intelligent contract, wherein the vulnerability types include integer overflow vulnerability, reentry vulnerability and inter-function call exception vulnerability;

q2, training a detection model including an integer overflow vulnerability, a reentry vulnerability and an inter-function call exception vulnerability by using a large amount of sample data of a programming language Java or C + +;

q3, detecting the vulnerability detection model of the traditional code on the Solidiy intelligent contract test sample by using the transfer learning, checking the result accuracy, and correspondingly adjusting the traditional programming language training detection model.

Further, the intelligent contract vulnerability detection tool based on the identity comprises Oyente, ZEUS and Osiris.

Compared with the prior art, the invention has the following advantages and effects:

1) according to the invention, an intelligent contract security vulnerability detection model is established, and the Solidity codes are combined to perform multi-feature combination extraction analysis, so that security risk vulnerabilities existing in the intelligent contract codes and problems caused by the properties of block chains can be detected, and specific vulnerability information is displayed, so that participating users can know the security vulnerabilities existing in the concerned intelligent contracts at a glance.

2) The method is based on the characteristics of the intelligent contract source code and the vulnerability label, adopts random forest and transfer learning to carry out automatic learning to obtain the intelligent contract detection model aiming at different vulnerability types. Because the intelligent contract source code reflects that the behavior of the contract is closely related to the vulnerability, the characteristics of the intelligent contract source code are extracted for machine learning, better characteristics can be effectively learned, and the vulnerability existing in the intelligent contract is detected. The invention can more efficiently and automatically obtain and detect the vulnerability of the intelligent contract of the identity.

Drawings

Fig. 1 is an operational flow diagram of an intelligent contract security vulnerability detection method based on machine learning according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an XML structured text in an intelligent contract security vulnerability detection method based on machine learning according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Examples

The embodiment discloses an intelligent contract security vulnerability detection method based on machine learning, as shown in fig. 1, the detection method comprises the following steps:

s1, collecting massive Solidity intelligent contract codes and Java/C + + codes on the network to form a basic data set for machine learning. Selecting a contract with a consistency compiling version higher than a specified version number and a code content repetition rate lower than a repetition threshold value in the basic data set as a machine learning sample set;

specifically, in this embodiment, the process of step S1 is as follows:

s11, collecting a identity intelligent contract code from the Etherhouse intelligent contract platform by using a crawler script, and collecting a Java/C + + code from an open source community;

s12, converting the identity intelligent contract code into a structured XML text, directly obtaining a compiled version of the identity, then comparing internal code segments of the converted XML text, and calculating the same proportion of the code segments to obtain the content repetition rate;

In this embodiment, a contract with a solid compiled version higher than 4.14 and a code content repetition rate lower than 30% is selected as a machine learning sample set.

In the above embodiment, the selected specified version number is a identity compiled version 4.14 and the selected repetition threshold is 30%, which does not limit the technical solution of the present invention, and other values still belong to the protection range of the technical solution of the present invention.

S2, determining vulnerability labels for the sample set data through an intelligent contract vulnerability detector, generating vulnerability label data based on a Solidiy intelligent contract vulnerability detection tool (including Oyente, ZEUS and Osiris), and counting the number of Solidiy intelligent contract samples of each vulnerability label in the sample set;

specifically, in this embodiment, the process of step S2 is as follows:

the Intelligent contract vulnerability detector is an intelligent contract security detection tool based on semantic analysis, and can automatically detect the following latest Ethernet security vulnerability types: 1) integer Underflow Integer Underflow; 2) integer Overflow Integer Overflow; 3) multiple wallet vulnerability multiple Bug 2; 4) a stack calls a deep Attack Vulnerability Callstack Depth Attack Vulnerability; 5) transaction order dependency vulnerability Transaction (TOD); 6) the Timestamp depends on the vulnerability Timestamp Dependency; 7) the reentry Vulnerability Re-Entrancy Vulnerability.

TABLE 1 vulnerability category table of intelligent contract security vulnerability detection mechanism based on machine learning

And S22, summarizing the detection results of different detection tools, and only when the frequency of the same vulnerability label appearing in different detection results is equal to or higher than 50%, the label can record to generate the identity intelligent contract vulnerability label.

S3, performing branch processing on the basis of the number of the Solidity intelligent contract label samples in the sample set, and constructing a detection model by adopting a random forest algorithm for multiple data labels greater than or equal to a preset comparison threshold value threshold; and for the data labels smaller than the preset comparison threshold value threshold, carrying out transfer learning by using the java/C + + vulnerability model to construct a detection model.

And S4, carrying out intelligent contract security vulnerability detection through the constructed detection model.

The method for constructing the detection model by the random forest algorithm specifically comprises the following steps:

based on the conversion from the intelligent contract source code of the XML text to the formatted XML structure text, the present embodiment adopts ANTLR, a parser generator implemented based on LL (Left-to-right) algorithm, and uses a top-down recursive descent LL (Left-to-right) parser method to convert the intelligent contract code of the identity into the XML structure text, which retains all information of the identity contract to facilitate the following security translation. The generated XML structured data can be considered as an Abstract syntax tree (Abstract syntax tree) of the identity source code. Each node in XML represents a syntax element of a programming language, for example, a < functional definition > node represents a function definition statement in the identity code, and can provide rich details about the characteristics of the source code, such as the number of contracts, the number of functions, the specific content of the functions, and the like.

And extracting features based on the XML structure text. And traversing the XML text by applying a dom4j package and an XPath Language, and further packaging the Solidiy source code information contained in the XML text into a SolFileBean entity, wherein dom4j is an open source XML parsing package for parsing the XML text, the XPath is an XML Path Language (XML Path Language) and is a computer Language for determining the position of a certain part in the XML document, and the SolFileBean is a programming entity for packaging the Solidiy source code information. The SolFileBean provides complete details about the characteristics of the Solidentity source code, including all source code information such as contract sets, method sets, variable sets, modifier sets, etc. According to the characteristics of the intelligent contract of the Solidity, the aspects of the Solidity grammar, the contract semantics, the function behavior and the like are considered respectively, a plurality of characteristics are extracted on the SolFileBean, and the characteristics can be divided into four types, namely 1) the basic information characteristics of the contract; 2) a binary operator characteristic; 3) a code complexity characteristic; 4) and (4) path characteristics.

1) The contract basic information features refer to the number and definition of contracts (contacts), functions (functions), events (events) and modifiers (modifiers) of intelligent contracts. The contract definition refers to the existence of a parent contract of a contract; the function definition refers to an access modifier, a return value and an input parameter list of the function; the event definition refers to an input parameter list of input events; the modifier definition refers to an input parameter list of a modifier;

2) a binary operator feature, which refers to the number of occurrences and frequency of occurrences of a binary operator such as +, -,/, >, <, ═ in each contract and each function;

3) the code complexity characteristic is that the complexity of the code is approximately represented by the number of code lines, the length of the code, the number of loop statements and the number of basic blocks of a code flow chart;

4) the path characteristics refer to calling relations among functions, modifier modification relations of the functions and control statements in the code flow chart. The call relation between the functions refers to calling another function in the function and calling the function by the rest functions. The modifier modification relation of the function means that the function is modified by the modifier, and the function can be normally used only if the condition of the modifier is met. The control statement in the code flow diagram means that the branch statements in the code flow diagram represent different code execution paths respectively.

The method comprises the steps of performing model training by using a random forest algorithm and 10-fold cross validation and taking feature vectors and label data corresponding to a solid intelligent contract as input, taking the inherent characteristics of the solid intelligent contract into consideration, performing training by taking path features such as features of function calling and code flow as high-weight features, and obtaining a detection model with the highest accuracy through modifying weights for multiple times and testing.

The method for constructing the detection model through transfer learning of the java/C + + vulnerability model specifically comprises the following steps:

extracting vulnerability types which are close to the traditional programming language (Java/C + +) in the identity intelligent contract, including integer overflow vulnerability, reentry vulnerability, function calling exception vulnerability and the like;

training a detection model including an integer overflow vulnerability, a reentry vulnerability, an inter-function call exception vulnerability and the like by using a large amount of sample data of a traditional programming language (Java/C + +); training an integer overflow vulnerability, a reentry vulnerability and an inter-function call abnormal vulnerability in a traditional programming language by using a machine learning step of VulDeeParker to obtain a detection model;

VulDeeParker refers to a known method for detecting Java code bugs based on deep learning. The training process is as follows, and consists of 4 steps:

1) library/API function calls and corresponding slices are extracted from the training data (source program code). Extracting one or more program fragments by referring to each parameter of the library/API function call, one program fragment representing one or more lines of code of the program related to the library/API function call parameter;

2) code gadgets and corresponding tags are generated. A Code gadget is composed of multiple semantically related Code lines (Codes in CFG), and then the Code gadget is labeled to be 1 (leaky) or 0 (non-leaky);

3) code gadgets are converted to a vector representation. By representing the Code gadget as a semantic representation, semantic information of the training data is preserved. Then, encoding the semantically expressed Code gadget into a vector, wherein the vector is the input of the BLSTM;

4) the BLSTM neural network is trained. The BLSTM model is trained on a training sample data set, in accordance with a standard training model.

And detecting the Solidiy intelligent contract test sample by using a vulnerability detection model of the traditional code through a parameter/model migration mode in migration learning, checking the accuracy of the result, and correspondingly adjusting the parameters in the traditional programming language training detection model to further fit the intelligent contract security vulnerability detection model.

The parameter/model migration mode refers to that the original model is migrated to a new field (domain) by assuming that some common parameters are shared between a source task (source tasks) and a target task (target tasks) or the prior distribution of the hyper-parameters of the shared model, so as to achieve better precision.

The source tasks (source tasks) represent vulnerability detection tasks (including integer overflow vulnerabilities, reentrant vulnerabilities and the like) of traditional codes, and the target tasks (target tasks) represent identity intelligent contract vulnerability detection tasks, so that the source tasks and the target tasks keep the same marking space to ensure a more efficient migration effect. The same mark space refers to tags such as integer overflow vulnerabilities and reentry vulnerabilities, the actual significance is the same between vulnerability detection tasks of traditional codes and vulnerability detection tasks of the solid intelligent contracts, and consistency is kept. According to the VulDeeParker training method, training is carried out through the semantic relation among codes without depending on the grammar of a specific programming language, which also shows that the feature space and the probability of vulnerability detection of the target domain (target domain) identity intelligent contract and the vulnerability detection of the source domain (source domain) traditional codes have high similarity, and the model construction of the transfer learning is supported.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. An intelligent contract security vulnerability detection method based on machine learning is characterized by comprising the following steps:

s1, collecting the identity intelligent contract code and the Java/C + + code on the network to form a basic data set for machine learning, and selecting a contract of which the identity compilation version is higher than a specified version number and the code content repetition rate is lower than a repetition threshold value in the basic data set as a machine learning sample set;

s2, determining vulnerability labels for sample set data through an intelligent contract vulnerability detector, generating vulnerability label data based on a solid intelligent contract vulnerability detection tool, and counting the number of solid intelligent contract samples of each vulnerability label in a sample set; the procedure of step S2 is as follows:

s21, using one or more solid intelligent contract leak detection tools, inputting solid intelligent contract codes in a sample set, and outputting a plurality of leak labels;

s23, counting the quantity of the Solidity intelligent contract samples of all vulnerability labels in the sample set;

s3, performing branch processing according to the number of samples of the Solidity intelligent contract label in the sample set, and constructing a detection model by adopting a random forest algorithm aiming at multiple data labels larger than or equal to a preset comparison threshold value threshold; aiming at the data labels smaller than a preset comparison threshold value threshold, performing transfer learning through a java/C + + vulnerability model to construct a detection model;

s4, carrying out intelligent contract security vulnerability detection on an intelligent contract to be detected through the constructed detection model to obtain security vulnerability information existing in the intelligent contract; the process of constructing the detection model by the random forest algorithm is as follows:

p1, converting the solid intelligent contract code into XML text, wherein each node in the XML text represents the syntax element of the contract code and provides all details about the source code characteristics;

p3, training a random forest model by taking the feature vectors and label data corresponding to the solid intelligent contract as input by adopting a random forest algorithm, and training by taking the representative execution path as a high-weight feature by considering the inherent characteristics of the solid intelligent contract;

the procedure of step P2 is as follows:

traversing XML texts by applying a dom4j package and an XPath language, and further packaging the Solidity source code information contained in the XML texts into a SolFileBean entity, wherein the dom4j is an open source XML parsing package used for parsing the XML texts, the XPath is an XML path language and is a computer language used for determining a certain part of positions in an XML document, the SolFileBean is a programming entity used for packaging the Solidity source code information, and the SolFileBean provides complete details about the characteristics of the Solidity source code, including source code information including a contract set, a method set, a variable set and a modifier set;

according to the characteristics of the solid intelligent contract, respectively considering from the solid grammar, the contract semantics and the function behavior, extracting a plurality of characteristics on the SolFileBean, wherein the characteristics are divided into four types, namely 1) the basic information characteristics of the contract; 2) a binary operator characteristic; 3) a code complexity characteristic; 4) a path characteristic;

the process of building the detection model through the transfer learning of the java/C + + vulnerability model is as follows:

2. The method for detecting the security vulnerability of the smart contracts based on the machine learning of claim 1, wherein the procedure of the step S1 is as follows:

3. The method for detecting security vulnerabilities of intelligent contracts based on machine learning of claim 2, wherein in step S13, contracts with a Solidity compiled version higher than 4.14 and a code content repetition rate lower than 30% are selected as a machine learning sample set.

4. A machine learning-based intelligent contract security vulnerability detection method according to claim 1, wherein the solid-based intelligent contract vulnerability detection tools comprise Oyente, ZEUS and Osiris.