CN113486357A

CN113486357A - Intelligent contract security detection method based on static analysis and deep learning

Info

Publication number: CN113486357A
Application number: CN202110766768.XA
Authority: CN
Inventors: 周福才; 罗熙霖; 焦梓; 孙劲桐
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2021-07-07
Filing date: 2021-07-07
Publication date: 2021-10-08
Anticipated expiration: 2041-07-07
Also published as: CN113486357B

Abstract

The invention discloses a smart contract security detection method based on static analysis and deep learning, and relates to the technical field of blockchain smart contract security. Including the static analysis of the solidity source program of the smart contract to obtain the graph structure of the solidity source program of the smart contract; extracting the abstract facts from the graph structure; according to the abstract facts of the solidity source program, build a deep learning for classifying the vulnerabilities of the solidity source program The model includes: an input module, an attention module, a residual connection module and an output module; constructing a training data set; using the training data set to train the deep learning model; using the trained deep learning model to input intelligence The contract performs vulnerability detection and outputs the security detection results of the solidity source program of the smart contract. This method can comprehensively analyze the behavior of the solidity source program of the smart contract, and improve the accuracy of the security detection of the solidity source program of the smart contract.

Description

Intelligent contract security detection method based on static analysis and deep learning

Technical Field

The invention relates to the technical field of block chain intelligent contract security, in particular to an intelligent contract security detection method based on static analysis and deep learning.

Background

A Smart Contract (Smart Contract) is a special protocol deployed in a blockchain. Buterin determines the applicability of decentralized computing outside of transactions and designs an Etherhouse blockchain that supports the execution of intelligent contracts. The smart contract contains code functions that include trading, decision making, and sending ethernet currency. Smart contracts have proven useful in many areas, including securities, communications, banking, medical, and the like. But the intelligent contract has the characteristic of transparency, namely, all participants can view the source code of the intelligent contract. And the intelligent contract has the characteristic that the intelligent contract can not be changed once deployed, so that the intelligent contract can not update software in time after finding a bug, and the loss can be reduced only by means of transaction suspension or bifurcation and the like. If the security detection is not carried out on the intelligent contract, the intelligent contract cannot be repaired in time, so that the normal use of the function of the intelligent contract is influenced, and even the benefit of the intelligent contract user can be damaged to cause serious consequences. Such as DAO attack events: the anonymous hacker uses the reentrant vulnerability of the intelligent contract to cheat 360 ten thousand Ethernet coins; parity cracking events: the deliberate breaker finds the time stamp loophole in the intelligent contract code library, and destroys the code library by utilizing the problem of inconsistent time stamps, thereby causing the loss of 1.5 hundred million dollars; malicious contract events: the five hackers maliciously release 34000 problematic intelligent contracts, which causes the ether house to be complicated, and generates abnormal chain reaction, thereby causing the ether currency with the value of 440 ten thousand dollars to be stolen. Under such severe security threat situation, currently, there is no good general means to detect the intelligent contract vulnerability, and the intelligent contract security assurance still mainly depends on the security technology level of the contract developer and the code audit based on expert experience. Therefore, an effective scheme for automatically detecting the security of the intelligent contract needs to be proposed urgently. The existing automatic safety detection has the following problems: 1. the intelligent contract code can not be analyzed in a full coverage mode, 2, the false alarm rate of security detection is high, and 3, only specific attacks are concerned, and other attacks are not easy to be detected.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides an intelligent contract security detection method based on static analysis and deep learning, aiming at solving the problem of intelligent contract security detection.

The technical scheme of the invention is as follows:

1. an intelligent contract security detection method based on static analysis and deep learning is characterized by comprising the following steps:

step 1: carrying out static analysis on the intelligent contract security source program to obtain a graph structure of the intelligent contract security source program; the static analysis comprises lexical analysis and syntactic analysis; the graph structure comprises an abstract syntax tree AST and a control flow graph CFG;

step 2: extracting abstract facts from the graph structure of the solid source program obtained in the step 1;

and step 3: according to the abstract fact of the relevance source program obtained in the step 2, a deep learning model for vulnerability classification of the relevance source program is built, and the deep learning model comprises the following steps: the device comprises an input module, an attention module, a residual error connecting module and an output module;

and 4, step 4: constructing a training data set of the deep learning model;

and 5: training the deep learning model by using the training data set;

step 6: and carrying out vulnerability detection on the input intelligent contract by using the trained deep learning model, and outputting a security detection result of the intelligent contract security source program.

Further, according to the intelligent contract security detection method based on static analysis and deep learning, the step 1 specifically includes the following steps:

step 1.1: preprocessing an intelligent contract security source program, and deleting all contents irrelevant to security detection of the security source program;

step 1.2: importing a source code file corresponding to the import statement into the preprocessed intelligent contract source program to obtain a complete source program of the intelligence;

step 1.3: for a complete solidity source program, converting the solidity source program into an abstract syntax tree by using an ANTLR analyzer;

step 1.4: and constructing a control flow graph CFG of the solid source program according to the abstract syntax tree.

Further, according to the intelligent contract security detection method based on static analysis and deep learning, the step 1.3 specifically includes the following steps:

step 1.3.1: performing lexical analysis on the complete solid source program by using an ANTLR analyzer, and labeling the attributes of words in the solid source program according to predefined word attribute categories to obtain a word sequence with word attribute labels corresponding to each program sentence;

step 1.3.2: performing syntactic analysis by using an ANTLR analyzer aiming at a word sequence corresponding to each program sentence generated by the lexical analysis, and determining a syntactic structure of each program sentence according to a predefined syntactic rule; the grammar structure comprises a contract structure, a function structure, a variable structure, an expression structure and a statement control flow structure;

step 1.3.3: the solidity source program is converted into an abstract syntax tree using an ANTLR parser according to the syntax structure of each program statement.

Further, according to the intelligent contract security detection method based on static analysis and deep learning, the word attribute categories include keyword < keyword >, visibility definer < qualifier >, variable data type < changeable type >, identifier < identifier >, operator > and constant < constant >.

Further, according to the intelligent contract security detection method based on static analysis and deep learning, the predefined grammar rules are as follows:

a)Contract::＝”contract”<identifier>”{”[contractBlock]”}”；

b)ContractBlock::＝[Function]|[Variable]；

c)Function::＝”function”<identifier>”(”[Variable]”)”<qualifier>[keyword][Return][”；”|Block]；

d)Variable::＝<variabletype><qualifier><identifier>[”＝”Expression]”；”；

e)Expression::＝Functioncall|<identifier>|Expression<operator>|Expression<operator>Expres sion|<identifier><operator><constant>；

f)Functioncall::＝Expression”(”Variable”)”；

g)Block::＝”{”Statement”}”；

i)IfStatement::＝”if””(”Expression”)”Block[”else”Block]；

j)WhileStatment::＝”while””(”Expression”)”Block；

k)ForStatement::＝”for””(”[Variable]”；”[Expression]”；”[Expression]”)”Block；

l)Return::＝”return”[Expression]。

further, according to the intelligent contract security detection method based on static analysis and deep learning, the step 1.4 specifically includes the following steps:

step 1.4.1: constructing different basic blocks according to Block nodes in an abstract syntax tree AST by using program statements belonging to a statement control flow structure, recording statement numbers StmtId of each statement in each basic Block, and recording an incoming edge and an outgoing edge of each basic Block;

step 1.4.2: connecting different basic blocks, connecting two basic blocks when the outgoing edge of one basic block is equal to the incoming edge of another basic block, and recording the jumping condition of the basic block when the outgoing edge number of one basic block is more than 1;

step 1.4.3: and recording the number VarId of the variable and assignment operation Assign in the program statement by using a static single assignment form, namely, only performing one assignment operation on one variable, and modifying the name of the variable subjected to secondary assignment.

Further, according to the intelligent contract security detection method based on static analysis and deep learning, the abstract facts containing all control flow information, data information and function information of the intelligent contract are written in datalog language, and the abstract facts are structured as follows:

the predicate is defined according to a solid source program structure, and comprises a data type, a function type, an expression structure and a control flow structure; arg1, argn being other parameters related to the content of a specific solidity program statement.

Further, according to the intelligent contract security detection method based on static analysis and deep learning, the method for extracting the abstract fact from the graph structure of the relevance source program obtained in the step 1 comprises the following steps: and traversing the graph structure of the similarity source program, and extracting the abstract fact of the similarity source program according to the keyword matching.

Further, according to the intelligent contract security detection method based on static analysis and deep learning, the step 3 specifically includes the following steps:

step 3.1: building an input module: using a 0-1 coding matrix X to represent the abstract facts obtained in the step 2, respectively performing word embedding processing and position embedding processing on the abstract facts represented by the 0-1 coding matrix X, and splicing a matrix obtained after the word embedding processing and a matrix obtained after the position embedding processing to obtain an E matrix as the input of an attention module;

step 3.2: constructing an attention module, which specifically comprises the following steps:

step 3.2.1: obtaining a Q matrix, a K matrix and a V matrix of the abstract fact through three linear changes of the E matrix respectively, and obtaining an attention coefficient matrix A of the abstract fact according to a formula (4);

A＝QK^T (4)

the Q matrix is a Query matrix of the abstract facts and consists of Query vectors corresponding to each word of each abstract fact; the K matrix is a Key matrix of the abstract facts and consists of Key vectors corresponding to each word of each abstract fact; the V matrix is a Value matrix of the abstract facts and consists of Value vectors corresponding to each word of each abstract fact;

step 3.2.2: updating element values in the V matrix according to a formula (5) according to an attention coefficient matrix A of the abstract fact to obtain an updated V matrix V';

wherein dk represents the arithmetic sum of squares of the K matrix; the softmax function is an activation function;

step 3.2.3: adding a layer normalization mechanism into a matrix V 'of the attention module to enable elements in the matrix V' to be more standard so as to accelerate convergence and ensure the stability of feature distribution;

step 3.3: building a residual connecting module, wherein a matrix calculation formula of the residual connecting module is as follows:

Z＝H(E)＝E+F(E)＝E+V″ (9)

wherein, the matrix E is the input of the attention module; v' is the output of the attention module; z is the output of the residual connecting module; f is a residual function, in the attention module, a mapping h (e) → Z is obtained through back propagation, and if there is no residual connection module, F (e) → 0;

step 3.4: the method comprises the following steps of building an output module to output vulnerability probability possibly existing in abstract facts, wherein the concrete steps of building the output module are as follows:

step 3.4.1: defining a vulnerability category output formula shown in a formula (10) for outputting abstract fact vulnerability category results of the intelligent contracts;

P_k＝softmax(Linear(Z)) (10)

wherein, Linear represents a Linear function, and Linear transformation is carried out on the matrix Z for one time; p_kProbability values for different vulnerability types;

step 3.4.2: and constructing a loss function of the deep learning model to enable the model to have vulnerability classification capability.

Further, according to the intelligent contract security detection method based on static analysis and deep learning, the loss function is a multi-class cross entropy loss function shown in formula (11):

Loss₁＝-∑_k y_klog(P_k) (11)

wherein, y_kAnd k represents a tag of one-hot coding corresponding to the abstract fact, and represents a vulnerability category corresponding to the abstract fact.

Compared with the prior art, the invention has the following beneficial effects:

1. the behavior of the intelligent contract security source program can be comprehensively analyzed. The security detection of the intelligent contract firstly needs to comprehensively analyze the code behavior. In the method, the abstract syntax tree and the control flow graph of the intelligent contract solid source program are analyzed, and then the graph structure is abstracted into fact representation, so that the abstract fact can cover the code behavior more comprehensively, the semantic features in the program are effectively represented, and the support is provided for a later deep learning model machine.

2. The expandability of the security detection of the intelligent contract security source program is enhanced. The traditional security detection method is mainly based on predefined rules and only focuses on known security vulnerabilities. The deep learning model used by the method is not limited to specific security holes, and the model can be trained by supplementing the training set so as to detect various security holes and easily expand the security holes. In addition, on the aspect of security detection of unknown vulnerabilities, the method can have the detection capability of the vulnerabilities only by training the model again, and compared with the traditional security detection method, the method has good expandability on the detection of the security vulnerabilities.

3. The accuracy of security detection of the intelligent contract security source program is improved. In the method, the static analysis method and the deep learning method are combined to carry out security detection on the intelligent contract, the existing deep learning model is improved, the attention module is added to learn the key information in the abstract fact, the accuracy of security detection classification is effectively improved on the basis of improving vectorization representation of the abstract fact, and the missing report rate of security holes is also effectively reduced.

Drawings

FIG. 1 is a schematic flow chart of an intelligent contract security detection method based on static analysis and deep learning according to the present invention;

FIG. 2 is a diagram of an abstract syntax tree of example code in an embodiment of the present invention;

FIG. 3 is a diagram of a deep learning model architecture in an embodiment of the present invention;

FIG. 4 is a schematic diagram of an attention module according to an embodiment of the invention.

Detailed Description

The following detailed description of embodiments of the invention will be described in conjunction with the accompanying drawings. The following examples are intended to illustrate the invention only, but to limit the scope of the invention.

Fig. 1 is a schematic flow chart of an intelligent contract security detection method based on static analysis and deep learning according to the present invention, and the intelligent contract security detection method based on static analysis and deep learning includes the following steps:

step 1: carrying out static analysis on the intelligent contract security source program to obtain a graph structure of the intelligent contract security source program; the static analysis comprises lexical analysis and syntactic analysis; the Graph structure includes an Abstract Syntax Tree (AST) and a Control Flow Graph (CFG).

in a preferred embodiment, the preprocessing of the smart contract relevance source program includes deleting a single line of comments "//", multiple lines of comments "/" … "/", spaces "", a carriage return "\\ n", and all content not relevant to the security detection of the relevance source program.

Step 1.2: and importing a source code file corresponding to the import statement into the preprocessed intelligent contract source program to obtain the complete source program of the intelligence.

Step 1.3: for a complete solidity source, the solidity source is converted into an abstract syntax tree using an ANTLR parser.

the word attribute categories include keyword < keyword >, visibility delimiter < qualifier >, variable data type < changeable type >, identifier < identifier >, operator > and constant.

in a preferred embodiment, the grammar rules predefined using BNF (Backus-Naur Form, Backus-Van) are as follows, according to the linguistic properties:

m)Contract::＝”contract”<identifier>”{”[contractBlock]”}”；

n)ContractBlock::＝[Function]|[Variable]；

o)Function::＝”function”<identifier>”(”[Variable]”)”<qualifier>[keyword][Return][”；”|Block]；

p)Variable::＝<variabletype><qualifier><identifier>[”＝”Expression]”；”；

q)Expression::＝Functioncall|<identifier>|Expression<operator>|Expression<operator>Expression|<identifier><operator><constant>；

r)Functioncall::＝Expression”(”Variable”)”；

s)Block::＝”{”Statement”}”；

u)IfStatement::＝”if””(”Expression”)”Block[”else”Block]；

v)WhileStatment::＝”while””(”Expression”)”Block；

w)ForStatement::＝”for””(”[Variable]”；”[Expression]”；”[Expression]”)”Block；

x)Return::＝”return”[Expression]。

step 1.3.3: converting the solid source program into an abstract syntax tree by using an ANTLR analyzer according to the syntax structure of each program statement;

for example, for the code shown below, it is converted into an abstract syntax tree as shown in fig. 2 using an ANTLR parser.

Step 1.4: constructing a control flow graph CFG of the solid source program according to the abstract syntax tree, which comprises the following concrete steps:

step 1.4.3: the number VarId of the variable in the program statement and the assignment operation Assign are recorded by using a static single assignment form (SSA form), that is, one variable only performs one assignment operation, and the variable name of the variable subjected to the secondary assignment is modified.

For example, for an assignment operation "x ═ 1; y is x + 1; x is y; "its static single assignment form is" x1 ═ 1; y ═ x1+ 1; x2 ═ y; "recording assignment operations for variables using a static single assignment form facilitates analysis of subsequent abstract facts.

Step 2: and (3) extracting an abstract fact from the graph structure of the solid source program obtained in the step (1), specifically, traversing the graph structure of the solid source program, and extracting the abstract fact of the solid source program according to keyword matching.

The abstract facts are written by using a datalog language and comprise all control flow information, data information and function information of the intelligent contracts, and the information is key characteristics related to security vulnerabilities;

in a preferred embodiment, the abstract facts are structured as follows:

where predicate is the name of the corresponding predicate defined according to the solidity structure, arg 1.

In the preferred embodiment, there are four predicate names, data type, function type, expression structure, and control flow structure. The specific predicate name definition and the parameter definition are as follows:

traversing all nodes of AST of the solid source program, defining a predicate name of the nodes as VarDecl for an operation node Variable of a data type, defining a predicate name of the nodes as FunDecl for an operation node Function of a Function type, and defining a predicate name of the nodes as FunCall for a Function call node Funcionall in an expression structure, wherein for the call of a special Function, the predicate names of the nodes include address-dependent functions call, delegatecall, send, transfer and error handling functions reverse, assert and requirer, and defining the predicate name as an original name; the parameters are the related statement number, the variable number and the parameters of all leaf nodes corresponding to the nodes.

Traversing a control flow graph of a solid source program, defining the predicate name of an assignment operation Assign between variables as VarAss, defining parameters as a corresponding statement number stmtId and a related variable number varId, defining the predicate name of a statement in the same basic Block as Block, defining parameters as a basic Block number Block Id and a statement number stmtId, and defining the predicate name of the statement as Block when a path exists between basic blocks and defining the predicate name as Block Path and the parameter as a corresponding basic Block number Block Id.

For example, the abstract facts extracted by traversing the graph structure generated by the example code in step 1.3.2 are as follows:

VarDecl(StmtId＝'S00',VarId＝'V00',variabletype＝'uint',identifier＝'storedData')

Block(BlockId＝'B00',StmtId＝'S00')

FunDecl(identifier＝'set',VarId＝'V01',qualifier＝'public')

VarDecl(VarId＝′V01′，variabletype＝′uint′，identifier＝′x′)

Block(BlockId＝′B01′，StmtId＝′S01′)

VarAss(StmtId＝′S01′，VarId＝′V00′，VarId＝′V01′)

and step 3: according to the abstract fact of the solubility source program obtained in the step 2, building a deep learning model for carrying out vulnerability classification on the solubility source program;

in a preferred embodiment, the structure design deep learning model based on the Transformer model, as shown in fig. 3, includes four modules: the device comprises an input module, an attention module, a residual error connecting module and an output module. The construction process of the deep learning model comprises the following steps:

step 3.1: building an input module: performing vectorization preprocessing on the abstract facts obtained in the step 2, representing the input abstract facts by using a 0-1 coding matrix X, and performing dimension reduction processing because the 0-1 coding matrix X is too sparse, namely performing word embedding processing and position embedding processing on the abstract facts represented by the 0-1 coding matrix X, wherein the matrix obtained after the dimension reduction processing is the input required by the attention module, and the specific steps are as follows:

step 3.1.1: performing word embedding processing on the abstract facts represented by the 0-1 coding matrix X according to formula (1) to obtain a word matrix X':

X_l*d′＝tanh(X_l*v W₁) (1)

wherein, W₁Is a parameter matrix to be trained in the input module; l is the row number of the longest abstract fact in the abstract facts corresponding to different solid source programs; v is the vocabulary size of the abstract fact; d is the dimension of the term after dimension reduction.

Step 3.1.2: performing position embedding processing on abstract facts represented by a 0-1 coding matrix X;

in order to ensure that the deep learning model can better acquire the position information of the abstract facts, the input module introduces a position coding mechanism of the abstract facts, namely position embedding.

In the preferred embodiment, the position information of each statement in the abstract fact is represented by a matrix P, and the matrix P is subjected to an activation function according to formula (2) to obtain a position coding matrix P':

P_l*d′＝tanh(P_l*d) (2)

the matrix P is initialized randomly before training, and a position coding matrix P' formed by position vectors corresponding to each position is obtained after training.

Step 3.1.3: for the abstract fact of an intelligent contract, the position coding matrix P 'and the word matrix X' are spliced according to the formula (3) to obtain an E matrix which is used as the input of the attention module.

Step 3.2: constructing an attention module, wherein a schematic diagram of the attention module is shown in FIG. 4;

the attention module is the core of the deep learning model. Through the attention mechanism of the module, attention coefficients among abstract fact words can be calculated, and the vector corresponding to each word of each abstract fact contains information of vectors corresponding to other words, so that key information in the abstract facts can be better acquired. The principle of the attention mechanism is that the attention coefficient between each word and other words in the abstract fact is obtained by matrix multiplication.

In a preferred embodiment, the specific steps of building the attention module are as follows:

step 3.2.1: calculating attention coefficients among the abstract fact words to obtain an attention coefficient matrix of the abstract facts;

the calculation of the attention coefficient in the preferred embodiment is similar to BERT, involving three matrices: q matrix, K matrix, and V matrix. The Q matrix is a Query matrix of the abstract facts and consists of Query vectors corresponding to each word of each abstract fact; the K matrix is a Key matrix of the abstract facts and is composed of Key vectors corresponding to each word of each abstract fact, and the V matrix is a Value matrix of the abstract facts and is composed of Value vectors corresponding to each word of each abstract fact. The three matrixes are randomly given values in an initial state, are respectively obtained by three linear changes of the matrix E, and have characterization significance after being trained.

An attention coefficient matrix of the abstract fact is obtained according to formula (4):

A＝QK^T (4)

step 3.2.2: updating element values in the V matrix according to the attention coefficient matrix A of the abstract fact to obtain an updated V matrix y';

in a preferred embodiment, after obtaining the attention coefficient matrix a, the element values in the V matrix are updated according to equation (5), and an updated V matrix V' can be obtained.

Where dk represents the arithmetic square sum of the K matrix, and the dimension enlarged by the square multiple after multiplication is reduced to the original size in formula (5), and a certain gradient update value jitter is reduced in the process of back propagation. Softmax is an activation function, and the significance of the activation function is that the characterization capability of a V' matrix is enhanced by adding nonlinear change.

the layer normalization mechanism takes the inputs of all dimensions of the matrix V' into account, calculates the average input value and input variance, and then transforms the inputs of each dimension using the same normalization operation. The formula of the mean of all elements of the V' matrix is as follows:

the variance formula for all elements of the V' matrix is as follows:

wherein n is^(v)Is the number of elements in V,. mu.^(v)Is taken as the mean value of the average value,

is the variance, σ^(v)Is the standard deviation. Each element V in the matrix V_iNormalization is performed according to equation (8):

in the above formula, v_i' for each element V in the matrix V_iNormalized values.

Step 3.3: building a residual error connection module;

in the preferred embodiment, the vocabulary of the source input (abstract facts) of the deep learning model is too small, the attention module may capture the connection relationship between words excessively, and the residual connection module is added to overcome the problem to some extent.

In a preferred embodiment, the matrix calculation formula of the residual concatenation module is as follows:

Z＝H(E)＝E+F(E)＝E+V″ (9)

wherein, the matrix E is the input of the attention module; v "is the output of the attention module and the addition of these two matrices results in the output Z of the residual concatenation module. F is the residual function, and in the attention module, a mapping h (e) → Z is obtained through back propagation, and if there is no residual connecting module, F (e) → 0.

Step 3.4: building an output module;

the output module is used for outputting the possible vulnerability probability of the abstract fact and maximizing the security vulnerability detection capability of the deep learning model according to the loss function.

In a preferred embodiment, the specific steps of constructing the output module are as follows:

step 3.4.1: and defining a vulnerability category output formula shown in the formula (10) for outputting an abstract fact vulnerability category result of the intelligent contract.

P_k＝softmax(Linear(Z)) (10)

Wherein, Linear represents a Linear function, namely, a Linear transformation is carried out on the matrix Z, the softmax function is an activation function, P_kProbability values for different vulnerability types.

Step 3.4.2: and (3) constructing a loss function of the deep learning model, wherein the model has vulnerability classification capability through the loss function, and the loss function is a multi-class cross entropy loss function shown in formula (11).

Loss₁＝-∑_k y_klog(P_k) (11)

And 4, step 4: constructing a training data set of a deep learning model;

vulnerability detection problems can be considered as multi-classification problems in machine learning. Because the classification problem belongs to supervised learning, data (relevance program) and tags for data (vulnerability type) are required. Therefore, the construction of the training data set of the deep learning model comprises the steps of acquiring data and labeling the data with label types.

In the preferred embodiment, a total of 1500 program files of the real-life smart contracts for the etherhouse are first collected. And then, according to the definition of the SWC Registry on the vulnerability of the intelligent contract, carrying out manual marking on the 1500 program files, and constructing a training data set of the deep learning model. The SWC Registry is an intelligent contract vulnerability annotation standard library which is mainstream at present. It is built by Etherhouse Security and developers in the Smart Contract Security organization. The vulnerability library provides Ethengfang intelligent contract security vulnerability classification, partial test cases and consequences caused by vulnerabilities. The number of holes in each category in the training dataset and the occupation ratio are shown in table 2.

TABLE 2 vulnerability Numbers and ratios

Vulnerability category	Number of	Ratio of occupation of
			Reentrant vulnerabilities	1014	67.6％
Timestamp dependency vulnerabilities	715	46.7％
			Endless loop leak	326	21.7％
Without leak	293	19.5％

And 5: and training a deep learning model by utilizing the training data set.

In a preferred embodiment, the training of the deep learning model is divided into two steps, the first step being pre-training (pre-train) with the aim of rapidly dropping the value of the loss function of the deep learning model. The second step is fine-tuning training (Finetune Train) aiming at further improving the security detection capability of the deep learning model. The combined training mode of pre-training and fine-tuning training enables the deep learning model to have better robustness and expandability.

In a preferred embodiment, the Jupyter notewood platform with GPU resources is used for pre-training and fine-tuning training of the deep learning model: during pre-training, setting the Batch-size to be 16, setting the Epoch to be 80, selecting the optimizer to be Adam, and stopping the pre-training to start fine tuning training when the loss value is stably changed to be 1; during the fine tuning training, the Batch-size is set to 4, the Epoch is set to 20, the optimizer selects SGD, and the fine tuning training is stopped when the loss value changes steadily to 0.1. The deep learning model after pre-training and fine-tuning training has vulnerability classification capability for the intelligent contract.

Step 6: carrying out vulnerability detection on the input intelligent contract by using the trained deep learning model, and outputting a security detection result of the intelligent contract security source program;

and detecting the vulnerability of the intelligent contract by using the trained deep learning model, wherein the output result is the probability value of each vulnerability type, if the output probability value is more than or equal to 0.5, the vulnerability of the intelligent contract is considered to exist, and if the output probability value is less than 0.5, the vulnerability does not exist. The method can effectively and automatically detect the security of the intelligent contract.

It is to be understood that the above-described embodiments are only a few embodiments of the present invention, and not all embodiments. The above examples are only for explaining the present invention and do not constitute a limitation to the scope of protection of the present invention. All other embodiments, which can be derived by those skilled in the art from the above-described embodiments without any creative effort, namely all modifications, equivalents, improvements and the like made within the spirit and principle of the present application, fall within the protection scope of the present invention claimed.

Claims

1. a smart contract security detection method based on static analysis and deep learning, is characterized in that, comprises the following steps:

Step 1: statically analyze the solidity source program of the smart contract to obtain a graph structure of the solidity source program of the smart contract; the static analysis includes lexical analysis and syntax analysis; the graph structure includes an abstract syntax tree AST and a control flow graph CFG;

Step 2: Extract abstract facts from the graph structure of the solidity source program obtained in step 1;

Step 3: According to the abstract facts of the solidity source program obtained in step 2, build a deep learning model for classifying vulnerabilities in the solidity source program. The deep learning model includes: an input module, an attention module, a residual connection module and an output module ;

Step 4: construct the training data set of the deep learning model;

Step 5: using the training data set to train the deep learning model;

Step 6: Use the trained deep learning model to perform vulnerability detection on the input smart contract, and output the security detection result of the solidity source program of the smart contract.

2. the smart contract security detection method based on static analysis and deep learning according to claim 1, is characterized in that, described step 1 specifically comprises the steps:

Step 1.1: Preprocess the solidity source program of the smart contract, and delete all content unrelated to the security detection of the solidity source program;

Step 1.2: Import the source code file corresponding to the import statement to the preprocessed smart contract solidity source program to obtain the complete solidity source program;

Step 1.3: For the complete solidity source program, use the ANTLR analyzer to convert the solidity source program into an abstract syntax tree;

Step 1.4: Construct the control flow graph CFG of the solidity source program according to the abstract syntax tree.

3. the smart contract security detection method based on static analysis and deep learning according to claim 2, is characterized in that, described step 1.3 specifically comprises the steps:

Step 1.3.1: Use the ANTLR analyzer to perform lexical analysis on the complete solidity source program, mark the attributes of the words in the solidity source program according to the predefined word attribute categories, and obtain the word attribute annotation corresponding to each program statement. word sequence;

Step 1.3.2: For the word sequence corresponding to each program statement generated by the lexical analysis, use the ANTLR analyzer to perform grammatical analysis, and determine the grammatical structure of each program statement according to predefined grammatical rules; the grammatical structure includes contract structure, Function structure, variable structure, expression structure and statement control flow structure;

Step 1.3.3: According to the grammatical structure of each program statement, use the ANTLR parser to convert the solidity source program into an abstract syntax tree.

4. The smart contract security detection method based on static analysis and deep learning according to claim 3, wherein the word attribute category includes keyword <keyword>, visibility definer <qualifier>, variable data type <variabletype>, identifier <identifier>, operator <operator>, and constant <constant>.

5. the smart contract security detection method based on static analysis and deep learning according to claim 3, is characterized in that, described predefined grammar rule is as follows:

a)Contract::=”contract”<identifier>”{”[contractBlock]”}”;

b)ContractBlock::=[Function]|[Variable];

c)Function::=”function”<identifier>”(”[Variable]”)”<qualifier>[keyword][Return][”;”|Block];

d)Variable::=<variabletype><qualifier><identifier>[”=”Expression]”;”;

e)Expression::=Functioncall|<identifier>|Expression<operator>|Expression<operator>Expression|<identifier><operator><constant>;

f)Functioncall::=Expression"("Variable")";

g) Block::="{"Statement"}";

i)IfStatement::="if""("Expression")"Block["else"Block];

j)WhileStatment::="while""("Expression")"Block;

k)ForStatement::="for""("[Variable]";"[Expression]";"[Expression]")"Block;

l)Return::="return"[Expression].

6. The smart contract security detection method based on static analysis and deep learning according to claim 2, is characterized in that, described step 1.4 specifically comprises the following steps:

Step 1.4.1: Use the program statements belonging to the statement control flow structure to construct different basic blocks Block according to the Block nodes in the abstract syntax tree AST, record the statement number StmtId of each statement in each basic block, and record each basic block the incoming and outgoing edges;

Step 1.4.2: Connect different basic blocks. When the outgoing edge of one basic block is equal to the incoming edge of another basic block, the two basic blocks can be connected. When the number of outgoing edges of one basic block is greater than 1 , record the jump condition of this basic block;

Step 1.4.3: Use the static single assignment form to record the variable number VarId in the program statement and the assignment operation Assign, that is, a variable only performs an assignment operation once, and the variable name is modified for the variable that is assigned twice.

7. The smart contract security detection method based on static analysis and deep learning according to claim 1, wherein the abstract fact contains all control flow information, data information and function information of the smart contract, and is written in datalog language , the structure of the abstract fact is as follows:

Among them, predicate is the corresponding predicate name defined according to the solidity source program structure, including data type, function type, expression structure and control flow structure; arg1,...,argn are other parameters related to the content of the specific solidity program statement.

8. The smart contract security detection method based on static analysis and deep learning according to claim 1 or 7, wherein the method for extracting abstract facts from the graph structure of the solidity source program obtained in step 1 is: Traverse the graph structure of the solidity source program, and extract the abstract facts of the solidity source program according to keyword matching.

9. The smart contract security detection method based on static analysis and deep learning according to claim 1, is characterized in that, described step 3 specifically comprises the steps:

Step 3.1: Build the input module: use the 0-1 encoding matrix X to represent the abstract facts obtained in step 2, and perform word embedding processing and position embedding processing on the abstract facts represented by the 0-1 encoding matrix X, respectively, and the word embedding processing The E matrix obtained by splicing the obtained matrix and the matrix obtained after the position embedding process is used as the input of the attention module;

Step 3.2: Build the attention module, which includes the following steps:

Step 3.2.1: Obtain the Q matrix, K matrix and V matrix of the abstract fact through three linear changes from the E matrix, and obtain the attention coefficient matrix A of the abstract fact according to formula (4);

A = QK ^T (4)

Among them, the Q matrix is the Query matrix of the abstract fact, which consists of the query vector corresponding to each word of each abstract fact; the K matrix is the Key matrix of the abstract fact, which consists of the key vector corresponding to each word of each abstract fact Composition; V matrix is the Value matrix of the abstract fact, which is composed of the value vector corresponding to each word of each abstract fact;

Step 3.2.2: According to the attention coefficient matrix A of the abstract fact, update the element values in the V matrix according to formula (5) to obtain the updated V matrix V';

Among them, d _k represents the arithmetic sum of squares of the K matrix; the softmax function is the activation function;

Step 3.2.3: Add a layer normalization mechanism to the matrix V' of the attention module to make the elements in the matrix V' more standardized to speed up the convergence and ensure the stability of the feature distribution;

Step 3.3: Build the residual connection module. The matrix calculation formula of the residual connection module is as follows:

Z=H(E)=E+F(E)=E+V″ (9)

Among them, the matrix E is the input of the attention module; V″ is the output of the attention module; Z is the output of the residual connection module; F is the residual function, in the attention module, after back propagation, a mapping H will be obtained (E)→Z, if there is no residual connection module, then F(E)→0;

Step 3.4: Build an output module to output the possible vulnerability probability of abstract facts. The specific steps for building an output module are as follows:

Step 3.4.1: Define the vulnerability category output formula shown in formula (10), which is used to output the abstract fact vulnerability category result of the smart contract;

P _k =softmax(Linear(Z)) (10)

Among them, Linear represents a linear function, which performs a linear transformation on the matrix Z; P _k is the probability value of different vulnerability types;

Step 3.4.2: Build the loss function of the deep learning model, so that the model has the ability to classify vulnerabilities.

10. The smart contract security detection method based on static analysis and deep learning according to claim 9, wherein the loss function is a multi-category cross-entropy loss function shown in formula (11):

Loss ₁ = -∑ _k y _k log(P _k ) (11)

Among them, y _k represents the one-hot encoded label corresponding to the abstract fact, and k represents the vulnerability category corresponding to the abstract fact.