CN115758388A

CN115758388A - Vulnerability detection method of intelligent contract based on low-dimensional byte code characteristics

Info

Publication number: CN115758388A
Application number: CN202211540037.4A
Authority: CN
Inventors: 王昊龙; 谢学说; 简兆龙; 李涛; 任奎
Original assignee: Zhejiang University ZJU; Nankai University
Current assignee: Zhejiang University ZJU; Nankai University
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-03-07

Abstract

The invention provides a vulnerability detection method of an intelligent contract based on low-dimensional bytecode characteristics, which comprises the following steps: constructing an operation code sequence control flow diagram through Ethengfang intelligent contract bytecode; recording and numbering each basic block in an operation code sequence control flow graph, traversing the associated edge of each basic block, and establishing an adjacency matrix; based on the adjacency matrix, acquiring characteristics of an operation code sequence control flow graph, namely the number of edges, the number of nodes, the maximum out degree and the maximum in degree, acquiring vulnerability categories and operation code categories possibly related to the vulnerability categories, and taking the proportion of each category of operation codes in the operation code sequence to the total number of the operation codes as characteristics; normalizing the features, then taking the normalized features as input and taking the vulnerability category as output, and training a machine learning model; and detecting the vulnerability by using the trained machine learning model. The invention effectively improves the detection efficiency of the machine learning model and the analysis and interpretation capability of the problem contract.

Description

Vulnerability detection method of intelligent contract based on low-dimensional byte code characteristics

Technical Field

The invention belongs to the technical field of intelligent contract vulnerability detection, and particularly relates to a vulnerability detection method of an intelligent contract based on low-dimensional byte code characteristics.

Background

The intelligent contract is used as a program code running on the block chain, can automatically and correctly execute and update the state after meeting predefined conditions in a zero-trust environment, and has the advantages of decentralization, no tampering, transparency, traceability and the like. With the introduction of intelligent contract programming language identity and execution engine EVM with complete Turing into the EtherFang blockchain, the application scenario of the blockchain is expanded from simple digital encryption currency transaction to secure sharing of data resources and trusted coordination of computing tasks under complex application scenarios such as edge intelligence, federal learning and group learning. The intelligent contract manages and operates sensitive information such as digital assets, encryption tokens, model parameters/gradients and the like, once the contract is deployed on a chain of the contract, contract codes can be accessed and analyzed by block chain nodes, vulnerability or wrong contract codes are difficult to modify and update, and the generated wrong transactions cannot be cancelled after being identified and authenticated together, so that serious safety problems such as digital asset loss, sensitive data leakage and the like are caused. Therefore, efficient and rapid safety detection is performed on the large-scale intelligent contract, safety during operation of the on-chain contract is improved, and safety risks and hidden dangers are reduced.

The existing intelligent contract vulnerability detection method is divided into a traditional method and a machine learning method. Conventional methods include formal verification methods, symbolic execution methods, fuzz testing methods, and intermediate representation methods. In the conventional method, the formal verification method uses a mathematical method to verify that there is no vulnerability in the system. The symbolic execution method traverses all executable paths in the program to detect vulnerabilities. The fuzzy test method inputs automatically or semi-automatically generated random data into the system, and judges the system bug according to the system response. Intermediate representation methods typically convert intelligent contract source code or bytecode into a unique intermediate representation to detect a particular type of vulnerability. The machine learning method comprises a graph neural network method, a random forest algorithm, a gradient boosting algorithm and the like. A graph neural network based machine learning method builds a flow chart from program source code to detect vulnerabilities. Other machine learning methods directly analyze opcode sequence features to detect vulnerabilities.

Hildenbrandt et al propose KEVM based on formal verification and provide an executable form specification using a K framework for the program language of EVM. Luu et al propose symbol-based execution of Oyente, which uses a contract control flow graph to traverse an intelligent contract execution path to detect vulnerabilities. Jiang et al propose a fuzzy test based Contractfuzzer that detects vulnerabilities by setting test cases and analyzing intelligent contract behavior logs. Albert et al propose Ethir based intermediate representation to analyze the security properties of bytecode by converting the Oyente's control flow graph into a rule-based representation (RBR). Tann et al propose SafeSC, which uses a long-short term memory (LSTM) model to analyze an ethernet operation code sequence to implement vulnerability detection, and can detect a contract vulnerability of unlimited freezing of assets, a contract vulnerability of easily revealing assets to strange accounts, and a vulnerability of which contracts can be destroyed by a person at will. Qin proposes ReChecker that can convert intelligent contract source code into contract fragments, detect reentrant vulnerabilities through bidirectional long-short term memory (BLSTM) and attention-driven mechanisms, and utilize captured semantic information and control flow information. Zhuang proposes DR-GCN and TMP, wherein the DR-GCN converts intelligent contract source codes into a contract graph, a vulnerability detection model is constructed by using a graph convolution neural network, and the TMP constructs the vulnerability detection model by using a time sequence graph neural network based on timing sequence information in the contract graph. Wang proposes ContractWard, takes an intelligent contract operation code as a text sequence, utilizes an N-gram word segmentation method to extract a binary group from the intelligent contract operation code, and constructs thousands of dimensional data characteristics for training a 5-machine learning model so as to realize contract leak detection.

The prior art has the following disadvantages:

(1) The traditional method has low detection efficiency

The traditional vulnerability detection method mainly comprises formal verification, symbolic execution, fuzzy test and the like, and mainly focuses on analyzing vulnerability types or problem reasons for discovered problem contracts, but most of the traditional methods need to traverse code execution paths, so that the problems of low efficiency, explosion of symbolic execution paths, large calculation amount, wide search space and the like exist in a large-scale test sample set, the detection time is long, the detection efficiency is low, and the quick analysis requirement of a large-scale intelligent contract cannot be met.

(2) The machine learning method has too high feature dimension and lacks interpretability

The existing machine learning method mainly relies on byte codes or source code sequences of an EtherFang intelligent contract as model input, contract source codes or byte code sequences are used as text sequences for feature analysis, and an N-gram word segmentation method is used for extracting sequence features for model training and analyzing vulnerabilities. The model feature dimensionality is too high due to the fact that the byte codes or source code sequences of the EtherFang intelligent contracts are long, the efficiency of the machine learning method is not greatly improved compared with that of the traditional method, and text features are simply and directly extracted from the byte codes or the source code sequences of the EtherFang intelligent contracts, so that the features lack interpretability, and model results correspondingly lack interpretability.

Disclosure of Invention

The invention provides a vulnerability detection method of an intelligent contract based on low-dimensional byte code characteristics, which aims at solving the technical problems in the prior art, and the method firstly improves the contract vulnerability detection efficiency of a large-scale intelligent contract by reducing the characteristic dimension, secondly improves the interpretability of a detection model and a detection result by combining with the adjacent matrix characteristics of the contract byte code and the characteristic contribution analysis, and effectively improves the detection efficiency of a machine learning model and the analysis and interpretation capability of a problem contract.

The technical scheme adopted by the invention is as follows: a vulnerability detection method of an intelligent contract based on low-dimensional byte code characteristics comprises the following steps:

step 1: constructing an operation code sequence control flow diagram through an Ethernet intelligent contract bytecode;

step 2: recording and numbering each basic block in an operation code sequence control flow graph, traversing the associated edge of each basic block, and establishing an adjacency matrix;

and 3, step 3: based on the adjacency matrix, acquiring the characteristics of an operation code sequence control flow graph, namely the number of edges, the number of nodes, the maximum out degree and the maximum in degree;

acquiring vulnerability categories and operation code categories possibly related to the vulnerability categories, and taking the proportion of each category of operation codes in an operation code sequence to the total number of the operation codes as characteristics;

and 4, step 4: normalizing the features in the step 3, and then taking the normalized features as input and vulnerability categories as output to train a machine learning model;

and 5: and detecting the vulnerability by using the trained machine learning model.

Further, in step 1, the ethernet intelligent contract bytecode includes a runtime code and metadata, the runtime code is separated from the ethernet intelligent contract bytecode, and the bytecode of the runtime code is converted into an operation code, so as to obtain an operation code sequence; converting the byte codes of the runtime codes into operation codes according to the conversion rules of the byte codes and the operation codes defined in the block chain of the EtherFang; and constructing an operation code sequence control flow graph according to the operation code sequence, wherein the basic block of the operation code sequence control flow graph is the operation code sequence from one jump address to the next jump instruction, and the edge of the operation code sequence control flow graph is the jump relation between the jump instruction and the jump destination address.

Further, in step 1, the number of the operation codes of each type in the operation code sequence is counted and the ratio of the operation codes of different types is calculated.

Further, in step 2, if the element in the ith row and the jth column in the adjacency matrix is 1, it indicates that there is an edge between the ith basic block and the jth basic block, and the direction of the edge is from the ith basic block to the jth basic block; if the element is 0, no edge is connected between the two basic blocks.

Further, in step 3, the number of edges: in the adjacency matrix, the sum of elements in the adjacency matrix represents the number of edges of the opcode control flow graph;

the number of nodes: in the adjacent matrix, the row number or the column number of the adjacent matrix represents the node number of the operation code control flow diagram;

maximum output: in the adjacency matrix, the sum of the elements in the ith row represents the degree of departure of the ith basic block;

maximum in-degree: in the adjacency matrix, the sum of elements in the ith column represents the maximum in-degree of the ith basic block.

Further, in step 4, the machine learning model is one of Xtreme Gradient Boosting, K-near Neighbors, logistic Regression, decision Tree, random Forest, navie Bayes, long Short-Term Memory.

Further, the machine learning model is a Decision Tree.

Further, in step 4, the features are normalized: respectively finding out the most value of each dimension characteristic in the data set, mapping the data of each dimension characteristic into decimal between 0 and 1 according to the proportion, and then diffusing the decimal into integers within 1000.

Further, step 6: and according to the trained machine learning model, calculating the contribution degree of each type of feature of each type of vulnerability by adopting a feature _ importances _ method, a persistence _ import () method or a model.

Further, in step 3, the vulnerability categories are respectively an integer positive overflow vulnerability, an integer negative overflow vulnerability, a stack call depth attack vulnerability, a transaction execution sequence dependency vulnerability, a timestamp dependency vulnerability and a reentrant vulnerability; the operation codes are respectively unary operation codes, binary operation codes, block information operation codes, control flow operation codes, environment operation codes, system operation codes, stack operation codes and invalid operation codes;

in step 6, the integer positive overflow vulnerability, the integer negative overflow vulnerability, the stack call depth attack vulnerability, the transaction execution sequence dependence vulnerability, the timestamp dependence vulnerability and the reentrant vulnerability are all related to the number of edges, the number of nodes, the maximum out degree and the maximum in degree; the integer positive overflow vulnerability is also related to unary operation code proportion and binary operation code proportion; the integer negative overflow vulnerability is also related to unary operation code proportion and binary operation code proportion; the stack call deep attack vulnerability is also related to the system operation code proportion and the stack operation code proportion; the transaction execution order dependency vulnerability is also related to the environment opcode scale; the timestamp dependency vulnerability is also related to block information opcode scale; reentrant vulnerabilities are also related to control flow opcode ratios.

Compared with the prior art, the invention has the beneficial effects that: the invention constructs the adjacency matrix from the Taifang intelligent contract bytecode, and extracts the expandable features with lower dimension for high-efficiency large-scale intelligent contract vulnerability detection. Compared with the existing machine learning detection method directly based on semantics or based on a graph neural network, the method provided by the invention has the advantages that the characteristic dimension is reduced by hundreds of times, and the detection time delay is increased by 25-650 times. The detection efficiency is improved, training and detection of more machine learning algorithms can be supported through feature normalization, model interpretability is optimized by means of feature contribution degrees and adjacency matrix features, and the detection performance of the features is not affected. The low-dimensional extensible feature provided by the invention has high efficiency and effectiveness.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention;

FIG. 2 is a diagram of an embodiment of an adjacency matrix.

Detailed Description

In order to make the technical solutions of the present invention better understood, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

The embodiment of the invention provides a vulnerability detection method of an intelligent contract based on low-dimensional byte code characteristics, which comprises the following steps as shown in figure 1:

step 1: an operation code sequence control flow graph is constructed through an Ethernet intelligent contract byte code, and the specific process is as follows:

step 1.1: the Etherhouse intelligent contract byte code comprises a runtime code and metadata, the metadata is removed from the Etherhouse intelligent contract byte code, and the runtime code is separated. And eliminating the metadata according to the metadata head obtained by analysis. The partial metadata header is as follows:

Solc0.4.17：0xa1 0x65‘b’‘z’‘z’‘r’‘0’0x58 0x20<32bytes swarm hash>0x00 0x29；

Solc0.5.9：0xa2 0x65‘b’‘z’‘z’‘r’‘0’0x58 0x20<32bytes swarm hash>0x64‘s’‘o’‘l’‘c’0x43<3bytes version encoding>0x000x32；

Solc0.6.0：0xa2 0x64‘I’‘p’‘f’‘s’0x58 0x22<34bytes IPFS hash>0x64‘s’‘o’‘l’‘c’0x43<3bytes version encoding>0x00 0x32；

ABIEncoderV2：0xa2 0x65‘b’‘z’‘z’‘r’‘0’0x58 0x20<32bytes swarm hash>0x6c‘e’‘x’‘p’‘e’‘r’‘I’‘m’‘e’‘n’‘t’‘a’‘l’0xf5 0x00 0x37。

step 1.2: and converting the byte codes of the runtime codes into the operation codes according to the conversion rules of the byte codes and the operation codes defined in the block chain of the EtherFang. For example, the bytecode 0x00 conversion opcode STOP; converting byte code 0x01 into operation code ADD; bytecode 0x03 is converted into opcode SUB, etc. For these non-PUSH operation codes, the conversion is performed directly according to the above-mentioned definition rules. For PUSH operation codes, such as PUSH1-PUSH32, the parameters of these operation codes are first obtained and then the byte code conversion is performed. After the conversion of the byte codes and the operation codes is finished, the number of the operation codes in the operation code sequence is counted, and the proportion of the operation codes in different classes is calculated.

Step 1.3: and constructing an operation code sequence control flow graph according to the operation code sequence, wherein main graph information in the operation code sequence control flow graph is divided into edges and basic blocks. The basic block is an operation code sequence from one jump address to the next jump instruction, and the edge is the jump relation between the jump instruction and the jump destination address.

And dividing the operation code sequence into a plurality of basic blocks according to the JUMP type operation codes, and constructing edges among the basic blocks. The basic block has the following parameters:

type (2): COMMON, DISPATCHER, FALLBACK, EXIT;

starting point: JUMPDEST;

end point: JUMP, JUMP I, STOP, REVERT, RETURN, INVALID, SELFDESTRUCT;

marking: offset of the first opcode in the basic block.

The edges between basic blocks are mainly classified into four types:

JUMP is immediately followed by PUSH: the parameter of the PUSH opcode is the offset of the target address of JUMP, and the corresponding edge is added to the control flow graph. This means that if a basic block ends with JUMP, an edge starting with the block is generated.

JUMP I is immediately preceded by PUSH: JUMP I is a conditional JUMP. If the condition is judged to be true, adding the edge determined by the offset of the destination address pointed by the parameters of the PUSH operation code into the control flow graph; and if the condition is judged to be false, adding the edge determined by the next operation code of the JUMP I operation code into the control flow graph. This means that if a basic block ends with JUMP I, two edges starting with the block are generated.

JUMP is not followed by PUSH: the destination address of the JUMP operation is calculated by a symbol stack execution algorithm. The symbolic stack execution algorithm is to traverse the previous operation code sequence for the type of JUMP, and construct a symbolic execution stack for the operation codes in the basic block. For elements within the symbolic execution stack, only the operation codes in the PUSH, DUP, SWAP families AND the AND AND POP operation codes are considered. For each other opcode, the symbol stack has only one "unbnow" element for either POP or PUSH, since they are not associated with the jump address.

REVERT, SELFDESTRUCT, RETURN, INVALID, STOP: these opcodes imply an interruption of the control flow, so that the basic block ending with these opcodes has no subsequent basic block.

And 2, step: recording and numbering each basic block in the control flow graph of the operation code sequence, and then traversing the associated edge of each basic block to establish an adjacency matrix. As shown in fig. 2, this figure shows an adjacency matrix composed of n basic blocks, where if the element in the ith row and the jth column in the adjacency matrix is 1, it indicates that there is an edge between the ith basic block and the jth basic block, and the direction of the edge is from the ith basic block to the jth basic block; if the element is 0, no edge is connected between the two basic blocks. The elements have two values of 0 or 1.

And step 3: and acquiring the characteristics of the control flow graph of the operation code sequence based on the adjacency matrix, wherein the characteristics are the edge number, the node number, the maximum out degree and the maximum in degree. The four-dimensional characteristic is graph information of an operation code sequence control flow graph, and represents the complexity of the operation code sequence control flow graph, namely the complexity of intelligent contract jumping.

The number of edges: in the adjacency matrix, the sum of elements in the adjacency matrix represents the number of edges of the opcode control flow graph;

maximum output: in the adjacency matrix, the sum of the elements of the ith row represents the degree of the ith basic block;

maximum in degree: in the adjacency matrix, the sum of the elements in the ith column represents the maximum in-degree of the ith basic block.

Acquiring vulnerability categories and operation code categories possibly related to the vulnerability categories, and taking the proportion of each category of operation codes in an operation code sequence to the total number of the operation codes as characteristics; the proportion of each type of operation code has already been counted in step 1.2. In this embodiment, six kinds of bugs are selected as detection targets, and eight kinds of operation codes are selected. The vulnerability categories are respectively an integer positive overflow vulnerability, an integer negative overflow vulnerability, a stack call depth attack vulnerability, a transaction execution sequence dependency vulnerability, a timestamp dependency vulnerability and a reentrant vulnerability. The opcode classes are unary, binary, block information, control flow, environment, system, stack, and invalid opcodes, respectively. The type and number of opcodes are selected according to the vulnerability category, and thus, the type of opcodes is scalable.

The scalable eight-dimensional opcode scale based on vulnerability classification analysis of this embodiment is characterized by a unary opcode scale, a binary opcode scale, a block information opcode scale, a control flow opcode scale, an environment opcode scale, a system opcode scale, a stack opcode scale, and an invalid opcode scale.

Unary opcode ratio: in extracting the feature, a function COUNT (opcodes) is defined to calculate the total number of each type of operation code, wherein the total number of operation codes is COUNT (total opcodes), and the number of unary arithmetic operation codes is COUNT (una), so that the unary operation code ratio is COUNT (una)/COUNT (total opcodes).

Binary operation opcode ratio: when the characteristic is extracted, a function COUNT (opcodes) is defined to calculate the total number of various operation codes, wherein the total number of the operation codes is COUNT (total opcodes), the number of the binary operation codes is COUNT (bin), and therefore the characteristic value of the proportion characteristic of the binary operation codes is COUNT (bin)/COUNT (total opcodes).

System opcode ratio: in extracting the feature, the number of system operation codes is defined as COUNT (sys), the number of total operation codes is defined as COUNT (total codes), and therefore, the system operation code ratio feature values are COUNT (sys)/COUNT (total codes), respectively.

Stack opcode ratio: in extracting the feature, the number of stack operation codes is defined as COUNT (sta), the number of total operation codes is defined as COUNT (total opcode), and therefore the feature values of the stack operation code proportional feature are COUNT (sta)/COUNT (total opcode) respectively.

Environmental opcode ratio: in extracting the feature, the number of environment operation codes is defined as COUNT (env), the number of total operation codes is defined as COUNT (total codes), and therefore the feature value of the environment operation code proportion feature is COUNT (env)/COUNT (total codes).

Block opcode ratio: in extracting the feature, the number of block operation codes is defined as COUNT (blob), and the number of total operation codes is defined as COUNT (total opcode), so that the feature value of the block operation code ratio feature is COUNT (blob)/COUNT (total opcode).

Control flow opcode ratio: in extracting the feature, the number of control flow operation codes is defined as COUNT (con), and the number of total operation codes is defined as COUNT (total opcode), so that the feature value of the control flow operation code proportional feature is COUNT (con)/COUNT (total opcode).

Invalid opcode ratio: the last class of opcodes is invalid opcodes, which is the sum of other opcodes besides the above. In extracting the feature, the number of invalid operation codes is defined as COUNT (inv), and the number of total operation codes is defined as COUNT (total opcode), so that the feature value of the invalid operation code feature is COUNT (inv)/COUNT (total opcode).

In summary, the present embodiment adopts 12-dimensional features and 6-dimensional vulnerabilities.

And 4, step 4: and normalizing the 12-dimensional features, and then training the machine learning model by taking the normalized 12-dimensional features as input and taking the normalized 6-dimensional vulnerability as output.

In order to better adapt to the current mainstream machine learning model, the extracted scalable 12-dimensional features are normalized. Firstly, due to the difference of feature extraction, the first four-dimensional data features (features of an operation code sequence control flow graph) are large integers, and the second eight-dimensional data features are small numbers between 0 and 1, so that if the twelve-dimensional data features are directly used as the input of a machine learning model, some machine learning models (such as KNN) only pay attention to the first four-dimensional features and ignore the second eight-dimensional features during training. In addition, some machine learning models require input data formats, such as LSTM requires input of integer data. Therefore, the data characteristics are normalized, the most value of the data characteristics of each dimension in the data set is respectively found out, the data of each dimension is mapped into decimal between 0 and 1 according to the proportion, and then the decimal is diffused into an integer within 1000. The normalized formula is as follows:

where x (n, f) represents the eigenvalues of the nth row and the fth column in the feature space.

The mainstream machine learning models are Xtreme Gradient Boosting (XGboost), K-Nearest Neighbors (KNN, K neighbor algorithm), logistic Regression (Logistic Regression algorithm), decision Tree (Decision Tree algorithm), random Forest algorithm, navie Bayes (naive Bayes algorithm), long Short-Term Memory (LSTM, long-Short Term Memory artificial neural network).

And 5: and (5) using the trained machine learning model to detect the vulnerability.

Compared with thousands of dimensional data features extracted by the existing N-gram, the feature of the embodiment shortens vulnerability detection time delay from 15ms to 0.2ms under the condition that parameters such as accuracy, precision, F1 index and Recall index on the XGboost model are not reduced, and the vulnerability detection time delay is improved by 75 times; under the condition that all parameters in the decision tree model are not reduced, the vulnerability detection time delay is shortened from 2.6ms to 0.004ms, which is improved by 650 times and is the model with the largest improvement in all models. In addition, for the traditional detection method, taking the eyente vulnerability detection tool as an example, the detection time delay is 7.89s; for the detection method based on the graph neural network, the detection time delay is 1.9s, and the detection time delay of the method is obviously improved compared with the methods.

Step 6: and according to the trained machine learning model, calculating the contribution degree of each type of feature of each type of vulnerability by adopting a feature _ importances _ method, a persistence _ import () method or a model. For the same machine learning model, when vulnerability type prediction is carried out on an abnormal contract, if the contribution degree of a certain one-dimensional feature is higher than the mean value of the contribution degrees of the features, the abnormal contract is very likely to have the vulnerability corresponding to the feature knowledge.

And calculating that the integer positive overflow vulnerability, the integer negative overflow vulnerability, the stack call depth attack vulnerability, the transaction execution sequence dependency vulnerability, the timestamp dependency vulnerability and the reentrant vulnerability are all associated with the number of edges, the number of nodes, the maximum out degree and the maximum in degree.

Further, unary opcode ratios and binary opcode ratios are associated with integer positive and negative overflow holes. The cause of the contract integer positive and negative overflow hole is that some variables exceed the word length specified in definition or are smaller than the minimum value during calculation and are closely related to the unary operation code and the binary operation code, so that the arithmetic operation code is divided into the unary operation code and the binary operation code, and the proportion of the operation code is calculated as the characteristic for detecting the hole.

The system opcode scale and the stack opcode scale are associated with a stack call deep attack vulnerability. The cause of the contract stack call depth attack vulnerability is that when the call stack depth exceeds a threshold value when a caller calls other contracts, an instruction does not throw an exception but returns false, so that the caller cannot sense call failure, operation codes related to calling between contracts are related to the vulnerability and can be classified into system operation codes and stack operation codes, and the proportion of the system operation codes and the proportion of the stack operation codes are calculated to be used as characteristics for detecting the vulnerability.

The environmental opcode scale is associated with a transaction execution order dependency vulnerability. The contract transaction execution sequence dependency vulnerability is obtained by sequencing and executing the transactions originated from the transaction pool in the Etherhouse according to the transaction cost gas, and miners can determine the value of the internal variables of the intelligent contract according to the transaction packaging sequence so as to obtain the transaction information needing to be executed. Such vulnerabilities are associated with opcodes that are relevant to discovering transaction information, classified as environmental opcodes, and the environmental opcode proportion is calculated as a feature to detect the vulnerability.

The block opcode scale is associated with a timestamp dependency vulnerability. The contract timestamp dependence vulnerability is caused by the fact that block information such as block timestamps, block numbers and block hash values in the ether houses is used as judgment conditions of key operations in the intelligent contracts or seeds for generating random numbers. The operation code related to the bug mainly comprises block information which is classified as a block operation code, and the proportion of the block operation code is calculated as the characteristic for detecting the bug.

The control flow opcode scale is associated with a reentrant vulnerability. The contract reentrant vulnerability is generated due to the special property of the EtherFang programming language Solidity, and the unique rollback mechanism of Solidity can cause an attacker to re-enter the called function before the execution of the program command is finished, so that repeated calling can be caused, and the vulnerability can cause huge economic loss. The vulnerability is closely associated with the control flow operation codes, classified as control flow operation codes, and the proportion of the control flow operation codes is calculated as the characteristic for detecting the vulnerability.

In this embodiment, we propose a rule-based method for constructing a adjacency matrix of contract bytecode by separating metadata and converting runtime code into operation code sequence, and converting the operation code into adjacency matrix form according to specific operation code. Then, we extract 12-dimensional data features satisfying 6 contract vulnerability detections from the adjacency matrix, including 4-dimensional adjacency matrix features for analyzing error diagnosis of problem contracts and 8-dimensional operation code proportion features for identifying existing 6 known contract vulnerabilities, and can extend the 12-dimensional data features according to new vulnerabilities. Finally, 7 different machine learning algorithms are used for verifying the detection efficiency and the performance index of the 12-dimensional data feature, effective detection of more machine learning algorithms is supported through optional feature normalization, error diagnosis and analysis are carried out on a problem contract by using feature contribution degrees, and the interpretability of the feature is improved. Compared with thousands of dimensional data features extracted by the existing N-gram, the vulnerability detection time delay is shortened from 15ms to 0.2ms under the condition that parameters such as accuracy, precision, F1 index and Recall index and the like on the XGboost model are not reduced, and is increased by 75 times; under the condition that all parameters in the decision tree model are not reduced, the vulnerability detection time delay is shortened from 2.6ms to 0.004ms, which is improved by 650 times and is the model with the largest improvement in all models. In addition, for the conventional detection method, taking the Oyente vulnerability detection tool as an example, the detection time delay is 7.89s; for the detection method based on the graph neural network, the detection time delay is 1.9s, and the detection time delay of the method is obviously improved compared with the methods.

The present invention has been described in detail with reference to the embodiments, but the description is only illustrative of the present invention and should not be construed as limiting the scope of the present invention. The scope of the invention is defined by the claims. The technical solutions of the present invention or those skilled in the art, based on the teaching of the technical solutions of the present invention, should be considered to be within the scope of the present invention, and all equivalent changes and modifications made within the scope of the present invention or equivalent technical solutions designed to achieve the above technical effects are also within the scope of the present invention.

Claims

1. A vulnerability detection method of an intelligent contract based on low-dimensional byte code features is characterized by comprising the following steps: the method comprises the following steps:

step 2: recording and numbering each basic block in an operation code sequence control flow graph, traversing edges associated with each basic block, and establishing an adjacency matrix;

and step 3: based on the adjacency matrix, acquiring the characteristics of an operation code sequence control flow graph, namely the number of edges, the number of nodes, the maximum out degree and the maximum in degree;

2. The method of claim 1, wherein the vulnerability detection method comprises: in step 1, separating a runtime code from the EtherFang intelligent contract bytecode, and converting the bytecode of the runtime code into an operation code to obtain an operation code sequence; and constructing an operation code sequence control flow graph according to the operation code sequence, wherein the basic block of the operation code sequence control flow graph is the operation code sequence from one jump address to the next jump instruction, and the edge of the operation code sequence control flow graph is the jump relation between the jump instruction and the jump destination address.

3. The method of claim 1, wherein the vulnerability detection method comprises: in step 1, the number of various operation codes in the operation code sequence is counted and the proportion of the operation codes of different classes is calculated.

4. The method of claim 1, wherein the vulnerability detection method comprises: in step 2, if the element in the ith row and the jth column in the adjacency matrix is 1, it indicates that there is an edge between the ith basic block and the jth basic block, and the direction of the edge is from the ith basic block to the jth basic block; if the element is 0, no edge is connected between the two basic blocks.

5. The method of claim 4, wherein the vulnerability detection method comprises: in step 3, the number of edges: in the adjacency matrix, the sum of elements in the adjacency matrix represents the number of edges of the flow chart of the operation code control;

the number of nodes is: in the adjacent matrix, the row number or the column number of the adjacent matrix represents the number of nodes of the flow chart controlled by the operation code;

maximum in degree: in the adjacency matrix, the sum of elements in the ith column represents the maximum in-degree of the ith basic block.

6. The method of claim 1, wherein the vulnerability detection method comprises: in step 4, the machine learning model is one of Xtreme Gradient Boosting, K-nearest neighbors, logistic Regression, decision Tree, random Forest, navie Bayes, long Short-Term Memory.

7. The method of claim 6, wherein the vulnerability detection method comprises: the machine learning model is a Decision Tree.

8. The method of claim 1, wherein the vulnerability detection method comprises: in step 4, the features are normalized: respectively finding out the most value of each dimension characteristic in the data set, mapping the data of each dimension characteristic into decimal between 0 and 1 according to the proportion, and then diffusing the decimal into integers within 1000.

9. The method of claim 1, wherein the vulnerability detection method comprises: step 6: and calculating the contribution degree of each type of feature of each type of vulnerability by adopting a feature _ opportunities _ method, a persistence _ opportunities () method or a model.

10. The method of claim 9 for vulnerability detection of intelligent contracts based on low-dimensional bytecode features, characterized by: in step 3, the vulnerability categories are respectively an integer positive overflow vulnerability, an integer negative overflow vulnerability, a stack call depth attack vulnerability, a transaction execution sequence dependence vulnerability, a time stamp dependence vulnerability and a reentrant vulnerability; the operation codes are respectively unary operation codes, binary operation codes, block information operation codes, control flow operation codes, environment operation codes, system operation codes, stack operation codes and invalid operation codes;

in step 6, the integer positive overflow vulnerability, the integer negative overflow vulnerability, the stack call depth attack vulnerability, the transaction execution sequence dependency vulnerability, the timestamp dependency vulnerability and the reentrant vulnerability are all related to the number of edges, the number of nodes, the maximum out degree and the maximum in degree; the integer positive overflow vulnerability is also related to unary operation code proportion and binary operation code proportion; the integer negative overflow vulnerability is also related to unary operation code proportion and binary operation code proportion; the stack call deep attack vulnerability is also related to the system operation code proportion and the stack operation code proportion; the transaction execution order dependency vulnerability is also related to the environment opcode scale; the timestamp dependency vulnerability is also related to block information opcode scale; reentrant vulnerabilities are also related to control flow opcode scale.