CN116204885A

CN116204885A - Operation code sequence generation method based on Ethernet transaction data replay

Info

Publication number: CN116204885A
Application number: CN202211531992.1A
Authority: CN
Inventors: 王国军; 李培强; 黎相彬; 邢萧飞; 彭滔; 陈淑红; 刘湘勇
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2022-12-01
Filing date: 2022-12-01
Publication date: 2023-06-02

Abstract

The invention relates to the technical field of Ethernet intelligent contract vulnerability detection, and discloses an operation code sequence generation method based on Ethernet transaction data replay, which comprises the following steps: s1: the pile inserting stage of the Geth client, S2: data playback stage, S3: transaction operation code sequence acquisition phase, S4: and a transaction operation code sequence storage stage. When the method is executed in the virtual machine, the byte codes are converted into the operation codes by using the API provided by the Geth, so that one calling process of the method in the intelligent contract is actually one executing process of the operation code sequence in the method, and therefore, the operation code sequences executed according to the calling sequence can be considered to form a control flow path for executing one transaction. According to the operation code sequence generation method based on the Ethernet transaction data replay, the operation code sequences in the transaction execution process are collected and used for machine learning and deep learning model training, so that the accuracy of detecting known vulnerabilities and unknown vulnerabilities of the intelligent contracts at the later stage can be improved.

Description

Operation code sequence generation method based on Ethernet transaction data replay

Technical Field

The invention relates to the technical field of Ethernet intelligent contract vulnerability detection, in particular to an operation code sequence generation method based on Ethernet transaction data playback.

Background

Blockchains, which are a decentralized distributed ledger technology, have been widely used in various scenarios including medicine, economics, internet of things, software engineering, etc. by virtue of their non-tamperable, non-counterfeitable, non-repudiatable, traceable, no third party trusted authorities, etc. At present, various industries are exploring blockchain technologies to support more complex and diverse business requirements. Along with the diversification of application scenes, the complexity of the intelligent contracts is continuously increased, and the influence caused by security vulnerabilities of the intelligent contracts is more remarkable. Since the intelligent contract holds a large amount of funds and cannot be modified once being up-linked, the attack action aiming at the intelligent contract frequently happens, and the intelligent contract is often subjected to malicious attack, so that the attack is more serious than the attack of a conventional network system. For example, a 2016 month 6 hacker steals ethernet dollars in about 6000 ten thousand dollars using The reentrant vulnerability of The DAO (The decentralized autonomous organization) contract. Currently, intelligent contracts are becoming one of the security bottlenecks of a blockchain platform, and intelligent contract security breach detection has become a problem to be solved in blockchain technology application.

Security vulnerabilities of smart contracts have a variety of reasons. First, in contrast to conventional programs running on computers, smart contracts are programs running on a decentralised blockchain. Intelligent contracts are typically written in a high-level programming language (e.g., solubility) that is complete with a turing language. These languages may not be secure themselves. For example, the DAO event mentioned above is caused by a callback mechanism that is unique to The resolution. Second, the smart contracts run on the virtual machines of the ethernet workshops. The operating mechanism of the ethernet virtual machine may also make the smart contract vulnerable. For example, in the ABI specification of a smart contract, the input contract address must be 20 bytes in length. When the address length is less than 20 bytes, the ethernet virtual machine can meet the requirement of the address length by automatically adding 0 at the end due to the automatic zero padding mechanism of the ethernet virtual machine. It is this feature that makes malicious attackers organically multiplicative. Third, defects in the blockchain system may also cause security holes in the smart contracts, such as block parameter dependency holes, the blockchain cannot use secure random numbers, and block related parameters or information may be assigned to predictable variables. If a malicious mineworker is listening for a corresponding smart contract transaction in a block, it may alter the current smart contract state by submitting the malicious transaction, thereby giving it an opportunity to deploy the attack in advance. Fourth, since the smart contract source code is transparent and non-tamperable, it provides a hacker with an opportunity, and once a security event occurs, an operator cannot repair the vulnerability by means of patches and the like as in the conventional program.

There are many types of security vulnerabilities of the intelligent contracts, taking the ethernet as an example, the code layer has reentrant vulnerabilities, integer overflows, authority control, exception handling and the like, the EVM execution layer has short address vulnerabilities, tx.origin vulnerabilities and the like, the blockchain system layer has timestamp dependencies, transaction sequence dependencies and the like, and the recent 400 vulnerabilities of the intelligent contracts are counted according to CNVD-BC (national blockchain vulnerability library).

At present, the main methods of intelligent contract vulnerability detection include formal verification, symbol execution, fuzzy test, stain analysis and the like, wherein the formal verification is that the correctness of the code function and the safety of the attribute are checked through mathematical reasoning logic and proof, absolute correctness in a certain range is ensured, but manual participation in modeling and reasoning processes is needed, and the efficiency is low; the core idea of symbol execution is to use symbol values to replace specific execution programs, the method can reduce test case sets to realize high coverage rate, but the condition of path explosion can also occur, and the fuzzy test is a software fault identification method by constructing unexpected input data and monitoring abnormal program operation results, and has the advantages of high test speed and low consumption, and the defect that the ideal path coverage rate cannot be achieved due to limited covered system behaviors.

The prior art has the defects that:

1. using the entire smart contract byte code sequence or the byte code decompiled operation code sequence as raw data, there is a lack of context when deep learning or machine learning to construct feature vectors.

2. The operation code sequence generated by using the operation codes to construct the CFG graph and then using the depth-first traversal can not dynamically detect the loopholes according to the actual deployment situation of the intelligent contracts on the chain, although the context relation is carried to a certain extent.

3. Because deep learning or machine learning requires massive data as a basis, although the number of intelligent contracts is large, the intelligent contracts are much worse than transaction data, and it is believed that better training models can be obtained by taking an operation code sequence acquired by transaction as training data.

Disclosure of Invention

The invention aims to provide an operation code sequence generation method based on the reproduction of Ethernet transaction data, so as to solve the problems in the background art.

In order to achieve the above purpose, the present invention provides the following technical solutions: an operation code sequence generating method based on the replay of Ethernet transaction data comprises the following steps:

s1: geth client pile inserting stage

1. The Geth source codes are obtained from the official network, and codes for collecting transaction related information are inserted into different positions and different methods by referring to the TxSpector idea.

2. Newly creating MongoDB related class, introducing mgo.v2 library for opening and closing MongoDB, connecting Geth client and MongoDB, and performing deletion and examination operation on database;

3. the idea of SODA is integrated, and the operation code sequence is obtained and marked according to the vulnerability type, so that the same transaction operation code sequence can have multiple vulnerabilities, and multiple marks can exist in the same transaction during marking.

4. Knowing that for each operation code of the virtual machine, there is a corresponding operation code method in the Geth source code, and modifying the operation code method to obtain information such as Address (Address of contract), balance (Balance of contract), origin (Address of transaction), caller (call Address of message), callValue (amount carried by message), blockhash (hash of block), gaspec (price of transaction), cobase (Address of current block miner), gasLimit (Gas limit), timestamp (UNIX Timestamp of current block), number (block height), difference (block Difficulty), and the like.

5. The class in which the transaction is performed is modified so as to acquire data such as GasUsed (Gas quantity has been used), txHash, inputData (input data of the transaction), and the like.

S2: data playback phase

After the client after pile insertion modification is installed, the MongoDB database is started through the./ monmod-config/usr/local/MongoDB 5/etc/MongoDB. Conf, and the Ethernet transaction data replay is started through the get-synchmode full-datadir data.

S3: transaction opcode sequence acquisition phase

A transaction operation code sequence, a triplet tx= < from, to, data > is used to represent a transaction, wherein from and to represent addresses of accounts of a sender and a receiver respectively, and data represents input data of the transaction; here, the transaction operation code sequence refers to an operation code sequence when a transaction initiates call execution to an intelligent contract, and is not processed when the transaction is only a transfer operation.

One of the most important features of knowing ethernet is to support the running of smart contracts, by which developers can perform any operation that a normal computer can perform, because smart contracts are essentially a piece of code that can be automatically executed on a blockchain; since all operations of the ethernet are transaction driven, the user must create, deploy and invoke the intelligent contract through the transaction, and note that the contract account cannot initiate the transaction itself, so the initial initiator of the transaction invocation chain always originates from the external account;

when the external account initiates a deployment intelligent contract transaction, from is the address of the external account of the sender, to will be

When the EVM virtual machine receives the intelligent contract, the EVM invokes a construction method of the contract to be deployed, sets an initial state of the contract, returns an address of a deployed contract account and byte codes when the contract runs, when an external account initiates a transaction for calling the intelligent contract method, the from is an address of the external account of a sender, the to address is an address of the intelligent contract to be called which is deployed on a blockchain, and the data is a parameter for specifying a method name and a method in the intelligent contract to be called.

From the above analysis, it is known that whether the initiated transaction is a deployment of a smart contract or a call to a smart contract, ultimately triggers execution of a smart contract-related method. It is also understood that a call to a method within a smart contract represents traversing one of all paths during execution of a smart contract.

S4: transaction opcode sequence deposit stage

Newly creating a transaction table through the MongoDB database installed in the prior art: transaction.

Preferably, in the step S3, the intelligent contract is written by the entity, then encoded into the byte code, and the byte code is converted into the operation code by using the API provided by Geth when executing in the virtual machine, so that a calling process of the method in the intelligent contract is actually an executing process of the operation code sequence in the method, and therefore, the operation code sequences executed according to the calling sequence can be considered to form a control flow path for executing a transaction.

Preferably, as shown in fig. 2, the operation code sequence is intelligent contract source code written by high-level language stability when acquired, the encoder compiles the intelligent contract source code into contract byte code, the contract byte code is decompiled into operation code again through an API provided by Geth, and the intelligent contract CFG is constructed by the operation code.

Preferably, as shown in fig. 2, when the method of transaction call initiated by the external account is test (), and the transferred parameter num >6, the executed Path is Path1: block1- > Block2- > Block4; when the transmitted parameter num < = 6, the executed Path is Path2: block1- > Block3- > Block4, and then the operation code sequences sequentially executed in the blocks contained in Path1 or Path2 respectively form a control flow Path for executing one transaction; the operation code sequence of each transaction marked for different vulnerability types can be collected in real time through the Geth client after the pile insertion.

Preferably, in the step S3, the current transaction operation code sequence is classified into 9 types, P0 (normal sequence), P1 (reentrant hole sequence), P2 (unexpected function call hole sequence), P3 (short address hole sequence), P4 (error authority check hole sequence), P5 (error handling exception hole sequence), P6 (lack of standard event hole sequence), P7 (strict check balance hole sequence), and P8 (timestamp/block number dependency hole sequence).

Preferably, the new transaction table in step S4: the transaction comprises: the method comprises the steps of storing the operation code sequences marked according to the vulnerability types in a database so as to be used for detecting known vulnerabilities or unknown vulnerabilities of intelligent contracts in the later period, wherein the information comprises the block height of the transaction, the transaction time, the initiator of the transaction, the receiver of the transaction, the transaction amount, the upper limit of Gas, the consumption of Gas, the transaction cost, the transaction state, the transaction identification, input data, the operation code sequences executed by the transaction, the vulnerability types corresponding to the operation code sequences and the like.

Compared with the prior art, the invention provides an operation code sequence generation method based on the replay of the Ethernet transaction data, which has the following beneficial effects:

1. compared with the prior art, the method for generating the operation code sequence based on the Ethernet transaction data replay has the advantages that the dynamic operation code sequence is generated by utilizing the Ethernet transaction data replay, the context relation is naturally implied, and the accuracy of detecting known vulnerabilities and unknown vulnerabilities of the intelligent contracts at the later stage can be improved in deep learning and machine learning model training.

2. According to the operation code sequence generation method based on the Ethernet transaction data replay, tens or hundreds of control flow paths can exist in one intelligent contract, each control flow path corresponds to one transaction, so that the data volume of the transaction on the Ethernet is possibly much larger than that of the intelligent contract, massive transaction data can naturally generate massive operation code sequences, and a good data basis can be provided for intelligent contract vulnerability detection.

3. The method for generating the operation code sequence based on the Ethernet transaction data replay provides a method for generating the dynamic operation code sequence by utilizing the Ethernet transaction data replay, and provides a new idea for detecting known vulnerabilities and unknown vulnerabilities of intelligent contracts.

Drawings

For a clearer description of the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it will be apparent that the drawings in the description below are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art:

FIG. 1 is a diagram of steps performed by a system of the present invention;

fig. 2 is a schematic diagram of an operation code sequence acquisition diagram according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-2, the present invention provides a technical solution: an operation code sequence generating method based on the replay of Ethernet transaction data comprises the following steps:

s1: geth client pile inserting stage

5. The class in which the transaction is performed is modified so as to acquire GasUsed (Gas quantity has been used), txHash, inputData (input data of the transaction), and the like.

S2: data playback phase

S3: transaction opcode sequence acquisition phase

A transaction operation code sequence, a triplet tx= < from, to, data > is used to represent a transaction, wherein from and to represent addresses of accounts of a sender and a receiver respectively, and data represents input data of the transaction; by transaction opcode is meant herein an opcode when a transaction initiates execution of a call to a smart contract, and we do not collect when the transaction is simply a transfer operation.

From the above analysis, it is known that whether the initiated transaction is a deployment of a smart contract or a call to a smart contract, the execution of a method associated with the smart contract is eventually triggered, and each path in the control flow chart is an execution process of a program according to the explanation of the term control flow chart, which is understood as a call to a method within the smart contract, and also represents traversing one path of all paths in the execution process of a smart contract.

S4: transaction opcode sequence deposit stage

Preferably, when the method of transaction call initiated by the external account is test (), and the transmitted parameter num >6, the executed Path is Path1: block1- > Block2- > Block4; when the transmitted parameter num < = 6, the executed Path is Path2: block1- > Block3- > Block4, and then the operation code sequences sequentially executed in the blocks contained in Path1 or Path2 respectively form a control flow Path for executing one transaction; the operation code sequence of each transaction marked for different vulnerability types can be collected in real time through the Geth client after the pile insertion.

In actual operation, when such an apparatus is used, it should be noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. An operation code sequence generating method based on the replay of Ethernet transaction data is characterized in that: the method comprises the following steps:

s1: geth client pile inserting stage

1. Obtaining Geth source codes from a official network, and inserting more codes for collecting transaction related information into different positions and different methods by referring to TxSpector thought;

3. the idea of integrating SODA (system on demand), when the operation code sequence is acquired, marking the operation code sequence according to the type of the loophole, because multiple loopholes can exist in the same transaction operation code sequence, multiple marks can exist in the same transaction during marking;

4. for each operation code of the virtual machine, a corresponding operation code method is arranged in the Geth source code, and the operation code method is modified;

5. modifying the related class of executing the transaction so as to acquire GasUsed, txHash, inputData and other data;

s2: data playback phase

After the client after pile insertion modification is installed, a MongoDB database is started through a @/monmod-config/usr/local/MongoDB 5/etc/mongodb.conf, and Ethernet transaction data replay is started through a get-synchmode full-datadir data;

s3: transaction opcode sequence acquisition phase

A transaction operation code sequence, a triplet tx= < from, to, data > is used to represent a transaction, wherein from and to represent addresses of accounts of a sender and a receiver respectively, and data represents input data of the transaction;

s4: transaction opcode sequence deposit stage

2. The method of generating an operation code sequence based on the playback of ethernet transaction data according to claim 1, wherein: in the step S3, the intelligent contract is written by the entity, then compiled into the byte code, and the byte code is converted into the operation code by using the API provided by the Geth when executing in the virtual machine, so that a calling process of the method in the intelligent contract is actually an executing process of the operation code sequence in the method, and therefore, the operation code sequences executed according to the calling sequence can be considered to form a control flow path for executing a transaction.

3. The method of generating an operation code sequence based on the playback of ethernet transaction data according to claim 1, wherein: as shown in fig. 2, the opcode sequence is intelligent contract source code written by high-level language resolution at the time of acquisition, compiled into contract byte code by an encoder, decompiled again into opcode by the API provided by Geth, and the intelligent contract CFG is constructed by the opcode.

4. The method of generating an operation code sequence based on the playback of ethernet transaction data according to claim 1, wherein: as shown in FIG. 2, when the method of transaction call initiated by the external account is test (), and the transferred parameter num >6, the executed Path is Path1: block1- > Block2- > Block4; when the transmitted parameter num < = 6, the executed Path is Path2: block1- > Block3- > Block4, and then the operation code sequences sequentially executed in the blocks contained in Path1 or Path2 respectively form a control flow Path for executing one transaction; the operation code sequence of each transaction marked for different vulnerability types can be collected in real time through the Geth client after the pile insertion.

5. The method of generating an operation code sequence based on the playback of ethernet transaction data according to claim 1, wherein: in step S3, the current transaction operation code sequences are classified into 9 types in total.

6. The method of generating an operation code sequence based on the playback of ethernet transaction data according to claim 1, wherein: the new transaction table in the step S4: the transaction comprises: the method comprises the steps of inputting information such as block height of a transaction, transaction time, a transaction initiator, a transaction receiver, transaction amount, gas upper limit, transaction cost, transaction state, transaction identification, data, operation code sequences executed by the transaction, vulnerability types corresponding to the operation code sequences and the like, and storing the operation code sequences marked according to the vulnerability types into a database so as to be used for detecting known vulnerabilities or unknown vulnerabilities of intelligent contracts in the later period.