CN117195220A - Intelligent contract vulnerability detection method and system based on Tree-LSTM and BiLSTM - Google Patents

Intelligent contract vulnerability detection method and system based on Tree-LSTM and BiLSTM Download PDF

Info

Publication number
CN117195220A
CN117195220A CN202310899637.8A CN202310899637A CN117195220A CN 117195220 A CN117195220 A CN 117195220A CN 202310899637 A CN202310899637 A CN 202310899637A CN 117195220 A CN117195220 A CN 117195220A
Authority
CN
China
Prior art keywords
tree
intelligent contract
grammar
lstm
bilstm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310899637.8A
Other languages
Chinese (zh)
Inventor
张鹏程
唐凌军
李雯睿
吉顺慧
楚涵婷
王萧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Nanjing Xiaozhuang University
Original Assignee
Hohai University HHU
Nanjing Xiaozhuang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU, Nanjing Xiaozhuang University filed Critical Hohai University HHU
Priority to CN202310899637.8A priority Critical patent/CN117195220A/en
Publication of CN117195220A publication Critical patent/CN117195220A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The application discloses an intelligent contract vulnerability detection method based on Tree-LSTM and BiLSTM, which comprises the following steps: collecting data to form an intelligent contract data set, and performing data processing; analyzing the intelligent contract source code subjected to the data processing in the step S1 by using a grammar analyzer to obtain an AST, and processing the AST; program slicing standardization and text word segmentation processing are carried out on the intelligent contract source code text, and word embedding model word2vec is trained and stored; extracting features of the AST by using a Tree-LSTM model to obtain grammar feature vectors; extracting features of the program slice text by using a BiLSTM+attribute model to obtain semantic feature vectors; and fusing the grammar feature vector and the semantic feature vector, and performing leak detection of the intelligent contract source code by utilizing the classifier network according to the fused feature vector. The method and the system realize automatic detection of the loopholes in the source codes of the intelligent contracts, and achieve the aim of guaranteeing the safety of the intelligent contracts.

Description

Intelligent contract vulnerability detection method and system based on Tree-LSTM and BiLSTM
Technical Field
The application belongs to the field of intelligent contract security, relates to a deep learning technology, and particularly relates to an intelligent contract vulnerability detection method and system based on Tree-LSTM and BiLSTM.
Background
The blockchain is a novel distributed computing and storage paradigm integrating multiple prior technologies and has the characteristics of transparency, non-falsification, traceability, decentralization and the like. Intelligent contracts refer to a piece of code that can be automatically run on a blockchain platform to describe and automatically execute a contract. The smart contracts can be enforced without the need for third party supervision, enabling people to conduct secure and reliable transactions in an untrusted environment. The Ethernet is an open source blockchain platform with the largest current influence, and is also a blockchain platform with the largest quantity of intelligent contracts, the largest vulnerability types and the largest loss caused by the vulnerability so far. The main programming language for writing intelligent contracts on the ethernet is the solubility, which is a high-level language for contract development, by means of JavaScript and python, the intelligent contract writing method is very suitable for developing public decentralizing application programs (decentralized application, DAPP) running on the ethernet, such as voting, crowd-sourcing, auction, multi-signature wallet and the like, and the DAPP of the ethernet accounts for 82% of the total creation value of the ethernet, wherein 80% of the high-risk categories such as betting, games and the like.
Since the self-contained financial nature of the smart contract can bring tremendous benefits to the attacker, attacks against their vulnerability have grown in recent years. Compared with the traditional application program, the intelligent contract is used as an off-centered application program running on the blockchain, and the life cycle and the characteristics of the development language bring about some special security holes. In addition, the transparency of the blockchain makes it easier for a common user to acquire intelligent contract byte codes running on the blockchain, and provides a multiplicative opportunity for a hacker, and the transaction between users is not guaranteed by a third party, so that the hacker may attack the contract at any time. The non-tamper-resistance of blockchains also makes intelligent contracts unable to repair vulnerabilities through patching, etc., as in ordinary programs. This makes the vulnerability detection effort of smart contracts particularly important.
Because the complexity of the intelligent contracts and the ubiquitous existence of the loopholes enable the loopholes of the intelligent contracts to be detected only by means of manpower, the safety of the intelligent contracts cannot be guaranteed while huge manpower cost is paid. The intelligent contract vulnerability detection research aims to automatically detect vulnerabilities existing in intelligent contracts and ensure the safety of the intelligent contracts.
The current intelligent contract vulnerability detection method can be divided into a traditional method and a method based on deep learning. The traditional method mainly comprises the methods of static analysis, dynamic analysis, fuzzy test, formal verification and the like. Static analysis is a method of analyzing code without executing a program. Dynamic analysis is a method of analyzing code while running a program. Static analysis is too dependent on rules formulated by experts, false alarms and false alarms are easy to generate, and dynamic analysis performance cost is high and reproduction is difficult. Fuzzy testing tests programs for vulnerabilities by randomly or semi-randomly generating input data that is difficult to cover all code paths and requires a significant amount of computing resources. Formal verification detects vulnerabilities based on mathematical logic reasoning, requires specialized mathematical knowledge, does not support complex language characteristics, and is difficult to widely apply.
Compared with the traditional intelligent contract vulnerability detection method, the intelligent contract vulnerability detection based on deep learning has higher accuracy and completeness. However, most of the existing machine learning methods directly use natural language processing technology for detecting vulnerabilities of intelligent contracts. The difference between the program language and the natural language is that the program language has strict grammar rules and obvious structural features, and the processing of the program language as natural language text is not reasonable. The abstract syntax tree is a tree-like representation of the abstract syntax structure of the source code, each node representing one syntax structure in the source code, adapted to represent the syntax and structural features of the program. The existing intelligent contract vulnerability detection method using abstract syntax trees mostly uses traversal method to represent abstract syntax trees as token sequences, so that the original structure is destroyed, and the grammar characteristics of programs can not be completely extracted.
Disclosure of Invention
The application aims to: in order to overcome the defects in the prior art, the intelligent contract vulnerability detection method and system based on the Tree-LSTM and the BiLSTM are provided, grammar features in AST are extracted by using the Tree-LSTM model, semantic features in source code texts are extracted by using the BiLSTM+attribute model, feature fusion is carried out, and then a classifier network is used for vulnerability detection, so that automatic vulnerability detection in an intelligent contract source code is realized, and the aim of guaranteeing intelligent contract safety is fulfilled.
The technical scheme is as follows: in order to achieve the above purpose, the present application provides an intelligent contract vulnerability detection method based on Tree-LSTM and BiLSTM, comprising the following steps:
s1: collecting data to form an intelligent contract data set, and performing data processing;
s2: analyzing the intelligent contract source code subjected to the data processing in the step S1 by using a grammar analyzer to obtain an AST, and processing the AST;
s3: performing program slicing standardization and text word segmentation processing on the intelligent contract source code text subjected to the data processing in the step S1, training and storing word embedding models word2vec;
s4: performing feature extraction on the AST processed in the step S2 by using the constructed Tree-LSTM model to obtain a grammar feature vector;
performing feature extraction on the program slice text processed in the step S3 by using the constructed BiLSTM+attribute model to obtain a semantic feature vector;
s5: and fusing the grammar feature vector and the semantic feature vector, and performing leak detection of the intelligent contract source code by utilizing the classifier network according to the fused feature vector.
Further, the step S1 specifically includes the following steps:
a1: collecting smart contract source code from Github and Estherscan using a web crawler;
a2: marking the contract source code by using tools such as a slit and the like, and performing manual inspection;
a3: removing notes and non-ASCII code characters in the source code;
a4: finding key sentences and key methods containing the key sentences in the program according to the characteristics of the loopholes, finding all methods for directly calling or nesting and calling the key methods and the key methods to form a method set, and finding variables which are depended and indirectly depended by all methods in the method set to be used as a variable set;
a5: extracting the intelligent contract source codes according to the method set and the variable set obtained in the step A4;
a6: and (3) performing text similarity check on the program slice text obtained in the step A5 by using a difflib library with python, and discarding program slices with too high similarity.
Further, the step S2 specifically includes the following steps:
b1: c, analyzing the program slice obtained in the step A6 by using an ANTLR grammar analyzer to obtain an analysis class of the abstract grammar tree;
b2: deeply traversing each node of the abstract syntax tree, numbering each node, representing the node by using only the type of the node for better extracting syntax information, and storing the abstract syntax tree in the form of an adjacency list;
b3: creating a dictionary according to the occurrence times of words in the statistic adjacency list, wherein the dictionary stores words and indexes corresponding to the words;
b4: and (3) iteratively calculating the calculation sequence of each node from the leaf node, storing the calculation sequence by using a plurality of groups, traversing the adjacency list to set all the leaf nodes as calculated nodes, designing the calculation sequence to be 0, traversing the adjacency list again to set all the child nodes as calculated nodes, designing the calculation sequence to be 1, repeating the steps until all the nodes in the adjacency list are calculated, and finally storing the plurality of groups together with the adjacency list.
Further, the step S3 specifically includes the following steps:
c1: contract normalization, replacing function names in program slices with FUN { # }, wherein # represents a number, and replacing variable names in program slices with VAR { # }, wherein # represents a number;
c2: word segmentation is carried out on the program slice;
and C3: and taking all words in the program slice as training corpus, embedding training words into a model word2vec, and storing the trained word embedded model.
Further, before the feature extraction in step S4, the semantic information representation is obtained in step S3, the grammar information representation and the label are obtained in step S2 to form a dataset, and since contracts with vulnerabilities in the dataset are far less than contracts without vulnerabilities, the number of contracts with vulnerabilities and the number of contracts without vulnerabilities are counted respectively, and the intelligent contract dataset is balanced by using a weighted random sampling method.
Further, the process of obtaining the grammar feature vector in step S4 is as follows: for grammar information representation, using the dictionary created in step B3 and a randomly initialized embedding matrix M ε R d×v Coding the type information of the nodes into vectors, wherein v is the size of a vocabulary, d is the length of an output vector, and extracting grammar characteristics by using a Tree-LSTM model to obtain a languageAnd B4, the method features, namely calculating the Tree-LSTM model according to the calculation sequence obtained in the step B4 to finally obtain the grammar feature vector.
The Tree-LSTM model used for grammar feature extraction is Multi-way Tree-LSTM, is an improvement on Child-sum Tree-LSTM, and is capable of learning the sequence relationship among the Child nodes by using BiLSTM to learn the relationship among the Child nodes instead of adding the feature vectors of the Child nodes.
Further, the obtaining process of the semantic feature vector in step S4 is as follows: and C3, using a word embedding model word2vec obtained in the step C3 to encode each word in the code slice into a vector to obtain an embedding matrix, and using a BiLSTM+attribute model to extract semantic features of the obtained embedding matrix to obtain a semantic feature vector.
Further, the step S5 specifically includes:
d1: splicing the grammar feature vector and the semantic feature vector to obtain a fusion feature vector;
d2: taking the fusion feature vector as input of a classifier network, predicting the probability p of vulnerability by the classifier network, and judging the intelligent contract as vulnerability when p is more than 0.5;
d3: training the model for multiple times by using the training data set, adjusting and optimizing model parameters, and storing the parameters of the model with optimal performance.
The detection method of the application can be summarized as follows: acquiring a large number of intelligent contract source codes from public resources, marking the source codes, slicing the program according to the vulnerability types of the codes, and removing some code fragments with too high code similarity; in the grammar information extraction part, a grammar analyzer is used for analyzing contract source codes to obtain AST, the AST is deeply traversed, each node of the AST is represented by a node type value, the index number and the adjacent table representation of each node of the AST are obtained, and the calculation sequence of each node is calculated and stored; in the semantic information extracting and representing part, replacing the method name and the variable name in the program, and embedding the training word into a model word2vec; in the feature extraction part, grammar information and semantic information are encoded into vectors, and a Tree-LSTM model and a BiLSTM+attribute model are respectively used for extracting to obtain grammar feature vectors and semantic feature vectors; and in the vulnerability detection part, feature fusion is carried out, and the vulnerability detection is carried out by using a classifier network.
The method of the application firstly collects and processes the intelligent contract data set of the Ethernet from the network public resource, and then carries out program slicing and filtering. Grammar and semantic features of the contract are then extracted from the smart contract source code. According to the method, the grammar information is acquired by analyzing the AST of the intelligent contract, the AST is acquired by analyzing the intelligent contract program slice by adopting a grammar analyzer, the type of the AST node is extracted to represent the node, and then the Multi-way Tree-LSTM model is used to extract the grammar feature vector of the intelligent contract on the premise of not damaging the AST grammar structure. According to the method, semantic features of the intelligent contract are obtained by analyzing source code text of the intelligent contract, normalized and segmented, and then a BiLSTM+attribute model is used for extracting semantic feature vectors. And finally merging the grammar feature vector and the semantic feature vector and classifying the intelligent convergence into a vulnerable contract and a non-vulnerable contract by using a classifier network.
The application also provides an intelligent contract vulnerability detection system based on Tree-LSTM and BiLSTM, which comprises the following steps:
the data collection module is used for collecting intelligent contract data sets from network resources, primarily processing the data sets, marking the data sets, carrying out program slicing according to vulnerability information, screening the program slicing and removing repeated program slicing;
the grammar information representation acquisition module is used for analyzing the program slice to acquire an abstract grammar tree, traversing the abstract grammar tree, representing each node by a type value, acquiring an index of the node, counting the occurrence frequency of each word to acquire a word list, and finally calculating the calculation sequence of each node and storing;
the semantic information representation acquisition module is used for standardizing program slicing, text word segmentation and training and saving word embedding models word2vec;
the grammar feature and semantic feature extraction module is used for respectively encoding grammar information and semantic information into vectors, and respectively extracting to obtain grammar feature vectors and semantic feature vectors by using a Tree-LSTM model and a BiLSTM+attention model;
and the vulnerability detection module is used for splicing the grammar feature vector and the semantic feature vector to obtain a fusion feature vector, and predicting whether the intelligent contract has the vulnerability or not by using the classifier network.
The beneficial effects are that: compared with the prior art, the method fully utilizes the grammar characteristics and semantic characteristics of the intelligent contract codes, and can directly extract the characteristics of Tree structure data by using the Tree-LSTM model, so that the grammar characteristic vector of the intelligent contract can be extracted on the premise of not damaging the AST grammar structure, and the intelligent contract vulnerability can be automatically, quickly and accurately detected.
Drawings
FIG. 1 is a general step diagram of the method of the present application;
FIG. 2 is a flow chart of the method of the present application;
FIG. 3 is a block diagram of the Tree-LSTM.
Detailed Description
The present application is further illustrated in the accompanying drawings and detailed description which are to be understood as being merely illustrative of the application and not limiting of its scope, and various modifications of the application, which are equivalent to those skilled in the art upon reading the application, will fall within the scope of the application as defined in the appended claims.
As shown in fig. 1 and 2, the present application provides an intelligent contract vulnerability detection method based on Tree-LSTM and BiLSTM, including the following steps:
step 1: collecting intelligent contract data sets from network resources, marking the data sets, performing program slicing according to vulnerability information, primarily processing data, screening the program slicing, and removing repeated program slicing:
step 2: analyzing the program slice to obtain an abstract syntax tree, traversing the abstract syntax tree to represent each node by a type value and obtain an index of the node, counting the occurrence frequency of each word to obtain a word list, and finally calculating the calculation sequence of each node and storing;
step 3: program slicing standardization is carried out on the intelligent contract source code text, text word segmentation is carried out, and word embedding model word2vec is trained and stored;
step 4: word embedding is carried out, grammar information and semantic information are encoded into vectors, and a Tree-LSTM model and a BiLSTM+attribute model are respectively used for extracting grammar feature vectors and semantic feature vectors;
step 5: and (3) splicing the grammar feature vector and the semantic feature vector obtained in the step (4) to obtain a fusion feature vector, and predicting whether the intelligent contract has the loopholes or not by using a classifier network according to the fusion feature vector.
The step 1 specifically comprises the following steps:
step 11: the web crawler is used for calling the api of the Github, searching the warehouse under the condition of 'language: solubility stars: >50', and crawling files ending with the sol in the warehouse, namely contract source codes. Requesting a published intelligent contract source code webpage for EtherScan, and analyzing the webpage by using a regular expression and XPath to obtain a source code;
step 12: performing vulnerability detection on contract source codes by using tools such as a slit and the like, manually checking and marking contracts with the slit judged by the slit, and discarding the contracts if the contracts are found to be unable to be compiled;
step 13: deleting notes and non-ASCII code characters in the intelligent CONTRACT by using the regular expression to obtain an intelligent CONTRACT vulnerability dataset CONTACTS= { CONTACTS 1 ,...,CONTRACT t -t is the number of contracts;
step 14: for an intelligent CONTRACT CONTACT in a data set CONTACTS k Finding key sentences (such as sentences including call-states) in program and key method FUNC including key sentences according to characteristics of loopholes k . Then according to the key method FUNC k Finding out all methods directly calling or nesting calling key methods and key method FUNC (function network control) by using the method names and parameter lists of (a) k Forming a method set FUNS k ={FUNC k ,FUN 1k ,...,FUN nk "where n is call or nested call FUNC k Is a number of methods of (a). If there are multiple key methods, the union is taken. Finding all the variables of the method dependence and the indirect dependence in the method set as variablesVariable set VARS k ={VAR 1k ,...,VAR mk Where m is the number of variables. The method comprises the steps of carrying out a first treatment on the surface of the
Step 15: for an intelligent CONTRACT CONTACT in a data set CONTACTS k Traversing intelligent CONTRACT CONTACT k Is performed according to the method set FUNC obtained in step 14 k Sum variable set VARS k Extracting statement sentences of each sentence and variable of the method to generate intelligent Contract k Set of program slices CODESEG k ={LINE 1K ,...,LINE qK And q is the number of statement lines. Deriving the program slice set codesegs= { CODESEG from CONTRACTS 1 ,...,CODESEG t T is the number of slices;
step 16: text similarity checking is performed on the program slice set CODESEGS obtained in step 15 by using a python self-contained difflib library, and when the similarity of program slices of a plurality of intelligent contracts is higher than 0.95, only one part is reserved, so that a program slice data set CSDS= { CSD is obtained 1 ,...,CSD r And r is the number of program slices after screening.
The step 2 specifically comprises the following steps:
step 21: for one program slice CSD in a dataset CSDS k Parsing class AST of abstract syntax tree by using ANTLR syntax parser k
Step 22: parsing class AST for an abstract syntax tree k Depth traversal AST k Extracting node information and setting a number in the traversal order. In order to better extract grammar information, only the type of the node is used for representing the node, and the node father-son relationship of the abstract grammar tree is stored in the form of an adjacency list to obtain AL k =[NODE 1k ,...,NODE sk ]Where s is the number of NODEs, where the o-th NODE NODE ok ={'type':type ok ,'child':[index ok1 ,...,index oku ]Where u is the number of child nodes, index is the node number, type ok Is of the node type and finally obtainsAbstract syntax tree representation set als= { AL with program slices 1 ,...,AL r R is the number of program slices;
step 23: because each type is a word, word segmentation is not needed, a dictionary is created according to the occurrence times of the types in the statistics ALS, and the dictionary stores indexes corresponding to the words;
step 24: adjacency list AL for any abstract syntax tree k The calculation order of each node is iteratively calculated starting from the leaf node. Using a plurality of ORDERs k And storing the calculation sequence, traversing the adjacency list to set all leaf nodes as calculated nodes, and designing the calculation sequence to be 0. Traversing the adjacency list again to set the node with all the child nodes being calculated nodes as calculated nodes, designing the calculation sequence to be 1, and iterating until all the nodes in the adjacency list are calculated to obtain an array ORDER k =[i 1k ,...,i sk ]Where s is the number of nodes, different i may have the same value, since the possible computation ORDERS of the plurality of nodes are the same, resulting in a node computation ORDER set order= { ORDER for all abstract syntax trees 1 ,...ORDER r Where r is the program slice number size.
The step 3 specifically comprises the following steps:
step 31: processing the program slice data set CSDS, replacing the user-defined function name with a regular FUN { # }, wherein # is a number, and replacing the user-defined variable name in the program slice with a VAR { # }, wherein # represents a number;
step 32: word segmentation is carried out on the processed program slices to obtain a set CSDT= { CSDT 1 ,...,CSDT r Where r is the number of program slices;
step 33: and taking the data set as corpus, taking all words as training corpus, embedding training words into a model word2vec, and storing the trained word embedding model.
The step 4 specifically comprises the following steps:
step 41: combining the semantic information representation, the grammar information representation, and the tag into a dataset dataets = { { AL 1 ,ORDER 1 ,CSDT 1 ,label 1 },...,{AL r ,ORDER r ,CSDT r ,label r -where r is the number of program slices, and since contracts in the dataset in which vulnerabilities exist are far fewer than contracts in the dataset in which vulnerabilities do not exist, the number of contracts in the dataset in which vulnerabilities exist is num, respectively 1 And the number of contracts without vulnerability is num 2 The weight of the contract with the vulnerability is reset toThe contract weight without vulnerability is set to +.>Randomly sampling according to the weight in each training;
step 42: for grammar information representation, the dictionary created in step 23 and a randomly initialized embedding matrix M ε R are used d×v Encoding the type value of the node as a vector, where is the size of the v vocabulary and d is the length of the output vector;
step 43: AL with Multi-way Tree-LSTM model k The Multi-Way Tree-LSTM model is an improvement on the Child-sum Tree-LSTM model, and can process any plurality of Child nodes like the Child-sum Tree-LSTM model, and each Multi-Way Tree-LSTM unit takes the state vectors of the Child nodes and the word vector representation of the father node as inputs to obtain the state vector of the father node as output, so that the Multi-Way Tree-LSTM model can process data of the Tree structure. Unlike Child-sum Tree-LSTM, the Multi-Way Tree-LSTM model takes as input the hidden states of the sequence of children nodes, uses BiLSTM to learn the interrelationship between the children nodes, rather than simply adding the hidden states of the children nodes. The Child-sum Tree-LSTM processing mode ignores the sequence relation among the Child nodes, biLSTM is commonly used for learning the long-term dependency relation of sequence data, the sequence relation among the Child nodes can be learned, and the Multi-Way Tree-LSTM model can process any plurality of Child nodes. To speed up the calculation, the Tree-LSTM model will be based on the steps of24, calculating the calculation sequence, and finally obtaining the hidden layer of the root node as a grammar characteristic vector from the bottom node. For example, compute the ORDER array as ORDER K At this time, the calculation ORDER is from 1 to max (ORDER k ) All ORDERs are calculated in parallel each time K [index]Nodes of =order, where index is the node number;
step 44: CSDT is constructed using the word embedding model word2vec from step 33 k Each word in the code is coded into a vector to obtain an embedded matrix, and then semantic feature extraction is carried out on the obtained embedded matrix by using a BiLSTM+attribute model to obtain a semantic feature vector.
The step 5 specifically comprises the following steps:
step 51: feature fusion, namely splicing the grammar feature vector and the semantic feature vector to obtain a fusion feature vector;
step 52: using a multi-layer perceptron as a classifier, using a fusion feature vector as an input of the classifier, and predicting the probability p of vulnerability by using a softMax function, and judging the intelligent contract as vulnerability when p is more than 0.5;
step 53: the model is trained, adjusted and optimized for multiple times. And comprehensively evaluating the model by using the accuracy, the precision, the recall and the F1 score, and storing the parameters of the model with optimal performance.
The application also provides an intelligent contract vulnerability detection system based on Tree-LSTM and BiLSTM, which comprises the following steps:
the data collection module is used for collecting intelligent contract data sets from network resources, primarily processing the data sets, marking the data sets, carrying out program slicing according to vulnerability information, screening the program slicing and removing repeated program slicing;
the grammar information representation acquisition module is used for analyzing the program slice to acquire an abstract grammar tree, traversing the abstract grammar tree, representing each node by a type value, acquiring an index of the node, counting the occurrence frequency of each word to acquire a word list, and finally calculating the calculation sequence of each node and storing;
the semantic information representation acquisition module is used for standardizing program slicing, text word segmentation and training and saving word embedding models word2vec;
the grammar feature and semantic feature extraction module is used for respectively encoding grammar information and semantic information into vectors, and respectively extracting to obtain grammar feature vectors and semantic feature vectors by using a Tree-LSTM model and a BiLSTM+attention model;
and the vulnerability detection module is used for splicing the grammar feature vector and the semantic feature vector to obtain a fusion feature vector, and predicting whether the intelligent contract has the vulnerability or not by using the classifier network.
Based on the above scheme, in order to verify the effectiveness of the present application, the following experiments were performed:
the data collection module of the application is utilized to obtain a reentrant vulnerability data set containing 3000 intelligent contracts as a training data set, wherein 256 reentrant vulnerability contracts are provided. The test dataset contained 100 intelligent contracts, 30 of which had reentrant vulnerabilities, and after training was completed, the model was comprehensively evaluated using accuracy, precision, recall, and F1 score, with the following results:
table 1 experimental results
The comparison experiment can find that the Tree-LSTM model used by the application can effectively extract the grammar characteristics of the intelligent contract, and compared with the BiLSTM model and the BiLSTM+attention model, the method has the advantages of greatly improving the recall rate, better solving the problem of high false positive rate and improving the practical value of the vulnerability detection model. The method is the highest in all models in three indexes of accuracy, precision and F1 score, and the method proves that the method is effective and can better extract grammar features and semantic features in intelligent contracts.
The present embodiment also provides a computer storage medium storing a computer program which, when executed by a processor, implements the method described above. The computer-readable medium may be considered tangible and non-transitory. Non-limiting examples of non-transitory tangible computer readable media include non-volatile memory circuits (e.g., flash memory circuits, erasable programmable read-only memory circuits, or masked read-only memory circuits), volatile memory circuits (e.g., static random access memory circuits or dynamic random access memory circuits), magnetic storage media (e.g., analog or digital magnetic tape or hard disk drives), and optical storage media (e.g., CDs, DVDs, or blu-ray discs), among others. The computer program includes processor-executable instructions stored on at least one non-transitory tangible computer-readable medium. The computer program may also include or be dependent on stored data. The computer programs may include a basic input/output system (BIOS) that interacts with the hardware of the special purpose computer, device drivers that interact with particular devices of the special purpose computer, one or more operating systems, user applications, background services, background applications, and so forth.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims (9)

1. The intelligent contract vulnerability detection method based on Tree-LSTM and BiLSTM is characterized by comprising the following steps:
s1: collecting data to form an intelligent contract data set, and performing data processing;
s2: analyzing the intelligent contract source code subjected to the data processing in the step S1 by using a grammar analyzer to obtain an AST, and processing the AST;
s3: performing program slicing standardization and text word segmentation processing on the intelligent contract source code text subjected to the data processing in the step S1, training and storing word embedding models word2vec;
s4: performing feature extraction on the AST processed in the step S2 by using the constructed Tree-LSTM model to obtain a grammar feature vector;
performing feature extraction on the program slice text processed in the step S3 by using the constructed BiLSTM+attribute model to obtain a semantic feature vector;
s5: and fusing the grammar feature vector and the semantic feature vector, and performing leak detection of the intelligent contract source code by utilizing the classifier network according to the fused feature vector.
2. The intelligent contract vulnerability detection method based on Tree-LSTM and BiLSTM according to claim 1, wherein the step S1 specifically comprises the following steps:
a1: collecting smart contract source code from Github and Estherscan using a web crawler;
a2: marking the contract source code and performing manual inspection;
a3: removing notes and non-ASCII code characters in the source code;
a4: finding key sentences and key methods containing the key sentences in the program according to the characteristics of the loopholes, finding all methods for directly calling or nesting and calling the key methods and the key methods to form a method set, and finding variables which are depended and indirectly depended by all methods in the method set to be used as a variable set;
a5: extracting the intelligent contract source codes according to the method set and the variable set obtained in the step A4;
a6: and (3) performing text similarity check on the program slice text obtained in the step A5 by using a difflib library with python, and discarding program slices with too high similarity.
3. The intelligent contract vulnerability detection method based on Tree-LSTM and BiLSTM according to claim 2, wherein the step S2 specifically comprises the following steps:
b1: c, analyzing the program slice obtained in the step A6 by using an ANTLR grammar analyzer to obtain an analysis class of the abstract grammar tree;
b2: deeply traversing each node of the abstract syntax tree, numbering each node, representing the node using only the type of the node, and storing the abstract syntax tree in the form of an adjacency table;
b3: creating a dictionary according to the occurrence times of words in the statistic adjacency list, wherein the dictionary stores words and indexes corresponding to the words;
b4: and (3) iteratively calculating the calculation sequence of each node from the leaf node, storing the calculation sequence by using a plurality of groups, traversing the adjacency list to set all the leaf nodes as calculated nodes, designing the calculation sequence to be 0, traversing the adjacency list again to set all the child nodes as calculated nodes, designing the calculation sequence to be 1, repeating the steps until all the nodes in the adjacency list are calculated, and finally storing the plurality of groups together with the adjacency list.
4. The intelligent contract vulnerability detection method based on Tree-LSTM and BiLSTM according to claim 1, wherein the step S3 specifically comprises the following steps:
c1: contract normalization, replacing function names in program slices with FUN { # }, wherein # represents a number, and replacing variable names in program slices with VAR { # }, wherein # represents a number;
c2: word segmentation is carried out on the program slice;
and C3: and taking all words in the program slice as training corpus, embedding training words into a model word2vec, and storing the trained word embedded model.
5. The intelligent contract vulnerability detection method based on Tree-LSTM and BiLSTM according to claim 1, wherein in the step S4, before feature extraction, the semantic information representation obtained in the step S3, the grammar information representation obtained in the step S2 and the labels are combined into a dataset, the number of contracts with vulnerabilities and the number of contracts without vulnerabilities are counted respectively, and the intelligent contract dataset is balanced by using a weighted random sampling method.
6. The intelligent contract vulnerability detection method based on Tree-LSTM and BiLSTM according to claim 5, wherein the grammar feature vector acquisition process in step S4 is as follows: for grammar information representation, using the dictionary created in step B3 and a randomly initialized embedding matrix M ε R d×v Coding the type information of the nodes into vectors, wherein v is the size of a vocabulary, d is the length of an output vector, and extracting grammar characteristics by using a Tree-LSTM model to obtain grammar characteristicsAnd C, calculating the Tree-LSTM model according to the calculation sequence obtained in the step B4 to finally obtain the grammar feature vector.
7. The intelligent contract vulnerability detection method based on Tree-LSTM and BiLSTM according to claim 6, wherein the Tree-LSTM model is a Multi-Way Tree-LSTM model, each Multi-Way Tree-LSTM unit takes state vectors of a plurality of child nodes and word vector representations of parent nodes as inputs to obtain state vectors of the parent nodes as outputs, and the Multi-Way Tree-LSTM model takes hidden states of a sequence of child nodes as inputs to learn correlations among the child nodes by using BiLSTM.
8. The intelligent contract vulnerability detection method based on Tree-LSTM and BiLSTM according to claim 5, wherein the semantic feature vector acquisition process in step S4 is as follows: and C3, encoding each word in the code slice into a vector by using the word embedding model word2vec obtained in the step C3 to obtain an embedding matrix, processing all words from two directions by using BiLSTM to obtain a context information representation vector of each word, calculating the importance of each word on the code slice by using an Attention mechanism to obtain Attention weight, and carrying out weighted summation to obtain a semantic feature vector.
9. An intelligent contract vulnerability detection system based on Tree-LSTM and BiLSTM, comprising:
the data collection module is used for collecting intelligent contract data sets from network resources, primarily processing the data sets, marking the data sets, carrying out program slicing according to vulnerability information, screening the program slicing and removing repeated program slicing;
the grammar information representation acquisition module is used for analyzing the program slice to acquire an abstract grammar tree, traversing the abstract grammar tree, representing each node by a type value, acquiring an index of the node, counting the occurrence frequency of each word to acquire a word list, and finally calculating the calculation sequence of each node and storing;
the semantic information representation acquisition module is used for standardizing program slicing, text word segmentation and training and saving word embedding models word2vec;
the grammar feature and semantic feature extraction module is used for respectively encoding grammar information and semantic information into vectors, and respectively extracting to obtain grammar feature vectors and semantic feature vectors by using a Tree-LSTM model and a BiLSTM+attention model;
and the vulnerability detection module is used for splicing the grammar feature vector and the semantic feature vector to obtain a fusion feature vector, and predicting whether the intelligent contract has the vulnerability or not by using the classifier network.
CN202310899637.8A 2023-07-21 2023-07-21 Intelligent contract vulnerability detection method and system based on Tree-LSTM and BiLSTM Pending CN117195220A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310899637.8A CN117195220A (en) 2023-07-21 2023-07-21 Intelligent contract vulnerability detection method and system based on Tree-LSTM and BiLSTM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310899637.8A CN117195220A (en) 2023-07-21 2023-07-21 Intelligent contract vulnerability detection method and system based on Tree-LSTM and BiLSTM

Publications (1)

Publication Number Publication Date
CN117195220A true CN117195220A (en) 2023-12-08

Family

ID=88983952

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310899637.8A Pending CN117195220A (en) 2023-07-21 2023-07-21 Intelligent contract vulnerability detection method and system based on Tree-LSTM and BiLSTM

Country Status (1)

Country Link
CN (1) CN117195220A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117573096A (en) * 2024-01-17 2024-02-20 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Intelligent code completion method integrating abstract syntax tree structure information
CN117668237A (en) * 2024-01-29 2024-03-08 深圳开源互联网安全技术有限公司 Sample data processing method and system for intelligent model training and intelligent model

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117573096A (en) * 2024-01-17 2024-02-20 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Intelligent code completion method integrating abstract syntax tree structure information
CN117573096B (en) * 2024-01-17 2024-04-09 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Intelligent code completion method integrating abstract syntax tree structure information
CN117668237A (en) * 2024-01-29 2024-03-08 深圳开源互联网安全技术有限公司 Sample data processing method and system for intelligent model training and intelligent model
CN117668237B (en) * 2024-01-29 2024-05-03 深圳开源互联网安全技术有限公司 Sample data processing method and system for intelligent model training and intelligent model

Similar Documents

Publication Publication Date Title
CN111611586B (en) Software vulnerability detection method and device based on graph convolution network
CN117195220A (en) Intelligent contract vulnerability detection method and system based on Tree-LSTM and BiLSTM
CN112115326B (en) Multi-label classification and vulnerability detection method for Etheng intelligent contracts
CN112579469A (en) Source code defect detection method and device
CN106874760A (en) A kind of Android malicious code sorting techniques based on hierarchy type SimHash
CN112699375A (en) Block chain intelligent contract security vulnerability detection method based on network embedded similarity
CN114329455B (en) User abnormal behavior detection method and device based on heterogeneous graph embedding
CN114817932A (en) Ether house intelligent contract vulnerability detection method and system based on pre-training model
CN112035345A (en) Mixed depth defect prediction method based on code segment analysis
CN109492401B (en) Content carrier risk detection method, device, equipment and medium
Hao et al. A novel method using LSTM-RNN to generate smart contracts code templates for improved usability
Truică et al. MCWDST: a minimum-cost weighted directed spanning tree algorithm for real-time fake news mitigation in social media
CN116149669B (en) Binary file-based software component analysis method, binary file-based software component analysis device and binary file-based medium
CN109359274B (en) Method, device and equipment for identifying character strings generated in batch
CN116821903A (en) Detection rule determination and malicious binary file detection method, device and medium
Hussainy et al. Deep learning toward preventing web attacks
CN115600211A (en) CNN-BilSTM multi-label classification-based intelligent contract unknown vulnerability detection method
CN115455945A (en) Entity-relationship-based vulnerability data error correction method and system
An et al. Deep learning based webshell detection coping with long text and lexical ambiguity
Fujita et al. Extreme gradient boosting for cyberpropaganda detection
Alabadee et al. Evaluation and Implementation of Malware Classification Using Random Forest Machine Learning Algorithm
CN116611057B (en) Data security detection method and system thereof
Sharmila Tapering Malicious Language for Identifying Fake Web Content
CN111860662B (en) Training method and device, application method and device of similarity detection model
CN113537372B (en) Address recognition method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination