CN110011986A

CN110011986A - A kind of source code leak detection method based on deep learning

Info

Publication number: CN110011986A
Application number: CN201910214764.3A
Authority: CN
Inventors: 金舒原; 吴跃隆
Original assignee: National Sun Yat Sen University
Current assignee: Sun Yat Sen University; National Sun Yat Sen University
Priority date: 2019-03-20
Filing date: 2019-03-20
Publication date: 2019-07-12
Anticipated expiration: 2039-03-20
Also published as: CN110011986B

Abstract

The invention proposes a kind of source code leak detection method based on deep learning, based on deep learning, it is automatically performed the feature extraction of source code, and the source code feature for combining code metric index and extracting automatically, constructs Hole Detection model using random deep woods algorithm.The present invention provides a kind of, and the source code leak detection method based on deep learning has higher the degree of automation, reduces the dependence to domain-specialist knowledge, greatlys save code audit cost, improve the efficiency of code audit.And compared to other methods for carrying out Hole Detection using deep learning, a variety of expression bigizationner ground of this method combination code retains the syntactic and semantic information of code, the feature for enabling deep learning algorithm to extract automatically preferably portrays code, feature in combination with common code metric index as detection, further increases detection effect.

Description

A kind of source code leak detection method based on deep learning

Technical field

The present invention relates to technical field of network security, more particularly, to a kind of source code loophole based on deep learning Detection method.

Background technique

Under the high environment of the current level of informatization, the every aspect of people's life is all ceased with miscellaneous software It is related.In daily life, people are communicated by instant communication software, carry out shopping online by shopping software, and It completes to pay using payment software；And software is similarly played an important role in various tissues, such as the financial system, each of school Self-help serving system, the data base management system in enterprise etc. in kind mechanism.And due in software design, realization and use Mistake that may be present, most of software are all inevitably present loophole.The loophole of software is once utilized by criminal, no Only the interests of directly damage software user can also influence the interests of software supplier indirectly, therefore software supplier is past Toward the code audit that can put into great cost progress software to reduce software loophole that may be present, the safety of software is improved Property.

Hole Detection tool usually can be used in code audit and carry out quick loophole positioning, improve audit efficiency. The leak detection method of software can be divided into the method for static analysis and the method for dynamic analysis according to whether executing program. The method of static analysis mainly passes through the syntactic and semantic information determining program of analysis program with the presence or absence of loophole, is mainly used for point Analyse the source code of program.Main problem existing for existing more mature Static Analysis Method is heavy dependence domain expert Knowledge needs domain expert to expend considerable time and effort and analyzes the source code of program, and rate of failing to report and wrong report Rate is relatively high；The method of dynamic analysis then passes through the information determining program generated in analysis program process with the presence or absence of leakage Hole, commonly used in analysis executable file.The dynamically analyzing of program of mainstream has a stain analysis and semiology analysis, the stain point of program Analysis can not cover all execution routes of program, also need domain expert and expend considerable time and effort analysis program and leakage Report rate is high, and semiology analysis is by inputting symbolism for program, program executes formulation, can theoretically cover all execution roads Diameter, but it is difficult to practical application greatly due to solving expense.

Summary of the invention

Existing leak detection method rate of failing to report is high, heavy dependence domain-specialist knowledge aiming at the problem that, the present invention proposes A kind of source code leak detection method based on deep learning, the technical solution adopted by the present invention is that:

A kind of source code leak detection method based on deep learning, comprising the following steps:

S1. the code metric index of function in source code file is calculated, and is integrated into a code metric vector V_cm；

S2. using the function in source code as basic unit, the automatic pumping of Function feature is completed using deep learning method It takes；Extract abstract syntax tree (AbstractSyntaxTree, AST), controlling stream graph (the Control Flow in source code Graph, CFG), program dependency graph (ProgramDependencyGraph, PDG)；

S3. by vector V_cm、V_ast、V_cfg、V_pdgIt is merged into a vector V_fAs the feature vector of function, by the spy of function Levy vector V_fTraining in random deep woods algorithm, which is input to, with the label of function obtains final Hole Detection model M (V_f)；

S4. for the function in source code to be detected, four vector V are obtained by S1 and S2_cm、V_ast、V_cfg、V_pdg, spell It is connected into feature vector V_fAs Hole Detection model M (V_f) input, output result be 1 representative function there are loophole, input results Loophole is not present for 0 representative function.

In a preferred embodiment, specific step is as follows by the S2:

S21. two-way length memory network (BidirectionalLong-shortTerm Memory, BLSTM) in short-term is used The automatic extraction feature from function AST；

S211. the traversing operation that depth-first is carried out to AST, is stored in one for the mark (Token) in AST in order In conceptual vector；

S212. the set indicated using in all conceptual vectors carries out word insertion as dictionary, by mark (WordEmbedding) and by AST conceptual vector it is converted into numerical value vector, it is suitable to select according to the distribution of all vector lengths Numerical value carries out cutting or 0 padding for the length standard of the numerical value vector of all functions to vector as regular length, with this Change；

S213. loophole whether there is according to function, adds label for function, the vector sum vector of length normalization is corresponding Label be input in BLSTM network and be trained；Trained model is tested and used using the method for cross validation F1 value carries out model evaluation as evaluation index；Tuning parameter, the training of duplication model, test process, when F1 value reaches maximum When value, by the output of global maximum pond layer as the feature vector V extracted from function AST_ast。

S22. CFG, PDG of extraction are indicated with adjacency matrix, is separately input to figure incorporation model and obtains two fixed length The vector V of degree_cfg、V_pdg, as the feature extracted from function CFG and PDG.

In a preferred embodiment, the code metric index includes statistical indicator and complexity index, wherein counting There is line number statistical indicator in index: total line number, blank line number, annotation line number, pre-processes lines of code, is inactive lines of code Line number, annotation and lines of code ratio, sentence number statistical indicator have: total sentence number, executes sentence number, null statement at declarative statement number Number；Complexity index includes circulation complexity, amendment circulation complexity, weighted shift complexity, essential complexity and maximum The nested number of plies.

In a preferred embodiment, the lines of code, blank line number, annotation line number and sentence number statistical indicator Statistical activity code, the i.e. not code among pretreated code block.

In a preferred embodiment, in step S213, the label of function addition is as follows, and label is that 0 representative function does not have Loophole, there are loopholes for 1 representative function.

In a preferred embodiment, the network structure of the BLSTM network include an input layer, one BLSTM layers, One global maximum pond layer, several full articulamentums and an output layer.

In a preferred embodiment, the parameter for needing to debug in the BLSTM network includes: learningrate, Epoch, batchsize, the unit number in each hidden layer, the number of full articulamentum, the unit number of full articulamentum, hidden layer and The activation primitive of full articulamentum, loss function, optimizer (Optimizer).

Compared with prior art, the beneficial effect of technical solution of the present invention is:

The present invention provides a kind of, and the source code leak detection method based on deep learning has higher the degree of automation, The dependence to domain-specialist knowledge is reduced, code audit cost is greatlyd save, improves the efficiency of code audit.And compared to other The method for carrying out Hole Detection using deep learning, a variety of expression bigizationner of this method combination code by the grammer of code with Semantic information retains, and the feature for enabling deep learning algorithm to extract automatically preferably portrays code, in combination with common code Feature of the Measure Indexes as detection, further increases detection effect.

Detailed description of the invention

Fig. 1 is the overall framework figure of the source code leak detection method provided by the invention based on deep learning；

Fig. 2 is the example function for the source code leak detection method based on deep learning that embodiment 2 provides；

Fig. 3 is the basic system of the programming process in embodiment 2；

Fig. 4 is the process of the example function recurrence abbreviation basic structure in embodiment 2；

Fig. 5 is the extraction of AST, PDG, CFG of function in embodiment 2；

Fig. 6 is the network structure of BLSTM in embodiment 2.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, only for illustration, Bu Nengli Solution is the limitation to this patent.Based on the embodiments of the present invention, those of ordinary skill in the art are not making creative labor Every other embodiment obtained under the premise of dynamic, shall fall within the protection scope of the present invention.

The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.

Embodiment 1

Referring to FIG. 1, a kind of source code leak detection method based on deep learning, comprising the following steps:

S2. using the function in source code as basic unit, the automatic pumping of Function feature is completed using deep learning method It takes；Extract abstract syntax tree AST, the controlling stream graph CFG, program dependency graph PDG in source code；

S21. two-way length memory network BLSTM automatic extraction feature from function AST in short-term is used；

S211. to AST carry out depth-first traversing operation, by the mark in AST be stored in order one indicate to In amount；

S212. the set indicated using in all conceptual vectors as dictionary, will mark carry out word insertion and by AST indicate to Amount is converted into numerical value vector, selects suitable numerical value as regular length according to the distribution of all vector lengths, with this to vector Cutting or 0 padding are carried out by the length normalization of the numerical value vector of all functions；

Embodiment 2

The present embodiment is consistent with 1 content of embodiment, and the precondition of implementation is that have an available large software loophole Database and the position that loophole type and the loophole place in source code can be clearly learnt from vulnerability scan, from the data It can be collected in library comprising certain seed type loophole and the identical source code of programming language is as data set.

S1. the code metric index of function in source code is calculated, the code metric index of function includes statistical indicator and answers Miscellaneous degree index.There is line number statistical indicator in statistical indicator: total line number, lines of code, blank line number, annotation line number, pretreatment generation Code line number, inactive line number, annotation and lines of code ratio；Sentence number statistical indicator has: total sentence number, executes declarative statement number Sentence number, null statement number.Wherein lines of code, blank line number, annotation line number and sentence number statistical indicator statistical activity generation Code, the i.e. not code among pretreated code block.

For the calculating for stating above-mentioned code metric index explicitly, using Fig. 2 example function as objective function, then head office Number is 35, lines of code 28, and blank line number is 3, and annotation line number is 2, and pretreatment lines of code is 2, i.e. pretreatment code block 31,34 interior rows；Inactive line number is 2, i.e., 32,33 rows in pretreatment code block, annotation are 3/28 with lines of code ratio, is protected Staying two-decimal is 0.07；Total sentence number is 19, and executing sentence number is 16, declarative statement 2, in respectively the 2nd row Intj=0 in intcount=5 and the 5th row, null statement number are 1, i.e. 32,33 rows in pretreatment code block.

Complexity index include circulation complexity, amendment circulation complexity, weighted shift complexity, essential complexity and The maximum nesting number of plies.The number that circulation complexity is equal to all decision points in function adds 1 decision point in c/c++ language to have If, for, while, case, catch,?；Amendment circulation complexity is identical with the circulation calculating of complexity, but only by more decision knots Structure calculates as a whole, and if the case of the switch structure in C language disregards, total is counted by 1；Weighted shift is multiple It is miscellaneous degree be equal to circulation complexity plus decision point conditional expression formula logical AND and logic or number；Essential complexity is pair All basic structures in function carry out calculating its circulation complexity after recurrence simplifies, and the basic structure in structured language includes Sequential organization, selection three kinds of structures of structure and loop structure, the controlling stream graph of three kinds of structures are as shown in Figure 3；The maximum nesting number of plies The nested number of plies of maximum of control structure in function is calculated, the nested number of plies of the maximum of Fig. 2 example function is 3.

Note circulation complexity is v (G), and amendment circulation complexity is mv (G), and weighted shift complexity is sv (G), and essence is multiple Miscellaneous degree is ev (G), then its calculation formula is as follows:

V (G)=dp (F)+1 (1)

Mv (G)=v (G)-c (F)+1 (2)

Sv (G)=v (G)+lce (F) (3)

Ev (G)=v (SG) (4)

Wherein, dp (F) is the number of decision point in function, and c (F) is non-default branch's number of multiple-branching construction in function, Lce (F) be in function decision point logical AND and logic or number, v (SG) is function by passing program basic structure Return simplified circulation complexity.Process is simplified to the recurrence of Fig. 2 example function as shown in figure 4, extracting the control of function first Flow graph, then from the controlling stream graph of innermost layer recurrence simplified function, all basic structure can be reduced to a flow points, The circulation complexity of function after last computational short cut.

As shown in Fig. 2, the decision point in function has 6, therefore dp (F) is 6, non-default branch in more decision structure switch Number is 2, therefore c (F) is 2, has one and logic in conditional expression, therefore lce (F) is 1, can be counted according to formula (1), (2), (3) Calculate the v (G) of example function be 7, mv (G) be 6, sv (G) is 8.According to the controlling stream graph of Fig. 3 abbreviation example function, finalization The circulation complexity of the control flow chart of function after solution is 1, therefore ev (G) is 1.

Finally all code metric indexs being calculated are stored into vector V_cm。

S2. using function as basic unit, the automatic extraction of Function feature is completed using deep learning method.It is quiet using code State analysis tool can complete the extraction of function abstract syntax tree AST, controlling stream graph CFG and program dependency graph PDG.Fig. 5 with For one simple function, AST, CFG and the PDG therefrom extracted is shown；

S21. for the AST of function, two-way length memory network BLSTM therefrom automatic extraction feature in short-term is used；

S211. the traversing operation for carrying out depth-first to the AST of function first, the content in AST node is suitable by traversing Sequence is stored in a conceptual vector.Be as the AST of Fig. 5 (b) carries out the conceptual vector that depth-first traversal obtains [func, Fool, DECL, int ,=, temp, CALL, number, IF, PRED ,==, temp, DATA, CALL, print, ARG, temp]；

S212. dictionary is constructed with the mark in all conceptual vectors, each mark is carried out using a hot coding mode Coding, the input as word incorporation model word2vec.Remember that total conventional number is n, the dimension of insertion is m, then i-th of mark uses The vector v that length is n is expressed as after one heat coding_i, v_iI-th bit be 1 remaining be 0.It is raw after the completion of word2vec model training At one dimension of generation be (n, m) word embeded matrix E_n,m, i-th mark word insertion after become length be m vector V_i, V_i=v_i*E_n,m, and then numerical value vector is converted by AST conceptual vector.Due in different size, the institute of the AST of different functions The length of the conceptual vector of extraction and its corresponding numerical value vector is also therefore different, and the input of BLSTM needs for fixed length The vector of degree, so selecting suitable numerical value as regular length according to the distribution of all vector lengths, by carrying out to vector It cuts or 0 filling behaviour completes vector length standardization, obtain vector V_std；

S213. loophole whether there is according to function, adds label for function.Label is that 0 representative function does not have a loophole, 1 There are loopholes for representative function.By the vector V of length normalization_stdThe data set that label corresponding with its is constituted is input to BLSTM It is trained and tests in network, use n times k folding cross-validation method and BLSTM model is carried out using F1 value as evaluation index Assessment, n and k choose suitable value according to the size of data set, and the network configuration of BLSTM in BLSTM network as shown in fig. 6, need The parameter to be debugged includes: learningrate, epoch, batchsize, the unit number in each hidden layer, full articulamentum Number, the unit number of full articulamentum, the activation primitive of hidden layer and full articulamentum, loss function, optimizer；

For the calculating for illustrating F1 value, by four kinds of possible prediction results of BLSTM model, real example TruePositive, False positive example FalsePositive, true counter-example TrueNegative, vacation counter-example FalseNegative be denoted as respectively TP, FP, TN, FN.Practical real example expression test sample is function containing loophole, and prediction result is function containing loophole；False positive example indicates test sample Practical is without loophole function, and prediction result is function containing loophole；True counter-example indicate test sample it is practical be without loophole function, Prediction result is also without loophole function；Practical false counter-example expression test sample is function containing loophole, and prediction result is without leakage Hole function.F1 is calculated by following formula.

Tuning parameter, the training of duplication model, test process will be global when the F1 value of cross validation reaches maximum value The output vector of maximum pond layer is as the feature vector V extracted from function AST_ast；

S22. CFG, PDG of extraction are indicated with adjacency matrix, using the adjacency matrix of the CFG of all functions as data Collection, as the input of graph2vec figure incorporation model, the vector that output obtains the CFG of a regular length indicates V_cfg；Equally Using the adjacency matrix of the PDG of all functions as data set, it is input in graph2vec figure incorporation model and obtains the fixation of PDG The vector of length indicates V_pdg, finally with V_cfgAnd V_pdgAs the feature extracted from function CFG and PDG；

S3. by vector V_cm、V_ast、V_cfg、V_pdgIt is merged into a vector V_fAs the feature vector of function, by the spy of function Levy vector V_fTraining in random deep woods algorithm, which is input to, with the label of function obtains final Hole Detection model M (V_f)

S4. for the function in source code to be detected, four vector V are obtained by step A and step B_cm、V_a ^t、V_cfg、 V_pdg, it is spliced into feature vector V_fAs Hole Detection model M (V_f) input, output result be 1 representative function there are loopholes, it is defeated Entering result is that there is no loopholes for 0 representative function.

Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims

1. a kind of source code leak detection method based on deep learning, which comprises the following steps:

S2. using the function in source code as basic unit, the automatic extraction of Function feature is completed using deep learning method；Make The pumping of function abstract syntax tree AST, controlling stream graph CFG and program dependency graph PDG can be completed with code static analysis tool It takes and is successively converted into numerical value vector V_ast、V_cfg、V_pdg。

S3. by vector V_cm、V_ast、V_cfg、V_pdgIt is merged into a vector V_fAs the feature vector of function, by the feature of function to Measure V_fTraining in random deep woods algorithm, which is input to, with the label of function obtains final Hole Detection model M (V_f)；

S4. for the function in source code to be detected, four vector V are obtained by S1 and S2_cm、V_ast、V_cfg、V_pdg, it is spliced into Feature vector V_fAs Hole Detection model M (V_f) input, output result be 1 representative function there are loophole, input results 0 Loophole is not present in representative function.

2. the source code leak detection method according to claim 1 based on deep learning, which is characterized in that the S2 Specific step is as follows:

S21. two-way length memory network automatic extraction feature from function AST in short-term is used；

S211. the traversing operation that depth-first is carried out to AST, the mark in AST is stored in order in a conceptual vector；

Mark is carried out word insertion and turns AST conceptual vector by the set S212. indicated using in all conceptual vectors as dictionary Numerical value vector is turned to, selects suitable numerical value as regular length according to the distribution of all vector lengths, vector is carried out with this It cuts or 0 padding is by the length normalization of the numerical value vector of all functions；

S213. loophole whether there is according to function, label is added for function, by the corresponding mark of vector sum vector of length normalization Label are input to two-way length and are trained in memory network in short-term；Trained model is tested simultaneously using the method for cross validation F1 value is used to carry out model evaluation as evaluation index；Tuning parameter, the training of duplication model, test process, when F1 value reaches When maximum value, by the output of global maximum pond layer as the feature vector V extracted from function AST_ast。

S22. CFG, PDG of extraction are indicated and are separately input to figure incorporation model to obtain two regular lengths with adjacency matrix Vector V_cfg、V_pdg, as the feature extracted from function CFG and PDG.

3. the source code leak detection method according to claim 1 based on deep learning, which is characterized in that the generation Code Measure Indexes include statistical indicator and complexity index, and wherein there is line number statistical indicator in statistical indicator: total line number, code line Number, blank line number, annotation line number, pretreatment lines of code, inactive line number, annotation and lines of code ratio, sentence number statistics refer to Indicate: total sentence number, executes sentence number, null statement number at declarative statement number；Complexity index includes circulation complexity, corrects and follow Ring complexity, weighted shift complexity, essential complexity and the maximum nested number of plies.

4. the source code leak detection method according to claim 3 based on deep learning, which is characterized in that the generation Code line number, blank line number, annotation line number and sentence number statistical indicator statistical activity code.

5. the source code leak detection method according to claim 2 based on deep learning, which is characterized in that in step In S213, the label of function addition is as follows, and label is that 0 representative function does not have loophole, and there are loopholes for 1 representative function.

6. the source code leak detection method according to claim 2 based on deep learning, which is characterized in that described is double Network structure to long memory network in short-term include an input layer, one BLSTM layer, it is a global maximum pond layer, several A full articulamentum and an output layer.

7. the source code leak detection method according to claim 6 based on deep learning, which is characterized in that described The parameter for needing to debug in BLSTM network includes: learningrate, epoch, batchsize, the unit in each hidden layer Number, the number of full articulamentum, the unit number of full articulamentum, the activation primitive of hidden layer and full articulamentum, loss function, optimization Device.