CN107967208A - A kind of Python resource sensitive defect code detection methods based on deep neural network - Google Patents

A kind of Python resource sensitive defect code detection methods based on deep neural network Download PDF

Info

Publication number
CN107967208A
CN107967208A CN201610915633.4A CN201610915633A CN107967208A CN 107967208 A CN107967208 A CN 107967208A CN 201610915633 A CN201610915633 A CN 201610915633A CN 107967208 A CN107967208 A CN 107967208A
Authority
CN
China
Prior art keywords
code
resource
version
mrow
pattern
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610915633.4A
Other languages
Chinese (zh)
Other versions
CN107967208B (en
Inventor
陈林
潘陶
陈芝菲
李言辉
徐宝文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201610915633.4A priority Critical patent/CN107967208B/en
Publication of CN107967208A publication Critical patent/CN107967208A/en
Application granted granted Critical
Publication of CN107967208B publication Critical patent/CN107967208B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)

Abstract

The present invention is a kind of Python resource sensitive defect code detection methods based on deep neural network, is comprised the following steps:1) source code of old version and the source code of version to be measured of same software are obtained;2) use pattern infers the resource sensitive code pattern for extracting each version;3) correlated characteristic of resource sensitive code pattern is extracted;4) each characteristic similarity between defect code pattern and security code pattern, defect code pattern and code pattern to be measured is calculated, generates feature vector, and obtain training set and test set;5) feature merging is carried out with training set training deep neural network model, then to the pattern in test set to calculating the degree of correlation with deep neural network model and sorting;6) in program development and maintenance phase, the resource object operation that mistake may occur is reminded according to relevancy ranking result, auxiliary development and maintenance;The present invention solves the problems such as automatic mode that presently, there are and lack for the identification of Python resource sensitive code and defect code detection, and then reduces software hazard, improves software quality, so as to improve developer and guardian's exploitation and the efficiency for safeguarding software.

Description

A kind of Python resource sensitive defect code detection methods based on deep neural network
Technical field
The invention belongs to field of computer technology, especially software technology field, and in particular to one kind based on deep Spend the Python resource sensitive aacode defect code detection methods of neutral net.
Background technology
With the continuous development of software application technology, requirement of the user to software quality is higher and higher, software developer Met the needs of users by various technologies.Resource sensitive code is the code block or sentence of a process resource object. All there is abnormal hidden danger in exploitation and maintenance phase in software, many resource sensitive codes, often just having in maintenance process can It can be found.With the continuous prevalence of agile development technology, version upgrading is frequent, causes resource sensitive code to trigger exception suddenly Situation constantly occur.It is for the most traditional solution method of resource sensitive code abnormality processing:It is crucial using try-except Word is captured and handled.However, developer in the development phase, often ignores abnormality processing, so as to cause program to happen suddenly Sexual abnormality, causes application crashes.Thus the identification to resource object risky operation and detection are that program development and maintenance phase must Indispensable step, it can effectively improve program quality, and help is developed finds procedural problem in time with maintenance personnel, so that Formulate more effective solution.
At present, Python has become the programming language that developer extremely favors.Now, major open source community Python should With continuing to bring out, a huge ecosystem is formd.Python is a kind of object-oriented, explanation type program language, is had There is succinct, graceful, practical feature.As a kind of dynamic language, Python is more applied to design the Internet, applications, figure User interface and script implantation etc., so as to be related to various types of resources.Due to the dynamic language characteristic of Python, Developer often dynamically changes types of variables, causes uneasy full operation numerous.On the other hand, Python is carried out to resource object During operation, usually since various exceptions occur in the reasons such as resource distribution, and it is not easy the problem of the operation generation of this resource sensitive It is found.At present, developer is detected using condition, the mode such as abnormality processing controls these aacode defects.
At this stage, the method for identifying and detecting resource object can substantially be divided into two classes.One kind is based on program analysis number According to method, it can be according to logic and semantic analysis locating resource object risky operation.In contrast, another kind of is using letter The method of retrieval is ceased, resource object and detection defect code are identified by means of the mode of machine learning.First method is based on language Justice analysis, generation that can be quickly is as a result, still have the problems such as accuracy rate is low, semantic rules is difficult to definition.And second of side Method, by the mode extraction feature such as context, is then learnt and is predicted using the mode of machine learning, although producing result It is relatively slow, but have the characteristics that accuracy rate is high, highly practical.The present invention is exactly to be detected by the way of machine learning.
In maintenance phase, developer submits every time may repair the defects of many places are identical at the same time, so that same version Defect code has very strong correlation.The present invention tells defect code and security code, and profit according to history restoration information With the correlation between defect code, thus it is speculated that with historic defects code, similar code is likely to existing defects, further Provide a kind of Python resource sensitive defect code detection methods based on deep neural network.
The content of the invention
The present invention provides a kind of Python resource sensitive defect code detection methods based on deep neural network.We Method is found out code similar to its in code to be measured, is reminded by excavating and comparing the defects of being repaired in old version code Developer and guardian pay attention to there may be it is identical the problem of, to repair as early as possible.This method is from software version control system Collect the old version of same Python softwares and version to be measured;For old version, then identified by type inference Resource sensitive code pattern, and extract corresponding pattern feature, according to history restoration information by drawbacks described above code pattern and peace Full code pattern composition associative mode pair and irrelevant pattern pair, and characteristic similarity generation feature vector is calculated, trained Collection;For version to be measured, using identical method extraction different mode and individual features, by old version defect code pattern and Version compositional model pair to be measured, and characteristic similarity generation feature vector is calculated, obtain test set.Then, using training Collection training deep neural network model, carries out feature merging to test set by trained deep neural network model, is treated Survey the degree of correlation between code and defect code.Finally, be ranked up according to the degree of correlation, identify in code to be measured with history version Originally the closely similar potential danger code of the resource sensitive code that is repaired, so as to propose to build to program developer and guardian View, prevents abnormal generation.Present invention seek to address that presently, there are shortage for Python resource sensitive code identification and The problems such as automatic mode of defect code detection, and then software hazard is reduced, software quality is improved, is opened so as to improve developer Feel like jelly the efficiency of part.
To reach above-mentioned purpose, the present invention proposes a kind of Python resource sensitive defect codes based on deep neural network Detection method method comprises the following steps:
1) source code of old version and the source code of version to be measured of same software are obtained;
2) use pattern infers the resource sensitive code pattern for extracting each version;
3) correlated characteristic of resource sensitive code pattern is extracted;
4) calculate each between defect code pattern and security code pattern, defect code pattern and code pattern to be measured Characteristic similarity, generates feature vector, and obtains training set and test set;
5) feature merging is carried out with training set training deep neural network model, then to the pattern in test set to depth Degree neural network model calculates the degree of correlation and sorts;
6) in program development and maintenance phase, the resource object that mistake may occur is operated according to relevancy ranking result Reminded, auxiliary development and maintenance.
Further, wherein above-mentioned steps 1) comprise the following steps that:
Step 1) -1:Initial state;
Step 1) -2:According to filename and version information, from the history version for the same software of version control system acquisition of increasing income The source program and the source program of version to be measured being repaired in this;
Step 1) -3:The collection of software different editions source program finishes.
Further, wherein above-mentioned steps 2) comprise the following steps that:
Step 2) -1:Initial state;
Step 2) -2:Morphological analysis and syntactic analysis are carried out to the source program of each version respectively, use Python java standard libraries In ast modules generate the corresponding abstract syntax tree of each version;
Step 2) -3:According to the abstract syntax defined in Python java standard libraries, each type of Python, each class are encapsulated Type has a mapping table table, built-in attribute name or api interface name comprising the type.
Step 2) -4:Ergodic abstract syntax tree, and according to the type and module of encapsulation, infer the possibility class of each variable Type.Extract the variable of resource object types.
Step 2) -5:For unidentified type, if the variable is an interface name, and there is resource object class in its parameter Type, then be identified as resource object types, if if it is not, then the variable is its dependent variable member;If calling variable is resource Object type, is also identified as resource object types.
Step 2) -6:The code snippet of resource object types variable will be called as sensitive resource code pattern.
Step 2) -8:Resource sensitive code pattern information is collected and finished.
Further, wherein above-mentioned steps 3) comprise the following steps that:
Step 3) -1:Initial state;
Step 3) -2:According to resource language pattern information, locating resource Object Operations position, extraction API (parameter type, Reference order), resource name, call structure and function internal structure etc. be used as feature.
Step 3) -3:By API (parameter type, quantity), resource name, structure and function structure is called to unitize name.
Step 3) -4:Resource language pattern feature information extraction finishes.
Further, wherein above-mentioned steps 4) comprise the following steps that:
Step 4) -1:Initial state;
Step 4) -2:The resource sensitive code pattern that will identify that is divided into three classes, and is respectively defect code pattern, safe generation Pattern and code pattern to be measured;
Step 4) -3:For old version, will be similar according to history restoration information the defects of code pattern match two-by-two, group Into associative mode pair;Defect code pattern and the security code pattern similar to its are matched two-by-two, form irrelevant pattern pair;
Step 4) -4:For version to be measured, defect code pattern and code pattern to be measured are matched two-by-two, form mould to be measured Formula pair;
Step 4) -5:Each characteristic similarity of different mode pair is calculated, and generates feature vector;
Step 4) -6:Training set is obtained to the set of eigenvectors formed by the code pattern of old version, by version to be measured Code pattern test set is obtained to the set of eigenvectors of composition;
Step 4) -7:Training set test set information, which is collected, to be finished;
Further, wherein above-mentioned steps 5) comprise the following steps that:
Step 5) -1:Initial state;
Step 5) -2:With the similar degrees of data training deep neural network of the training set of generation in step 4), model is obtained Each parameter value;
Step 5) -3:Using the test set of generation in step 4) as input, pass through trained deep neural network mould Type, obtains relevance degree;
Step 5) -4:According to the relevance degree calculated, the degree of correlation between all codes pair is arranged from big to small Sequence, take before k test pattern to as resource sensitive code detection as a result, edition code to be measured therein is labeled as possible Resource sensitive defect code.
Step 5) -5:Possible resource sensitive defect code mark finishes.
Further, wherein above-mentioned steps 6) comprise the following steps that:
Step 6) -1:Initial state;
Step 6) -2:Code for being labeled as sensitive resource, prompts exploitation and maintenance personnel's history version related to this The position occurred in this, it is proposed that make it modify, and provide a kind of recovery scenario.
Step 6) -3:In program development and maintenance phase, system is automatically to submitting code to be detected, for there are potential The operation of dangerous resource, provides warning.
Step 6) -4:Using the version program newly submitted as old version data, for comparing next time so that detection knot Fruit is more accurate.
Step 6) -5:Resource sensitive defect code prompting in code to be measured finishes.
The present invention carries out feature merging based on deep neural network, and code to be measured is weighed using the metric of a standard Correlation levels between the defects of old version code, so as to navigate to resource sensitive defect code block, deeply To basic statement level.After resource sensitive code is identified according to type inference, solved according in the old version similar to its Certainly scheme, is repaired and prompts developer and guardian automatically.By the above method, have identified resource sensitive code and its Risky operation, improves the efficiency of software development, is beneficial to the software application product for developing high quality.
Brief description of the drawings
Fig. 1 is a kind of Python resource sensitive defect codes detection side based on deep neural network of the embodiment of the present invention The general frame figure of method.
Fig. 2 is a kind of Python resource sensitive defect codes detection side based on deep neural network of the embodiment of the present invention The flow chart of method.
Fig. 3 is the possible abstract syntax tree schematic diagram of a loop control structure.
Embodiment
The method of the present invention by the software version control system such as CVS, have collected all of same Python softwares first The source code that old version is repaired.Then morphological analysis and grammer point are carried out to the source code of old version and version to be measured Analysis, according to the abstract syntax tree of generation, carries out type inference, marks out the variable of resource object operation, identify resource language Pattern, and picked out from the resource sensitive code pattern of each version of history according to history restoration information defect code pattern and Security code pattern, composition associative mode pair and irrelevant pattern pair.Then, version resource sensitive code pattern to be measured and will go through History defect code pattern forms test pattern pair.Then, according to the pattern feature of extraction, each pattern is calculated to each feature Similarity, and feature vector is generated, obtain corresponding training set and test set.Then, using above-mentioned training set training depth god Through network model, trained deep neural network model is subjected to feature merging to test set, obtain code pattern to be measured with The corresponding degree of correlation between historic defects code pattern.Finally, it is ranked up according to the degree of correlation, k relevant patterns before selection Pair as a result, the code to be measured of code centering is labeled as the sensitive resource sensitive code for having latent defect, with this in program Auxiliary development and maintenance personnel are developed and are safeguarded in exploitation and maintenance process, prevent abnormal generation.
In order to which the technology contents of the present invention are better described, spy coordinates appended diagram to be explained as follows.
The general frame figure of the present invention is as shown in Figure 1, flow chart is as shown in Figure 2.One kind proposed by the present invention is based on depth The Python resource sensitive defect code detection methods of neutral net, including following 6 steps:
Step 1:Obtain the source code of source code that same software old version is repaired and version program to be measured.CVS etc. All versions of the program are saved in software version control system, and are labelled with version number.Can according to the version number of formulation, Obtain the old version of same Python softwares and version source code to be measured.
Step 2:Usage type infers that mode extracts the resource language pattern of each version program source code.First, first to step The source code of each version obtained in rapid 1 carries out morphological analysis and syntactic analysis, utilizes the ast module phases in Python java standard libraries Function is answered to generate abstract syntax tree.In abstract syntax tree, each tree interior joint and subtree correspond to a source code entity. In order to preferably carry out type inference, type that we define according to Python encapsulates several abstract types Types.Often There are a table attribute, expression and the title in current type attribute or the relevant abstract syntax tree of calling in a Types, Such as append;For each node in abstract syntax tree, we are provided with type and value, while are provided with node only One identifier id.For each tree interior joint, t (x) represents the type of the type, i.e. node of node, such as assignment statement.V (x) tables Show the value of node, be the text representation of the node, such as the particular content of the assignment statement.Id (x) represents unique mark of node Symbol is known, for distinguishing node.
Such as:Assignment statement is a simple sentence, corresponding to a leaf node in abstract syntax tree, the leaf The type of node is " assign statement ", value are the content of assignment statement;While Do statements correspond to abstract A stalk tree in syntax tree, the type of the root node of the subtree is " while statement ", value are while statement Rule of judgment, child nodes are while internal statements content and the sentence content for jumping out circulation.Fig. 3 is a Do statement knot The possible abstract syntax tree of structure.
Finally, the whole abstract syntax tree of postorder traversal, maps according to table in the type information and node of abstract syntax tree The information such as the relevant attribute of each type, infer type of variables, the code piece for the calling resource object variable that will conclude that Section is labeled as resource sensitive code pattern.Resource sensitive code pattern refers to resource object (file object, graphical user interface Object etc.) code snippet that is operated.
Such as:
In the code snippet, self is a resource object, and have invoked switch_backends function pairs, it is grasped Make.Therefore, here it is a resource sensitive code pattern.
Step 3:By step 2, we have extracted resource language pattern from source code.The resource that the present invention extracts is quick Sense code pattern correlated characteristic be:API (parameter type, reference order), resource name, call structure and function structure.
Then, by the feature Naming conventions of extraction.Wherein, for API features, parameter type and reference order meter are used Calculate characteristic similarity;For resource name feature, the word order column count characteristic similarity in resource name is used;It is special for calling structure Sign, characteristic similarity is used as using structural similarity is called;For function structure feature, feature is used as using function structure feature Similarity.
Step 4:Firstly, for old version, will be similar according to history restoration information the defects of code pattern match two-by-two, Form associative mode pair;Defect code pattern and the security code pattern similar to its are matched two-by-two, form irrelevant pattern It is right.For version to be measured, defect code pattern and code pattern to be measured are matched two-by-two, form test pattern pair.Pass through step 3, we with the characteristic information of decimation pattern, can calculate each characteristic similarity of different mode pair.
The characteristic similarity of API uses rVSM algorithms, wherein for parameter type, power is calculated using the algorithm of TF-IDF Weight, formula are as follows:
Wherein TF appears in the frequency in API, Total for the typeapiFor API sums, ContaintypeTo include such The API quantity of type.The present invention uses weight of the above method as the API feature vectors formed, meanwhile, adopted for type sequence Measured with 2-Grams, change of this method for type sequence has robustness.By type sequence and the degree of parameter type Amount one feature vector of composition.For the feature vector of two version generations, similarity is calculated using rVSM algorithms.In this method In, COS distance between old version feature vector a and version feature vector b to be measured represents similarity, and formula is as follows:
Wherein,WithOld version feature vector a and version feature vector b to be measured is represented respectively,Represent two The inner product of feature vector.
Resource name characteristic similarity uses text similarity measurement algorithm.First, resource name is parsed into by a sequence word combination The form formed.Next, for the resource name R in old version1With the resource name R in version to be measured2, calculation formula is as follows:
Wherein, lcs (R1, R2) represent R1In all sub- word in R2In appearance number, so as to obtain resource name Quantized value, generate relevant vector.Such as " length " and " getLength ", itsAnd " getLength " " getlength ", its
For function structure characteristic similarity and architectural feature similarity is called, according to the abstract syntax obtained in step 2 Tree, travels through tree construction, obtains corresponding similarity by the identical number of tree node, calculating probability, is required.Finally, by going through The code pattern of history version obtains training set to the set of eigenvectors of composition, by the code pattern of version to be measured to the feature that forms Vector set obtains test set.
Step 5:By step 4, we can obtain the training set and test set being made of feature vector.Since it can not Entirety indicate whether it is related to some dangerous resource object operation, thus, here we using the algorithm of deep neural network come Realize that feature merges, and calculate the degree of correlation.
First, using the training set training deep neural network of generation.The neutral net that the present invention designs is divided into three layers, Respectively input layer, hidden layer -1, hidden layer -2 and output layer.Wherein hidden layer -1 is twice of input layer number, hidden Hide the half that the node of layer -2 is input layer number.The each node H1 of hidden layer -1iCalculation formula it is as follows:
Wherein w1i, b is to need trained parameter, InputiFor input node value.Similarly hidden layer -2 by the formula by Hidden layer -1 is derived from.Training for w and b, the method that the present invention is declined using batch gradient, step are as follows:
1) initialize:Δw(l)=0, Δ b(l)Then random initializtion is smaller numerical value by=0, w and b;
2) assume that iterations is m, for i from 1 to m, calculate gradient using BP algorithm and add up:
Wherein,
3) undated parameter:
Wherein, λ is optional parameters, takes 2 in the present invention.By above-mentioned training method, deep neural network mould is trained Type.
In detection-phase, using the feature vector for testing each pattern pair in set as input, pass through above-mentioned node formula Calculated.Last output is a relevance degree, represents the degree of correlation of the pattern pair.Used here as nonlinear nerve Network method can preferably react correlation levels than using linear information retrieval method, significant effect.
In deep neural network, intermediate layer and the weights each linked of input layer are trained by the data of old version Obtain, corresponding weights are as the same.Meanwhile by largely training, change partial link and weights among neuron, so that excellent Change output result.
For obtained relevance degree, we according to being ranked up from big to small, and before choosing k pattern to as output As a result.
Step 6:According to the very big sensitive code to be measured of the obtained degree of correlation, the position for reminding exploitation to occur with maintenance personnel With history resource operation related to this, and to the abnormality processing scheme of this resource before providing, and give a warning.For The Python source codes detected are used to detect next time as old version data, and Detection accuracy is improved with this.For firm The Python source codes of submission, are detected automatically, and send alarm to exploitation and maintenance personnel according to result.
Such as:In old version, the operation to resource object of certain is as follows:
In the old version, self variables are a resource objects, are to carry out reading operation to the object here.Exploitation There is exception in order to prevent in person, and try_catch abnormality processings are added in sentence periphery.
And occurs sentence as follows in the source code of version to be measured:
Def read_bytes (self, num_bytes, callback=None, streaming_callback= None,
Partial=False):
self._try_inline_read()
Here read operation has been carried out and to resource object, and has used identical API, but does not carry out exception Reason.By above-mentioned two code combination into code pair, by means of the invention it is also possible to identify and detect whether both related, So that it is determined that whether code to be measured is sensitive resource code, and remind developer and guardian to be handled, while provide correlation Old version code information.
In conclusion the present invention provides a kind of Python resource sensitives defect code detection based on deep neural network Method, solves automatic mode for lacking identified for Python resource sensitive code detection and risky operation at present etc. Problem, improves software application quality, ensure that the controllability in software evolution process.

Claims (7)

1. a kind of Python resource sensitive defect code detection methods based on deep neural network, it is characterised in that from software Version control system, collects the old version of same Python softwares and version to be measured;For old version, then pass through class Type, which is inferred, identifies resource sensitive code pattern, and extracts corresponding pattern feature, according to history restoration information by drawbacks described above Code pattern and security code pattern composition associative mode pair and irrelevant pattern pair, and calculate characteristic similarity generate feature to Amount, obtains training set;For version to be measured, using identical method extraction different mode and individual features, old version is lacked Code pattern and version compositional model pair to be measured are fallen into, and calculates characteristic similarity generation feature vector, obtains test set;So Afterwards, deep neural network model is trained using training set, trained deep neural network model is subjected to feature to test set Merge, obtain the degree of correlation between code to be measured and defect code;Finally, it is ranked up according to the degree of correlation, k correlation before selection Code pair as a result, the code to be measured of code centering to be labeled as to the resource sensitive code of latent defect, detect to endanger Dangerous resource object operation, and auxiliary information is provided;This method comprises the following steps:
1) source code of old version and the source code of version to be measured of same software are obtained;The software version control system such as CVS In save all versions of the program and submit, and have standardized version number, can be obtained same according to the version number of formulation The old version of Python softwares and version source code to be measured;
2) use pattern infers the resource sensitive code pattern for extracting each version;To the old version gathered in step 1 Morphology and syntactic analysis are carried out with version source code to be measured, it is corresponding abstract using the ast modules generation in Python java standard libraries Syntax tree, Python types are abstracted, and set type and value to each node, reuse global type inference Method, extract resource sensitive code pattern;
Resource sensitive code pattern refers to the code operated to resource object (file object, graphical user interface object etc.) Fragment;
Such as:
In the code snippet, self is a resource object, and have invoked switch_backends function pairs, it is operated; Therefore, here it is a resource sensitive code pattern;
Define 1:Python java standard libraries are issued with Python, contain the interior modeling for providing various system level functions Block;
Define 2:Type inference is a kind of by carrying out static analysis to source code, infers the side of types of variables in dynamic language Method;
Define 3:Type is used to identify the node type information in abstract syntax tree, is worth the abstract syntax defined from Python, Including function_call etc.;
Define 4:Value is the text representation of the node content in abstract syntax tree, such as the Rule of judgment of while control structures Deng;
3) correlated characteristic of resource sensitive code pattern is extracted;By step 2, we have extracted resource sensitive from source code Code pattern;The correlated characteristic for the resource sensitive code pattern that the present invention extracts is:API (parameter type, reference order), resource Name, call structure and function structure;Finally, by the feature Naming conventions of extraction;
Define 1:For API features, characteristic similarity is calculated using parameter type and reference order;
Define 2:For resource name feature, the word order column count characteristic similarity in resource name is used;
Define 3:For calling architectural feature, characteristic similarity is used as using structural similarity is called;
Define 4:For function structure feature, characteristic similarity is used as using function structure feature;
4) each feature between defect code pattern and security code pattern, defect code pattern and code pattern to be measured is calculated Similarity, generates feature vector, and obtains training set and test set;, will be similar according to history restoration information for old version The defects of code pattern match two-by-two, form associative mode pair;By defect code pattern and the security code pattern similar to its Match two-by-two, form irrelevant pattern pair;For version to be measured, defect code pattern and code pattern to be measured are matched two-by-two, Form test pattern pair;Then, each characteristic information extracted according to step 3, each feature for calculating different mode pair are similar Degree, and generate feature vector;Finally, training set is obtained to the set of eigenvectors formed by the code pattern of old version, by treating The code pattern for surveying version obtains test set to the set of eigenvectors of composition;
Define 1:Defect code pattern refers to the resource sensitive defect code pattern being then repaired in history restoration information;
Define 2:Security code pattern refer to it is similar to defect code pattern but without find defect resource sensitive code mould Formula;
Define 3:The characteristic similarity of API uses rVSM algorithms, wherein for parameter type, power is calculated using the algorithm of TF-IDF Weight, formula are as follows:
<mrow> <mi>T</mi> <mi>F</mi> <mo>-</mo> <mi>I</mi> <mi>D</mi> <mi>F</mi> <mo>=</mo> <mi>T</mi> <mi>F</mi> <mo>*</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <mfrac> <mrow> <msub> <mi>Total</mi> <mrow> <mi>a</mi> <mi>p</mi> <mi>i</mi> </mrow> </msub> </mrow> <mrow> <msub> <mi>Contain</mi> <mrow> <mi>t</mi> <mi>y</mi> <mi>p</mi> <mi>e</mi> </mrow> </msub> <mo>+</mo> <mn>1</mn> </mrow> </mfrac> <mo>)</mo> </mrow> </mrow>
Wherein TF appears in the frequency in API, Total for the typeapiFor API sums, ContaintypeTo include the type API quantity;The present invention uses weight of the above method as the API feature vectors formed, meanwhile, use 2- for type sequence Grams is measured, and change of this method for type sequence has robustness;By type sequence and the set of measurements of parameter type Into a feature vector;The feature vector of generation for two versions, similarity is calculated using rVSM algorithms;In this method In, COS distance between old version feature vector a and version feature vector b to be measured represents similarity, and formula is as follows:
<mrow> <mi>S</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>C</mi> <mi>o</mi> <mi>s</mi> <mrow> <mo>(</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mover> <msub> <mi>V</mi> <mi>a</mi> </msub> <mo>&amp;RightArrow;</mo> </mover> <mo>&amp;CenterDot;</mo> <mover> <msub> <mi>V</mi> <mi>b</mi> </msub> <mo>&amp;RightArrow;</mo> </mover> </mrow> <mrow> <mo>|</mo> <mover> <msub> <mi>V</mi> <mi>a</mi> </msub> <mo>&amp;RightArrow;</mo> </mover> <mo>|</mo> <mo>|</mo> <mover> <msub> <mi>V</mi> <mi>b</mi> </msub> <mo>&amp;RightArrow;</mo> </mover> <mo>|</mo> </mrow> </mfrac> </mrow>
Wherein,WithOld version feature vector a and version feature vector b to be measured is represented respectively,Represent two features The inner product of vector;
Define 4:The characteristic similarity of resource name uses text similarity measurement algorithm;First, resource name is parsed into by a sequence word The form being composed;Next, for the resource name R in old version1With the resource name R in version to be measured2, calculation formula It is as follows:
<mrow> <mi>S</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>R</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>R</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mo>|</mo> <mi>l</mi> <mi>c</mi> <mi>s</mi> <mrow> <mo>(</mo> <msub> <mi>R</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>R</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>|</mo> <mo>+</mo> <mo>|</mo> <mi>l</mi> <mi>c</mi> <mi>s</mi> <mrow> <mo>(</mo> <msub> <mi>R</mi> <mn>2</mn> </msub> <mo>,</mo> <msub> <mi>R</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>|</mo> </mrow> <mrow> <mo>|</mo> <msub> <mi>R</mi> <mn>1</mn> </msub> <mo>|</mo> <mo>+</mo> <mo>|</mo> <msub> <mi>R</mi> <mn>2</mn> </msub> <mo>|</mo> </mrow> </mfrac> </mrow>
Wherein, lcs (R1, R2) represent R1In all sub- word in R2In appearance number, so as to obtain the amount of resource name Change value, generates relevant vector;
Define 5:RVSM algorithms are vector space model, are a kind of algorithms for calculating similarity;
5) feature merging is carried out with training set training deep neural network model, then to the pattern in test set to refreshing with depth The degree of correlation is calculated through network model and is sorted;The training set training deep neural network model generated using step 2), then The test set of step 2) generation is subjected to feature merging using trained deep neural network model, and calculates the degree of correlation; Finally by the relevance degree between defect code pattern and code pattern to be measured according to being ranked up from big to small, and choose k generation Code is to as output result;
6) in program development and maintenance phase, the resource object that mistake may occur is operated according to relevancy ranking result and is carried out Remind, auxiliary development and maintenance;According to the very big resource sensitive code to be measured of the obtained degree of correlation, exploitation and maintenance personnel are reminded The position of appearance and history resource operation related to this, and to the abnormality processing scheme of this resource, concurrent responding before providing Accuse;It is used to detect next time as old version data for the Python source codes crossed after testing, it is accurate to improve detection with this True rate;For the Python source codes just submitted, it is detected automatically, and announcement is sent to exploitation and maintenance personnel according to result It is alert.
2. the Python resource sensitive defect code detection methods according to claim 1 based on deep neural network, its It is characterized in that, in step 1), according to the version number of formulation, obtains the old version of same Python softwares and version source to be measured Code.
3. the Python resource sensitive defect code detection methods according to claim 1 based on deep neural network, its It is characterized in that, in step 2), morphology and syntactic analysis is carried out to the old version gathered and version source code to be measured, Corresponding abstract syntax tree is generated using the ast modules in Python java standard libraries, Python types are abstracted, are reused The method of global type inference, extracts resource sensitive code pattern.
4. the Python resource sensitive defect code detection methods according to claim 1 based on deep neural network, its It is characterized in that, in step 3), according to the resource language pattern information of collection, extracts following correlated characteristic:API (parameter type, Reference order), resource name, call structural similarity and function structure similarity;Finally, by the feature Naming conventions of extraction.
5. the Python resource sensitive defect code detection methods according to claim 1 based on deep neural network, its It is characterized in that, in step 4), calculates defect code pattern and security code pattern, defect code pattern and code pattern to be measured Between each characteristic similarity, generate feature vector, and obtain training set and test set;For old version, according to history The defects of restoration information will be similar code pattern matches two-by-two, forms associative mode pair;By defect code pattern and similar to its Security code pattern match two-by-two, form irrelevant pattern pair;For version to be measured, by defect code pattern and code to be measured Pattern is matched two-by-two, forms test pattern pair;Then, each characteristic information extracted according to step 3, calculates different mode pair Each characteristic similarity, and generate feature vector;Finally, the set of eigenvectors formed is obtained by the code pattern of old version Training set, test set is obtained by the code pattern of version to be measured to the set of eigenvectors formed.
6. the Python resource sensitive defect code detection methods according to claim 1 based on deep neural network, its It is characterized in that, in step 5), training deep neural network model carries out feature merging, to test code to being arranged according to the degree of correlation Sequence;The training set training deep neural network model generated using step 2), is then made the test set that step 2) generates Feature merging is carried out with trained deep neural network model, and is calculated related between code to be measured and historic defects code Degree;Finally, according to obtained relevance degree, we according to being ranked up from big to small, and before choosing k code to as output As a result.
7. the Python resource sensitive defect code detection methods according to claim 1 based on deep neural network, its It is characterized in that,, can according to relevancy ranking result pair in program development and maintenance phase according to ranking results in step 6) The resource object operation that mistake can occur is reminded, auxiliary development and maintenance.
CN201610915633.4A 2016-10-20 2016-10-20 Python resource sensitive defect code detection method based on deep neural network Active CN107967208B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610915633.4A CN107967208B (en) 2016-10-20 2016-10-20 Python resource sensitive defect code detection method based on deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610915633.4A CN107967208B (en) 2016-10-20 2016-10-20 Python resource sensitive defect code detection method based on deep neural network

Publications (2)

Publication Number Publication Date
CN107967208A true CN107967208A (en) 2018-04-27
CN107967208B CN107967208B (en) 2020-01-17

Family

ID=61996517

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610915633.4A Active CN107967208B (en) 2016-10-20 2016-10-20 Python resource sensitive defect code detection method based on deep neural network

Country Status (1)

Country Link
CN (1) CN107967208B (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241739A (en) * 2018-07-19 2019-01-18 中国科学院信息工程研究所 Android malware detection methods, device and storage medium based on API
CN109446078A (en) * 2018-10-18 2019-03-08 网易(杭州)网络有限公司 Code test method and device, storage medium, electronic equipment
CN109657461A (en) * 2018-11-26 2019-04-19 浙江大学 RTL hardware Trojan horse detection method based on gradient boosting algorithm
CN109726120A (en) * 2018-12-05 2019-05-07 北京计算机技术及应用研究所 A kind of software defect confirmation method based on machine learning
CN110162245A (en) * 2019-04-11 2019-08-23 北京达佳互联信息技术有限公司 Analysis method, device, electronic equipment and the storage medium of graphic operation
CN110175128A (en) * 2019-05-29 2019-08-27 北京百度网讯科技有限公司 A kind of similar codes case acquisition methods, device, equipment and storage medium
CN110349477A (en) * 2019-07-16 2019-10-18 湖南酷得网络科技有限公司 A kind of misprogrammed restorative procedure, system and server based on history learning behavior
CN110597735A (en) * 2019-09-25 2019-12-20 北京航空航天大学 Software defect prediction method for open-source software defect feature deep learning
CN110780878A (en) * 2019-10-25 2020-02-11 湖南大学 Method for carrying out JavaScript type inference based on deep learning
CN110825642A (en) * 2019-11-11 2020-02-21 浙江大学 Software code line-level defect detection method based on deep learning
CN111427775A (en) * 2020-03-12 2020-07-17 扬州大学 Method level defect positioning method based on Bert model
CN111459789A (en) * 2019-08-28 2020-07-28 南京意博软件科技有限公司 Detection method and device for application programming interface
CN111858323A (en) * 2020-07-11 2020-10-30 南京工业大学 Code representation learning-based instant software defect prediction method
CN111913874A (en) * 2020-06-22 2020-11-10 西安交通大学 Software defect tracing method based on syntactic structure change analysis
CN111913718A (en) * 2020-06-22 2020-11-10 西安交通大学 Binary function differential analysis method based on basic block context information
CN112131120A (en) * 2020-09-27 2020-12-25 北京软安科技有限公司 Source code defect detection method and device
CN112328475A (en) * 2020-10-28 2021-02-05 南京航空航天大学 Defect positioning method for multiple suspicious code files
CN113407442A (en) * 2021-05-27 2021-09-17 杭州电子科技大学 Pattern-based Python code memory leak detection method
CN113408597A (en) * 2021-06-10 2021-09-17 北京工业大学 Java method name recommendation method based on two-stage framework
CN113722239A (en) * 2021-11-01 2021-11-30 南昌航空大学 Airborne embedded software quality detection method, device, medium and electronic equipment
CN113836020A (en) * 2021-09-24 2021-12-24 中国电信股份有限公司 Code detection method, device and storage medium
US11449610B2 (en) * 2018-03-20 2022-09-20 WithSecure Corporation Threat detection system
CN115454855A (en) * 2022-09-16 2022-12-09 中国电信股份有限公司 Code defect report auditing method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1609855A (en) * 2003-06-23 2005-04-27 微软公司 Query optimizer system and method
CN101441571A (en) * 2008-12-02 2009-05-27 南京大学 Gridding system implementing method based on Python language
CN105159715A (en) * 2015-09-01 2015-12-16 南京大学 Python code change reminding method on basis of abstract syntax tree node change extraction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1609855A (en) * 2003-06-23 2005-04-27 微软公司 Query optimizer system and method
CN100517307C (en) * 2003-06-23 2009-07-22 微软公司 Query optimizer system and method
CN101441571A (en) * 2008-12-02 2009-05-27 南京大学 Gridding system implementing method based on Python language
CN105159715A (en) * 2015-09-01 2015-12-16 南京大学 Python code change reminding method on basis of abstract syntax tree node change extraction

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZHIFEI CHEN 等: "Tracking Down Dynamic Feature Code Changes Against Python Software Evolution", 《2016 THIRD INTERNATIONAL CONFERENCE ON TRUSTWORTHY SYSTEMS AND THEIR APPLICATIONS》 *
李清言: "Pyreview:一个基于抽象语法树差异提取的Python源代码分析工具", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11449610B2 (en) * 2018-03-20 2022-09-20 WithSecure Corporation Threat detection system
CN109241739A (en) * 2018-07-19 2019-01-18 中国科学院信息工程研究所 Android malware detection methods, device and storage medium based on API
CN109241739B (en) * 2018-07-19 2021-01-05 中国科学院信息工程研究所 API-based android malicious program detection method and device and storage medium
CN109446078A (en) * 2018-10-18 2019-03-08 网易(杭州)网络有限公司 Code test method and device, storage medium, electronic equipment
CN109446078B (en) * 2018-10-18 2022-02-18 网易(杭州)网络有限公司 Code testing method and device, storage medium and electronic equipment
CN109657461A (en) * 2018-11-26 2019-04-19 浙江大学 RTL hardware Trojan horse detection method based on gradient boosting algorithm
CN109657461B (en) * 2018-11-26 2020-12-08 浙江大学 RTL hardware Trojan horse detection method based on gradient lifting algorithm
CN109726120A (en) * 2018-12-05 2019-05-07 北京计算机技术及应用研究所 A kind of software defect confirmation method based on machine learning
CN109726120B (en) * 2018-12-05 2022-03-08 北京计算机技术及应用研究所 Software defect confirmation method based on machine learning
CN110162245A (en) * 2019-04-11 2019-08-23 北京达佳互联信息技术有限公司 Analysis method, device, electronic equipment and the storage medium of graphic operation
CN110175128A (en) * 2019-05-29 2019-08-27 北京百度网讯科技有限公司 A kind of similar codes case acquisition methods, device, equipment and storage medium
CN110349477A (en) * 2019-07-16 2019-10-18 湖南酷得网络科技有限公司 A kind of misprogrammed restorative procedure, system and server based on history learning behavior
CN110349477B (en) * 2019-07-16 2022-01-07 长沙酷得网络科技有限公司 Programming error repairing method, system and server based on historical learning behaviors
CN111459789B (en) * 2019-08-28 2023-11-03 南京意博软件科技有限公司 Detection method and device for application programming interface
CN111459789A (en) * 2019-08-28 2020-07-28 南京意博软件科技有限公司 Detection method and device for application programming interface
CN110597735A (en) * 2019-09-25 2019-12-20 北京航空航天大学 Software defect prediction method for open-source software defect feature deep learning
CN110780878A (en) * 2019-10-25 2020-02-11 湖南大学 Method for carrying out JavaScript type inference based on deep learning
CN110825642A (en) * 2019-11-11 2020-02-21 浙江大学 Software code line-level defect detection method based on deep learning
CN111427775A (en) * 2020-03-12 2020-07-17 扬州大学 Method level defect positioning method based on Bert model
CN111427775B (en) * 2020-03-12 2023-05-02 扬州大学 Method level defect positioning method based on Bert model
CN111913874B (en) * 2020-06-22 2021-12-28 西安交通大学 Software defect tracing method based on syntactic structure change analysis
CN111913874A (en) * 2020-06-22 2020-11-10 西安交通大学 Software defect tracing method based on syntactic structure change analysis
CN111913718A (en) * 2020-06-22 2020-11-10 西安交通大学 Binary function differential analysis method based on basic block context information
CN111858323B (en) * 2020-07-11 2021-06-01 南京工业大学 Code representation learning-based instant software defect prediction method
CN111858323A (en) * 2020-07-11 2020-10-30 南京工业大学 Code representation learning-based instant software defect prediction method
CN112131120B (en) * 2020-09-27 2022-09-30 北京智联安行科技有限公司 Source code defect detection method and device
CN112131120A (en) * 2020-09-27 2020-12-25 北京软安科技有限公司 Source code defect detection method and device
CN112328475B (en) * 2020-10-28 2021-11-30 南京航空航天大学 Defect positioning method for multiple suspicious code files
CN112328475A (en) * 2020-10-28 2021-02-05 南京航空航天大学 Defect positioning method for multiple suspicious code files
CN113407442A (en) * 2021-05-27 2021-09-17 杭州电子科技大学 Pattern-based Python code memory leak detection method
CN113408597A (en) * 2021-06-10 2021-09-17 北京工业大学 Java method name recommendation method based on two-stage framework
CN113408597B (en) * 2021-06-10 2024-06-04 北京工业大学 Java method name recommendation method based on two-stage framework
CN113836020A (en) * 2021-09-24 2021-12-24 中国电信股份有限公司 Code detection method, device and storage medium
CN113722239A (en) * 2021-11-01 2021-11-30 南昌航空大学 Airborne embedded software quality detection method, device, medium and electronic equipment
CN115454855A (en) * 2022-09-16 2022-12-09 中国电信股份有限公司 Code defect report auditing method and device, electronic equipment and storage medium
CN115454855B (en) * 2022-09-16 2024-02-09 中国电信股份有限公司 Code defect report auditing method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN107967208B (en) 2020-01-17

Similar Documents

Publication Publication Date Title
CN107967208A (en) A kind of Python resource sensitive defect code detection methods based on deep neural network
Li et al. Dear: A novel deep learning-based approach for automated program repair
Kovbasistyi et al. Method for detection of non-relevant and wrong information based on content analysis of web resources
Di Lucca et al. An approach to identify duplicated web pages
CN109214191A (en) A method of utilizing deep learning forecasting software security breaches
Shen et al. A survey of automatic software vulnerability detection, program repair, and defect prediction techniques
CN117951701A (en) Method for determining flaws and vulnerabilities in software code
CN109657473A (en) A kind of fine granularity leak detection method based on depth characteristic
CN114297654A (en) Intelligent contract vulnerability detection method and system for source code hierarchy
CN111459799A (en) Software defect detection model establishing and detecting method and system based on Github
CN107066262A (en) Source code file clone&#39;s adjacency list merges detection method
CN115495755B (en) Codebert and R-GCN-based source code vulnerability multi-classification detection method
CN103914374B (en) The aacode defect detection method and device extracted based on program slice and frequent mode
CN109067800A (en) A kind of cross-platform association detection method of firmware loophole
CN115033895B (en) Binary program supply chain safety detection method and device
CN105279086A (en) Flow chart-based method for automatically detecting logic loopholes of electronic commerce websites
CN106682507A (en) Virus library acquiring method and device, equipment, server and system
CN106330861A (en) Website detection method and apparatus
CN114900346A (en) Network security testing method and system based on knowledge graph
Yang et al. Smart contract vulnerability detection based on abstract syntax tree
CN114398069A (en) Method and system for identifying accurate version of public component library based on cross fingerprint analysis
CN110049052A (en) The malice domain name detection method of label and attribute similarity based on dom tree
Kim Enhancing code clone detection using control flow graphs.
CN109670311A (en) Malicious code analysis and detection method based on high-level semantics
Zhao et al. Suzzer: A vulnerability-guided fuzzer based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant