CN107967208B - Python resource sensitive defect code detection method based on deep neural network - Google Patents

Python resource sensitive defect code detection method based on deep neural network Download PDF

Info

Publication number
CN107967208B
CN107967208B CN201610915633.4A CN201610915633A CN107967208B CN 107967208 B CN107967208 B CN 107967208B CN 201610915633 A CN201610915633 A CN 201610915633A CN 107967208 B CN107967208 B CN 107967208B
Authority
CN
China
Prior art keywords
code
resource
version
mode
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610915633.4A
Other languages
Chinese (zh)
Other versions
CN107967208A (en
Inventor
陈林
潘陶
陈芝菲
李言辉
徐宝文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201610915633.4A priority Critical patent/CN107967208B/en
Publication of CN107967208A publication Critical patent/CN107967208A/en
Application granted granted Critical
Publication of CN107967208B publication Critical patent/CN107967208B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing

Abstract

The invention relates to a Python resource sensitive defect code detection method based on a deep neural network, which comprises the following steps: 1) acquiring a source code of a historical version and a source code of a version to be tested of the same software; 2) extracting resource sensitive code modes of all versions by utilizing type inference; 3) extracting relevant characteristics of the resource sensitive code mode; 4) calculating each feature similarity between the defect code mode and the safety code mode, and between the defect code mode and the code mode to be tested, generating a feature vector, and obtaining a training set and a test set; 5) training the deep neural network model by using a training set to perform feature merging, and then calculating the correlation and sequencing by using the deep neural network model for the mode in the test set; 6) in the program development and maintenance stage, reminding resource object operation which is possibly wrong according to the relevance ranking result, and assisting development and maintenance; the invention solves the problems that an automatic method aiming at Python language resource sensitive code identification and defect code detection is lacked at present, and the like, thereby reducing the software risk and improving the software quality, and further improving the software development and maintenance efficiency of developers and maintainers.

Description

Python resource sensitive defect code detection method based on deep neural network
Technical Field
The invention belongs to the technical field of computers, particularly relates to the technical field of software, and particularly relates to a Python resource sensitive code defect code detection method based on a deep neural network.
Background
With the continuous development of software application technology, users have higher and higher requirements on software quality, and software developers are meeting the requirements of users through various technologies. Resource sensitive code is a block or statement of code that processes a resource object. In the development and maintenance stage of software, many resource sensitive codes have abnormal hidden dangers and are often discovered only in the maintenance process. With the constant popularity of agile development technologies, version changes are frequent, which often causes the situation that resource sensitive code suddenly causes an exception. The most traditional solutions to resource sensitive code exception handling are: the try-except key is used for capture and processing. However, developers often ignore exception handling during the development phase, resulting in a sudden exception to the program, causing the application to crash. Therefore, the identification and detection of dangerous operation of the resource object are indispensable steps in the program development and maintenance stage, the program quality can be effectively improved, and development and maintenance personnel can be helped to find program problems in time, so that a more effective solution is made.
At present, Python has become a very favored programming language for developers. At present, the application of each large open source community Python is continuously emerged, and a huge ecological system is formed. Python is an object-oriented, interpreted programming language with the characteristics of being concise, elegant, and practical. As a dynamic language, Python is more used in designing internet applications, graphical user interfaces, and scripting, involving various types of resources. Due to the dynamic language nature of Python language, developers tend to dynamically change variable types, resulting in many unsafe operations. On the other hand, when the Python operates on the resource object, various exceptions often occur due to the resource configuration and the like, and the problem caused by the resource sensitive operation is not easy to be discovered. At present, developers adopt the ways of condition detection, exception handling and the like to control the code defects.
At this stage, methods for identifying and detecting resource objects can be roughly divided into two categories. One type is a program-based data analysis method that can locate resource object hazardous operations based on logical and semantic analysis. In contrast, another class is the method of using information retrieval to identify resource objects and detect defect codes by way of machine learning. The first method is based on semantic analysis and can generate results quickly, but has the problems of low accuracy, difficult definition of semantic rules and the like. The second method extracts features through context and other modes, and then learns and predicts in a machine learning mode, although the result is slow, the second method has the characteristics of high accuracy, strong practicability and the like. The invention adopts a machine learning mode to detect.
In the maintenance phase, each submission of a developer may repair the same defects at the same time, so that the defect codes of the same version have strong correlation. The invention distinguishes the defect code and the safety code according to the historical repair information, and presumes that the code similar to the historical defect code has defects possibly by utilizing the correlation between the defect codes, and further provides a Python resource sensitive defect code detection method based on the deep neural network.
Disclosure of Invention
The invention provides a Python resource sensitive defect code detection method based on a deep neural network. The method finds out the codes similar to the repaired defect codes in the codes to be detected by mining and comparing the repaired defect codes in the historical versions, and reminds developers and maintainers of paying attention to the fact that the same problems possibly exist so as to repair the codes as early as possible. The method comprises the steps of collecting a historical version and a version to be tested of the same Python software from a software version control system; for the historical version, identifying a resource sensitive code mode through type inference, extracting corresponding mode characteristics, forming a relevant mode pair and a non-relevant mode pair by the defect code mode and the safety code mode according to historical repair information, and calculating characteristic similarity to generate a characteristic vector to obtain a training set; and for the version to be tested, extracting different modes and corresponding features by using the same method, forming a mode pair by using the historical version defect code mode and the version mode to be tested, and calculating feature similarity to generate a feature vector to obtain a test set. And then, training the deep neural network model by using the training set, and performing feature combination on the test set by using the trained deep neural network model to obtain the correlation degree between the code to be tested and the defect code. And finally, sequencing according to the correlation degree, and identifying potential dangerous codes which are very similar to the resource sensitive codes with the repaired historical versions in the codes to be tested, so that suggestions are provided for program developers and maintainers, and the generation of exceptions is prevented. The invention aims to solve the problems that an automatic method aiming at Python language resource sensitive code identification and defect code detection is lacked at present, and the like, so that the software risk is reduced, the software quality is improved, and the software development efficiency of developers is improved.
In order to achieve the above object, the present invention provides a Python resource sensitive defect code detection method based on a deep neural network, which comprises the following steps:
1) acquiring a source code of a historical version and a source code of a version to be tested of the same software;
2) extracting resource sensitive code modes of all versions by utilizing type inference;
3) extracting relevant characteristics of the resource sensitive code mode;
4) calculating each feature similarity between the defect code mode and the safety code mode, and between the defect code mode and the code mode to be tested, generating a feature vector, and obtaining a training set and a test set;
5) training the deep neural network model by using a training set to perform feature merging, and then calculating the correlation and sequencing by using the deep neural network model for the mode in the test set;
6) and in the stage of program development and maintenance, reminding the resource object operation which possibly has errors according to the result of the relevance ranking, and assisting development and maintenance.
Further, the specific steps of the step 1) are as follows:
step 1) -1: an initial state;
step 1) -2: acquiring a source program repaired in a historical version and a source program of a version to be detected in the same software from an open source version control system according to the file name and the version information;
step 1) -3: and finishing the acquisition of the source programs of different versions of the software.
Further, the specific steps of the step 2) are as follows:
step 2) -1: an initial state;
step 2) -2: performing lexical analysis and syntactic analysis on the source programs of the versions respectively, and generating abstract syntax trees corresponding to the versions by using an ast module in a Python standard library;
step 2) -3: and packaging each type of the Python according to abstract syntax defined in a Python standard library, wherein each type has a mapping table which contains the internal attribute name or the API interface name of the type.
Step 2) -4: the abstract syntax tree is traversed and the possible types of each variable are inferred based on the type and module of the encapsulation. And extracting the variable of the resource object type.
Step 2) -5: for the unidentified type, if the variable is an interface name and the parameter of the variable has a resource object type, the variable is identified as the resource object type, and if the variable is not the other variable members; if the calling variable is the resource object type, the calling variable is also marked as the resource object type.
Step 2) -6: and taking the code segment for calling the resource object type variable as a sensitive resource code mode.
Step 2) -8: and finishing the collection of the resource sensitive code mode information.
Further, the specific steps of the step 3) are as follows:
step 3) -1: an initial state;
step 3) -2: according to the resource code mode information, the operation position of the resource object is positioned, and the API (parameter type, parameter sequence), the resource name, the calling structure, the function internal structure and the like are extracted as characteristics.
Step 3) -3: the API (parameter type, number), resource name, call structure and function structure are named uniformly.
Step 3) -4: and finishing the extraction of the resource code mode characteristic information.
Further, the specific steps of the step 4) are as follows:
step 4) -1: an initial state;
step 4) -2: dividing the identified resource sensitive code modes into three classes, namely a defect code mode, a safety code mode and a code mode to be detected;
step 4) -3: for the historical version, pairwise matching similar defect code modes according to historical repair information to form a related mode pair; pairing the defect code mode and the security code mode similar to the defect code mode pairwise to form a non-relevant mode pair;
step 4) -4: for the version to be tested, pairing the defect code mode and the code mode to be tested in pairs to form a mode pair to be tested;
step 4) -5: calculating the similarity of each feature of different mode pairs and generating a feature vector;
step 4) -6: obtaining a training set by a feature vector set formed by the code mode pairs of the historical version, and obtaining a test set by a feature vector set formed by the code mode pairs of the version to be tested;
step 4) -7: finishing the collection of the training set test set information;
further, the specific steps of the step 5) are as follows:
step 5) -1: an initial state;
step 5) -2: training the deep neural network by using the training set similarity data generated in the step 4) to obtain each parameter value of the model;
step 5) -3: taking the test set generated in the step 4) as an input, and obtaining a correlation value through a trained deep neural network model;
step 5) -4: and sequencing the correlation degrees among all the code pairs from large to small according to the calculated correlation degree values, taking the first k test mode pairs as a resource sensitive code detection result, and marking the version code to be detected as a possible resource sensitive defect code.
Step 5) -5: and marking the possible resource sensitive defect codes.
Further, the specific steps of the step 6) are as follows:
step 6) -1: an initial state;
step 6) -2: for code labeled as sensitive resources, development and maintenance personnel are prompted for the location of the historical version associated therewith, suggested for modification, and a repair solution is presented.
Step 6) -3: in the program development and maintenance stage, the system automatically detects the submitted codes and gives a warning for the operation with potential dangerous resources.
Step 6) -4: and the newly submitted version program is used as historical version data for next comparison, so that the detection result is more accurate.
Step 6) -5: and finishing prompting the resource sensitive defect codes in the codes to be tested.
The invention carries out feature combination based on the deep neural network, and adopts a standard metric value to measure the correlation level between the code to be tested and the defect code in the historical version, thereby being capable of positioning the resource sensitive defect code block to be deep into the basic statement level. After identifying resource sensitive code based on type inference, automatic repairs are made and developers and maintainers are prompted based on solutions in historical versions similar thereto. By the method, the resource sensitive codes and dangerous operations thereof are identified, the software development efficiency is improved, and high-quality software application products are beneficially developed.
Drawings
Fig. 1 is an overall architecture diagram of a Python resource-sensitive defect code detection method based on a deep neural network according to an embodiment of the present invention.
Fig. 2 is a flowchart of a Python resource sensitive defect code detection method based on a deep neural network according to an embodiment of the present invention.
Fig. 3 is a diagram of a possible abstract syntax tree for a loop control structure.
Detailed Description
The method firstly collects the source codes of all the historical versions of the same Python software which are repaired through a software version control system such as a CVS. And then, performing lexical analysis and syntactic analysis on the source codes of the historical version and the version to be detected, performing type inference according to the generated abstract syntax tree, labeling variables of resource object operation, identifying resource code modes, and selecting a defect code mode and a safety code mode from resource sensitive code modes of various historical versions according to historical repair information to form a relevant mode pair and a non-relevant mode pair. And then, forming a test mode pair by the resource sensitive code mode and the historical defect code mode of the version to be tested. Then, according to the extracted pattern features, calculating the similarity of each pattern to each feature, and generating feature vectors to obtain corresponding training sets and test sets. And then, training the deep neural network model by using the training set, and performing feature combination on the test set by using the trained deep neural network model to obtain the corresponding correlation degree between the code mode to be tested and the historical defect code mode. And finally, sequencing according to the relevancy, selecting the first k relevant mode pairs as results, and marking the code to be tested in the code pairs as sensitive resource sensitive codes with potential defects, so as to assist development and maintenance personnel in the process of program development and maintenance to develop and maintain and avoid abnormity.
To better explain the technical contents of the present invention, the accompanying drawings are shown as follows.
The general architecture of the present invention is shown in fig. 1, and the flow chart is shown in fig. 2. The invention provides a Python resource sensitive defect code detection method based on a deep neural network, which comprises the following 6 steps:
step 1: and acquiring the source code of the repaired program of the historical version of the same software and the source code of the program of the version to be tested. All versions of the program are stored in a software version control system such as the CVS, and version numbers are marked. And obtaining the historical version and the source code of the version to be tested of the same Python software according to the established version number.
Step 2: and extracting the resource code mode of the program source code of each version by using a type inference mode. Firstly, lexical analysis and syntactic analysis are carried out on the source codes of the versions acquired in the step 1, and an abstract syntax tree is generated by using a corresponding function of an ast module in a Python standard library. In the abstract syntax tree, each node and sub-tree in the tree corresponds to a source code entity. To better perform type inference, we encapsulate several abstract type Types according to the type defined by Python. Each type has a table attribute, which represents the name in the abstract syntax tree related to the current type attribute or call, such as an apn; for each node in the abstract syntax tree, we set type and value, and at the same time set the unique identifier id of the node. For each node in the tree, t (x) represents the type of the node, i.e., the type of the node, such as an assignment statement. v (x) represents the value of a node, and is a text representation of the node, such as the specific content of the assignment statement. Id (x) represents a unique identifier of the node to distinguish the nodes.
For example: the assignment statement is a simple statement, and corresponds to a leaf node in the abstract syntax tree, wherein the type of the leaf node is "assignment state", and the value is the content of the assignment statement; the While loop statement corresponds to a subtree in the abstract syntax tree, the type of the root node of the subtree is "While state", the value is the judgment condition of the While statement, and the child nodes are the contents of the While internal statement and the contents of the statement of jumping out of the loop. Fig. 3 is a possible abstract syntax tree for a loop statement structure.
And finally, traversing the whole abstract syntax tree in a subsequent order, deducing the type of the variable according to the type information of the abstract syntax tree and the information such as the attribute and the like related to each type mapped by the table in the node, and marking the deduced code segment for calling the resource object variable as a resource sensitive code mode. A resource-sensitive code pattern refers to a code fragment that operates on a resource object (file object, graphical user interface object, etc.).
For example:
Figure BSA0000135429070000051
Figure BSA0000135429070000061
in the code segment, self is a resource object, and a switch _ backings function is called to operate the resource object. Thus, there is a resource sensitive code pattern.
And step 3: we have extracted the resource code pattern from the source code, via step 2. The relevant characteristics of the resource sensitive code mode extracted by the invention are as follows: API (parameter type, parameter order), resource name, call structure, and function structure.
The extracted feature designations are then normalized. For API characteristics, calculating characteristic similarity by using parameter types and parameter sequences; for the resource name features, calculating feature similarity by using word sequences in the resource names; for the calling structure feature, using the calling structure similarity as the feature similarity; for the functional structural feature, the functional structural feature is used as the feature similarity.
And 4, step 4: firstly, pairing similar defect code modes pairwise according to historical repair information to form a related mode pair for a historical version; and pairing the defect code pattern and the security code pattern similar to the defect code pattern pairwise to form an uncorrelated pattern pair. And pairing the defect code mode and the code mode to be detected pairwise to form a mode pair to be detected for the version to be detected. Through step 3, the feature information of the patterns can be extracted, and the similarity of each feature of different pattern pairs can be calculated.
The feature similarity of the API adopts an rVSM algorithm, wherein for parameter types, a TF-IDF algorithm is adopted to calculate the weight, and the formula is as follows:
Figure BSA0000135429070000062
wherein TF is the frequency of occurrence of the type in the API, TotalapiIs the total number of APIs, ContaintypeThe number of APIs containing that type. The method is used as the weight of the feature vector formed by the API, and meanwhile, the type sequence is measured by adopting 2-Grams, and the method has robustness for the change of the type sequence. And forming a feature vector by the type sequence and the measurement of the parameter type. And calculating the similarity of the feature vectors generated by the two versions by adopting an rVSM algorithm. In the method, the cosine distance between the historical version feature vector a and the version feature vector b to be detected represents the similarity, and the formula is as follows:
Figure BSA0000135429070000063
wherein the content of the first and second substances,and
Figure BSA0000135429070000065
respectively representing a historical version feature vector a and a version feature vector b to be tested,
Figure BSA0000135429070000066
representing the inner product of two feature vectors.
And the resource name characteristic similarity adopts a text similarity algorithm. First, the resource name is parsed into a form composed of a sequence of words. Next, for resource name R in the history version1And resource name R in the version to be tested2The calculation formula is as follows:
Figure BSA0000135429070000067
wherein, lcs (R)1,R2) Represents R1Wherein all sub-words are in R2So that the quantized value of the resource name can be obtained and the related vector can be generated. Such as "length" and "getLength", which
Figure BSA0000135429070000071
And "getLength" and "getLength", which are
Figure BSA0000135429070000072
And (3) traversing the tree structure according to the abstract syntax tree obtained in the step (2) for the similarity of the function structure characteristics and the similarity of the calling structure characteristics, and obtaining the corresponding similarity by the same number of tree nodes and the calculated probability, namely obtaining the similarity. And finally, obtaining a training set by a feature vector set formed by the code mode pairs of the historical version, and obtaining a test set by a feature vector set formed by the code mode pairs of the version to be tested.
And 5: through step 4, we can get a training set and a test set composed of feature vectors. Since it cannot represent whether it is related to a certain dangerous resource object operation as a whole, we use the algorithm of deep neural network to realize feature merging and calculate the correlation degree here.
First, the deep neural network is trained using the generated training set. The neural network designed by the invention is divided into three layers which are respectively inputLayer, hidden layer-1, hidden layer-2 and output layer. The hidden layer-1 is twice the number of the nodes of the input layer, and the hidden layer-2 nodes are half the number of the nodes of the input layer. Hidden layer-1 Each node H1iThe calculation formula of (a) is as follows:
wherein w1iB is a parameter to be trained, InputiIs the input node value. Similarly, hidden layer-2 is derived from hidden layer-1 by this formula. For the training of w and b, the invention adopts a batch gradient descent method, which comprises the following steps:
1) initialization: Δ w(l)=0,Δb(l)When the value is 0, w and b are randomly initialized to be smaller values;
2) assuming that the number of iterations is m, for i from 1 to m, the gradient is calculated and accumulated using the BP algorithm:
Figure BSA0000135429070000074
Figure BSA0000135429070000075
wherein the content of the first and second substances,
Figure BSA0000135429070000076
3) updating parameters:
Figure BSA0000135429070000077
Figure BSA0000135429070000078
wherein, λ is an optional parameter, and 2 is taken in the invention. And (4) training a deep neural network model by the training method.
In the detection stage, the feature vector of each mode pair in the test set is used as input, and calculation is carried out through the node formula. The final output is a correlation value representing the degree of correlation of the pattern pair. The nonlinear neural network method is more effective than the linear information retrieval method, and can better reflect the correlation level.
In the deep neural network, the weight of each link of the middle layer and the input layer is obtained through historical version data training, and the corresponding weight is obtained. Meanwhile, partial links and weights in the middle of the neurons are changed through a large amount of training, and therefore output results are optimized.
And for the obtained correlation values, sorting the correlation values from large to small, and selecting the first k mode pairs as output results.
Step 6: and reminding development and maintenance personnel of the position and historical resource operation related to the position according to the obtained sensitive code to be detected with high correlation, giving a previous abnormal processing scheme for the resource, and sending a warning. And the detected Python source code is used as historical version data for next detection, so that the detection accuracy is improved. And automatically detecting the Python source code just submitted, and sending an alarm to development and maintenance personnel according to the result.
For example: in the historical version, the operations on the resource object somewhere are as follows:
Figure BSA0000135429070000081
in the historical version, the self variable is a resource object, and a read operation is performed on the object. To prevent exceptions, the developer adds a try _ catch exception to the statement's periphery.
And the source code of the version to be tested has the following statements:
def read_bytes(self,num_bytes,callback=None,streaming_callback=None,
partial=False):
self._try_inline_read()
here again, the resource object is read and uses the same API, but no exception handling is performed. The two codes are combined into a code pair, and whether the two codes are related or not can be identified and detected by the method, so that whether the code to be detected is a sensitive resource code or not is determined, developers and maintainers are reminded to process the code, and related historical version code information is given out.
In conclusion, the Python resource sensitive defect code detection method based on the deep neural network solves the problems that an automatic method aiming at Python language resource sensitive code detection and dangerous operation identification is lacked at present, improves software application quality and ensures controllability in a software evolution process.

Claims (1)

1. A Python resource sensitive defect code detection method based on a deep neural network is characterized in that a historical version and a version to be detected of the same Python software are collected from a software version control system; for the historical version, identifying a resource sensitive code mode through type inference, extracting corresponding mode characteristics, forming a relevant mode pair and a non-relevant mode pair by the defect code mode and the safety code mode according to historical repair information, and calculating characteristic similarity to generate a characteristic vector to obtain a training set; for the version to be tested, different modes and corresponding features are extracted by using the same method, a historical version defect code mode and the version to be tested form a mode pair, and feature similarity is calculated to generate feature vectors to obtain a test set; secondly, training a deep neural network model by using a training set, and performing feature combination on the test set by using the trained deep neural network model to obtain the correlation degree between the code to be tested and the defect code; finally, sorting is carried out according to the relevance, the first k relevant code pairs are selected as results, the codes to be detected in the code pairs are marked as resource sensitive codes with potential defects, dangerous resource object operation is detected, and auxiliary information is provided; the method comprises the following steps:
1) acquiring a source code of a historical version and a source code of a version to be tested of the same software; all versions of software are stored in the software version control system and submitted, and the version numbers are standardized; the historical version and the source code of the version to be tested of the same Python software can be obtained according to the established version number;
2) extracting resource sensitive code modes of all versions by utilizing type inference; performing lexical and syntactic analysis on the source codes of the historical version and the version to be detected which are collected in the step 1, generating a corresponding abstract syntax tree by using an ast module in a Python standard library, abstracting Python types, setting a type and a value for each node, and extracting a resource sensitive code mode by using a global type inference method;
the resource sensitive code mode refers to a code segment for operating a resource object;
definition 1: the Python standard library is issued along with the Python language and comprises built-in modules providing various system level functions;
definition 2: type inference is a method of inferring variable types in dynamic languages by performing static analysis on source code;
definition 3: the type is used for identifying node type information in the abstract syntax tree, and the concrete value of the type is from the abstract syntax defined by Python;
definition 4: value is a textual representation of the contents of a node in the abstract syntax tree;
3) extracting relevant characteristics of the resource sensitive code mode; through step 2, we have extracted resource sensitive code patterns from the source code; the relevant characteristics of the resource sensitive code mode extracted by the method are as follows: API (parameter type, parameter order), resource name, call structure and function structure; finally, the extracted feature names are normalized;
definition 1: for API features, calculating feature similarity by using parameter types and parameter sequences;
definition 2: for the resource name features, calculating feature similarity by using word sequences in the resource names;
definition 3: for the calling structure feature, using the calling structure similarity as the feature similarity;
definition 4: for the function structure feature, using the function structure feature as a feature similarity;
4) calculating each feature similarity between the defect code mode and the safety code mode, and between the defect code mode and the code mode to be tested, generating a feature vector, and obtaining a training set and a test set; for the historical version, pairwise matching similar defect code modes according to historical repair information to form a related mode pair; pairing the defect code mode and the security code mode similar to the defect code mode pairwise to form a non-relevant mode pair; for the version to be tested, pairing the defect code mode and the code mode to be tested in pairs to form a mode pair to be tested; then, according to each feature information extracted in the step 3, calculating each feature similarity of different mode pairs and generating a feature vector; finally, a training set is obtained by a feature vector set formed by the code pattern pairs of the historical version, and a test set is obtained by the feature vector set formed by the code pattern pairs of the version to be tested;
definition 1: the defect code mode refers to a resource sensitive defect code mode which is repaired later in the historical repair information;
definition 2: a secure code pattern refers to a resource-sensitive code pattern that is similar to a defective code pattern but does not find a defect;
definition 3: the feature similarity of the API adopts a VSM algorithm, wherein for the parameter types, a TF-IDF algorithm is adopted to calculate the weight, and the formula is as follows:
Figure FSB0000184214660000021
wherein TF is the frequency of occurrence of the type in the API, TotalapiIs the total number of APIs, ContaintypeThe number of APIs that contain the type; the method adopts the method as the weight of the characteristic vector formed by the API, measures the type sequence by adopting 2-Grams, has robustness to the change of the type sequence, and forms the type sequence and the measurement of the parameter type into one characteristic vector; calculating the similarity of the generated feature vectors of the two versions by adopting a VSM algorithm; in the method, the history versionThe cosine distance between the feature vector a and the feature vector b of the version to be detected represents the similarity, and the formula is as follows:
Figure FSB0000184214660000022
wherein the content of the first and second substances,
Figure FSB0000184214660000023
and
Figure FSB0000184214660000024
respectively representing a historical version feature vector a and a version feature vector b to be tested,
Figure FSB0000184214660000025
representing the inner product of two feature vectors;
definition 4: the feature similarity of the resource names adopts a text similarity algorithm; firstly, resolving a resource name into a form formed by combining a sequence of words; next, for the resource name in the history version and the resource name in the version to be tested, the calculation formula is as follows:
Figure FSB0000184214660000026
wherein, lcs (R)1R2) Represents R1Wherein all sub-words are in R2The number of the resource names, so that the quantized value of the resource names can be obtained, and related vectors are generated;
definition 5: the VSM algorithm is a space vector model and is an algorithm for calculating similarity;
5) training the deep neural network model by using a training set to perform feature merging, and then calculating the correlation and sequencing by using the deep neural network model for the mode in the test set; training a deep neural network model by using the training set generated in the step 2), then performing feature combination on the test set generated in the step 2) by using the trained deep neural network model, and calculating the correlation; finally, sorting the correlation values between the defect code mode and the code mode to be detected from large to small, and selecting k code pairs as output results;
6) in the program development and maintenance stage, reminding resource object operation which is possibly wrong according to the relevance ranking result, and assisting development and maintenance; according to the obtained sensitive code of the resource to be detected with high correlation degree, reminding development and maintenance personnel of the position and the historical resource operation related to the position, giving a previous abnormal processing scheme for the resource and giving an alarm; the detected Python source code is used as historical version data for next detection, so that the detection accuracy is improved; and automatically detecting the Python source code just submitted, and sending an alarm to development and maintenance personnel according to the result.
CN201610915633.4A 2016-10-20 2016-10-20 Python resource sensitive defect code detection method based on deep neural network Active CN107967208B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610915633.4A CN107967208B (en) 2016-10-20 2016-10-20 Python resource sensitive defect code detection method based on deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610915633.4A CN107967208B (en) 2016-10-20 2016-10-20 Python resource sensitive defect code detection method based on deep neural network

Publications (2)

Publication Number Publication Date
CN107967208A CN107967208A (en) 2018-04-27
CN107967208B true CN107967208B (en) 2020-01-17

Family

ID=61996517

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610915633.4A Active CN107967208B (en) 2016-10-20 2016-10-20 Python resource sensitive defect code detection method based on deep neural network

Country Status (1)

Country Link
CN (1) CN107967208B (en)

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2572155B (en) * 2018-03-20 2022-12-28 Withsecure Corp Threat detection system
CN109241739B (en) * 2018-07-19 2021-01-05 中国科学院信息工程研究所 API-based android malicious program detection method and device and storage medium
CN109446078B (en) * 2018-10-18 2022-02-18 网易(杭州)网络有限公司 Code testing method and device, storage medium and electronic equipment
CN109657461B (en) * 2018-11-26 2020-12-08 浙江大学 RTL hardware Trojan horse detection method based on gradient lifting algorithm
CN109726120B (en) * 2018-12-05 2022-03-08 北京计算机技术及应用研究所 Software defect confirmation method based on machine learning
CN110162245B (en) * 2019-04-11 2020-12-08 北京达佳互联信息技术有限公司 Analysis method and device of graphic operation, electronic equipment and storage medium
CN110175128B (en) * 2019-05-29 2023-04-07 北京百度网讯科技有限公司 Similar code case acquisition method, device, equipment and storage medium
CN110349477B (en) * 2019-07-16 2022-01-07 长沙酷得网络科技有限公司 Programming error repairing method, system and server based on historical learning behaviors
CN111459789B (en) * 2019-08-28 2023-11-03 南京意博软件科技有限公司 Detection method and device for application programming interface
CN110597735B (en) * 2019-09-25 2021-03-05 北京航空航天大学 Software defect prediction method for open-source software defect feature deep learning
CN110780878A (en) * 2019-10-25 2020-02-11 湖南大学 Method for carrying out JavaScript type inference based on deep learning
CN110825642B (en) * 2019-11-11 2021-01-01 浙江大学 Software code line-level defect detection method based on deep learning
CN111427775B (en) * 2020-03-12 2023-05-02 扬州大学 Method level defect positioning method based on Bert model
CN111913718B (en) * 2020-06-22 2022-02-11 西安交通大学 Binary function differential analysis method based on basic block context information
CN111913874B (en) * 2020-06-22 2021-12-28 西安交通大学 Software defect tracing method based on syntactic structure change analysis
CN111858323B (en) * 2020-07-11 2021-06-01 南京工业大学 Code representation learning-based instant software defect prediction method
CN112131120B (en) * 2020-09-27 2022-09-30 北京智联安行科技有限公司 Source code defect detection method and device
CN112328475B (en) * 2020-10-28 2021-11-30 南京航空航天大学 Defect positioning method for multiple suspicious code files
CN113407442B (en) * 2021-05-27 2022-02-18 杭州电子科技大学 Pattern-based Python code memory leak detection method
CN113408597A (en) * 2021-06-10 2021-09-17 北京工业大学 Java method name recommendation method based on two-stage framework
CN113836020A (en) * 2021-09-24 2021-12-24 中国电信股份有限公司 Code detection method, device and storage medium
CN113722239B (en) * 2021-11-01 2022-01-25 南昌航空大学 Airborne embedded software quality detection method, device, medium and electronic equipment
CN115454855B (en) * 2022-09-16 2024-02-09 中国电信股份有限公司 Code defect report auditing method, device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1609855A (en) * 2003-06-23 2005-04-27 微软公司 Query optimizer system and method
CN101441571A (en) * 2008-12-02 2009-05-27 南京大学 Gridding system implementing method based on Python language
CN105159715A (en) * 2015-09-01 2015-12-16 南京大学 Python code change reminding method on basis of abstract syntax tree node change extraction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1609855A (en) * 2003-06-23 2005-04-27 微软公司 Query optimizer system and method
CN100517307C (en) * 2003-06-23 2009-07-22 微软公司 Query optimizer system and method
CN101441571A (en) * 2008-12-02 2009-05-27 南京大学 Gridding system implementing method based on Python language
CN105159715A (en) * 2015-09-01 2015-12-16 南京大学 Python code change reminding method on basis of abstract syntax tree node change extraction

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Pyreview:一个基于抽象语法树差异提取的Python源代码分析工具;李清言;《中国优秀硕士学位论文全文数据库信息科技辑》;20161015;第2016年卷(第10期);I138-175 *
Tracking Down Dynamic Feature Code Changes Against Python Software Evolution;Zhifei Chen 等;《2016 Third International Conference on Trustworthy Systems and their Applications》;20160920;54-63 *

Also Published As

Publication number Publication date
CN107967208A (en) 2018-04-27

Similar Documents

Publication Publication Date Title
CN107967208B (en) Python resource sensitive defect code detection method based on deep neural network
CN109697162B (en) Software defect automatic detection method based on open source code library
CN110737899B (en) Intelligent contract security vulnerability detection method based on machine learning
CN117951701A (en) Method for determining flaws and vulnerabilities in software code
CN114297654A (en) Intelligent contract vulnerability detection method and system for source code hierarchy
Vendome et al. Machine learning-based detection of open source license exceptions
CN112733156B (en) Intelligent detection method, system and medium for software vulnerability based on code attribute graph
CN108520180A (en) A kind of firmware Web leak detection methods and system based on various dimensions
US20230273776A1 (en) Code Processing Method and Apparatus, Device, and Medium
US20210389997A1 (en) Techniques for detecting atypical events in event logs
CN115033895B (en) Binary program supply chain safety detection method and device
CN115168856A (en) Binary code similarity detection method and Internet of things firmware vulnerability detection method
CN106874762B (en) Android malicious code detecting method based on API dependence graph
CN113742205A (en) Code vulnerability intelligent detection method based on man-machine cooperation
CN109670311A (en) Malicious code analysis and detection method based on high-level semantics
CN116702160B (en) Source code vulnerability detection method based on data dependency enhancement program slice
CN116975881A (en) LLVM (LLVM) -based vulnerability fine-granularity positioning method
Fujita et al. Towards hybrid intelligence for logic error detection
Rajbahadur et al. Pitfalls analyzer: quality control for model-driven data science pipelines
CN110989991A (en) Method and system for detecting source code clone open source software in application program
CN116340185A (en) Method, device and equipment for analyzing software open source code components
CN115859307A (en) Similar vulnerability detection method based on tree attention and weighted graph matching
CN115438341A (en) Method and device for extracting code loop counter, storage medium and electronic equipment
CN116401145A (en) Source code static analysis processing method and device
CN114398069A (en) Method and system for identifying accurate version of public component library based on cross fingerprint analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant