CN113742732A

CN113742732A - Code vulnerability scanning and positioning method

Info

Publication number: CN113742732A
Application number: CN202010487204.8A
Authority: CN
Inventors: 房春荣; 葛宇; 刘子夕; 葛修婷; 钱美缘; 李宁
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2021-12-03

Abstract

A code vulnerability scanning and positioning method includes firstly scanning a code data set through a plurality of vulnerability scanning tools, analyzing and extracting scanning results to obtain basic vulnerability information. And then, adopting a voting strategy to mark the bugs by positive and false alarms, and filtering out false alarm bugs. And finally, slicing the source code by using an existing slicing tool wala on the basis of the known basic information of the vulnerability. The slicing module of the invention is improved in three aspects of selection of slicing modes, classification processing of slicing point instruction types and filtration of irrelevant sentences, and the precision of vulnerability location is effectively improved.

Description

Code vulnerability scanning and positioning method

Technical Field

The invention belongs to the field of software engineering, in particular to application of static program analysis in the field of software engineering, which is used for scanning vulnerability information generated by a static code vulnerability scanning tool.

Background

Static code analysis tools are known to mark a large number of false positives. Due to the limitations of static analysis itself, Rice theorem has proven that the problem of static analysis is an indeterminate problem in the worst case; on the other hand, most static analysis tools are not accurate enough in modeling, the difference between an analysis model and the actual execution of a program is large, and many tools use a conservative analysis method and adopt an analysis mode with sensitive flow distribution or insensitive context, so that high false alarm rate and false alarm rate are generated. How to reduce the false alarm rate and the false alarm rate of the static analysis tool becomes a hot problem in the software vulnerability analysis. Since classifying a large number of false positives is time consuming for the developer and may reduce the confidence of the static code analysis tool. False positives will be one of the biggest obstacles to using static code analysis. Therefore, how to determine whether a warning message from a static code analysis tool does indicate an error, and reducing the number of false positives that developers must avoid becomes an urgent issue to be addressed at the moment.

The academia has already made some research on the problem of false alarms. A false report analysis mode is introduced in a software study report vulnerability review, and a thought is provided by a three-star Bug feedback system. However, it is also a data granularity issue that cannot be circumvented. Samsung employs an internally implemented checker to intercept statements associated with a vulnerability, and the specific implementation is unknown. The software statement overview uses a manual slicing approach, i.e., all source code elements that are not associated with false positives are deleted until the deletion of the next element results in the disappearance of the false positive message. With artificial sections, the effect is certainly best. But not inconsiderable, the cost of manual slicing is very large for the huge amount of data in the data set (this is also mentioned in the literature).

At present, leak scanning tools in the industry are various, but the phenomenon of missing report is still serious. The result obtained by scanning the source file by a single scanning tool is very unstable, so that the possibility of false alarm of the vulnerability is very high, and finally the vulnerability is inaccurately positioned.

Some slicing tools are already mature on the market, and automatic slicing is not difficult. However, the following problems are also present:

1. the slice may contain some irrelevant statements in the source file, for example, statements in java class libraries of the import in the source file, and runtime class of the Wala tool.

2. The use of data flow dependent and control flow dependent parameters can severely impact the final slicing result.

3. The slicing tool generates an intermediate language representation for the source file as it is analyzed, resulting in certain statements (e.g., simple assignments (x ═ y, y ═ z)) that will not appear in the intermediate language due to optimizations performed by the intermediate language, and these Java statements will never appear in the slice. But based on the principle of taint propagation, these assignment statements are likely to be a source of contamination.

4. The selection of the appropriate slicing mode has a great influence on the final slicing result.

Therefore, we have generated the following idea: and scanning the code source file by using a plurality of vulnerability scanning tools, and comparing scanning reports of different scanning tools to obtain vulnerability basic information. According to the basic information of the loophole, an ideal slicing result is obtained automatically, and only a small amount of redundancy is achieved compared with manual slicing. Thereby more accurate vulnerability localization can be obtained.

Disclosure of Invention

The invention aims to solve the problems that: the current scanning tool fails to report too high, and the vulnerability location is not accurate.

1) Three vulnerability scanning tools are adopted to scan a code source file, and the vulnerability scanning method is mainly divided by the following steps:

1.1) firstly, packaging a code source file into a jar packet, automatically executing a command line mode instruction of a vulnerability scanning tool by calling a command line through java, and storing a scanning result.

1.2) extracting basic information of vulnerability scanning according to the result file stored in the step 1.1, wherein the basic information comprises vulnerability category, vulnerability grade, vulnerability ID, vulnerability method name and vulnerability class name. Judging whether the vulnerability is reported positively or not, adopting a voting mechanism, and defining the weight of the vulnerability to the level, wherein the light risk weight is 1, the medium risk weight is 2, and the high risk weight is 3. The three scanning tools are supposed to scan the same source file and report a bug warning, namely a positive warning, or a false warning. Assuming that it is critical that three scanning tools scan the same source file, the weighting weight of the source file is 3, so whether the vulnerability is positive is determined by calculating whether the weighting weight of each source file is greater than 3.

2) The method comprises the following steps of obtaining relevant information of a code source file, and dividing the code source file mainly through the following steps:

2.1) reading a source file (jar packet form) and generating a class inheritance graph of the code source file. And obtaining a system call graph CallGraph of the source file according to the class inheritance graph.

2.2) computing data and control flow dependency graphs according to the known system call graph CallGraph.

3) Finding out accurate slicing points, and dividing the slicing points mainly through the following steps:

3.1) traversing each CGnode of the system call graph CallGraph, and stopping traversing until the same method name and the same class name are matched according to the method name and the class name in the basic vulnerability information;

3.2) for the CGnode obtained in the step 3.1), traversing each instruction in the CGnode, and stopping traversing according to the condition provided by the vulnerability basic information until the instruction meeting the condition is matched;

3.3) if the statement is a common statement, directly packaging the statement into an object normalstement of Wala; if the statement is an inter-method calling statement, besides the related information of the slice point, the position of the calling method is found according to the method name called by the slice point, and the related information of the method point is added to obtain accurate normalstement.

4) Generating a final slicing result, and mainly dividing the final slicing result through the following steps:

4.1) according to the reference and the referred times of the slicing points, based on a reasonable strategy, selecting a slicing mode.

Source files are sliced using Wala.

4.2) pruning. Based on the system dependency graph, a filter is defined to filter a common java base class library, and classes from an import in a source file are filtered.

The invention is characterized in that: 1. according to the scanning results of a plurality of vulnerability scanning tools, adopting a voting strategy to mark the vulnerabilities positively and falsely; 2. according to the instruction type of the slicing point, calling statements between methods are specially processed, and the characteristics of the slicing point are enriched; 3. selecting a proper slicing mode according to the reference number and the referenced times of the slicing points; 4. and filtering related classes of the Java basic class library based on the system dependency graph. By combining the four points, the loophole positioning obtained by the method avoids the possibility of false alarm to a certain extent, and the granularity of the obtained slices is very small.

Through the above steps, we can achieve the benefits including but not limited to: some false alarms can be filtered more accurately, and relevant classes of the Wala slicing tool during operation are effectively filtered; statements that do not appear in the source file are eliminated, etc.

Drawings

FIG. 1 is an overall flow chart of the present invention.

Fig. 2 is a source code and a slice code: source code (left), slice (right).

Detailed Description

The key technology involved in the invention is Wala.

Wala

The main use of the watson Libraries for Analysis (WALA) is to provide static Analysis for Java bytecodes and related languages as well as JavaScript. The WALA tool can perform interprocess data flow analysis, context-based list slicers, pointer analysis and call graph construction, and a general framework for iterative data flow on Java bytecodes. In the invention, the position of the vulnerability in the vulnerability report is taken as a starting point, and program slicing is carried out by means of the control dependence relationship of a WALA tool analysis program to form a simplified vulnerability code segment.

The following describes the steps of the method with a specific example and shows the results.

In order to explain the technical contents of the present invention, the objects achieved, and the final effects in detail, specific embodiments will be described in more detail below.

The slicing tool we took is Wala and the dataset used is Juliet Java.

We have chosen a piece of code in the Julie Java test case dataset.

The method comprises the following specific implementation steps:

1. and packaging the code source file into a jar package form, automatically executing command line mode instructions of three vulnerability scanning tools through java call command lines, and storing the scanning result.

2. And extracting basic information of vulnerability scanning according to the stored result file, wherein the basic information comprises vulnerability category, vulnerability grade, vulnerability ID, vulnerability method name and vulnerability class name. Judging whether the vulnerability is reported positively or not, adopting a voting mechanism, and defining the weight of the vulnerability to the level, wherein the light risk weight is 1, the medium risk weight is 2, and the high risk weight is 3. The three scanning tools are supposed to scan the same source file and report a bug warning, namely a positive warning, or a false warning. Assuming that it is critical that three scanning tools scan the same source file, the weighting weight of the source file is 3, so whether the vulnerability is positive is determined by calculating whether the weighting weight of each source file is greater than 3.

3. And enumerating the running class of Wala as a Wala exclusion analysis option, restricting the action range of a Wala slicing tool, and reducing the analysis domain of Wala. Here we define a related class library to be excluded, including the unrelated class libraries of java.

4. Packaging Julie Java test case dataset into Jar package form, because the dataset is integrated by ant, we recommend taking ant command: and (4) packing the anti-f MyProject \ built. xml clean Jar, so as to ensure the performability of the Jar package. The method comprises the steps that Wala loads all classes in a Jar package into an analysis pool of Wala by reading the Jar package of a source file and adding a filter file defined in 1 as an analysis domain parameter to generate a class inheritance graph of the Jar package; the system call graph is computed by class inheritance graph.

5. And finding the current slice point based on the vulnerability positioning information. According to the source file class name and method name recorded in the vulnerability positioning information, the function CGnode where the slice point is located is obtained through a system call graph (Wala internally encapsulates the function bodies into CGnode classes). And after the corresponding CGnode is obtained, traversing the IR instruction (the IR instruction is an intermediate representation in the Wala analysis process) in the CGnode, and mapping the position of the IR instruction to the source file until the IR instruction with the mapping position consistent with the vulnerability positioning line number recorded in the vulnerability positioning information is found, namely the target instruction. If the instruction is a common instruction, the instruction is directly encapsulated into a Wala statement object normalstement; if the statement is an inter-method calling statement, besides the related information of the slice point, the CGNode of the calling method is found according to the method name called by the slice point, and the related information of the called method is added.

6. The slicing mode is selected by a reasonable adoption strategy based on the number of times of reference and reference of the slicing point (getNumberOfUses and getNumberOfDefs can obtain the number of reference and reference). It is necessary to select the proper slicing mode, which affects not only the slicing result but also whether the slicing can be completed. Because for the seed states (slice points), if only few statements depend on the result of s, but s depends on the results of many other statements, taking forward slices can occupy a large memory, directly resulting in the generation of OOM exceptions. At this time, slicing is performed in a backward slicing manner. And the statements embodying the s dependence are also the measurement of the depended statements, and the most intuitive attributes are the number of times of reference and the number of times of reference. We adopt a rough comparison method here, if the reference number of the seed state is larger than the referenced number, the backward slice is adopted; otherwise, forward slicing is adopted.

7. Source files are sliced using Wala. According to the slicing mode determined in the step 4, adopting a static method computeBackWardSlice of the Slicer or computeForWardSlice to slice the segment state; setting system call graph, data flow and control flow parameters; the system call graph we have obtained in 2. Data flow and control flow parameters, which we set here are Full, and experiments prove that the statements generated in this way are more.

8. And (6) pruning. Based on the system dependency graph, a Predicate filter is defined to filter a common java base class library, and class statements from an import in a source file are filtered. The generation of the system dependency graph requires parameters of a system call graph, a data flow and a control flow, and all the parameters are generated in the previous step and can be directly used.

Claims

1. A code vulnerability scanning and positioning method is characterized by comprising the following steps: (1) according to the scanning results of a plurality of vulnerability scanning tools, adopting a voting strategy to mark the vulnerabilities positively and falsely; (2) according to the instruction type of the slicing point, calling statements between methods are specially processed, and the characteristics of the slicing point are enriched; (3) selecting a proper slicing mode according to the reference number and the referenced times of the slicing points; (4) and filtering related classes of the Java basic class library based on the system dependency graph.

2. The method for scanning and locating the code bugs according to claim 1, wherein the bugs are positively and falsely marked by a voting strategy according to the scanning results of a plurality of bug scanning tools, and the steps are mainly as follows:

firstly, a code source file is packaged into a jar packet form, a command line mode instruction of a vulnerability scanning tool is automatically executed through a java call command line, and a scanning result is stored. And then extracting basic information of vulnerability scanning according to the stored result file, wherein the basic information comprises vulnerability category, vulnerability grade, vulnerability ID, vulnerability method name and vulnerability class name. Judging whether the vulnerability is reported positively or not, adopting a voting mechanism, and defining the weight of the vulnerability to the level, wherein the light risk weight is 1, the medium risk weight is 2, and the high risk weight is 3. The three scanning tools are supposed to scan the same source file and report a bug warning, namely a positive warning, or a false warning. Assuming that it is critical that three scanning tools scan the same source file, the weighting weight of the source file is 3, so whether the vulnerability is positive is determined by calculating whether the weighting weight of each source file is greater than 3.

3. The method for scanning and positioning the code vulnerability according to claim 1, wherein the inter-method call statement is specially processed according to the instruction type of the slice point, so as to enrich the characteristics of the slice point; first, a source file (jar packet form) is read, and a class inheritance graph of the code source file is generated. Then, according to the class inheritance graph, each CGnode of the system call graph CallGraph traversal system call graph of the source file is obtained, and according to the method name and the class name in the vulnerability basic information, traversal is stopped until the same method name and the same class name are matched; then, the obtained CGnode traverses each instruction in the CGnode, and stops traversing according to the condition provided by the vulnerability basic information until the instruction meeting the condition is matched; finally, if the statement is a common statement, the statement is directly packaged into an object normalstement of Wala; if the statement is an inter-method calling statement, besides the related information of the slice point, the position of the calling method is found according to the method name called by the slice point, and the related information of the method point is added to obtain accurate normalstement.

4. A method for scanning and locating code vulnerabilities as defined in claim 1 in which an appropriate slicing mode is selected based on the number of times slices are referenced; the method is characterized in that the slicing mode is selected according to the number of times of reference and reference of the slicing points (the number of times of reference and reference can be obtained by getNumberOfUSEs and getNumberOfDefs). The statements embodying the dependency are the metrics of the depended statements, and the most intuitive attributes are the number of references and the number of references. We here take a rough comparison, if the number of references of a slice point is greater than the number of references, backward slicing is used; otherwise, forward slicing is adopted.

5. The method for code vulnerability scanning and location according to claim 1, wherein relevant classes of the Java base class library are filtered based on a system dependency graph. Based on the system dependency graph, a Predicate filter is defined to filter common java base class libraries and filter class statements from import in the source file.