CN112131122B

CN112131122B - Method and device for source code defect detection tool misinformation evaluation

Info

Publication number: CN112131122B
Application number: CN202011031690.9A
Authority: CN
Inventors: 笋大伟; 华嘉仪
Original assignee: Beijing Zhilian Anhang Technology Co ltd
Current assignee: Beijing Zhilian Anhang Technology Co ltd
Priority date: 2020-09-27
Filing date: 2020-09-27
Publication date: 2022-09-30
Anticipated expiration: 2040-09-27
Also published as: CN112131122A

Abstract

The invention discloses a source code defect detection tool false alarm evaluation method and a device, wherein the method comprises the following steps: acquiring a feature code segment containing source code sensitive point feature information of a known sample as training data; training a neural network model by adopting a deep learning method according to the acquired feature code segments to obtain the trained neural network model, wherein the trained neural network model is used for judging whether the sensitive points corresponding to the feature code segments are misreported by a defect detection tool; inputting a feature code segment containing source code sensitive point feature information of an unknown sample into the trained neural network model to determine whether a sensitive point corresponding to the feature code segment is misreported by a defect detection tool. By adopting the method and the device, whether the sensitive point corresponding to the feature code segment of the current unknown sample is misreported by the defect detection tool can be quickly and accurately evaluated.

Description

Method and device for source code defect detection tool misinformation evaluation

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a source code defect detection tool false alarm evaluation method and device.

Background

There are many tools and systems for software code defect detection at home and abroad. Code detection tools and companies that are common abroad are Checkmarx, Parasoft, coverage, Fortify, etc. The Checkmarx is a source code security detection product produced by Israel, is specially designed for identifying, tracking and repairing technical and logical security holes of software source codes, can detect multi-program security defects, and supports multi-system platforms, programming languages and development frameworks. Its advantages are self-defined rule, high integration, low speed and precision, and high cost. Parasoft corporation is the world-wide top software quality, service virtualization, and software lifecycle management solution provider. The product comprises c/c + + test and jtest, has complete testing functions, relatively low price and high cost performance, but supports insufficient language environment. The company Synopsys marks a brand, and comprises static code analysis tools and dynamic code analysis tools, provides charging services and free open-source products, and can be used by individuals and enterprises with different requirements. The defects are that the user experience of the page is poor, and the target defects in the list are not easy to find. The detectable language is relatively few and there are some unfiltered false positives for the tool. The Fortify SCA is a software source code security testing tool of a static white box under the flag of Hewlett-packard science and technology company, converts front-end languages (such as C/C + +, Java and the like) into an intermediate file NST through a language compiler, analyzes and clarifies calling relations, environments and the like among source codes, and then performs matching and searching with a rule base through an analysis engine to capture security vulnerabilities existing in the source codes and sort out an FPR result file. However, the usage cost of the Fortify SCA is high, and the false defect report rate of the Fortify SCA is high according to the feedback of some users.

The common code detection worker in China has a CoBOT, 360 code safety guarantee system, OpenRASP, Scantist SCA, pinpoint, DMSCA and the like. CoBOT (Kubo) is a first-money commercialized code security detection tool with intellectual property rights in China, which is created by the Beijing university software engineering center in combination with the research and development team of the Beijing software enterprise, breaks through the monopoly of foreign products in the fields of software defect detection and security vulnerability analysis, and is relatively popularized in military, government and academy of sciences with the help of the Beijing university platform. The 360-code safety guarantee system has higher popularity in the industry by virtue of propaganda of company brands, free antivirus software and the like, and has the capability of covering three major directions of source code defect detection, source code compliance detection and source code traceability detection. OpenRASP is a free and open-source self-protection product during application running, which is launched by Baidu security, is directly injected into the service of protected application to provide real-time protection at a function level, can detect and protect unknown bugs without updating a strategy and upgrading protected application codes, and is suitable for Internet application using open-source components in large quantity and financial application developed by using third-party integrators. Scantist SCA is a solution created by Shanghai-control Ann that combines source code with binary file analysis, which can assist users or enterprises in providing detailed analysis of third-party components used in applications. Pinpoint is developed by source parachute technology, the technology is proved based on the fifth generation theorem of academic circles, 8 years of scientific achievements are developed from hong Kong university of technology, software behaviors are understood by analyzing source codes and binary codes of software, code defects are searched, code specification violation and malicious or illegal behaviors are found, the method is high in speed and accuracy, flexible in customization and development, transparent in price and high in C + +/C and Java/Android detection capability. DMSCA is a scanning analysis service platform for source code security holes, quality defects and logic defects, which are introduced by Shanghai Mar science and technology, analyzes software source code security holes, quality defects and logic defects on the basis of a static analysis technology, is convenient for enterprises to evaluate, monitor and improve software security and product quality, manage development teams and outsourcing teams, and supports customization of a customized platform.

However, the existing source code defect detection tool has a high false alarm rate and a high false missing rate, and many users can only invest a lot of manpower to manually judge whether the code really has a bug or not one by one. In addition, the current research direction mainly includes the combination of dynamic and static detection, information fusion based on various static detection tools, and the like. For example, source code static analysis tools with different mechanisms are integrated into a tool platform, output data of each tool are preprocessed according to a uniform format and stored in a database, and data in the database are subjected to statistical analysis to obtain a comprehensive detection result. Through multi-level comprehensive detection, the method can effectively find the security vulnerability in the software under the condition of only software source codes, and can accurately provide the position of the software security vulnerability and the danger degree of the vulnerability so as to carry out targeted processing. Or a multi-strategy software code defect detection system is designed and realized by adopting a method of combining dynamic detection and static detection and integrating detection by a plurality of static detection tools. The system not only can detect various source codes such as Java, C/C + + and the like, but also can realize detection after data is loaded by applying decompilation processing under the condition that only executable program codes exist.

However, the prior art has the following problems:

1. the speed is slow. The existing method has the problem of long time consumption no matter whether the result given by the detection tool is misinformed or not is judged manually or whether the possible false alarm is reduced by combining the detection results of different types of tools. For example, information fusion based on multiple static detection tools means that different tools are required to perform multiple scanning detection on software to be detected, the detection time difference of different detection tools may be large, and although detection can be performed in parallel, the overall completion time will be limited by the slowest detection tool. Dynamic detection is combined with static detection, and the dynamic detection needs to compile and run a code, is relatively slow, and is not necessarily suitable for all use scenarios.

2. The cost is high. The false alarm condition of the detection tool is judged and checked manually, so that a large amount of labor cost can be increased. Combining the inspection results of multiple tools means purchasing and using multiple inspection tools. Only a few code detection tools are available in the market for free, most of them need to pay for the detection, and even some of them are expensive, and this method combining multiple tools can easily double the cost of code detection.

3. The effect is limited by the detection tool used. Because complete accurate analysis such as directional analysis, guard value analysis, non-guard value analysis and the like is technically difficult to realize and often has high time complexity and space complexity, the analysis mode of the existing detection tool cannot achieve accuracy. The insufficient acquisition of the context information leads to uncertainty caused by adopting a formalized method to carry out coding rule check, and the rule matching mechanism is not perfect enough, so that a detection tool generates false alarm or false negative alarm. Although the mechanisms of different source code static analysis tools are different, the code defect problem of each tool which is good at detection is different, and the accuracy of the tool is relatively higher for the problem which is good at detection, the problem which is not good at detection can cover all problem types, the detection result of the problem which is good at detection can not be ensured to be accurate, and the false alarm of most code problems can not be effectively filtered.

Disclosure of Invention

The invention aims to provide a method and a device for evaluating the false alarm of a source code defect detection tool, which can quickly and accurately evaluate whether a sensitive point corresponding to a feature code fragment of a current unknown sample is falsely reported by the defect detection tool.

In order to achieve the above object, the present invention provides a method for evaluating false alarm of source code defect detection tool, comprising:

acquiring a feature code segment containing source code sensitive point feature information of a known sample as training data;

training a neural network model by adopting a deep learning method according to the acquired feature code segments to obtain the trained neural network model, wherein the trained neural network model is used for judging whether the sensitive points corresponding to the feature code segments are misreported by a defect detection tool;

and inputting a feature code segment containing source code sensitive point feature information of an unknown sample into the trained neural network model to determine whether the sensitive point corresponding to the feature code segment is misreported by a defect detection tool.

In order to achieve the above object, the present invention further provides a false alarm evaluation device for a source code defect detection tool, the device comprising:

the acquisition module is used for acquiring a feature code segment containing source code sensitive point feature information of a known sample as training data;

the training module is used for training a neural network model by adopting a deep learning method according to the acquired feature code segments to obtain the trained neural network model, and is used for judging whether the sensitive points corresponding to the feature code segments are misreported by a defect detection tool or not;

and the detection module is used for inputting a feature code segment of an unknown sample containing the feature information of the source code sensitive point into the trained neural network model so as to determine whether the sensitive point corresponding to the feature code segment is misreported by a defect detection tool.

In summary, the present invention provides a method and an apparatus for evaluating a source code defect detection tool by false alarm, wherein the method comprises: acquiring a feature code segment containing source code sensitive point feature information of a known sample as training data; training a neural network model by adopting a deep learning method according to the acquired feature code segments to obtain the trained neural network model, wherein the trained neural network model is used for judging whether the sensitive points corresponding to the feature code segments are misreported by a defect detection tool; inputting a feature code segment containing source code sensitive point feature information of an unknown sample into the trained neural network model to determine whether a sensitive point corresponding to the feature code segment is misreported by a defect detection tool. Because the scheme of the invention leads the misinformation evaluation to be more and more accurate by continuously training the neural network model, the problems of low speed, high cost and effect limited by the used detection tool in the prior art are solved.

Drawings

Fig. 1 is a schematic flow chart of a source code feature extraction method according to an embodiment of the present invention.

FIG. 2 is a simplified abstract syntax tree according to a second embodiment of the present invention.

Fig. 3 is a schematic diagram of a feature code segment code line arrangement process according to a second embodiment of the present invention.

Fig. 4 is a flow chart illustrating a method for evaluating false alarms of a source code defect detection tool in accordance with a third embodiment of the present invention.

Fig. 5 is a schematic structural diagram of a false alarm evaluation device for a source code defect detection tool according to a fifth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Code examination is an important link in the software development process and is very important for maintaining software safety. Although various techniques and tools have been proposed to detect code defects, reducing the false negative and false positive rates of static analysis tools remains a hot problem. Artificial intelligence, as a science, is the intersection of statistics and computer science, and is now widely applied to various fields, and plays an increasingly important supporting role in scientific research and application. The combination of artificial intelligence and the code detection field brings a new idea for the false alarm evaluation of detection tools. The invention provides a code detection tool false alarm evaluation system based on artificial intelligence, which is used for scanning and analyzing a source code sample, extracting a code characteristic vector set, training an artificial intelligence model aiming at the set, optimizing a detection result report of the existing tool according to a model prediction result, removing false alarms, and providing a method and a way for exploring the application of the artificial intelligence in the safety field, making up the defects in a static code detection tool and saving the labor cost.

Example one

Because the input neural network model of the invention is a vector formed by the feature code segments, and the feature code segments are formed after feature extraction is carried out on the source code file, the embodiment first explains how each feature code segment in the source code file is obtained.

The embodiment first preprocesses a source code file; then, carrying out data flow analysis by taking the sensitive point as an entry point to obtain a characteristic code segment with a reserved key code line; and finally, post-processing the source code file to obtain the serial number of the feature code fragment, the path of the source code file and the line number of the sensitive point to form the feature code file comprising the feature code fragment.

A schematic flow diagram of a method for extracting source code features provided in an embodiment of the present invention is shown in fig. 1, and includes the following steps:

and 11, preprocessing the source code file, specifically including normalizing the source code file and developing macro definition.

In source codes of different programming languages, situations that sentences and character strings are written across lines, null values are different in form, and support for unary operators is different may occur. In the C and C + + languages, developers may also define macros to replace portions of code. To remove the influence of such cases on code analysis, we will normalize and develop macro definitions in advance, merge across strings, unify null characters, convert unary operators (e.g., a + + to a ═ a +1), and develop macro definitions.

And step 12, analyzing the data stream according to each sensitive point in the source code file to obtain a feature code segment containing feature information of the sensitive point of the source code.

The method specifically comprises the following steps:

and S121, determining each sensitive point in the source code file.

The source code refers to an uncompiled text code file, such as a.c,. cpp,. java file, written by a developer according to a programming language specification during a software development process. The source code files support programming languages such as JAVA, C, and C + +.

The sensitive points are divided into three types, including sensitive function call, array sensitive operation or pointer sensitive operation; wherein the sensitive function comprises a library function or an API function; array sensitive operations include access to or assignment of array elements; pointer sensitive operations include access to or assignment of a pointer.

The sensitive point is a heuristic concept, which can be regarded as a "center" of a flaw or a defect, and is a position where the defect easily occurs. Through research and analysis on defect code samples, we can find that a large number of bugs and defects are associated with improper use of certain library functions or API functions, and improper operation of arrays and pointers. For vulnerabilities that do not use some library functions or API functions correctly, the sensitive point is a library or API function call, and for vulnerabilities that do not use arrays, the sensitive point is an operation on the array. Some bugs, defects may have a variety of sensitive points, for example, buffer errors may be associated with library or API function calls, array and pointer operations. And the same code sensitive point may exist in multiple bugs, for example, both buffer fault and resource management fault bugs contain the sensitive point of a library or API function call. The method takes three types of code behaviors of sensitive function call, pointer and array sensitive operation as sensitive points, and processes the code behavior entries to extract the characteristics of the code.

Among them, the sensitive function is a library function or an API function which is liable to cause a security problem due to improper use by a program developer. Library functions are predefined functions provided by a compiler or a development tool that can be called in a source program. The system comprises library functions specified by the C/C + + language standard, JDKs (java development kits) and java library functions provided by enterprises and organizations. The API functions are predefined functions provided by the operating system to the program developer. Sensitive operations are operations on arrays or pointers which are prone to cause security problems, and include access to array elements and pointers and assignment of values.

Wherein, step S121 specifically includes:

s1211, generating an abstract syntax tree from the source code file;

an abstract syntax tree is an abstract representation of the source code syntax structure. It represents the syntactic structure of the programming language in the form of a tree, each node on the tree representing a structure in the source code.

S1212, analyzing the abstract syntax tree, and recording the position of each sensitive point in the source code file to form a sensitive point list; and storing the function name and the parameter called by the user-defined function, the function name and the parameter of the user-defined function and the calling relation among the functions.

And S122, for each sensitive point, tracking the data stream of the initial sensitive variable according to the initial sensitive variable of each sensitive point, acquiring other sensitive variables, and obtaining a code line set related to the semantics of each sensitive variable.

For each sensitive point, the initial sensitive variable is the first sensitive variable added to the list of sensitive variables to which the initial sensitive variable belongs. The list of sensitive variables includes a plurality of sensitive variables.

When the sensitive point is a sensitive function call, the initial sensitive variable is a parameter of the sensitive function call;

when the sensitive point is an array sensitive operation, the initial sensitive variable is the array;

when the sensitive point is a pointer sensitive operation, the initial sensitive variable is the pointer

The method specifically comprises the following steps:

s121, circularly traversing the sensitive point list, acquiring an initial sensitive variable of each sensitive point, adding the initial sensitive variable into the sensitive variable list, tracking a data stream of the initial sensitive variable by analyzing the abstract syntax tree, and recording a code line set semantically related to the initial sensitive variable;

s122, iteratively tracking other sensitive variables by taking the initial sensitive variable as a starting point, wherein in the process of tracking the data stream of the initial sensitive variable,

when a first variable is transmitted to a sensitive variable in a code line in the code line set, adding the first variable serving as the first sensitive variable into a sensitive variable list, tracking a data stream of the first sensitive variable by analyzing the abstract syntax tree, and recording a code line set related to the semantics of the first sensitive variable;

or according to the calling relation among the functions, the function name and the parameter called by the user-defined function, and the function name and the parameter of the user-defined function, determining a second sensitive variable and adding the second sensitive variable into a sensitive variable list, tracking the data stream of the second sensitive variable by analyzing the abstract syntax tree, and recording a code line set semantically related to the second sensitive variable.

The method comprises the following steps of determining a first sensitive variable according to a calling relation between functions, a function name and a parameter called by a user-defined function, and the function name and the parameter of the user-defined function, and specifically comprises the following steps:

obtaining the function called by the user-defined function to be the same as the user-defined function according to the calling relation among the functions;

and determining that the form parameter corresponds to the actual parameter and is a second sensitive variable according to the fact that the parameter of the function called by the same user-defined function is the actual parameter and is a sensitive variable and the fact that the parameter of the same user-defined function is the form parameter.

And then, a recursive processing method is adopted to enter the definition of the called function, sensitive variables are continuously tracked, the sensitive variables are added into a sensitive variable list, and code line sets related to the sensitive variables in a semantic mode are recorded.

That is, for each sensitive point in the sensitive point list, a sensitive variable list is corresponding to one sensitive variable, and the sensitive variable list includes a plurality of sensitive variables, and is not limited to the first sensitive variable and the second sensitive variable.

In the above step, when there is a first variable transferred to a sensitive variable in a code line in the code line set, that is, if there is a process in which information of a certain variable is transferred to an initial sensitive variable or other sensitive variables, the variable is also a sensitive variable. The transfer means that when an assignment statement (e.g., "variable ═ expression;", where "═" may also be +, -,/,%, >, < &, ^, |) occurs in the code, and when a conditional assignment statement (e.g., "variable name?, expression 1: expression 2") occurs, information is transferred from the right side to the left side of the "═ or so symbol, and when a for loop (e.g.," for (element data type element variable: traversal object) ") occurs, information is transferred from the right side to the left side.

And S123, arranging the acquired code line sets according to the sequence in the source code file to obtain a feature code segment containing feature information of the source code sensitive point.

When there is a call relationship between functions, after arranging the acquired code line sets in the order in the source code file, the method further includes: and splicing the code lines according to the execution sequence of the functions.

And step 13, post-processing the source code file to obtain the serial number of the feature code fragment, the path of the source code file and the line number of the sensitive point.

The feature code file comprises a feature code segment number, a source code file path, a line number of a sensitive point and a specific feature code segment. When the feature code file is used for code defect detection false alarm evaluation, the error report of a code defect detection tool to a certain feature code segment is regarded as false alarm, and can be positioned to a corresponding position in a source code through a line number.

Thus, the source code feature extraction method of the present invention is completed. And one sensitive point corresponds to obtain a feature code segment containing the feature information of the sensitive point of the source code. Many software security problems are caused by software source code defects, and code review becomes an important link in the software development process and is very important for maintaining software security. The formation of code defects generally involves a plurality of lines of codes, and the source code characteristics mentioned in the invention mainly refer to code defect characteristics, including sensitive points and related code lines which are in close semantic relation with the sensitive points. A feature code segment embodies the context information of a code sensitive point and removes irrelevant information. The source code characteristics extracted by the method can be applied to the false alarm evaluation of the source code defect detection.

Example two

Based on the first embodiment, specific scenarios are listed below for the purpose of clearly illustrating the present invention. The C language code.c shown in table 1 is explained in detail as an example.

TABLE 1

The code is an example code for doubly releasing bugs, a Stonesoup _ handle _ target function in the code receives a character string and copies the character string into a heap memory, namely, the character string is copied into a Stonesoup _ buffer, then whether the first character is larger than or equal to 'a' is checked, and if the first character is larger than or equal to 'a', the Stonesoup _ process _ buffer function is called. If the first letter of the character string is greater than or equal to 'a', the Stonesoup _ buffer is released once in the Stonesoup _ process _ buffer function and released again in the 25 th line of the Stonesoup _ handle _ paint function, so that a double release vulnerability is formed.

And step 21, preprocessing the source code file.

The macro definition is defined in line 1 of the code, the "+ +" operator is used in line 14, the annotation is provided in

lines

23 and 28, and the following pre-processing is required in order to remove the influence of these non-code characters and writing habits on the code analysis, in the sentence written across lines 24 and 25:

1) the macro definition in line 17 of the code is expanded according to line 1 of the code,

2) replace line 14 with:

“stonesoup_global_variable＝stonesoup_global_variable+1”，

3) the comments at the end of the 23, 28 lines are removed,

4) the 24, 25 lines are merged (keeping the line number of the code after 25 lines unchanged). The result of the preprocessing of the source code file is shown in table 2.

TABLE 2

And step 22, analyzing the data stream according to each sensitive point in the source code file to obtain a feature code segment containing feature information of the sensitive point of the source code.

First, an abstract syntax tree of a code is generated, and fig. 2 is a simplified abstract syntax tree diagram. Under the root file, the definition (function definition) of the method includes the function stop _ process _ buffer and the parameter buffer _ param, and also includes the function stop _ handle _ target and the parameter champion _ advertisement. The call of the method (funcnameexpr) includes the function call relationship of stonesoup _ handle _ taint to stonesoup _ process _ buffer.

The sensitive spot is then recorded. In the code example, a free function, a strcpy function and a malloc function are sensitive functions which are easy to generate bugs, the access and assignment of a pointer are sensitive operations which are easy to cause security problems, so that the code example comprises sensitive function calling, the 5 th, 6 th, 16 th, 17 th, 22 th and 28 th sensitive operations of the pointer all contain sensitive points, and parameters of the sensitive functions and the pointer subjected to the sensitive operations are initial sensitive variables. Meanwhile, the function name and its parameters (function stop _ process _ buffer and parameter stop _ buffer in row 24) of the user-defined function call, the function name and parameters (function stop _ process _ buffer and parameter buffer _ param, function stop _ handle _ tint and parameter change _ address) of the user-defined function in the code, and the function call relation (stop _ handle _ tint call stop _ process _ buffer) included in the code are stored.

Taking the free function in the 28 th row as an example, taking the parameter of the free function as an initial sensitive variable, and analyzing the abstract syntax tree to obtain the row semantically related to the parameter "store _ buffer" of the free function in the code, such as the row identified by the segment 1 in table 3, including the 12 th, 17 th, 18 th, 22 th, 23 th, 24 th, 27 th and 28 th rows.

On line 17 of the code, there is a process of passing information of the variable "royal _ fresh" to the sensitive variable "storesound _ buffer", i.e. data flows from "royal _ fresh" to "storesound _ buffer", so "royal _ fresh" also adds to the list of sensitive variables, and obtains the lines semantically related to it in the code, including

lines

13, 16, and 17. Similarly, on line 16, data flows from "champion _ admallo" to "royal _ replay", and then to "stonesoup _ buffer", and the variable "champion _ admallo" becomes a sensitive variable and is also added to the sensitive variable list, and the lines semantically related to the code are obtained, including

lines

10, 15 and 16. The line identified by line segment 2 in table 3 is the line obtained by the dataflow analysis.

In the 24 th line of the code, the user-defined function store _ process _ buffer is called in the code, and knowing that store _ buffer is a sensitive variable, the actual parameter store _ buffer corresponds to the form parameter buffer _ param in the function definition through the function name and its parameters called by the user-defined function, and the function name and parameters of the user-defined function, so that the definition of the function store _ process _ buffer is entered, and the buffer _ param is used as a sensitive variable, and semantically related code lines are extracted, such as the line identified by the segment 3 in table 3, including the 3 rd, 5 th, and 6 th lines.

TABLE 3

And finally, arranging the obtained code lines in sequence, and splicing the code lines according to the function execution sequence according to the function call relation. Since the function store _ handle _ taint calls store _ process _ buffer at line 24, the code lines extracted from store _ process _ buffer will be arranged after line 24 and before line 27 in store _ handle _ taint. The data flow analysis phase is completed. The signature code snippet code line arrangement process is illustrated in fig. 3, where the numbers represent the line numbers of the code lines.

And step 23, performing post-processing on the source code file to obtain the serial number of the feature code fragment, the path of the source code file and the line number of the sensitive point.

In the last step, a line number sequence containing vulnerability characteristics in the code is obtained, and in the step, information such as the serial number of the output characteristic code segment, the path of the source code and the like, and a code line corresponding to the line number and a sensitive point correspond to one characteristic code segment. The code features extracted from the source code file shown in table 1 are shown in table 4. The method comprises 6 sensitive points, wherein the serial numbers are respectively 1 to 6, taking the 6 th sensitive point as an example, the path of a source code is/path/to/code.c, the line number 28 of the sensitive point is located, and a characteristic code segment included by the 6 th sensitive point is specifically expanded.

TABLE 4

In summary, in the above embodiment, there are 6 sensitive points, and each sensitive point corresponds to obtain a feature code segment containing feature information of a source code sensitive point, including a sensitive point and a related code line having a close semantic relationship with the sensitive point. If the false alarm evaluation is carried out, a certain characteristic code segment is detected as having the false alarm, which code line the false alarm is generated in can be accurately determined, and the user can conveniently check the false alarm.

EXAMPLE III

This embodiment illustrates the core idea of the invention. The method comprises three parts, namely, the collection of a known sample, the training of a neural network model and the judgment of whether the current defect detection tool misreports by an unknown sample through the neural network model.

The flow diagram of the method for evaluating the false alarm of the source code defect detection tool provided by the third embodiment of the invention is shown in fig. 4, and the method comprises the following steps:

and step 41, collecting feature code segments containing source code sensitive point feature information of known samples as training data.

Artificial intelligence based techniques typically require a large amount of data as support. Researchers have shown that AI techniques can work well in the field of code defect detection, but require sufficient and comprehensive training data. The present invention also requires sufficient data set support when training the model. The invention downloads or crawls a large number of source code files from the public data sets of SARD, OWASP, NVD and the like to construct a training sample set. These public data sets contain over a hundred thousand test cases, covering C, C + +, Java, etc. programming languages, including CWE defect classes. That is, the known sample of the present invention may be a signature code fragment obtained from the source code file of the data set described above.

And 42, training a neural network model by adopting a deep learning method according to the acquired feature code segments to obtain the trained neural network model, wherein the trained neural network model is used for judging whether the sensitive points corresponding to the feature code segments are misreported by a defect detection tool.

The method includes the following steps that according to the obtained feature code fragments, a deep learning method is adopted to train a neural network model, and the trained neural network model is obtained, and the method specifically includes the following steps:

s421, performing symbolization and vectorization on each feature code segment, then packaging the feature code segment into a vector, using the vector as the ith sample in M samples of a deep learning training set, inputting the sample into a neural network model, and obtaining a predicted value whether a defect detection tool misreports; i belongs to M, and M is a natural number;

in step S421, the symbolization of the first step is to convert the feature code segments into their symbolic representation. This step is intended to heuristically capture some semantic information in the program used to train the neural network.

Symbolizing the feature code segment includes: mapping each user-defined variable in the feature code segment to a first symbol name one by one; and mapping each user-defined function in the feature code segment to a second symbolic name one by one. Wherein the first symbol name refers to a plurality of symbol names of the same type (e.g., "VAR 1", "VAR 2"); note also that when multiple user-defined variables appear in different feature code fragments, they may map to the same symbolic name. The second symbol name refers to a plurality of symbol names (e.g., "FUN 1", "FUN 2") in the same type of form that are distinguished from the first symbol name; note also that when multiple user-defined functions appear in different feature code segments, they may map to the same symbolic name.

It should be further noted that the user-defined variables in this embodiment include, but are not limited to, sensitive variables.

In step S421, the second step is vectorization. The machine learning algorithm can only receive input in the form of vectors, and we choose a word vector model to map a symbol to an integer and then convert it to a vector of fixed length. Since the feature code segments may have different numbers of symbols, the corresponding vectors may have different lengths. For vectors that are too long, truncation is performed at the beginning, and the underlength is filled with 0 at the beginning. Through vectorization, the processing of text content can be simplified into vector operation in a vector space, and the similarity in the vector space is calculated to express the similarity in text semantics.

Vectorizing the tokenized feature code segments comprises: splitting the characteristic code segment comprising the first symbol name and the second symbol name to form a symbol sequence; the word vector model is selected to map each symbol in the symbol sequence to an integer and then to convert the integer into a vector having a predetermined length.

It should be noted that, in step S421, each feature code segment is packaged into a vector after being symbolized and vectorized, which is a preferred embodiment. That is, if the feature code segment is directly vectorized, the feature code segment may also be input into the neural network model, but the vector that is not symbolized is input into the neural network model, which may affect the training effect of the neural network model, and therefore, the first symbolization of the feature code segment may reduce the adverse effect.

S422, constructing a loss function according to whether the defect detection tool of the ith sample misreported predicted value and true value, optimizing the minimum value of the loss function, updating the weight parameter of the network, and obtaining the trained neural network model.

Whether the sensitive point corresponding to the feature code segment of each sample is the true value which is misreported by the defect detection tool or not can be determined by performing defect detection in advance through the defect detection tool and observing the feature code segment of the sample. For data sets such as SARD, OWASP that directly provide source code and a signature, if a line contained in a feature code fragment is marked as defective in the data set or the source code that generated the feature code fragment is marked as defective, it is indicated that the security defect does exist in the sensitive point of the feature code fragment. For a data set which is not directly marked by NVD, if the characteristic code segment is observed to contain at least one statement deleted or modified according to the patch, the security defect of the sensitive point of the characteristic code segment is proved to exist. If the defect detection tool judges that the sensitive point of the feature code segment has a safety defect and does have the safety defect, setting the true value of the sample to be 0, and indicating that the defect detection tool does not report the defect detection result in a false way; if the defect detection tool judges that the security defect exists in the sensitive point of the feature code segment, but the problem does not exist actually, the real value of the sample is set to be 1, and the defect detection tool generates false alarm on the defect detection result. And when the predicted value and the true value of the sample are closer, the loss function takes the minimum value, the obtained network weight parameter is optimal, and the model training is finished. The trained neural network model may be used in a subsequent step 43 to determine, for a feature code segment of an unknown sample, whether a sensitive point corresponding to the feature code segment is misreported by a defect detection tool.

The gradient of the loss function is calculated by adopting a gradient descending method, and the direction with the fastest gradient descending is selected to ensure that the loss function is the minimum. The neural network model corresponds to a set of optimal network weight parameters after training.

And 43, inputting a feature code segment containing the feature information of the sensitive point of the source code of the unknown sample into the trained neural network model to determine whether the sensitive point corresponding to the feature code segment is misreported by the defect detection tool.

Further, after training the neural network model by using the deep learning method between step 42 and step 43, in order to optimize the accuracy of the neural network model, the method further includes: and performing performance evaluation on the trained model until the performance requirement is met.

It should be noted that the feature code segment of the known sample in step 41 and the feature code segment of the unknown sample in step 43 are obtained according to the first embodiment. Also, input to the neural network model in

steps

42 and 43 are vectors formed by the feature code segments.

Furthermore, after an accurate result whether the sensitive point corresponding to the feature code segment of the unknown sample is misreported by the defect detection tool or not is obtained, the type and the position of the misreport can be displayed through a user interface, and real error reporting information is highlighted. The type of false alarm refers to a specific security defect existing in a sensitive point corresponding to the feature code segment, and the position refers to a line number of the sensitive point.

Take the C language code.c shown in table 1 as an example. Table 5 shows the partial content of the code detection result report of a certain code detection tool. The trained neural network model respectively generates '1' and '0' for feature code segment detection results generated by the two code sensitive points on the 6 th and 28 th lines.

Serial number	Source code path	Location of defect	Type of defect
				1	/path/to/code.c	6	Dual release loophole
2	/path/to/code.c	28	Dual release loophole

TABLE 5

And the results of the detection tool show that double release loopholes exist in the 6 th and 28 th rows, the results of the false alarm evaluation indicate that the error report of the 6 th row is false alarm, and the 28 th row is non-false alarm. The report of the detection tool can be optimized by the result of the false positive evaluation, as shown in table 6.

Serial number	Source code path	Location of defect	Type of defect	Whether or not to report by mistake
					1	/path/to/code.c	6	Dual release loophole	Is that
2	/path/to/code.c	28	Dual release loophole	Whether or not

TABLE 6

In summary, the neural network model of the present embodiment can be dynamically updated with the increase of the known sample data set through the training process, and the efficiency and accuracy of the false alarm evaluation are continuously improved. After the unknown sample is input into the trained neural network model, the accurate result of whether the sensitive point corresponding to the feature code fragment of the unknown sample is misreported by the defect detection tool can be quickly obtained, and therefore the purpose of the invention is achieved.

Example four

Based on the third embodiment, specific scenarios are listed below for the purpose of clearly illustrating the present invention. The feature code segment included in the 6 th sensitive point shown in table 4 is taken as an example for explanation, and the feature code segment is subjected to symbolization and vectorization. The signature code fragments are shown in table 7.

void stonesoup_handle_taint(char＊champion_adamello)

char＊stonesoup_buffer＝0；

char＊royalising_resaw＝0；

if(champion_adamelIo！＝0){

royalising_resaw＝((char＊)champion_adamello)；

stonesoup_buffer＝malloc((strlen(royalising_resaw)+1)＊sizeof(char))；

if(stonesoup_buffer＝＝0){

strcpy(stonesoup_buffer，royalising_resaw)；

if(stonesoup_buffer[0]＞＝97){

stonesoup_printf(″Index of first char：％i\n″，stonesoup_process_buffer(stonesoup_buffer))；

char stonesoup_process_buffer(char＊buffer_param){

first_char＝buffer_param[0]-97；

free(buffer_param)；

if(stonesoup_buffer！＝0){

free(stonesoup_buffer)；

TABLE 7

First, symbolization is performed.

In the feature code segment, champion _ admallo, stonesoup _ buffer, royal _ reseaw, buffer _ param, and first _ char are user-defined variable names to be mapped one by one to the symbols VAR1, VAR2, VAR3, VAR4, and VAR5, as shown in table 8.

void stonesoup_handle_taint(char＊VAR1)

char＊VAR2＝0；

char＊VAR3＝0；

if(VAR1！＝0){

VAR3＝((char＊)VAR1)；

VAR2＝malloc((strlen(VAR3)+1)＊sizeof(char))；

if(VAR2＝＝0){

strcpy(VAR2，VAR3)；

if(VAR2[0]＞＝97){

stonesoup_printf(″Index of first char：％i\n″，stonesoup_process_buffer(VAR2))；

char stonesoup_process_buffer(char＊VAR4){

VAR5＝VAR4[0]-97；

free(VAR4)；

if(VAR2 ！＝0){

free(VAR2)；

TABLE 8

In the feature code segment, the storesound _ handle _ tint, storesound _ printf and storesound _ process _ buffer are user-defined functions, and these function names are mapped one by one to symbols FUN1, FUN2 and FUN3, as shown in table 9.

void FUN1(char＊VAR1)

char＊VAR2＝0；

char＊VAR3＝0；

if(VAR1！＝0){

VAR3＝((char＊)VAR1)；

VAR2＝malloc((strlen(VAR3)+1)＊sizeof(char))；

if(VAR2＝＝0){

strcpy(VAR2，VAR3)；

if(VAR2[0]＞＝97){

FUN2(″Index of first char：％i\n″，FUN3(VAR2))；

char FUN3(char＊VAR4){

VAR5＝VAR4[0]-97；

free(VAR4)；

if(VAR2！＝0){

free(VAR2)；

TABLE 9

The signed feature code segment is then vectorized.

The signed feature code segment is split into individual symbols, and the former two behaviors are as follows:

void FUN1(char*VAR1)

char*VAR2＝0；

will be split into: "void", "FUN 1", "(", "char", "," VAR1 ",") "," char "," "," VAR2 "," ═ 0 ","; ".

The next few rows are similar. The signed signature code segment shown in table 7 will be decomposed into a symbol sequence of 137 symbols.

And selecting a word vector model, mapping each symbol in the symbol sequence to an integer, and converting the integer into a vector with a fixed length. The results after transformation as in the first two lines are: 38, 15,0, 12,6,4,1, 12,6,7,5, 14,2.

The next few rows are similar. Wherein like symbols are mapped to like numbers. E.g. 12, 6, occur twice because they are the result of the mapping of the symbols "char", respectively ". The entire symbol sequence will be converted into a number vector containing 137 numbers.

Assuming that the machine learning model happens to require a length 137 vector as input, the sequence of numbers is not subject to additional processing. Assuming that the model requires a vector of length 130 as input, the number sequence is too long and needs to be truncated from the beginning, i.e. the first seven numbers are deleted, the number sequence becomes 12, 6, 7, 5, 14, 2, … assuming that the model requires a vector of length 140 as input, the number sequence is too short and needs to be filled with 0 in the beginning, i.e. the number sequence becomes 0, 0, 0, 38, 15, 0, 12, 6, 4, 1, 12, 6, 7, 5, 14, 2, ·

The vectorized digital vectors are input into a machine learning model, i.e., a neural network model. The physical meaning is the mapping of text content in vector space, and one vector represents one feature code segment. Vectorization may simplify the operations between text information such as feature code fragments to operations between vectors.

EXAMPLE five

Based on the same inventive concept as the embodiment, the invention also discloses a false alarm evaluation device of a source code defect detection tool, the structural schematic diagram is shown in fig. 5, and the device comprises:

the acquisition module 501 is used for acquiring a feature code segment containing feature information of a source code sensitive point of a known sample as training data;

the training module 502 is used for training a neural network model by adopting a deep learning method according to the acquired feature code segments to obtain the trained neural network model, and is used for judging whether the sensitive points corresponding to the feature code segments are misreported by a defect detection tool;

the detection module 503 inputs a feature code segment of an unknown sample, which includes feature information of a source code sensitive point, into the trained neural network model to determine whether a sensitive point corresponding to the feature code segment is misreported by the defect detection tool.

The acquisition of the feature code fragment comprises the following steps: and analyzing the data stream according to each sensitive point in the source code file to obtain a feature code segment containing feature information of the sensitive point of the source code.

The data flow analysis is performed according to each sensitive point in the source code file to obtain a feature code segment containing feature information of the sensitive point of the source code, and the method specifically includes:

determining each sensitive point in a source code file;

for each sensitive point, tracking the data stream of the initial sensitive variable according to the initial sensitive variable of each sensitive point, and acquiring other sensitive variables to obtain a code line set related to the semantics of each sensitive variable;

arranging the acquired code line sets according to the sequence in the source code file to obtain a feature code segment containing feature information of the source code sensitive points;

the sensitive point comprises a sensitive function call, an array sensitive operation or a pointer sensitive operation; the sensitive function comprises a library function or an API function; array-sensitive operations include access to or assignment of array elements; pointer sensitive operations include access to or assignment of a pointer;

when the sensitive point is a pointer sensitive operation, the initial sensitive variable is the pointer.

Before determining each sensitive point in the source code file, preprocessing the source code file, including normalizing the source code file and developing macro definition;

after arranging the acquired code line sets according to the sequence in the source code file, post-processing the source code file to acquire the feature code segment number, the source code file path and the line number of the sensitive point.

The determining each sensitive point in the source code file comprises:

generating an abstract syntax tree from the source code file;

analyzing the abstract syntax tree, and recording the position of each sensitive point in a source code file to form a sensitive point list; and storing the function name and the parameter called by the user-defined function, the function name and the parameter of the user-defined function and the calling relation among the functions.

For each sensitive point, tracking the data stream of the initial sensitive variable according to the initial sensitive variable of each sensitive point, acquiring other sensitive variables, and obtaining a code line set semantically related to each sensitive variable, specifically comprising:

circularly traversing the sensitive point list, acquiring an initial sensitive variable of each sensitive point, adding the initial sensitive variable into the sensitive variable list, tracking a data stream of the initial sensitive variable by analyzing the abstract syntax tree, and recording a code line set related to the initial sensitive variable semanteme;

iteratively tracking other sensitive variables by taking the initial sensitive variable as a starting point, and in the process of tracking the data stream of the initial sensitive variable,

when a first variable exists in a code line in the code line set and is transmitted to a sensitive variable, adding the first variable serving as the first sensitive variable into a sensitive variable list, tracking a data stream of the first sensitive variable by analyzing the abstract syntax tree, and recording a code line set semantically related to the first sensitive variable;

According to the calling relation among the functions, the function name and the parameter called by the user-defined function, and the function name and the parameter of the user-defined function, determining a second sensitive variable, which specifically comprises the following steps:

according to the calling relation among the functions, the function called by the user-defined function is the same as the user-defined function;

The training module 502 trains the neural network model by a deep learning method according to the obtained feature code segments to obtain a trained neural network model, and is specifically configured to:

performing symbolization and vectorization on each feature code segment, then packaging the feature code segment into a vector, using the vector as the ith sample in M samples of a deep learning training set, inputting the ith sample into a neural network model, and obtaining a predicted value whether a defect detection tool misreports; i belongs to M, wherein M is a natural number;

and constructing a loss function according to the predicted value and the true value of whether the defect detection tool of the ith sample misreports, optimizing the minimum value of the loss function, and updating the network weight parameters to obtain the trained neural network model.

The training module 502, symbolizing the feature code segment, includes: mapping each user-defined variable in the feature code segment to a first symbol name one by one; mapping each user-defined function in the feature code segment to a second symbol name one by one;

vectorizing the tokenized feature code snippets includes: splitting the characteristic code segment comprising the first symbol name and the second symbol name to form a symbol sequence; the word vector model is selected to map each symbol in the sequence of symbols to an integer and then converted to a vector having a predetermined length.

After the training module 502 trains the neural network model by using a deep learning method, the training module is further configured to: and performing performance evaluation on the trained model until the performance requirement is met.

In conclusion, the beneficial effects of the invention are as follows:

one, fast

In the existing method, the detection results of different types of tools are combined to reduce possible false alarm, which means that different tools are required to carry out scanning detection on software to be detected for multiple times, the detection time difference of different detection tools is probably larger, and even if the detection is carried out in parallel, the total completion time is also limited by the slowest detection tool. The method provided by the invention only needs to collect the past false alarm code samples in the initial investment time, summarize the code sensitive points, train the model, once the model training is completed, the speed of evaluating false alarm will be fast, and no other tools are needed to repeatedly detect the code.

Second, the cost is low

The manual judgment of the false alarm condition of the inspection and detection tool or the combination of the inspection results of various tools needs to increase the labor cost and the capital investment. The method provided by the invention carries out false alarm filtering on the detection result of the existing detection tool, and then shows the result to related personnel, so that the manual input of judging false alarm can be reduced. And the user does not need to buy an additional code detection tool, only needs to continue to use the existing detection tool, and the cost is lower.

Thirdly, the evaluation effect can be continuously improved

After the rules or logic definition in the existing method and tool is completed, the false alarm evaluation effect is fixed unless manpower and material resources are put into improvement. And based on an artificial intelligence method, the trained model is from the constructed sample, and the model has the capability of detecting the false alarm modes existing in all the training samples. With the continuous accumulation of samples, the number of the code false alarm modes can be continuously and automatically updated in an iterative manner, and the detection capability of the trained model can be correspondingly improved. When a new false alarm type appears, the existing method is not easy to deal with, and the method can obtain new false alarm evaluation capability by collecting a new code sample and training a model.

Fourth, the evaluation effect is not limited by the detection tool

The existing method utilizes the characteristics that different source code static analysis tools have different mechanisms and correct the detection result of the adequacy problem, takes the advantages of all tools to carry out false alarm evaluation, but can not ensure that the problem of the adequacy detection of the tools can cover all problem types, and can not ensure that the detection result of the problem of the adequacy detection is accurate, and the false alarm of most code problems can not be effectively filtered. The method provided by the invention has the advantages that the evaluation effect is not limited by the capability of a detection tool, and for a code problem, the system can also carry out false alarm evaluation no matter whether the existing tool is good at detection or not.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A source code defect detection tool false alarm evaluation method is characterized by comprising the following steps:

inputting a feature code segment containing source code sensitive point feature information of an unknown sample into a trained neural network model to determine whether a sensitive point corresponding to the feature code segment is misreported by a defect detection tool;

the method for acquiring the feature code fragment comprises the following steps: according to each sensitive point in the source code file, carrying out data flow analysis to obtain a feature code segment containing feature information of the sensitive point of the source code;

the method for analyzing the data stream according to each sensitive point in the source code file to obtain the feature code segment containing the feature information of the sensitive point of the source code specifically comprises the following steps:

determining each sensitive point in a source code file;

the determining each sensitive point in the source code file comprises:

generating an abstract syntax tree from the source code file;

analyzing the abstract syntax tree, and recording the position of each sensitive point in a source code file to form a sensitive point list; storing the function name and parameter called by the user-defined function, the function name and parameter of the user-defined function and the calling relation among the functions;

circularly traversing the sensitive point list, acquiring an initial sensitive variable of each sensitive point, adding the initial sensitive variable into the sensitive variable list, tracking the data stream of the initial sensitive variable by analyzing the abstract syntax tree, and recording a code line set related to the initial sensitive variable semantics;

2. The method of claim 1,

3. The method of claim 2,

before determining each sensitive point in the source code file, the method further comprises preprocessing the source code file, including normalizing the source code file and developing macro definition;

after arranging the acquired code line sets according to the sequence in the source code file, the method further comprises the step of carrying out post-processing on the source code file to acquire the feature code segment number, the source code file path and the line number of the sensitive point.

4. The method of claim 3, wherein determining the second sensitive variable according to the calling relationship among the functions, the function name and parameter called by the user-defined function, and the function name and parameter called by the user-defined function comprises:

5. The method of claim 1, wherein the training of the neural network model by a deep learning method is performed according to the obtained feature code segments, so as to obtain the trained neural network model, and the method specifically comprises:

performing symbolization and vectorization on each feature code segment, packaging the feature code segment into a vector, using the vector as the ith sample in M samples of a deep learning training set, and inputting the sample into a neural network model to obtain a predicted value of whether a defect detection tool misreports; i belongs to M, and M is a natural number;

6. The method of claim 5,

symbolizing the feature code segment includes: mapping each user-defined variable in the feature code segment to a first symbol name one by one; mapping each user-defined function in the feature code segment to a second symbol name one by one;

vectorizing the tokenized feature code segments comprises: splitting the characteristic code segment comprising the first symbol name and the second symbol name to form a symbol sequence; the word vector model is selected to map each symbol in the sequence of symbols to an integer and then converted to a vector having a predetermined length.

7. The method of claim 1, wherein after training the neural network model using the deep learning method, the method further comprises:

and performing performance evaluation on the trained model until the performance requirement is met.

8. A source code defect detection tool false alarm assessment apparatus, the apparatus comprising:

the detection module is used for inputting a feature code segment containing source code sensitive point feature information of an unknown sample into the trained neural network model so as to determine whether a sensitive point corresponding to the feature code segment is misreported by a defect detection tool;

the acquisition of the feature code fragment specifically comprises the following steps: according to each sensitive point in the source code file, carrying out data flow analysis to obtain a feature code segment containing feature information of the sensitive point of the source code;

and analyzing the data stream according to each sensitive point in the source code file to obtain a feature code segment containing feature information of the sensitive point of the source code, wherein the feature code segment is specifically used for:

determining each sensitive point in a source code file;

each sensitive point in the source code file is determined, and the method is specifically used for:

generating an abstract syntax tree from the source code file;

for each sensitive point, tracking the data stream of the initial sensitive variable according to the initial sensitive variable of each sensitive point, acquiring other sensitive variables, and obtaining a code line set semantically related to each sensitive variable, wherein the code line set is specifically used for:

or determining a second sensitive variable and adding the second sensitive variable into a sensitive variable list according to the calling relation among the functions, the function name and the parameter called by the user-defined function, and the function name and the parameter of the user-defined function, tracking the data stream of the second sensitive variable by analyzing the abstract syntax tree, and recording a code line set semantically related to the second sensitive variable.