CN113742731A

CN113742731A - Data collection method for code vulnerability intelligent detection

Info

Publication number: CN113742731A
Application number: CN202010487163.2A
Authority: CN
Inventors: 房春荣; 钱美缘; 葛修婷; 王旭; 曹振飞; 李彤宇
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2021-12-03

Abstract

A data collection method for code vulnerability intelligent detection constructs an initial code vulnerability data set, then utilizes a trained machine learning model to process unmarked codes, and expands the data set according to results of model marking and manual marking. The initial data set is constructed by combining the results of the code vulnerability detection tool with the judgment of testers, the training of the machine model is to utilize the initial data set, determine whether false alarm occurs or not by combining the judgment of the machine learning model and the judgment results of the testers for the code which is not marked, and expand the data set according to the false alarm.

Description

Data collection method for code vulnerability intelligent detection

Technical Field

The invention belongs to the field of software engineering, and particularly relates to application of a code vulnerability false alarm detection and machine learning method in the field of software engineering, which is used for constructing and collecting a code vulnerability data set.

Background

Due to the increasing complexity of modern software products, the manual testing method is not enough to quickly complete the software bug detection. At present, the traditional vulnerability discovery technology theory is mature, and vulnerabilities can be mined from codes in a mode of model detection, fuzzy test, symbolic execution and binary ratio equivalence. These sophisticated techniques have been largely automated and can scan software code awaiting testing for specific types of vulnerabilities with minimal human intervention. However, the use of automated code vulnerability detection tools also faces problems, such as:

1) code vulnerability detection tools must make a tradeoff between detection efficiency and accuracy. Whether the syntax is analyzed or the execution path of the code is analyzed, a complex analysis model needs to be constructed, and the problems of overlarge solution scale or path explosion easily occur. Due to the limitation of vulnerability detection technology, accurate analysis requires a considerable analysis time, which is not allowed in practical applications.

2) The code vulnerability detection tool relies on rules preset by human experts, so the detected vulnerabilities are often limited to certain specific types. The manually defined vulnerability rules have strong subjectivity, all conditions are difficult to consider comprehensively, and the imperfect rules can cause the problems of false missing and false reports.

3) The detection capability of the code vulnerability detection tool is fixed, and most detected vulnerabilities are real vulnerabilities for programs with low security levels. However, as bugs are fixed, the security of programs is increasing, and the rate of false alarms also increases. If the capabilities of the code vulnerability detection tools are not increased, most of the developer's time is wasted manually checking and marking invalid vulnerabilities.

In summary, in the using process of the automatic detection tool, the situations of missing report and false report are very common. The problem of excessive false alarms can be solved by improving the model. With the continuous breakthrough of machine learning and deep learning technologies, the machine learning technology can be used for helping a code vulnerability detection tool to improve the detection accuracy and reduce the false alarm ratio. However, the accuracy of the machine learning model is very dependent on the size of the data set, and overfitting may occur when insufficient data is provided during training.

The existing code vulnerability data set construction and collection technology has the following problems:

1) the collection modes and the quality of the vulnerability data are different, and the formats of the data sets are also different. At present, a universal and efficient data set is lacked, so that the data set can be automatically constructed only in a web crawler crawling mode in the research process.

2) The continuous increase of code bugs in the using process cannot be considered, the data set cannot be updated, and therefore the detection model cannot be effectively improved.

Disclosure of Invention

In view of the defects of the prior art, the technical problems to be solved by the invention are as follows: in the code vulnerability intelligent detection, an original data set cannot be expanded, so that the accuracy of vulnerability false-alarm detection is influenced.

In order to solve the problems, the invention adopts the technical scheme that: a data set amplification method in code vulnerability intelligent detection comprises the following steps:

1) sending the original code into an automatic vulnerability detection tool for detection;

2) delivering the original code to a tester for vulnerability marking;

3) comparing the detection result with the mark, determining the vulnerability which belongs to the false alarm, and constructing an initial code vulnerability data set;

4) learning the relation between the bug codes and whether the false alarm occurs by using a machine learning model;

5) processing the unmarked codes by using the trained machine learning model;

6) submitting the vulnerability code identified as false alarm by the model to a tester for auditing;

7) adding the vulnerability and the auditing result into a code vulnerability data set constructed before;

by means of the technical scheme, the invention provides the method for expanding the code vulnerability data set, the original data set can be continuously expanded in the using process of the false alarm detection model, then the false alarm detection model can be subjected to iterative training, and higher accuracy is obtained in the later detection process.

Drawings

FIG. 1 is an overall flow chart of the present invention.

Detailed Description

In order to explain the technical content of the invention, the objectives achieved and the final results in detail, specific embodiments will be described in more detail below:

1) and (4) carrying out automatic detection by using a code vulnerability detection tool, and acquiring vulnerability types and vulnerability positions from the detection reports.

2) And acquiring a code segment possibly containing the vulnerability, and judging whether the vulnerability exists by a tester.

3) And comparing the detection result of the tool with the identification result of the tester, if the results are consistent, the detection is considered to be correct, and if the loophole detected by the tool is not marked as a loophole by the tester, the loophole is considered to belong to a loophole with false alarm. Therefore, the vulnerability source code segments and the judgment result of whether the vulnerability source code segments are false reports can be combined to construct an initial code vulnerability false report data set.

4) And training according to the data set through a machine learning algorithm, and learning the relation between the vulnerability source code text and whether false alarm occurs or not to obtain a trained model.

5) And after detecting bugs in other codes by using the code bug detection tool, processing bug codes indicated in the detection report by using the trained model. If a section of code is identified as a bug code with false alarm by the model, the code is handed to a tester for judgment, and if the bug is not contained in the judgment, the code is marked as the bug and added into the database.

Claims

1. A data collection method for code vulnerability intelligent detection is characterized in that an initial code vulnerability data set is constructed by combining the result of a code vulnerability detection tool and the judgment of a tester, then a machine learning model for false-alarm judgment is trained according to the initial code vulnerability data set, finally, the vulnerability with false alarm can be determined by combining the judgment of the machine learning model and the judgment result of the tester, and the vulnerability is added into the code vulnerability data set.

2. The data collection method for intelligent detection of code vulnerabilities as claimed in claim 1, wherein the results of the code vulnerability detection tool are combined with the judgment of the tester; firstly, integrating detection reports of several different code vulnerability detection tools as a final result of tool detection; then, the testing personnel judges the loophole detected by the tool, and if the loophole is not judged, a false alarm result is recorded; specific data items of the data set include: vulnerability code segment, vulnerability type, whether false report.

3. The method for collecting data oriented to intelligent detection of code vulnerabilities as described in claim 1, characterized in that the code vulnerability data set described in claim 2 is used to train a machine learning model for false positive judgment, and after training is completed, the model can be used to predict a newly given code segment to judge whether the code segment is a false positive vulnerability.

4. The data collection method for intelligent detection of code vulnerabilities as described in claim 1, wherein the vulnerability with false alarm is determined by combining the judgment of a machine learning model and the judgment result of a tester; for each section of code possibly containing a bug, firstly inputting the code into the machine learning model in claim 3, and judging whether the code is a false alarm occurring in the bug detection process; then, according to the judgment result of the model, if the judgment result is false alarm, the code segment is delivered to a tester for inspection; if the code segment is not really a bug after checking, the code segment is added to the initial data set.