CN113221960A

CN113221960A - Construction method and collection method of high-quality vulnerability data collection model

Info

Publication number: CN113221960A
Application number: CN202110424826.0A
Authority: CN
Inventors: 房鼎益; 胡飞; 徐榕泽; 叶贵鑫; 王焕廷; 汤战勇
Original assignee: Northwestern University
Current assignee: Northwestern University
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2021-08-06
Anticipated expiration: 2041-04-20
Also published as: CN113221960B

Abstract

The invention discloses a construction method and a collection method of a high-quality vulnerability data collection model, wherein a change submission file is collected as a sample set, and the sample set is subjected to label processing to obtain a positive sample set and a negative sample set; extracting numerical characteristics of the change submission files in the sample set, extracting change submission description information of the change submission files in the sample set, and extracting code blocks in the change submission files in the sample set; the expert integration model integrates a plurality of excellent classifiers, avoids the defect of a single machine learning model, and improves the accuracy in the aspect of vulnerability identification; according to the method, the expert integration model and the conformal evaluation classifier are combined, namely probability learning and statistical evaluation in the machine learning technology are combined, so that the accuracy and reliability of prediction data of the expert integration model are remarkably improved, the false positive rate is reduced, the problem of false alarm of some existing vulnerability data acquisition models is solved, and a feasible scheme is provided for lack of high-quality source code vulnerability data.

Description

Construction method and collection method of high-quality vulnerability data collection model

Technical Field

The invention belongs to the field of generation audit, relates to a source code feature extraction technology, and particularly relates to a construction method and a collection method of a high-quality vulnerability data collection model.

Background

Traditional deep learning algorithms usually need millions of vulnerability samples to learn an effective model, and the potential of deep learning for potential vulnerability pattern learning can be developed through a sufficient amount of training data. However, since real-life high-quality vulnerability samples are very scarce, the lack of training data limits the quality of vulnerability detection models. Some previous approaches have been to use program generation to generate vulnerability samples, thereby alleviating the problem of training sample deficiencies, however program generation has two distinct disadvantages, on the one hand, they are affected by the grammar, template or model used to generate the program. On the other hand, they cannot reflect the diversification and evolving patterns of real-life programs.

Although some standard vulnerability databases such as SARD and SAMATE data sets exist at present, the method provides convenience for security researchers to analyze and apply existing vulnerabilities. But the defect samples in the standard vulnerability dataset still have many problems: firstly, the sample size is small, and generally, only hundreds of bugs of one type are insufficient to support the training of a high-quality bug detection model; secondly, the sample type is single, and the standard leak library only contains a few conditions which can cause the leak; third, the standard vulnerability data set is updated slowly.

Based on the above-mentioned deficiencies of the standard vulnerability database, we use GitHub as a platform for data collection. GitHub, as the largest code hosting platform around the world, can provide a rich source of data. If a high-quality vulnerability collection model can be constructed, and high-quality defect samples can be automatically obtained from the GitHub, the problem of data shortage can be solved.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a construction method and a collection method of a high-quality vulnerability data collection model, and solve the technical problem of lack of high-quality source code vulnerability data in the prior art.

In order to solve the technical problems, the invention adopts the following technical scheme:

a method for constructing a high-quality vulnerability data collection model is carried out according to the following steps:

step 1, collecting a change submission file as a sample set, and performing label processing on the sample set to obtain a positive sample set and a negative sample set;

the change submission files comprise vulnerable change submission files submitted to the CVE and vulnerable change submission files not submitted to the CVE;

step 2, extracting numerical characteristics of the change submission files in the sample set, extracting change submission description information of the change submission files in the sample set, extracting code blocks in the change submission files in the sample set, and storing the code blocks in a code set;

the numerical characteristics comprise star level, total submitting number, total publishing number, contributors of a warehouse, contribution rate and total branch number;

the code block comprises a deleted line code and an added line code in a modified file;

step 3, carrying out first vector quantization processing on the change submission description information to obtain a characteristic vector 1, carrying out second vector quantization processing on the code set to obtain a characteristic vector 2, cutting and tiling numerical value characteristics, the characteristic vector 1 and the characteristic vector 2 into one-dimensional characteristics, and taking the one-dimensional characteristics as digital characteristic vectors of the change submission file;

step 4, using the digital feature vector as a training set, wherein the training set comprises a positive training set and a negative training set;

step 5, constructing an expert integration model and a shape-preserving evaluation classifier;

step 5.1, training a single classifier by using the collected change submission file, and determining the optimal hyper-parameter of each classifier; combining the five classifiers into an expert integration model, and selecting a voting mechanism of the expert integration model as soft voting;

the five classifiers are support vector machines, random forests, k-nearest neighbors, logistic regression and gradient promotion;

step 5.2, inputting the positive training set and the negative training set into an expert integrated model for training to obtain a trained expert integrated model;

and 5.3, setting a threshold C of the shape preserving evaluation classifier, inputting the trained expert integration model into the shape preserving evaluation classifier to obtain the constructed shape preserving evaluation classifier, and inputting the digital feature vector into the constructed shape preserving evaluation classifier for training to obtain the trained shape preserving evaluation classifier.

Specifically, the process of acquiring data and labeling the data is performed according to the following steps:

step 1.1, crawling a warehouse name by means of an API (application programming interface) provided by a hosting platform, and selecting a warehouse name of a java warehouse with a star level higher than 10 according to a star-level ranking level of a warehouse on the hosting platform;

step 1.2, splicing the name of the warehouse and keywords by using the name of the warehouse and the downloading rules of the changed submitted files in the API of the hosting platform in the step 1.1 as downloading links, and downloading high-quality vulnerability samples by using the downloading links as a first original sample set;

the keywords are FIX CVE and CVE ID;

step 1.3, judging whether the change submission description information in the first sample set has the CVE ID mapped to the CVE standard vulnerability library and whether the change submission file repairs the vulnerability described by the CVE ID, otherwise, discarding the data, if so, retaining the data, and taking the data as the first sample set;

and step 1.4, crawling warehouse names by means of an API (application programming interface) provided by the hosting platform, and selecting warehouse names of warehouses with warehouse ranks higher than 1000 from the warehouse names according to the star rank ranks of the warehouses on the hosting platform.

Step 1.5, the warehouse name obtained in the step 1.4 and the downloading rule of the change submission file in the API of the hosting platform are used for splicing the warehouse name and the regular expression to be used as a downloading link, and the downloading link is used for downloading a high-quality vulnerability sample to be used as a second sample set;

and 1.6, taking the first sample set and the second sample set together as a sample set for model training.

Specifically, the label processing is to manually determine whether a change submission file in the sample set is defective data, if so, the change submission file is used as a positive sample set, and if not, the change submission file is used as a negative sample set for model training.

Specifically, the first vector quantization processing process is performed according to the following steps:

step 4.1, extracting highly relevant numerical characteristics of the warehouse to which the change submission file belongs in the training set;

and 4.2, dividing the change submission description information extracted in the step 3 into a series of tokens by using lexical analysis, discarding tokens with Chinese word descriptions, using a tool to enable each token to generate a 50-dimensional vector corresponding to the token, and splicing the vectors to obtain a feature vector 1.

Specifically, the second quantization process is performed according to the following steps:

step 4.2, dividing the code set extracted in the step 3 into a series of tokens through lexical analysis, using tools to enable each token to generate a 50-dimensional vector corresponding to each token, and splicing the vectors to obtain a feature vector 2;

the token comprises an identifier, a keyword, an operator and a symbol;

a method for collecting high-quality vulnerability data is carried out according to the following steps:

step one, collecting change submission files, and processing the change submission files according to the step 2 and the step 3 in the claim 1 to obtain digital feature vectors for evaluation;

inputting the digital feature vectors for evaluation into a trained expert integration model and a trained shape-preserving evaluation classifier, predicting the digital feature vectors by the expert integration model and giving a prediction result, grading the prediction result by the shape-preserving evaluation classifier, and keeping corresponding change submission files as high-quality vulnerability data when the grade is higher than 1-C; when the score is below 1-C, the corresponding change submission file is discarded.

Compared with the prior art, the invention has the beneficial technical effects that:

the expert integration model integrates a plurality of excellent classifiers, avoids the defect of a single machine learning model, and improves the accuracy in the aspect of vulnerability identification; according to the method, the expert integration model and the conformal evaluation classifier are combined, namely probability learning and statistical evaluation in the machine learning technology are combined, so that the accuracy and reliability of prediction data of the expert integration model are remarkably improved, the false positive rate is reduced, the problem of false alarm of some existing vulnerability data acquisition models is solved, and a feasible scheme is provided for lack of high-quality source code vulnerability data.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2. Change submission file.

FIG. 3 is a set of codes extracted from patch.

The three features of fig. 4 are vectorized and clipped and tiled to obtain 2000-dimensional digital feature vectors.

Fig. 5 is a diagram of an expert integration model architecture.

FIG. 6 is a regular expression used in the step of filtering change submission files related to vulnerabilities.

FIG. 7 Integrated classifier vs. Single classifier comparison experiment.

FIG. 8 comparison experiment of vulnerability data collection method.

Figure 9 data improvement experiment based on the VULDEEPECKER method.

Figure 10 data improvement experiment based on μ VULDEEPECKER method.

FIG. 11 conformal evaluation classifier comparison experiment.

The present invention will be explained in further detail with reference to examples.

Detailed Description

It should be noted that, in the present application, the overall name of SARD is Software assertion Reference Dataset, that is, Software Assurance Reference Dataset.

It should be noted that SAMATE is referred to as Software Assurance Metrics And Tool Evaluation throughout this application.

It should be noted that the CVE is referred to throughout this application as Common Vulnerabilities and Exposuers, i.e., Common Vulnerabilities and Exposures.

It should be noted that the CVE ID is referred to as Common vunneavailability and exposure Identity Document in the present application, and represents the number in the Common vulnerability and exposure library.

It should be noted that the GitHub in this application is a managed platform oriented to open source and private software projects.

It should be noted that, in the present application, the change submission file refers to one code submission and includes a code repair submission and an information description of the code repair.

It should be noted that the change submission description information in the present application refers to modification description information in one code submission.

In the present Application, the API is called Application Programming Interface, i.e., Application program Interface.

It should be noted that word2vec in this application is a model for generating word vectors. These models are shallow, two-layer neural networks that are trained to reconstruct word text.

In the present application, LR is entirely called Logistic Regression, that is, Logistic Regression.

It should be noted that the RF is called Random Forest throughout this application.

It should be noted that, in the present application, GB is called Gradient Boosting, i.e. Gradient Boosting.

It should be noted that the SVM in this application is called Support Vector Machine, i.e. Support Vector Machine.

It should be noted that the general name of KNN in this application is K-Nearest Neighbor, i.e., K Nearest Neighbor.

The following embodiments of the present invention are provided, and it should be noted that the present invention is not limited to the following embodiments and all equivalent changes made on the basis of the technical solutions of the present application fall within the protection scope of the present invention.

Example 1:

the embodiment provides a method for constructing a high-quality vulnerability data collection model, which is carried out according to the following steps:

in the present embodiment, the collected change submission file is as shown in FIG. 2,

the numerical characteristics comprise star level, total submission, total release, contributors of the warehouse, contribution rate and total branch number;

in the present embodiment, the code blocks in the change commit file shown in fig. 2 in the sample set are extracted and stored in the code set shown in fig. 3.

in this embodiment, a feature vector 1 is obtained by performing first vector quantization processing on the change submission description information, a feature vector 2 is obtained by performing second vector quantization processing on the code set, the numerical feature, the feature vector 1 and the feature vector 2 are cut and tiled into one-dimensional features, and the one-dimensional features are used as digital feature vectors of the change submission file, and the digital feature vectors obtained by performing vectorization processing on the digital feature vectors are as shown in fig. 4.

in the embodiment, firstly, a single classifier is trained by utilizing the collected change submission file, and the optimal hyper-parameter of each classifier is determined; and combining the five classifiers into an expert integration model, then giving an evaluation by the conformal evaluation classifier based on the calculated possibility, if the evaluation is higher than a set threshold value, confirming the prediction result of the expert integration model, and finally giving the prediction result by the expert integration model, wherein 0 represents that the change submission file is not defect data, and 1 represents that the change submission file is defect data. And inputting the digital feature vectors of the positive training set and the negative training set into an expert integration model and a conformal evaluation classifier to obtain a prediction result. The process is as in figure 5.

As a preferred solution of this embodiment, the data acquiring process is performed according to the following steps:

the keywords are FIX CVE and CVE ID;

in the present embodiment, the regular expression is as in FIG. 6; the hosting platform employed is GitHub.

Step 1.6, the first sample set and the second sample set are used as sample sets for model training together;

as a preferred scheme of this embodiment, the label processing is to manually determine whether a change submission file in the sample set is defective data, if so, take the change submission file as a positive sample set, and if not, take the change submission file as a negative sample set for model training.

As a preferable solution of this embodiment, the first vector quantization processing is performed according to the following steps:

dividing the change submission description information extracted in the step 3 into a series of tokens by lexical analysis, discarding tokens with Chinese word description, enabling each token to generate a 50-dimensional vector corresponding to the token by using a tool, and splicing the vectors to obtain a feature vector 1;

in this embodiment, the practical tool is word2 vec.

As a preferable solution of this embodiment, the second quantization processing is performed according to the following steps:

dividing the code set extracted in the step 3 into a series of tokens through lexical analysis, using a tool to enable each token to generate a 50-dimensional vector corresponding to each token, and splicing the vectors to obtain a feature vector 2;

token includes an identifier, a key, an operator, and a symbol.

Example 2:

the embodiment provides a method for collecting high-quality vulnerability data, which is carried out according to the following steps:

In the present embodiment, a conformal assessment classifier is applied to calculate a confidence value pv for a class y of an input digital feature vector x, and in order to calculate pv for each x, a calibration score for the weakly supervised model h for the digital feature vector x prediction is calculated using a metric function a (x, y, h) specific to each weakly supervised model;

in order to calculate the confidence value pv, 10% of the digital feature vectors are reserved as a verification set, and the calibration score of each weakly supervised model on n input samples in the verification set is calculated respectively

Given a new input sample x_n+1And calculating a calibration score using a metric function A (x, y, h)

Then is sample x_n+1Pv of (a):

wherein:

pv represents the evaluation score of the conformal assessment classifier on input x;

COUNT represents the number of samples added;

i represents the ith sample;

y_ia label representing the ith sample;

y^prepresenting a sample label category;

x represents a numerical feature vector;

x_n+1representing the (n +1) th digital feature vector;

c represents a set confidence threshold, set to 0.3 in the example;

a (x, y, h) represents a prediction function of the weakly supervised model on an input digital feature vector x;

if the calculated pv value is close to the lower bound 1/(n +1), the prediction is inaccurate, and if the calculated pv value is close to the upper bound 1, the prediction is accurate, in this embodiment, only the prediction with pv greater than 1-C is considered, and when C is set to 0.3, the expert integration model has the best performance.

Actual measurement example 1:

according to the technical scheme, an expert integration model and 5 vulnerability data crawling models (VCCFinder, ZvD, VULPECKER, ZHOU et al and SABETTA et al) are trained on the same data set respectively. The results of the experiment are shown in FIG. 7. The horizontal axis represents the FPR (true positive rate) threshold value, and the vertical axis represents the TPR (false positive rate) threshold value. Under the same false positive rate threshold, the true positive rate of the expert integrated model method is higher than that of the other 5 data collection schemes; under the same true positive rate threshold, the false positive rate of the expert integrated model is lower than the other 5 data collection schemes; the expert integration model ensures that a high true positive rate is still achieved at a low false positive rate.

Actual measurement example 2:

following the technical scheme, the expert integrated model and a single classifier model method are compared, the expert integrated model and 5 independent classifiers (LR, RF, GB, SVM and KNN) are trained on the same data set, the experimental result is shown in figure 8, the horizontal axis represents an FPR (true positive rate) threshold value, the vertical axis represents a TPR (false positive rate) threshold value, and the true positive rates of the expert integrated model are higher than those of the other 5 independent classifiers under the same false positive rate threshold value; under the same true positive rate threshold, the false positive rates of the expert integrated models are lower than that of the other 5 independent classifiers; the expert integration model ensures that a high true positive rate is still achieved at a low false positive rate.

Actual measurement example 3:

following the above technical solution, the vulnerability collection method in the invention is compared with the existing 3 vulnerability collection methods (ZvD, ZHOU et al, SABETTA et al), and the data obtained by using the invention and the other 3 vulnerability collection methods are applied to two vulnerability detection methods (vuldeecker and μ vuldeeecker), respectively, and the experimental result based on the vuldeeecker vulnerability detection method is shown in fig. 9. Fig. 9 reports the detection effect of the vuldeeecker-based vulnerability detection method using different data collection methods, and the detection effect has different improvements on four evaluation indexes, Accuracy (Accuracy), Precision (Precision), Recall (Recall), and F1 score (F1 score). The horizontal axis shows four evaluation indexes, the vertical axis shows the lifting effect on the reference model, and a negative value shows performance reduction. On four evaluation indexes of Accuracy, Precision, Recall and F1 score, the improvement of the data collected by the method on the detection effect of the VULDEEEPECKER vulnerability detection method can be maintained between 10.2% and 12%, the improvement of the data collected by the other three vulnerability data collection schemes on the detection effect of the VULDEEEPECKER vulnerability detection method is only up to 6%, and the improvement of the collection method on the final detection effect is nearly twice as much as that of the other three methods, so that the collection method is superior to the existing vulnerability collection method.

An experiment based on the vuldeeecker vulnerability detection method is shown in fig. 10. FIG. 10 reports the detection effect of the detection method based on the μ VULDEEEPECKER vulnerability using different data collection methods, the four evaluation indexes of Accuracy, Precision, Recall and F1 score (F1 score) are all promoted in different amplitudes, the horizontal axis shows the four evaluation indexes, the vertical axis shows the promotion effect on the reference model, the negative value shows the performance reduction, the promotion effect on the detection effect of the μ VULDEEEPECKER vulnerability detection method by the collected data of the present invention can be maintained between 12% and 14% on the four evaluation indexes of Accuracy, Precision, Recall and F1 score (F1 score), while the promotion effect on the μ VULDEEEPECKER vulnerability detection method by the collected data of the other three vulnerability data collection schemes is only up to 7%, the promotion effect on the μ LDEEPECKER vulnerability detection method by the collection method of the present invention is nearly twice as the promotion effect on the final detection effect of the other three other methods, therefore, the collection method in the invention is superior to the existing vulnerability collection method.

Actual measurement example 4:

following the technical scheme, training an integrated classifier and an expert integrated model applying a conformal evaluation classifier on the same data set, wherein the experimental result is shown in figure 11, the horizontal axis represents the experimental times, and the vertical axis represents the accuracy. Compared with the accuracy of an integrated classifier, the accuracy of the expert integrated model applying the shape-preserving evaluation classifier can be maintained at 4-12%, the accuracy of the expert integrated model applying the shape-preserving evaluation classifier can reach 91% at most, and the shape-preserving evaluation classifier plays a key role in the classification of the models.

Claims

1. A method for constructing a high-quality vulnerability data collection model is characterized by comprising the following steps:

2. The method for constructing a high-quality vulnerability data collection model according to claim 1, wherein the process of acquiring and tagging data is performed according to the following steps:

the keywords are FIX CVE and CVE ID;

step 1.4, crawling warehouse names by means of an API (application programming interface) provided by a hosting platform, and selecting warehouse names of warehouses with warehouse ranks higher than 1000 from the warehouse names according to the star rank ranks of the warehouses on the hosting platform;

3. The method for constructing a high-quality vulnerability data collection model according to claim 1, wherein the label processing is to manually determine whether a change submission file in the sample set is defective data, if so, the change submission file is used as a positive sample set, and if not, the change submission file is used as a negative sample set for model training.

4. The method for constructing a high-quality vulnerability data collection model according to claim 1, wherein the first vector quantization process is performed according to the following steps:

5. The method for constructing a high-quality vulnerability data collection model according to claim 1, wherein the second quantization process is performed according to the following steps:

the token comprises an identifier, a keyword, an operator and a symbol.

6. A method for collecting high-quality vulnerability data is characterized by comprising the following steps: