CN113221960A - Construction method and collection method of high-quality vulnerability data collection model - Google Patents

Construction method and collection method of high-quality vulnerability data collection model Download PDF

Info

Publication number
CN113221960A
CN113221960A CN202110424826.0A CN202110424826A CN113221960A CN 113221960 A CN113221960 A CN 113221960A CN 202110424826 A CN202110424826 A CN 202110424826A CN 113221960 A CN113221960 A CN 113221960A
Authority
CN
China
Prior art keywords
sample set
change submission
vulnerability
warehouse
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110424826.0A
Other languages
Chinese (zh)
Other versions
CN113221960B (en
Inventor
房鼎益
胡飞
徐榕泽
叶贵鑫
王焕廷
汤战勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern University
Original Assignee
Northwestern University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern University filed Critical Northwestern University
Priority to CN202110424826.0A priority Critical patent/CN113221960B/en
Publication of CN113221960A publication Critical patent/CN113221960A/en
Application granted granted Critical
Publication of CN113221960B publication Critical patent/CN113221960B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

The invention discloses a construction method and a collection method of a high-quality vulnerability data collection model, wherein a change submission file is collected as a sample set, and the sample set is subjected to label processing to obtain a positive sample set and a negative sample set; extracting numerical characteristics of the change submission files in the sample set, extracting change submission description information of the change submission files in the sample set, and extracting code blocks in the change submission files in the sample set; the expert integration model integrates a plurality of excellent classifiers, avoids the defect of a single machine learning model, and improves the accuracy in the aspect of vulnerability identification; according to the method, the expert integration model and the conformal evaluation classifier are combined, namely probability learning and statistical evaluation in the machine learning technology are combined, so that the accuracy and reliability of prediction data of the expert integration model are remarkably improved, the false positive rate is reduced, the problem of false alarm of some existing vulnerability data acquisition models is solved, and a feasible scheme is provided for lack of high-quality source code vulnerability data.

Description

Construction method and collection method of high-quality vulnerability data collection model
Technical Field
The invention belongs to the field of generation audit, relates to a source code feature extraction technology, and particularly relates to a construction method and a collection method of a high-quality vulnerability data collection model.
Background
Traditional deep learning algorithms usually need millions of vulnerability samples to learn an effective model, and the potential of deep learning for potential vulnerability pattern learning can be developed through a sufficient amount of training data. However, since real-life high-quality vulnerability samples are very scarce, the lack of training data limits the quality of vulnerability detection models. Some previous approaches have been to use program generation to generate vulnerability samples, thereby alleviating the problem of training sample deficiencies, however program generation has two distinct disadvantages, on the one hand, they are affected by the grammar, template or model used to generate the program. On the other hand, they cannot reflect the diversification and evolving patterns of real-life programs.
Although some standard vulnerability databases such as SARD and SAMATE data sets exist at present, the method provides convenience for security researchers to analyze and apply existing vulnerabilities. But the defect samples in the standard vulnerability dataset still have many problems: firstly, the sample size is small, and generally, only hundreds of bugs of one type are insufficient to support the training of a high-quality bug detection model; secondly, the sample type is single, and the standard leak library only contains a few conditions which can cause the leak; third, the standard vulnerability data set is updated slowly.
Based on the above-mentioned deficiencies of the standard vulnerability database, we use GitHub as a platform for data collection. GitHub, as the largest code hosting platform around the world, can provide a rich source of data. If a high-quality vulnerability collection model can be constructed, and high-quality defect samples can be automatically obtained from the GitHub, the problem of data shortage can be solved.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a construction method and a collection method of a high-quality vulnerability data collection model, and solve the technical problem of lack of high-quality source code vulnerability data in the prior art.
In order to solve the technical problems, the invention adopts the following technical scheme:
a method for constructing a high-quality vulnerability data collection model is carried out according to the following steps:
step 1, collecting a change submission file as a sample set, and performing label processing on the sample set to obtain a positive sample set and a negative sample set;
the change submission files comprise vulnerable change submission files submitted to the CVE and vulnerable change submission files not submitted to the CVE;
step 2, extracting numerical characteristics of the change submission files in the sample set, extracting change submission description information of the change submission files in the sample set, extracting code blocks in the change submission files in the sample set, and storing the code blocks in a code set;
the numerical characteristics comprise star level, total submitting number, total publishing number, contributors of a warehouse, contribution rate and total branch number;
the code block comprises a deleted line code and an added line code in a modified file;
step 3, carrying out first vector quantization processing on the change submission description information to obtain a characteristic vector 1, carrying out second vector quantization processing on the code set to obtain a characteristic vector 2, cutting and tiling numerical value characteristics, the characteristic vector 1 and the characteristic vector 2 into one-dimensional characteristics, and taking the one-dimensional characteristics as digital characteristic vectors of the change submission file;
step 4, using the digital feature vector as a training set, wherein the training set comprises a positive training set and a negative training set;
step 5, constructing an expert integration model and a shape-preserving evaluation classifier;
step 5.1, training a single classifier by using the collected change submission file, and determining the optimal hyper-parameter of each classifier; combining the five classifiers into an expert integration model, and selecting a voting mechanism of the expert integration model as soft voting;
the five classifiers are support vector machines, random forests, k-nearest neighbors, logistic regression and gradient promotion;
step 5.2, inputting the positive training set and the negative training set into an expert integrated model for training to obtain a trained expert integrated model;
and 5.3, setting a threshold C of the shape preserving evaluation classifier, inputting the trained expert integration model into the shape preserving evaluation classifier to obtain the constructed shape preserving evaluation classifier, and inputting the digital feature vector into the constructed shape preserving evaluation classifier for training to obtain the trained shape preserving evaluation classifier.
Specifically, the process of acquiring data and labeling the data is performed according to the following steps:
step 1.1, crawling a warehouse name by means of an API (application programming interface) provided by a hosting platform, and selecting a warehouse name of a java warehouse with a star level higher than 10 according to a star-level ranking level of a warehouse on the hosting platform;
step 1.2, splicing the name of the warehouse and keywords by using the name of the warehouse and the downloading rules of the changed submitted files in the API of the hosting platform in the step 1.1 as downloading links, and downloading high-quality vulnerability samples by using the downloading links as a first original sample set;
the keywords are FIX CVE and CVE ID;
step 1.3, judging whether the change submission description information in the first sample set has the CVE ID mapped to the CVE standard vulnerability library and whether the change submission file repairs the vulnerability described by the CVE ID, otherwise, discarding the data, if so, retaining the data, and taking the data as the first sample set;
and step 1.4, crawling warehouse names by means of an API (application programming interface) provided by the hosting platform, and selecting warehouse names of warehouses with warehouse ranks higher than 1000 from the warehouse names according to the star rank ranks of the warehouses on the hosting platform.
Step 1.5, the warehouse name obtained in the step 1.4 and the downloading rule of the change submission file in the API of the hosting platform are used for splicing the warehouse name and the regular expression to be used as a downloading link, and the downloading link is used for downloading a high-quality vulnerability sample to be used as a second sample set;
and 1.6, taking the first sample set and the second sample set together as a sample set for model training.
Specifically, the label processing is to manually determine whether a change submission file in the sample set is defective data, if so, the change submission file is used as a positive sample set, and if not, the change submission file is used as a negative sample set for model training.
Specifically, the first vector quantization processing process is performed according to the following steps:
step 4.1, extracting highly relevant numerical characteristics of the warehouse to which the change submission file belongs in the training set;
and 4.2, dividing the change submission description information extracted in the step 3 into a series of tokens by using lexical analysis, discarding tokens with Chinese word descriptions, using a tool to enable each token to generate a 50-dimensional vector corresponding to the token, and splicing the vectors to obtain a feature vector 1.
Specifically, the second quantization process is performed according to the following steps:
step 4.2, dividing the code set extracted in the step 3 into a series of tokens through lexical analysis, using tools to enable each token to generate a 50-dimensional vector corresponding to each token, and splicing the vectors to obtain a feature vector 2;
the token comprises an identifier, a keyword, an operator and a symbol;
a method for collecting high-quality vulnerability data is carried out according to the following steps:
step one, collecting change submission files, and processing the change submission files according to the step 2 and the step 3 in the claim 1 to obtain digital feature vectors for evaluation;
inputting the digital feature vectors for evaluation into a trained expert integration model and a trained shape-preserving evaluation classifier, predicting the digital feature vectors by the expert integration model and giving a prediction result, grading the prediction result by the shape-preserving evaluation classifier, and keeping corresponding change submission files as high-quality vulnerability data when the grade is higher than 1-C; when the score is below 1-C, the corresponding change submission file is discarded.
Compared with the prior art, the invention has the beneficial technical effects that:
the expert integration model integrates a plurality of excellent classifiers, avoids the defect of a single machine learning model, and improves the accuracy in the aspect of vulnerability identification; according to the method, the expert integration model and the conformal evaluation classifier are combined, namely probability learning and statistical evaluation in the machine learning technology are combined, so that the accuracy and reliability of prediction data of the expert integration model are remarkably improved, the false positive rate is reduced, the problem of false alarm of some existing vulnerability data acquisition models is solved, and a feasible scheme is provided for lack of high-quality source code vulnerability data.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2. Change submission file.
FIG. 3 is a set of codes extracted from patch.
The three features of fig. 4 are vectorized and clipped and tiled to obtain 2000-dimensional digital feature vectors.
Fig. 5 is a diagram of an expert integration model architecture.
FIG. 6 is a regular expression used in the step of filtering change submission files related to vulnerabilities.
FIG. 7 Integrated classifier vs. Single classifier comparison experiment.
FIG. 8 comparison experiment of vulnerability data collection method.
Figure 9 data improvement experiment based on the VULDEEPECKER method.
Figure 10 data improvement experiment based on μ VULDEEPECKER method.
FIG. 11 conformal evaluation classifier comparison experiment.
The present invention will be explained in further detail with reference to examples.
Detailed Description
It should be noted that, in the present application, the overall name of SARD is Software assertion Reference Dataset, that is, Software Assurance Reference Dataset.
It should be noted that SAMATE is referred to as Software Assurance Metrics And Tool Evaluation throughout this application.
It should be noted that the CVE is referred to throughout this application as Common Vulnerabilities and Exposuers, i.e., Common Vulnerabilities and Exposures.
It should be noted that the CVE ID is referred to as Common vunneavailability and exposure Identity Document in the present application, and represents the number in the Common vulnerability and exposure library.
It should be noted that the GitHub in this application is a managed platform oriented to open source and private software projects.
It should be noted that, in the present application, the change submission file refers to one code submission and includes a code repair submission and an information description of the code repair.
It should be noted that the change submission description information in the present application refers to modification description information in one code submission.
In the present Application, the API is called Application Programming Interface, i.e., Application program Interface.
It should be noted that word2vec in this application is a model for generating word vectors. These models are shallow, two-layer neural networks that are trained to reconstruct word text.
In the present application, LR is entirely called Logistic Regression, that is, Logistic Regression.
It should be noted that the RF is called Random Forest throughout this application.
It should be noted that, in the present application, GB is called Gradient Boosting, i.e. Gradient Boosting.
It should be noted that the SVM in this application is called Support Vector Machine, i.e. Support Vector Machine.
It should be noted that the general name of KNN in this application is K-Nearest Neighbor, i.e., K Nearest Neighbor.
The following embodiments of the present invention are provided, and it should be noted that the present invention is not limited to the following embodiments and all equivalent changes made on the basis of the technical solutions of the present application fall within the protection scope of the present invention.
Example 1:
the embodiment provides a method for constructing a high-quality vulnerability data collection model, which is carried out according to the following steps:
step 1, collecting a change submission file as a sample set, and performing label processing on the sample set to obtain a positive sample set and a negative sample set;
in the present embodiment, the collected change submission file is as shown in FIG. 2,
the change submission files comprise vulnerable change submission files submitted to the CVE and vulnerable change submission files not submitted to the CVE;
step 2, extracting numerical characteristics of the change submission files in the sample set, extracting change submission description information of the change submission files in the sample set, extracting code blocks in the change submission files in the sample set, and storing the code blocks in a code set;
the numerical characteristics comprise star level, total submission, total release, contributors of the warehouse, contribution rate and total branch number;
in the present embodiment, the code blocks in the change commit file shown in fig. 2 in the sample set are extracted and stored in the code set shown in fig. 3.
The code block comprises a deleted line code and an added line code in a modified file;
step 3, carrying out first vector quantization processing on the change submission description information to obtain a characteristic vector 1, carrying out second vector quantization processing on the code set to obtain a characteristic vector 2, cutting and tiling numerical value characteristics, the characteristic vector 1 and the characteristic vector 2 into one-dimensional characteristics, and taking the one-dimensional characteristics as digital characteristic vectors of the change submission file;
in this embodiment, a feature vector 1 is obtained by performing first vector quantization processing on the change submission description information, a feature vector 2 is obtained by performing second vector quantization processing on the code set, the numerical feature, the feature vector 1 and the feature vector 2 are cut and tiled into one-dimensional features, and the one-dimensional features are used as digital feature vectors of the change submission file, and the digital feature vectors obtained by performing vectorization processing on the digital feature vectors are as shown in fig. 4.
Step 4, using the digital feature vector as a training set, wherein the training set comprises a positive training set and a negative training set;
step 5, constructing an expert integration model and a shape-preserving evaluation classifier;
step 5.1, training a single classifier by using the collected change submission file, and determining the optimal hyper-parameter of each classifier; combining the five classifiers into an expert integration model, and selecting a voting mechanism of the expert integration model as soft voting;
the five classifiers are support vector machines, random forests, k-nearest neighbors, logistic regression and gradient promotion;
in the embodiment, firstly, a single classifier is trained by utilizing the collected change submission file, and the optimal hyper-parameter of each classifier is determined; and combining the five classifiers into an expert integration model, then giving an evaluation by the conformal evaluation classifier based on the calculated possibility, if the evaluation is higher than a set threshold value, confirming the prediction result of the expert integration model, and finally giving the prediction result by the expert integration model, wherein 0 represents that the change submission file is not defect data, and 1 represents that the change submission file is defect data. And inputting the digital feature vectors of the positive training set and the negative training set into an expert integration model and a conformal evaluation classifier to obtain a prediction result. The process is as in figure 5.
Step 5.2, inputting the positive training set and the negative training set into an expert integrated model for training to obtain a trained expert integrated model;
and 5.3, setting a threshold C of the shape preserving evaluation classifier, inputting the trained expert integration model into the shape preserving evaluation classifier to obtain the constructed shape preserving evaluation classifier, and inputting the digital feature vector into the constructed shape preserving evaluation classifier for training to obtain the trained shape preserving evaluation classifier.
As a preferred solution of this embodiment, the data acquiring process is performed according to the following steps:
step 1.1, crawling a warehouse name by means of an API (application programming interface) provided by a hosting platform, and selecting a warehouse name of a java warehouse with a star level higher than 10 according to a star-level ranking level of a warehouse on the hosting platform;
step 1.2, splicing the name of the warehouse and keywords by using the name of the warehouse and the downloading rules of the changed submitted files in the API of the hosting platform in the step 1.1 as downloading links, and downloading high-quality vulnerability samples by using the downloading links as a first original sample set;
the keywords are FIX CVE and CVE ID;
step 1.3, judging whether the change submission description information in the first sample set has the CVE ID mapped to the CVE standard vulnerability library and whether the change submission file repairs the vulnerability described by the CVE ID, otherwise, discarding the data, if so, retaining the data, and taking the data as the first sample set;
and step 1.4, crawling warehouse names by means of an API (application programming interface) provided by the hosting platform, and selecting warehouse names of warehouses with warehouse ranks higher than 1000 from the warehouse names according to the star rank ranks of the warehouses on the hosting platform.
Step 1.5, the warehouse name obtained in the step 1.4 and the downloading rule of the change submission file in the API of the hosting platform are used for splicing the warehouse name and the regular expression to be used as a downloading link, and the downloading link is used for downloading a high-quality vulnerability sample to be used as a second sample set;
in the present embodiment, the regular expression is as in FIG. 6; the hosting platform employed is GitHub.
Step 1.6, the first sample set and the second sample set are used as sample sets for model training together;
as a preferred scheme of this embodiment, the label processing is to manually determine whether a change submission file in the sample set is defective data, if so, take the change submission file as a positive sample set, and if not, take the change submission file as a negative sample set for model training.
As a preferable solution of this embodiment, the first vector quantization processing is performed according to the following steps:
dividing the change submission description information extracted in the step 3 into a series of tokens by lexical analysis, discarding tokens with Chinese word description, enabling each token to generate a 50-dimensional vector corresponding to the token by using a tool, and splicing the vectors to obtain a feature vector 1;
in this embodiment, the practical tool is word2 vec.
As a preferable solution of this embodiment, the second quantization processing is performed according to the following steps:
dividing the code set extracted in the step 3 into a series of tokens through lexical analysis, using a tool to enable each token to generate a 50-dimensional vector corresponding to each token, and splicing the vectors to obtain a feature vector 2;
token includes an identifier, a key, an operator, and a symbol.
Example 2:
the embodiment provides a method for collecting high-quality vulnerability data, which is carried out according to the following steps:
step one, collecting change submission files, and processing the change submission files according to the step 2 and the step 3 in the claim 1 to obtain digital feature vectors for evaluation;
inputting the digital feature vectors for evaluation into a trained expert integration model and a trained shape-preserving evaluation classifier, predicting the digital feature vectors by the expert integration model and giving a prediction result, grading the prediction result by the shape-preserving evaluation classifier, and keeping corresponding change submission files as high-quality vulnerability data when the grade is higher than 1-C; when the score is below 1-C, the corresponding change submission file is discarded.
In the present embodiment, a conformal assessment classifier is applied to calculate a confidence value pv for a class y of an input digital feature vector x, and in order to calculate pv for each x, a calibration score for the weakly supervised model h for the digital feature vector x prediction is calculated using a metric function a (x, y, h) specific to each weakly supervised model;
in order to calculate the confidence value pv, 10% of the digital feature vectors are reserved as a verification set, and the calibration score of each weakly supervised model on n input samples in the verification set is calculated respectively
Figure BDA0003028929780000111
Given a new input sample xn+1And calculating a calibration score using a metric function A (x, y, h)
Figure BDA0003028929780000112
Then is sample xn+1Pv of (a):
Figure BDA0003028929780000113
wherein:
pv represents the evaluation score of the conformal assessment classifier on input x;
COUNT represents the number of samples added;
i represents the ith sample;
yia label representing the ith sample;
yprepresenting a sample label category;
x represents a numerical feature vector;
xn+1representing the (n +1) th digital feature vector;
c represents a set confidence threshold, set to 0.3 in the example;
a (x, y, h) represents a prediction function of the weakly supervised model on an input digital feature vector x;
if the calculated pv value is close to the lower bound 1/(n +1), the prediction is inaccurate, and if the calculated pv value is close to the upper bound 1, the prediction is accurate, in this embodiment, only the prediction with pv greater than 1-C is considered, and when C is set to 0.3, the expert integration model has the best performance.
Actual measurement example 1:
according to the technical scheme, an expert integration model and 5 vulnerability data crawling models (VCCFinder, ZvD, VULPECKER, ZHOU et al and SABETTA et al) are trained on the same data set respectively. The results of the experiment are shown in FIG. 7. The horizontal axis represents the FPR (true positive rate) threshold value, and the vertical axis represents the TPR (false positive rate) threshold value. Under the same false positive rate threshold, the true positive rate of the expert integrated model method is higher than that of the other 5 data collection schemes; under the same true positive rate threshold, the false positive rate of the expert integrated model is lower than the other 5 data collection schemes; the expert integration model ensures that a high true positive rate is still achieved at a low false positive rate.
Actual measurement example 2:
following the technical scheme, the expert integrated model and a single classifier model method are compared, the expert integrated model and 5 independent classifiers (LR, RF, GB, SVM and KNN) are trained on the same data set, the experimental result is shown in figure 8, the horizontal axis represents an FPR (true positive rate) threshold value, the vertical axis represents a TPR (false positive rate) threshold value, and the true positive rates of the expert integrated model are higher than those of the other 5 independent classifiers under the same false positive rate threshold value; under the same true positive rate threshold, the false positive rates of the expert integrated models are lower than that of the other 5 independent classifiers; the expert integration model ensures that a high true positive rate is still achieved at a low false positive rate.
Actual measurement example 3:
following the above technical solution, the vulnerability collection method in the invention is compared with the existing 3 vulnerability collection methods (ZvD, ZHOU et al, SABETTA et al), and the data obtained by using the invention and the other 3 vulnerability collection methods are applied to two vulnerability detection methods (vuldeecker and μ vuldeeecker), respectively, and the experimental result based on the vuldeeecker vulnerability detection method is shown in fig. 9. Fig. 9 reports the detection effect of the vuldeeecker-based vulnerability detection method using different data collection methods, and the detection effect has different improvements on four evaluation indexes, Accuracy (Accuracy), Precision (Precision), Recall (Recall), and F1 score (F1 score). The horizontal axis shows four evaluation indexes, the vertical axis shows the lifting effect on the reference model, and a negative value shows performance reduction. On four evaluation indexes of Accuracy, Precision, Recall and F1 score, the improvement of the data collected by the method on the detection effect of the VULDEEEPECKER vulnerability detection method can be maintained between 10.2% and 12%, the improvement of the data collected by the other three vulnerability data collection schemes on the detection effect of the VULDEEEPECKER vulnerability detection method is only up to 6%, and the improvement of the collection method on the final detection effect is nearly twice as much as that of the other three methods, so that the collection method is superior to the existing vulnerability collection method.
An experiment based on the vuldeeecker vulnerability detection method is shown in fig. 10. FIG. 10 reports the detection effect of the detection method based on the μ VULDEEEPECKER vulnerability using different data collection methods, the four evaluation indexes of Accuracy, Precision, Recall and F1 score (F1 score) are all promoted in different amplitudes, the horizontal axis shows the four evaluation indexes, the vertical axis shows the promotion effect on the reference model, the negative value shows the performance reduction, the promotion effect on the detection effect of the μ VULDEEEPECKER vulnerability detection method by the collected data of the present invention can be maintained between 12% and 14% on the four evaluation indexes of Accuracy, Precision, Recall and F1 score (F1 score), while the promotion effect on the μ VULDEEEPECKER vulnerability detection method by the collected data of the other three vulnerability data collection schemes is only up to 7%, the promotion effect on the μ LDEEPECKER vulnerability detection method by the collection method of the present invention is nearly twice as the promotion effect on the final detection effect of the other three other methods, therefore, the collection method in the invention is superior to the existing vulnerability collection method.
Actual measurement example 4:
following the technical scheme, training an integrated classifier and an expert integrated model applying a conformal evaluation classifier on the same data set, wherein the experimental result is shown in figure 11, the horizontal axis represents the experimental times, and the vertical axis represents the accuracy. Compared with the accuracy of an integrated classifier, the accuracy of the expert integrated model applying the shape-preserving evaluation classifier can be maintained at 4-12%, the accuracy of the expert integrated model applying the shape-preserving evaluation classifier can reach 91% at most, and the shape-preserving evaluation classifier plays a key role in the classification of the models.

Claims (6)

1. A method for constructing a high-quality vulnerability data collection model is characterized by comprising the following steps:
step 1, collecting a change submission file as a sample set, and performing label processing on the sample set to obtain a positive sample set and a negative sample set;
the change submission files comprise vulnerable change submission files submitted to the CVE and vulnerable change submission files not submitted to the CVE;
step 2, extracting numerical characteristics of the change submission files in the sample set, extracting change submission description information of the change submission files in the sample set, extracting code blocks in the change submission files in the sample set, and storing the code blocks in a code set;
the numerical characteristics comprise star level, total submitting number, total publishing number, contributors of a warehouse, contribution rate and total branch number;
the code block comprises a deleted line code and an added line code in a modified file;
step 3, carrying out first vector quantization processing on the change submission description information to obtain a characteristic vector 1, carrying out second vector quantization processing on the code set to obtain a characteristic vector 2, cutting and tiling numerical value characteristics, the characteristic vector 1 and the characteristic vector 2 into one-dimensional characteristics, and taking the one-dimensional characteristics as digital characteristic vectors of the change submission file;
step 4, using the digital feature vector as a training set, wherein the training set comprises a positive training set and a negative training set;
step 5, constructing an expert integration model and a shape-preserving evaluation classifier;
step 5.1, training a single classifier by using the collected change submission file, and determining the optimal hyper-parameter of each classifier; combining the five classifiers into an expert integration model, and selecting a voting mechanism of the expert integration model as soft voting;
the five classifiers are support vector machines, random forests, k-nearest neighbors, logistic regression and gradient promotion;
step 5.2, inputting the positive training set and the negative training set into an expert integrated model for training to obtain a trained expert integrated model;
and 5.3, setting a threshold C of the shape preserving evaluation classifier, inputting the trained expert integration model into the shape preserving evaluation classifier to obtain the constructed shape preserving evaluation classifier, and inputting the digital feature vector into the constructed shape preserving evaluation classifier for training to obtain the trained shape preserving evaluation classifier.
2. The method for constructing a high-quality vulnerability data collection model according to claim 1, wherein the process of acquiring and tagging data is performed according to the following steps:
step 1.1, crawling a warehouse name by means of an API (application programming interface) provided by a hosting platform, and selecting a warehouse name of a java warehouse with a star level higher than 10 according to a star-level ranking level of a warehouse on the hosting platform;
step 1.2, splicing the name of the warehouse and keywords by using the name of the warehouse and the downloading rules of the changed submitted files in the API of the hosting platform in the step 1.1 as downloading links, and downloading high-quality vulnerability samples by using the downloading links as a first original sample set;
the keywords are FIX CVE and CVE ID;
step 1.3, judging whether the change submission description information in the first sample set has the CVE ID mapped to the CVE standard vulnerability library and whether the change submission file repairs the vulnerability described by the CVE ID, otherwise, discarding the data, if so, retaining the data, and taking the data as the first sample set;
step 1.4, crawling warehouse names by means of an API (application programming interface) provided by a hosting platform, and selecting warehouse names of warehouses with warehouse ranks higher than 1000 from the warehouse names according to the star rank ranks of the warehouses on the hosting platform;
step 1.5, the warehouse name obtained in the step 1.4 and the downloading rule of the change submission file in the API of the hosting platform are used for splicing the warehouse name and the regular expression to be used as a downloading link, and the downloading link is used for downloading a high-quality vulnerability sample to be used as a second sample set;
and 1.6, taking the first sample set and the second sample set together as a sample set for model training.
3. The method for constructing a high-quality vulnerability data collection model according to claim 1, wherein the label processing is to manually determine whether a change submission file in the sample set is defective data, if so, the change submission file is used as a positive sample set, and if not, the change submission file is used as a negative sample set for model training.
4. The method for constructing a high-quality vulnerability data collection model according to claim 1, wherein the first vector quantization process is performed according to the following steps:
step 4.1, extracting highly relevant numerical characteristics of the warehouse to which the change submission file belongs in the training set;
and 4.2, dividing the change submission description information extracted in the step 3 into a series of tokens by using lexical analysis, discarding tokens with Chinese word descriptions, using a tool to enable each token to generate a 50-dimensional vector corresponding to the token, and splicing the vectors to obtain a feature vector 1.
5. The method for constructing a high-quality vulnerability data collection model according to claim 1, wherein the second quantization process is performed according to the following steps:
step 4.2, dividing the code set extracted in the step 3 into a series of tokens through lexical analysis, using tools to enable each token to generate a 50-dimensional vector corresponding to each token, and splicing the vectors to obtain a feature vector 2;
the token comprises an identifier, a keyword, an operator and a symbol.
6. A method for collecting high-quality vulnerability data is characterized by comprising the following steps:
step one, collecting change submission files, and processing the change submission files according to the step 2 and the step 3 in the claim 1 to obtain digital feature vectors for evaluation;
inputting the digital feature vectors for evaluation into a trained expert integration model and a trained shape-preserving evaluation classifier, predicting the digital feature vectors by the expert integration model and giving a prediction result, grading the prediction result by the shape-preserving evaluation classifier, and keeping corresponding change submission files as high-quality vulnerability data when the grade is higher than 1-C; when the score is below 1-C, the corresponding change submission file is discarded.
CN202110424826.0A 2021-04-20 2021-04-20 Construction method and collection method of high-quality vulnerability data collection model Active CN113221960B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110424826.0A CN113221960B (en) 2021-04-20 2021-04-20 Construction method and collection method of high-quality vulnerability data collection model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110424826.0A CN113221960B (en) 2021-04-20 2021-04-20 Construction method and collection method of high-quality vulnerability data collection model

Publications (2)

Publication Number Publication Date
CN113221960A true CN113221960A (en) 2021-08-06
CN113221960B CN113221960B (en) 2023-04-18

Family

ID=77088249

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110424826.0A Active CN113221960B (en) 2021-04-20 2021-04-20 Construction method and collection method of high-quality vulnerability data collection model

Country Status (1)

Country Link
CN (1) CN113221960B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114120120A (en) * 2021-11-25 2022-03-01 广东电网有限责任公司 Method, device, equipment and medium for detecting illegal building based on remote sensing image
CN115048316A (en) * 2022-08-15 2022-09-13 中国电子科技集团公司第三十研究所 Semi-supervised software code defect detection method and device
CN116302043A (en) * 2023-05-25 2023-06-23 深圳市明源云科技有限公司 Code maintenance problem detection method and device, electronic equipment and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109886020A (en) * 2019-01-24 2019-06-14 燕山大学 Software vulnerability automatic classification method based on deep neural network
CN110197286A (en) * 2019-05-10 2019-09-03 武汉理工大学 A kind of Active Learning classification method based on mixed Gauss model and sparse Bayesian
US20200320254A1 (en) * 2019-04-03 2020-10-08 RELX Inc. Systems and Methods for Dynamically Displaying a User Interface of an Evaluation System Processing Textual Data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109886020A (en) * 2019-01-24 2019-06-14 燕山大学 Software vulnerability automatic classification method based on deep neural network
US20200320254A1 (en) * 2019-04-03 2020-10-08 RELX Inc. Systems and Methods for Dynamically Displaying a User Interface of an Evaluation System Processing Textual Data
CN110197286A (en) * 2019-05-10 2019-09-03 武汉理工大学 A kind of Active Learning classification method based on mixed Gauss model and sparse Bayesian

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ULF JOHANSSON: "Model-agnostic nonconformity functions for conformal classification", 《2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)》 *
任才溶: "基于并行随机森林的城市PM2.5浓度预测", 《中国优秀硕士学位论文全文数据库》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114120120A (en) * 2021-11-25 2022-03-01 广东电网有限责任公司 Method, device, equipment and medium for detecting illegal building based on remote sensing image
CN115048316A (en) * 2022-08-15 2022-09-13 中国电子科技集团公司第三十研究所 Semi-supervised software code defect detection method and device
CN116302043A (en) * 2023-05-25 2023-06-23 深圳市明源云科技有限公司 Code maintenance problem detection method and device, electronic equipment and readable storage medium
CN116302043B (en) * 2023-05-25 2023-10-10 深圳市明源云科技有限公司 Code maintenance problem detection method and device, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN113221960B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
CN113221960B (en) Construction method and collection method of high-quality vulnerability data collection model
CN109697162B (en) Software defect automatic detection method based on open source code library
CN111309912B (en) Text classification method, apparatus, computer device and storage medium
CN109829155B (en) Keyword determination method, automatic scoring method, device, equipment and medium
CN109635108B (en) Man-machine interaction based remote supervision entity relationship extraction method
CN111459799A (en) Software defect detection model establishing and detecting method and system based on Github
CN110162478B (en) Defect code path positioning method based on defect report
Kobayashi et al. Towards an NLP-based log template generation algorithm for system log analysis
CN107862327B (en) Security defect identification system and method based on multiple features
Zhang et al. Large-scale empirical study of important features indicative of discovered vulnerabilities to assess application security
CN111460250A (en) Image data cleaning method, image data cleaning device, image data cleaning medium, and electronic apparatus
CN107368526A (en) A kind of data processing method and device
US11385988B2 (en) System and method to improve results of a static code analysis based on the probability of a true error
CN111860981B (en) Enterprise national industry category prediction method and system based on LSTM deep learning
CN112800232A (en) Big data based case automatic classification and optimization method and training set correction method
CN112035345A (en) Mixed depth defect prediction method based on code segment analysis
CN116523284A (en) Automatic evaluation method and system for business operation flow based on machine learning
CN114416573A (en) Defect analysis method, device, equipment and medium for application program
CN113626241A (en) Application program exception handling method, device, equipment and storage medium
CN114238768A (en) Information pushing method and device, computer equipment and storage medium
CN110941713A (en) Self-optimization financial information plate classification method based on topic model
CN110888977A (en) Text classification method and device, computer equipment and storage medium
CN115686995A (en) Data monitoring processing method and device
CN113722230A (en) Integrated assessment method and device for vulnerability mining capability of fuzzy test tool
CN113448860A (en) Test case analysis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant