CN113221960B - Construction method and collection method of high-quality vulnerability data collection model - Google Patents

Construction method and collection method of high-quality vulnerability data collection model Download PDF

Info

Publication number
CN113221960B
CN113221960B CN202110424826.0A CN202110424826A CN113221960B CN 113221960 B CN113221960 B CN 113221960B CN 202110424826 A CN202110424826 A CN 202110424826A CN 113221960 B CN113221960 B CN 113221960B
Authority
CN
China
Prior art keywords
sample set
change submission
warehouse
vulnerability
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110424826.0A
Other languages
Chinese (zh)
Other versions
CN113221960A (en
Inventor
房鼎益
胡飞
徐榕泽
叶贵鑫
王焕廷
汤战勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwest University
Original Assignee
Northwest University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwest University filed Critical Northwest University
Priority to CN202110424826.0A priority Critical patent/CN113221960B/en
Publication of CN113221960A publication Critical patent/CN113221960A/en
Application granted granted Critical
Publication of CN113221960B publication Critical patent/CN113221960B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

The invention discloses a construction method and a collection method of a high-quality vulnerability data collection model, wherein a change submission file is collected as a sample set, and the sample set is subjected to label processing to obtain a positive sample set and a negative sample set; extracting numerical characteristics of the change submission files in the sample set, extracting change submission description information of the change submission files in the sample set, and extracting code blocks in the change submission files in the sample set; the expert integration model integrates a plurality of excellent classifiers, avoids the defect of a single machine learning model, and improves the accuracy in the aspect of vulnerability identification; according to the method, the expert integration model and the conformal evaluation classifier are combined, namely probability learning and statistical evaluation in the machine learning technology are combined, so that the accuracy and reliability of prediction data of the expert integration model are obviously improved, the false positive rate is reduced, the problem of false alarm of some existing vulnerability data acquisition models is solved, and a feasible scheme is provided for lack of high-quality source code vulnerability data.

Description

Construction method and collection method of high-quality vulnerability data collection model
Technical Field
The invention belongs to the field of generation audit, relates to a source code feature extraction technology, and particularly relates to a construction method and a collection method of a high-quality vulnerability data collection model.
Background
Traditional deep learning algorithms usually need millions of vulnerability samples to learn an effective model, and the potential of deep learning for potential vulnerability pattern learning can be developed through a sufficient amount of training data. However, since real-life high-quality vulnerability samples are very scarce, the lack of training data limits the quality of vulnerability detection models. Some previous approaches have been to use program generation to generate vulnerability samples, thereby alleviating the problem of training sample deficiencies, however program generation has two distinct disadvantages, on the one hand, they are affected by the grammar, template or model used to generate the program. On the other hand, they cannot reflect the diversification and evolving patterns of real-life programs.
Although some standard vulnerability databases such as SARD and SAMATE data sets exist at present, the method provides convenience for security researchers to analyze and apply existing vulnerabilities. But the defect samples in the standard vulnerability dataset still have many problems: firstly, the sample size is small, and generally, only hundreds of bugs of one type are insufficient to support the training of a high-quality bug detection model; secondly, the sample type is single, and the standard leak library only contains a few conditions which can cause the leak; third, the standard vulnerability data set is updated slowly.
Based on the above-mentioned deficiencies of the standard vulnerability database, we use GitHub as a platform for data collection. GitHub, as the largest code hosting platform around the world, can provide a rich source of data. If a high-quality vulnerability collection model can be constructed, and high-quality defect samples can be automatically obtained from the GitHub, the problem of data shortage can be solved.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a construction method and a collection method of a high-quality vulnerability data collection model, and solve the technical problem that high-quality source code vulnerability data are lack in the prior art.
In order to solve the technical problems, the invention adopts the following technical scheme to realize:
a method for constructing a high-quality vulnerability data collection model comprises the following steps:
step 1, collecting a change submission file as a sample set, and performing label processing on the sample set to obtain a positive sample set and a negative sample set;
the change submission files comprise vulnerable change submission files submitted to the CVE and vulnerable change submission files not submitted to the CVE;
step 2, extracting numerical characteristics of the change submission files in the sample set, extracting change submission description information of the change submission files in the sample set, extracting code blocks in the change submission files in the sample set, and storing the code blocks in a code set;
the numerical characteristics comprise star level, total submitting number, total publishing number, contributors of a warehouse, contribution rate and total branch number;
the code block comprises a deleted line code and an added line code in a modified file;
step 3, carrying out first vector quantization processing on the change submission description information to obtain a characteristic vector 1, carrying out second vector quantization processing on the code set to obtain a characteristic vector 2, cutting and tiling numerical value characteristics, the characteristic vector 1 and the characteristic vector 2 into one-dimensional characteristics, and taking the one-dimensional characteristics as digital characteristic vectors of the change submission file;
step 4, using the digital feature vector as a training set, wherein the training set comprises a positive training set and a negative training set;
step 5, constructing an expert integration model and a shape-preserving evaluation classifier;
step 5.1, training a single classifier by using the collected change submission file, and determining the optimal hyper-parameter of each classifier; combining the five classifiers into an expert integration model, and selecting a voting mechanism of the expert integration model as soft voting;
the five classifiers are support vector machines, random forests, k-nearest neighbors, logistic regression and gradient promotion;
step 5.2, inputting the positive training set and the negative training set into an expert integrated model for training to obtain a trained expert integrated model;
and 5.3, setting a threshold C of the shape preserving evaluation classifier, inputting the trained expert integration model into the shape preserving evaluation classifier to obtain the constructed shape preserving evaluation classifier, and inputting the digital feature vector into the constructed shape preserving evaluation classifier for training to obtain the trained shape preserving evaluation classifier.
Specifically, the collecting the change submission file as the sample set specifically includes the following steps:
step 1.1, crawling a warehouse name by means of an API (application programming interface) provided by a hosting platform, and selecting a warehouse name of a java warehouse with a star level higher than 10 according to a star-level ranking level of a warehouse on the hosting platform;
step 1.2, splicing the name of the warehouse and keywords by using the name of the warehouse and the downloading rule of the change submission file in the API of the hosting platform in the step 1.1 as a downloading link, and downloading a high-quality vulnerability sample by using the downloading link as a first original sample set;
the keywords are FIX CVE and CVE ID;
step 1.3, judging whether the change submission description information in the first sample set has a CVE ID mapped to the CVE standard vulnerability library and whether the change submission file repairs the vulnerability described by the CVE ID, otherwise, discarding the data, and if so, keeping the data and taking the data as the first sample set;
and step 1.4, crawling warehouse names by means of an API (application programming interface) provided by the hosting platform, and selecting warehouse names of warehouses with warehouse ranks higher than 1000 from the warehouse names according to the star rank ranks of the warehouses on the hosting platform.
Step 1.5, the warehouse name obtained in the step 1.4 and the downloading rule of the change submission file in the API of the hosting platform are used for splicing the warehouse name and the regular expression to be used as a downloading link, and the downloading link is used for downloading a high-quality vulnerability sample to be used as a second sample set;
and 1.6, taking the first sample set and the second sample set together as a sample set for model training.
Specifically, the label processing is to manually determine whether a change submission file in the sample set is defective data, if so, the change submission file is used as a positive sample set, and if not, the change submission file is used as a negative sample set for model training.
Specifically, the first vector quantization processing process is performed according to the following steps:
step 4.1, extracting highly relevant numerical characteristics of the warehouse to which the change submission file belongs in the training set;
and 4.2, dividing the change submission description information extracted in the step 3 into a series of tokens by using lexical analysis, discarding tokens with Chinese word descriptions, using a tool to enable each token to generate a 50-dimensional vector corresponding to the token, and splicing the vectors to obtain a feature vector 1.
Specifically, the second quantization process is performed according to the following steps:
step 4.2, dividing the code set extracted in the step 3 into a series of tokens through lexical analysis, using tools to enable each token to generate a 50-dimensional vector corresponding to each token, and splicing the vectors to obtain a feature vector 2;
the token comprises an identifier, a keyword and an operator;
a method for collecting high-quality vulnerability data comprises the following steps:
step one, collecting change submission files, and processing the change submission files according to the step 2 and the step 3 in the claim 1 to obtain digital feature vectors for evaluation;
inputting the digital feature vectors for evaluation into a trained expert integration model and a trained shape-preserving evaluation classifier, predicting the digital feature vectors by the expert integration model and giving a prediction result, grading the prediction result by the shape-preserving evaluation classifier, and keeping corresponding change submission files as high-quality vulnerability data when the grade is higher than 1-C; when the score is below 1-C, the corresponding change submission file is discarded.
Compared with the prior art, the invention has the beneficial technical effects that:
the expert integration model integrates a plurality of excellent classifiers, avoids the defect of a single machine learning model, and improves the accuracy in the aspect of vulnerability identification; according to the method, the expert integration model and the conformal evaluation classifier are combined, namely probability learning and statistical evaluation in the machine learning technology are combined, so that the accuracy and reliability of prediction data of the expert integration model are remarkably improved, the false positive rate is reduced, the problem of false alarm of some existing vulnerability data acquisition models is solved, and a feasible scheme is provided for lack of high-quality source code vulnerability data.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2. Change submission file.
FIG. 3 code sets extracted from patch.
The three features of fig. 4 are vectorized and clipped and tiled to obtain 2000-dimensional digital feature vectors.
FIG. 5 is a diagram of an expert integration model architecture.
FIG. 6 is a regular expression used in the step of filtering change submission files related to vulnerabilities.
FIG. 7 Integrated classifier vs. Single classifier comparison experiment.
FIG. 8 comparison experiment of vulnerability data collection method.
Figure 9 data improvement experiment based on the VULDEEPECKER method.
Figure 10 data improvement experiment based on μ VULDEEPECKER method.
FIG. 11 conformal evaluation classifier comparison experiment.
The present invention will be explained in further detail with reference to examples.
Detailed Description
It should be noted that, in the present application, the overall name of SARD is Software assertion Reference Dataset, that is, software Assurance Reference Dataset.
It should be noted that SAMATE is referred to as Software Assurance Metrics And Tool Evaluation throughout this application.
It should be noted that the CVE is referred to throughout this application as Common Vulnerabilities and Exposuers, i.e., common Vulnerabilities and Exposures.
It should be noted that the CVE ID is referred to as Common vunneavailability and exposure Identity Document in the present application, and represents the number in the Common vulnerability and exposure library.
It should be noted that the GitHub in this application is a managed platform oriented to open source and private software projects.
It should be noted that, in the present application, the change submission file refers to one code submission and includes a code repair submission and an information description of the code repair.
It should be noted that the change submission description information refers to modification description information in one code submission.
In the present Application, the API is called Application Programming Interface, i.e., application program Interface.
It should be noted that word2vec in this application is a model for generating word vectors. These models are shallow, two-layer neural networks that are trained to reconstruct word text.
In the present application, LR is entirely called Logistic Regression, that is, logistic Regression.
It should be noted that the RF is called Random Forest throughout this application.
It should be noted that, in the present application, GB is called Gradient Boosting, i.e. Gradient Boosting.
It should be noted that, in this application, the SVM is referred to as Support Vector Machine.
It should be noted that the general name of KNN in this application is K-Nearest Neighbor, i.e., K Nearest Neighbor.
The following embodiments of the present invention are provided, and it should be noted that the present invention is not limited to the following embodiments and all equivalent changes made on the basis of the technical solutions of the present application fall within the protection scope of the present invention.
Example 1:
the embodiment provides a method for constructing a high-quality vulnerability data collection model, which comprises the following steps of:
step 1, collecting a change submission file as a sample set, and performing label processing on the sample set to obtain a positive sample set and a negative sample set;
in the present embodiment, the collected change submission file is as shown in FIG. 2,
the change submission files comprise vulnerable change submission files submitted to the CVE and vulnerable change submission files not submitted to the CVE;
step 2, extracting numerical characteristics of the change submission files in the sample set, extracting change submission description information of the change submission files in the sample set, extracting code blocks in the change submission files in the sample set, and storing the code blocks in a code set;
the numerical characteristics comprise star level, total submission, total release, contributors of the warehouse, contribution rate and total branch number;
in this embodiment, the code chunks in the change commit file shown in FIG. 2 in the sample set are extracted and stored in the code set shown in FIG. 3.
The code block comprises a deleted line code and an added line code in a modified file;
step 3, carrying out first vector quantization processing on the change submission description information to obtain a characteristic vector 1, carrying out second vector quantization processing on the code set to obtain a characteristic vector 2, cutting and tiling numerical value characteristics, the characteristic vector 1 and the characteristic vector 2 into one-dimensional characteristics, and taking the one-dimensional characteristics as digital characteristic vectors of the change submission file;
in this embodiment, a feature vector 1 is obtained by performing first vector quantization processing on the change submission description information, a feature vector 2 is obtained by performing second vector quantization processing on the code set, the numerical feature, the feature vector 1 and the feature vector 2 are cut and tiled into one-dimensional features, and the one-dimensional features are used as digital feature vectors of the change submission file, and the digital feature vectors obtained by performing vectorization processing on the digital feature vectors are as shown in fig. 4.
Step 4, using the digital feature vector as a training set, wherein the training set comprises a positive training set and a negative training set;
step 5, constructing an expert integration model and a shape-preserving evaluation classifier;
step 5.1, training a single classifier by using the collected change submission file, and determining the optimal hyper-parameter of each classifier; combining the five classifiers into an expert integration model, and selecting a voting mechanism of the expert integration model as soft voting;
the five classifiers are support vector machines, random forests, k-nearest neighbors, logistic regression and gradient promotion;
in the embodiment, firstly, a single classifier is trained by using the collected change submission file, and the optimal hyper-parameter of each classifier is determined; and combining the five classifiers into an expert integration model, then giving an evaluation by the conformal evaluation classifier based on the calculated possibility, if the evaluation is higher than a set threshold value, confirming the prediction result of the expert integration model, and finally giving the prediction result by the expert integration model, wherein 0 represents that the change submission file is not defect data, and 1 represents that the change submission file is defect data. And inputting the digital feature vectors of the positive training set and the negative training set into an expert integration model and a conformal evaluation classifier to obtain a prediction result. The process is as in figure 5.
Step 5.2, inputting the positive training set and the negative training set into the expert integration model for training to obtain a trained expert integration model;
and 5.3, setting a threshold C of the shape preserving evaluation classifier, inputting the trained expert integration model into the shape preserving evaluation classifier to obtain the constructed shape preserving evaluation classifier, and inputting the digital feature vector into the constructed shape preserving evaluation classifier for training to obtain the trained shape preserving evaluation classifier.
As a preferred solution of this embodiment, collecting the change submission file as the sample set specifically includes the following steps:
step 1.1, crawling a warehouse name by means of an API (application programming interface) provided by a hosting platform, and selecting a warehouse name of a java warehouse with a star level higher than 10 according to a star-level ranking level of a warehouse on the hosting platform;
step 1.2, splicing the name of the warehouse and keywords by using the name of the warehouse and the downloading rules of the changed submitted files in the API of the hosting platform in the step 1.1 as downloading links, and downloading high-quality vulnerability samples by using the downloading links as a first original sample set;
the keywords are FIX CVE and CVE ID;
step 1.3, judging whether the change submission description information in the first sample set has the CVE ID mapped to the CVE standard vulnerability library and whether the change submission file repairs the vulnerability described by the CVE ID, otherwise, discarding the data, if so, retaining the data, and taking the data as the first sample set;
and step 1.4, crawling warehouse names by means of an API (application programming interface) provided by the hosting platform, and selecting warehouse names of warehouses with warehouse ranks higher than 1000 from the warehouse names according to the star rank ranks of the warehouses on the hosting platform.
Step 1.5, the warehouse name obtained in the step 1.4 and the downloading rule of the change submission file in the API of the hosting platform are used for splicing the warehouse name and the regular expression to be used as a downloading link, and the downloading link is used for downloading a high-quality vulnerability sample to be used as a second sample set;
in the present embodiment, the regular expression is as in FIG. 6; the hosting platform employed is GitHub.
Step 1.6, the first sample set and the second sample set are used as sample sets for model training together;
as a preferred solution of this embodiment, the label processing is to manually determine whether a change submission file in the sample set is defective data, if so, take the change submission file as a positive sample set, and if not, take the change submission file as a negative sample set for model training.
As a preferable solution of this embodiment, the first vector quantization processing is performed according to the following steps:
dividing the change submission description information extracted in the step 3 into a series of tokens by lexical analysis, discarding tokens with Chinese word description, enabling each token to generate a 50-dimensional vector corresponding to the token by using a tool, and splicing the vectors to obtain a feature vector 1;
in this embodiment, the practical tool is word2vec.
As a preferable solution of this embodiment, the second quantization processing is performed according to the following steps:
dividing the code set extracted in the step 3 into a series of tokens through lexical analysis, using a tool to enable each token to generate a 50-dimensional vector corresponding to each token, and splicing the vectors to obtain a feature vector 2;
tokens include identifiers, keywords, and operators.
Example 2:
the embodiment provides a method for collecting high-quality vulnerability data, which is carried out according to the following steps:
step one, collecting change submission files, and processing the change submission files according to the step 2 and the step 3 in the claim 1 to obtain digital feature vectors for evaluation;
inputting the digital feature vectors for evaluation into a trained expert integration model and a trained shape-preserving evaluation classifier, predicting the digital feature vectors by the expert integration model and giving a prediction result, grading the prediction result by the shape-preserving evaluation classifier, and keeping corresponding change submission files as high-quality vulnerability data when the grade is higher than 1-C; when the score is below 1-C, the corresponding change submission file is discarded.
In the present embodiment, a conformal assessment classifier is applied to calculate a confidence value pv for a class y of an input digital feature vector x, and in order to calculate pv for each x, a calibration score for the weakly supervised model h for the digital feature vector x prediction is calculated using a metric function a (x, y, h) specific to each weakly supervised model;
in order to calculate the confidence value pv, 10% of the digital feature vectors are reserved as a verification set, and the calibration score of each weakly supervised model on n input samples in the verification set is calculated respectively
Figure GDA0004067565610000101
Given a new input sample x n+1 And calculates a calibration score ≧ utilizing the metric function A (x, y, h)>
Figure GDA0004067565610000111
Then is sample x n+1 Pv of (a):
Figure GDA0004067565610000112
wherein:
pv represents the evaluation score of the conformal assessment classifier on input x;
COUNT represents the number of samples added;
i represents the ith sample;
y i a label representing the ith sample;
y p representing a sample label category;
x represents a numerical feature vector;
x n+1 representing the (n + 1) th digital feature vector;
c represents a set confidence threshold, set to 0.3 in the example;
a (x, y, h) represents a prediction function of the weakly supervised model on an input digital feature vector x;
if the calculated pv value is close to the lower bound 1/(n + 1), the prediction is inaccurate, and if the calculated pv value is close to the upper bound 1, the prediction is accurate, in this embodiment, only the prediction with pv greater than 1-C is considered, and when C is set to 0.3, the expert integration model has the best performance.
Actual measurement example 1:
following the technical scheme, an expert integration model and 5 vulnerability data crawling models (VCCFinder, zvD, VULPECKER, ZHOU et al, SABETTA et al) are trained on the same data set respectively. The results of the experiment are shown in FIG. 7. The horizontal axis represents the FPR (true positive rate) threshold value, and the vertical axis represents the TPR (false positive rate) threshold value. Under the same false positive rate threshold value, the true positive rate of the expert integrated model method is higher than that of the other 5 data collection schemes; under the same true positive rate threshold, the false positive rate of the expert integrated model is lower than the other 5 data collection schemes; the expert integration model ensures that a high true positive rate is still achieved at a low false positive rate.
Actual measurement example 2:
following the technical scheme, the expert integrated model and a single classifier model method are compared, the expert integrated model and 5 independent classifiers (LR, RF, GB, SVM and KNN) are trained on the same data set, the experimental result is shown in figure 8, the horizontal axis represents an FPR (true positive rate) threshold value, the vertical axis represents a TPR (false positive rate) threshold value, and the true positive rates of the expert integrated model are higher than those of the other 5 independent classifiers under the same false positive rate threshold value; under the same true positive rate threshold, the false positive rates of the expert integrated models are lower than that of the other 5 independent classifiers; the expert integration model ensures that a high true positive rate is still achieved at a low false positive rate.
Actual measurement example 3:
following the above technical scheme, the vulnerability collection method in the invention is compared with the existing 3 vulnerability collection methods (ZvD, ZHOU et al, SABETTA et al), the data obtained by using the invention and the other 3 vulnerability collection methods are respectively applied to two vulnerability detection methods (VULDEEEPECKER and μ VULDEEEPECKER), and the experimental result based on the VULDEEEPECKER vulnerability detection method is shown in FIG. 9. Fig. 9 reports the detection effect of the vuldeeecker-based vulnerability detection method when different data collection methods are used, and the improvement of the Accuracy, precision, recall and F1 score on four evaluation indexes is different. The horizontal axis shows four evaluation indexes, the vertical axis shows the lifting effect on the reference model, and a negative value shows that the performance is reduced. On four evaluation indexes of Accuracy, precision, recall and F1 score, the improvement of the collected data on the detection effect of the mu VULDEEEPECKER vulnerability detection method can be maintained between 10.2% and 12%, the improvement of the data collected by the other three vulnerability data collection schemes on the detection effect of the VULDEEEPECKER vulnerability detection method is only up to 6%, and the improvement of the collection method on the final detection effect is nearly twice as much as that of the other three methods, so that the collection method is superior to the existing vulnerability collection method.
An experiment based on the vuldeeecker vulnerability detection method is shown in fig. 10. Fig. 10 reports the detection effect of the vuldeecker-based vulnerability detection method using different data collection methods, the four evaluation indexes of Accuracy, precision, recall and F1 score are improved in different amplitudes, the horizontal axis shows the four evaluation indexes, the vertical axis shows the improvement effect on the reference model, the negative value shows the performance reduction, the improvement of the collected data on the detection effect of the vuldeecker vulnerability detection method can be maintained between 12% and 14% on the four evaluation indexes of Accuracy, precision, recall and F1 score, and the improvement of the collected data on the detection effect of the vuldeecker vulnerability detection method is only up to 7%, while the improvement of the collected data on the detection effect of the vuldeecker vulnerability detection method by the other three vulnerability data collection schemes is nearly twice as good as that of the other three vulnerability collection methods, so that the improvement of the final detection effect of the collection method of the present invention is better than that of the existing vulnerability collection methods.
Actual measurement example 4:
following the technical scheme, training an integrated classifier and an expert integrated model applying a conformal evaluation classifier on the same data set, wherein the experimental result is shown in figure 11, the horizontal axis represents the experimental times, and the vertical axis represents the accuracy. Compared with the accuracy of an integrated classifier, the accuracy of the expert integrated model applying the conformal assessment classifier can be maintained at 4-12%, the accuracy of the expert integrated model applying the conformal assessment classifier can reach 91% at most, and the conformal assessment classifier plays a key role in classifying the models.

Claims (6)

1. A method for constructing a high-quality vulnerability data collection model is characterized by comprising the following steps:
step 1, collecting a change submission file as a sample set, and performing label processing on the sample set to obtain a positive sample set and a negative sample set;
the change submission files comprise vulnerable change submission files submitted to the CVE and vulnerable change submission files not submitted to the CVE;
step 2, extracting numerical characteristics of the sample set change submission file, extracting change submission description information of the sample set change submission file, extracting a code block in the sample set change submission file, and storing the code block into a code set;
the numerical characteristics comprise star level, total submitting number, total publishing number, contributors of a warehouse, contribution rate and total branch number;
the code block comprises a deleted line code and an added line code in a modified file;
step 3, carrying out first vector quantization processing on the change submission description information to obtain a characteristic vector 1, carrying out second vector quantization processing on the code set to obtain a characteristic vector 2, cutting and tiling numerical value characteristics, the characteristic vector 1 and the characteristic vector 2 into one-dimensional characteristics, and taking the one-dimensional characteristics as digital characteristic vectors of the change submission file;
step 4, taking the digital feature vector as a training set, wherein the training set comprises a positive training set and a negative training set;
step 5, constructing an expert integration model and a shape-preserving evaluation classifier;
step 5.1, training a single classifier by using the collected change submission file, and determining the optimal hyper-parameter of each classifier; combining the five classifiers into an expert integration model, and selecting a voting mechanism of the expert integration model as soft voting;
the five classifiers are support vector machines, random forests, k-nearest neighbors, logistic regression and gradient promotion;
step 5.2, inputting the positive training set and the negative training set into an expert integrated model for training to obtain a trained expert integrated model;
and 5.3, setting a threshold C of the shape preserving evaluation classifier, inputting the trained expert integration model into the shape preserving evaluation classifier to obtain the constructed shape preserving evaluation classifier, and inputting the digital feature vector into the constructed shape preserving evaluation classifier for training to obtain the trained shape preserving evaluation classifier.
2. The method for constructing a high-quality vulnerability data collection model according to claim 1, wherein the collecting of the change submission file as the sample set specifically comprises the following steps:
step 1.1, crawling a warehouse name by means of an API (application programming interface) provided by a hosting platform, and selecting a warehouse name of a java warehouse with a star level higher than 10 according to a star-level ranking level of a warehouse on the hosting platform;
step 1.2, splicing the name of the warehouse and keywords by using the name of the warehouse and the downloading rules of the changed submitted files in the API of the hosting platform in the step 1.1 as downloading links, and downloading high-quality vulnerability samples by using the downloading links as a first original sample set;
the keywords are FIX CVE and CVEID;
step 1.3, judging whether the change submission description information in the first sample set has a CVE ID mapped to the CVE standard vulnerability library and whether the change submission file repairs the vulnerability described by the CVE ID, otherwise, discarding the data, and if so, keeping the data and taking the data as the first sample set;
step 1.4, crawling warehouse names by means of an API (application programming interface) provided by a hosting platform, and selecting warehouse names of warehouses with warehouse ranks higher than 1000 from the warehouse names according to the star rank ranks of the warehouses on the hosting platform;
step 1.5, splicing the warehouse name and the regular expression by using the warehouse name obtained in the step 1.4 and the downloading rule of the change submission file in the API of the hosting platform as a downloading link, and downloading a high-quality vulnerability sample by using the downloading link as a second sample set;
and 1.6, taking the first sample set and the second sample set together as a sample set for model training.
3. The method for constructing a high-quality vulnerability data collection model according to claim 1, wherein the label processing is to manually determine whether a change submission file in the sample set is defective data, if so, the change submission file is used as a positive sample set, and if not, the change submission file is used as a negative sample set for model training.
4. The method for constructing a high-quality vulnerability data collection model according to claim 1, wherein the first vector quantization process is performed according to the following steps:
step 4.1, extracting highly relevant numerical characteristics of the warehouse to which the change submission file belongs in the training set;
and 4.2, dividing the change submission description information extracted in the step 3 into a series of tokens by using lexical analysis, discarding tokens with Chinese word descriptions, using a tool to enable each token to generate a 50-dimensional vector corresponding to the token, and splicing the vectors to obtain a feature vector 1.
5. The method for constructing a high-quality vulnerability data collection model according to claim 1, wherein the second quantization process is performed according to the following steps:
step 4.2, dividing the code set extracted in the step 3 into a series of tokens through lexical analysis, using tools to enable each token to generate a 50-dimensional vector corresponding to each token, and splicing the vectors to obtain a feature vector 2;
the token comprises an identifier, a keyword and an operator.
6. A method for collecting high-quality vulnerability data is characterized by comprising the following steps:
step one, collecting change submission files, and processing the change submission files according to the step 2 and the step 3 in the claim 1 to obtain digital feature vectors for evaluation;
inputting the digital feature vectors for evaluation into a trained expert integration model and a trained shape-preserving evaluation classifier, predicting the digital feature vectors by the expert integration model and giving a prediction result, grading the prediction result by the shape-preserving evaluation classifier, and keeping corresponding change submission files as high-quality vulnerability data when the grade is higher than 1-C; when the score is below 1-C, the corresponding change submission file is discarded.
CN202110424826.0A 2021-04-20 2021-04-20 Construction method and collection method of high-quality vulnerability data collection model Active CN113221960B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110424826.0A CN113221960B (en) 2021-04-20 2021-04-20 Construction method and collection method of high-quality vulnerability data collection model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110424826.0A CN113221960B (en) 2021-04-20 2021-04-20 Construction method and collection method of high-quality vulnerability data collection model

Publications (2)

Publication Number Publication Date
CN113221960A CN113221960A (en) 2021-08-06
CN113221960B true CN113221960B (en) 2023-04-18

Family

ID=77088249

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110424826.0A Active CN113221960B (en) 2021-04-20 2021-04-20 Construction method and collection method of high-quality vulnerability data collection model

Country Status (1)

Country Link
CN (1) CN113221960B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114120120A (en) * 2021-11-25 2022-03-01 广东电网有限责任公司 Method, device, equipment and medium for detecting illegal building based on remote sensing image
CN115048316B (en) * 2022-08-15 2022-12-09 中国电子科技集团公司第三十研究所 Semi-supervised software code defect detection method and device
CN116302043B (en) * 2023-05-25 2023-10-10 深圳市明源云科技有限公司 Code maintenance problem detection method and device, electronic equipment and readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109886020A (en) * 2019-01-24 2019-06-14 燕山大学 Software vulnerability automatic classification method based on deep neural network
CN110197286A (en) * 2019-05-10 2019-09-03 武汉理工大学 A kind of Active Learning classification method based on mixed Gauss model and sparse Bayesian

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11520990B2 (en) * 2019-04-03 2022-12-06 RELX Inc. Systems and methods for dynamically displaying a user interface of an evaluation system processing textual data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109886020A (en) * 2019-01-24 2019-06-14 燕山大学 Software vulnerability automatic classification method based on deep neural network
CN110197286A (en) * 2019-05-10 2019-09-03 武汉理工大学 A kind of Active Learning classification method based on mixed Gauss model and sparse Bayesian

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Model-agnostic nonconformity functions for conformal classification;Ulf Johansson;《2017 International Joint Conference on Neural Networks (IJCNN)》;20170703;全文 *
基于并行随机森林的城市PM2.5浓度预测;任才溶;《中国优秀硕士学位论文全文数据库》;20181015;全文 *

Also Published As

Publication number Publication date
CN113221960A (en) 2021-08-06

Similar Documents

Publication Publication Date Title
CN113221960B (en) Construction method and collection method of high-quality vulnerability data collection model
CN111309912B (en) Text classification method, apparatus, computer device and storage medium
CN110688288B (en) Automatic test method, device, equipment and storage medium based on artificial intelligence
Kitchenham et al. Why comparative effort prediction studies may be invalid
CN108459955B (en) Software defect prediction method based on deep self-coding network
CN110175697B (en) Adverse event risk prediction system and method
CN111459799A (en) Software defect detection model establishing and detecting method and system based on Github
CN109871688B (en) Vulnerability threat degree evaluation method
CN110689368B (en) Method for designing advertisement click rate prediction system in mobile application
CN113779272A (en) Data processing method, device and equipment based on knowledge graph and storage medium
CN111460250A (en) Image data cleaning method, image data cleaning device, image data cleaning medium, and electronic apparatus
Zhang et al. Large-scale empirical study of important features indicative of discovered vulnerabilities to assess application security
CN111047173B (en) Community credibility evaluation method based on improved D-S evidence theory
CN107368526A (en) A kind of data processing method and device
CN108614778B (en) Android App program evolution change prediction method based on Gaussian process regression
CN111199469A (en) User payment model generation method and device and electronic equipment
CN113537807A (en) Enterprise intelligent wind control method and device
CN114416573A (en) Defect analysis method, device, equipment and medium for application program
CN112888008B (en) Base station abnormality detection method, device, equipment and storage medium
CN113722230B (en) Integrated evaluation method and device for vulnerability mining capability of fuzzy test tool
CN115686995A (en) Data monitoring processing method and device
CN114238768A (en) Information pushing method and device, computer equipment and storage medium
CN112148605B (en) Software defect prediction method based on spectral clustering and semi-supervised learning
CN115080386A (en) Scene effectiveness analysis method and device based on automatic driving function requirement
CN114186644A (en) Defect report severity prediction method based on optimized random forest

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant