CN114329482A

CN114329482A - C/C + + vulnerability based on sequencing and inter-patch link recovery system and method thereof

Info

Publication number: CN114329482A
Application number: CN202111560155.7A
Authority: CN
Inventors: 孙小兵; 杨云帆; 薄莉莉; 魏颖; 李斌
Original assignee: Yangzhou University
Current assignee: Yangzhou University
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-04-12

Abstract

The invention discloses a C/C + + vulnerability based on sequencing and inter-patch link recovery system and a method thereof, wherein the method comprises the following steps: collecting vulnerability data; generating vulnerability related characteristics, including named entity identification based on Bert + BilSTM + CRF, vulnerability automatic detection based on deep learning and text similarity calculation based on TF-IDF + cosine similarity; and sequencing related characteristics between the patches and the vulnerabilities based on a sequencing model RankNet, and realizing linkage recovery between the vulnerabilities and the patches with manual verification. The method can better utilize semantic information and type information of patch codes in vulnerability description, fully excavate the correlation between the vulnerability and the patches, and better balance the relation between the coverage amount and the manual workload of the vulnerability patches.

Description

C/C + + vulnerability based on sequencing and inter-patch link recovery system and method thereof

Technical Field

The invention belongs to the field of software security, and particularly relates to a C/C + + vulnerability based on sequencing and inter-patch link recovery system and method thereof.

Background

Software bugs can weaken the security of computer software, and can cause problems of data loss and tampering, privacy disclosure and the like, and more than half of the bugs are C/C + + project-related bugs. In order to support the work of detection, analysis, diagnosis and the like of C/C + + project related bugs, it is necessary to collect the description of open-source software bugs and repair patch codes thereof. However, most open-source software bug fixes are silent submission, and existing software bug repositories such as NVD, X-Force and other stored bugs often lack patch link for fixing, so that patch codes of bugs are inconvenient to collect, and therefore linking recovery work needs to be performed on C/C + + bugs and patches thereof. Most of the previous link recovery work is carried out aiming at defects and patches, and link recovery work aiming at vulnerabilities and patches is rare, and most of the work uses a matching-based method, so long as certain submitted information is not completely matched with defect information, the submitted information can be abandoned, a large number of patches with incompletely matched information are abandoned, and the coverage area of defect-patch link recovery is reduced. Many recent works explore the construction of vulnerability data sets, but the work often cannot fully utilize vulnerability information, only uses partial information directly provided in a vulnerability repository and a software repository, does not consider deep-level association between vulnerabilities and patches, and causes the link recovery effect to be not ideal.

There are some works that have been done to build vulnerability datasets using a ranking-based approach, such as: the documents "Xin Tan, Yuan Zhang, Chenyuan Mi, et al locating the Security Patches for exposed OSS Vulnerariates with Vulnerariability-Commit Correlation rating [ C ]. Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security,2021:3282 & 3299 ] extract Vulnerability IDs, Vulnerability locations, Vulnerability type correlations, Vulnerability text statistics to rank the submissions. However, code modification information is not fully utilized when vulnerability type correlation is extracted, the vulnerability description and submission information are utilized only based on some statistical indexes without considering text similarity, counterexample reduction is not performed before sequencing, so that the ranking of the patches in submission is low, a large amount of effort is required for manual verification, and the patch coverage area is low.

Disclosure of Invention

The purpose of the invention is as follows: in view of the above problems in the prior art, the present invention aims to provide a system and a method for restoring a C/C + + vulnerability and inter-patch links thereof based on sequencing, which can restore the vulnerability and inter-patch links thereof and can assist downstream vulnerability detection, analysis, diagnosis, and repair tasks.

The technical scheme is as follows: in order to achieve the purpose, the invention specifically adopts the following technical scheme:

the invention provides a C/C + + vulnerability based on sequencing and an inter-patch link recovery system thereof, which comprises the following steps: the vulnerability data acquisition module is used for extracting basic data of vulnerabilities related to open source C/C + + projects from a vulnerability database NVD and a vulnerability database X-Force, extracting basic data related to submission in a software storage library of the projects corresponding to the vulnerabilities, and establishing vulnerability-submission pairs by using the basic data of the vulnerabilities and the basic data related to the submission; the vulnerability related characteristic generating module is used for generating vulnerability related characteristics for each vulnerability-submission pair through basic data of the vulnerability and the related submitted basic data; the sequencing module is used for training a sequencing model RankNet based on a neural network through vulnerability related characteristics to obtain a trained sequencing model RankNet, inputting a vulnerability-submission pair related to a patch unknown vulnerability into the trained sequencing model RankNet, calculating the score of each pair of vulnerability-submission pairs related to the patch unknown vulnerability through the sequencing model RankNet, and sequencing the vulnerability-submission pairs related to the patch unknown vulnerability according to the scoring sequence to obtain sequenced vulnerability-submission pairs; the manual inspection module is used for manually inspecting the sequenced vulnerability-submission pairs according to the sequencing sequence and judging whether the submission in the vulnerability-submission pairs is patch submission or not, if so, the vulnerability is repaired, and the vulnerability and the patch submission corresponding to the vulnerability are linked; if not, the vulnerability is not repaired, and the patch corresponding to the vulnerability and the vulnerability is submitted without a recovery link.

Furthermore, in the vulnerability data acquisition module,

screening related vulnerabilities of the C/C + + project according to the fact that whether the vulnerability description contains the C/. cpp file or not, and obtaining vulnerability IDs, disclosure dates, vulnerability descriptions corresponding to the IDs, reference links, software version configuration influenced by the vulnerabilities and vulnerability types of related vulnerability entries; for a vulnerability database X-Force, obtaining the description of a relevant vulnerability; for the software repository of the project corresponding to the vulnerability, screening submission is carried out according to whether the c/. cpp file is modified in the submission and whether the submission is released after the affected version, code change, submission date, submission title and submission information which meet the screening conditions are obtained, and the submission title and the submission information are spliced into a submission text; for a certain vulnerability entry, taking data extracted from a vulnerability database NVD and a vulnerability database X-Force as vulnerability data, taking data extracted from n submissions meeting conditions in a software repository of a project corresponding to the vulnerability as submission data, merging the vulnerability data of the vulnerability and the submission data submitted by each submission in the n submissions respectively for one time, and finally obtaining n groups of vulnerability-submission pairs containing vulnerability information and submission information.

Further, in the vulnerability-related characteristic generation module,

generating vulnerability identifier characteristics: for each vulnerability-submission pair, based on the reference links extracted from the vulnerability database NVD, a regular expression ". jira./(. a)", ". bugzilla.,? id ═ ma? id ═ and ". bugs. eclipse.? Extracting existing defect IDs, judging whether submitted texts contain related defect IDs and vulnerability IDs by combining the corresponding vulnerability IDs, and if so, generating vulnerability identifier characteristics, wherein the vulnerability identifier characteristics comprise: a defect ID feature and a vulnerability ID feature; if not, the vulnerability does not have vulnerability identifier characteristics;

generating vulnerability location characteristics: for each vulnerability-submission pair, extracting a file location where the vulnerability occurs using regular expressions "(. eras.c)" and "(. eras.cpp)" based on the vulnerability description extracted from the vulnerability database NVD, and then performing named entity identification on the vulnerability description to extract a function location where the vulnerability occurs; the extraction process of the function position is as follows: converting the vulnerability description into an input sequence of a Bert pre-training model and inputting the input sequence into the Bert pre-training model for calculation to obtain an expression vector; inputting the expression vector into a Bi-LSTM layer of the named entity recognition model for calculation to obtain text features; then, the text features are used as the input of a CRF layer of a named entity recognition model, text marking is carried out, a function position entity of the vulnerability is obtained, and the function position entity is used as the function position of the vulnerability;

comparing the file position and the function position extracted from the vulnerability description with the file position and the function position directly extracted from the submitted code change, judging whether the file position and the function position extracted from the vulnerability description are changed or not, and if the file position and the function position are changed, generating vulnerability position characteristics, wherein the vulnerability position characteristics comprise: file location features and function location features; if not, the vulnerability has no vulnerability position characteristics;

generating vulnerability type characteristics: for each vulnerability-submission pair, based on submitted code change, performing function level vulnerability detection based on a graph convolution neural network on a pre-trained code and a pre-trained code after change of a file change function for a C (C project)/. cpp (C + + project) file in a code change file, if a certain type of vulnerability is detected in the code before change and the type of vulnerability is not detected in the code after change, judging that the vulnerability is repaired in the code change, and obtaining a vulnerability type corresponding to the vulnerability, wherein the specific process is as follows:

for a file affected in a C (C project)/. cpp (C + + project) file, firstly, dividing a code before modification of the file into n functions, extracting an abstract syntax tree AST, a control flow graph CFG and a data flow graph DFG of each function to represent syntax and semantic information of the function, coding a source code of the function into a vector, inputting the vector into a pre-trained graph convolution neural network model for vulnerability detection, and extracting a vulnerability type of the vulnerability if the model is output as a function with vulnerability; if the model output indicates that the function has no leak, detecting the next function; after all the functions in the file are detected, obtaining a vulnerability dictionary formed by the function name and the vulnerability type of the file;

then dividing the changed codes of the file into n functions, and executing the operation process to obtain a vulnerability dictionary formed by the function name and the vulnerability type of the file;

comparing the difference of the bug dictionaries before and after the file is changed, if the bug dictionaries before and after the file is changed are not different, judging that the code change does not repair the bug, if a certain function and bug in the bug dictionary before the file is changed disappear from the bug dictionary after the file is changed, judging that the bug is repaired in the code change, and obtaining the bug type of the repaired bug; comparing the vulnerability type with the vulnerability type extracted from the vulnerability database NVD, and generating vulnerability type characteristics;

generating vulnerability text similarity characteristics: for each vulnerability-submission pair, based on vulnerability description extracted from a vulnerability database NVD, deleting a text corresponding to a function position entity which is identified and extracted by using a named entity when generating vulnerability position characteristics, then performing text marking, combining marked words into a text Sequence, and then performing stop word removal, morphological restoration and synonym replacement on the text Sequence to obtain a text Sequence N _ Sequence;

based on vulnerability descriptions extracted from a vulnerability database X-Force and submitted submission texts, respectively executing the following operations:

performing text marking, combining marked words into a text sequence, and then performing stop word removal, word form reduction and synonym replacement on the text sequence;

finally, respectively obtaining a text Sequence X _ Sequence and a text Sequence C _ Sequence;

calculating the importance of each word in the Text sequence N _ Text by using a TF-IDF algorithm, quantizing the importance of the words, and generating a Vector N _ Vector corresponding to the Text sequence N _ Text, wherein the calculation process is as follows:

calculating the word frequency TF of the word w in the text sequence t:

wherein w _ num is the number of times the word w appears in the text sequence, and t _ num is the total number of words in the text sequence t;

calculating the inverse document frequency IDF of the word w in the text sequence set d:

wherein | d | represents the total number of text sequences in the text sequence set d, | { t ∈ d: w ∈ t } | represents the number of text sequences containing the word w;

calculating the word frequency of the word w-the inverse file frequency TF-IDF: TF-IDF ═ TF × IDF;

vectorizing the Text sequence N _ Text into a Vector N _ Vector according to the word frequency-inverse file frequency of each word;

obtaining a Vector X _ Vector of the Text sequence X _ Text and a Vector C _ Vector of the Text sequence C _ Text by using the method;

and finally, respectively calculating the Text similarity between the Text sequence N _ Text and the Text sequence C _ Text by using the cosine similarity:

calculating to obtain Text Similarity X _ Similarity between the Text sequence X _ Text and the Text sequence C _ Text by using the formula, and generating vulnerability Text Similarity characteristics according to the numerical values of N _ Similarity and X _ Similarity, wherein the vulnerability Text Similarity characteristics comprise: the vulnerability description-submission text similarity characteristic of the vulnerability database NVD and the vulnerability description-submission text similarity characteristic of the vulnerability database X-Force vulnerability;

generating vulnerability repair possibility characteristics: for each vulnerability-submission pair, judging whether the code change repairs the vulnerability or not according to a vulnerability detection result when generating vulnerability type characteristics based on the submitted code change, if the code change repairs the vulnerability, generating vulnerability repair possibility characteristics, and if the code change does not repair the vulnerability, judging that the vulnerability-submission pair does not have the vulnerability repair possibility characteristics;

generating a date difference characteristic: for each vulnerability-submission pair, calculating a difference value between the vulnerability disclosure time extracted from the vulnerability database NVD and the submission time of the vulnerability:

Date_Diff＝N_Date-C_Date

wherein, Date _ Diff represents the difference value of the two, N _ Date represents the vulnerability disclosure time, and C _ Date represents the submission time;

judging the interval where the difference value is located according to the investigation result, and generating a date difference value characteristic according to the proportion of the number of vulnerability-patch pairs of the interval in the investigation result to the number of all investigation objects; the vulnerability-patch pair is a vulnerability and a patch corresponding to the vulnerability, and the investigation object is the vulnerability-patch pair.

Further, in the sorting module,

for a known bug of a certain patch, extracting any two pairs of related bugs-submission pairs<i,j>Calculating the true probability of the vulnerability-submission pair i being ranked before the vulnerability-submission pair j

Wherein

Then, according to the vulnerability relevant characteristics, expressing two pairs of vulnerability-submission pairs < i, j > as characteristic vectors, and inputting a scoring function S:

wherein k is the total number of features; f_nAn nth feature score for a vulnerability-submission pair; w_nScoring coefficients for the nth feature of the vulnerability-submission pairs, generated by the ranking model RankNe;

two pairs of vulnerability-submission pairs computed by a scoring function S<i,j>Fraction of (2)<S_i,S_j>And calculating the prediction probability P of the vulnerability-submission pair i before the vulnerability-submission pair j by using a sigmoid function_ij：

Using the cross entropy C as a loss function of the model:

then training a model to minimize a loss function;

obtaining a trained ranking model RankNet through the steps;

inputting a vulnerability-submission pair related to a vulnerability unknown patch into a trained sequencing model RankNet, calculating the score of each pair of vulnerability-submission pairs related to the vulnerability unknown patch, and sequencing the vulnerability-submission pairs related to the vulnerability unknown patch according to the grading sequence to obtain sequenced vulnerability-submission pairs.

The invention provides a C/C + + vulnerability based on sequencing and an inter-patch link recovery method thereof, which comprises the following steps: step 1, extracting basic data of a vulnerability related to an open source C/C + + project from a vulnerability database NVD and a vulnerability database X-Force, extracting basic data related to submission in a software repository of the project corresponding to the vulnerability, and establishing a vulnerability-submission pair by using the basic data of the vulnerability and the basic data related to the vulnerability; step 2, for each vulnerability-submission pair, generating vulnerability related characteristics through basic data of the vulnerability and related submitted basic data; step 3, training a sequencing model RankNet based on a neural network through vulnerability-related characteristics to obtain a trained sequencing model RankNet, inputting a vulnerability-submission pair related to a patch unknown vulnerability into the trained sequencing model RankNet, calculating the score of each pair of vulnerability-submission pairs related to the patch unknown vulnerability through the sequencing model RankNet, and sequencing the vulnerability-submission pairs related to the patch unknown vulnerability according to the scoring sequence to obtain sequenced vulnerability-submission pairs; step 4, manually checking the ordered vulnerability-submission pairs according to the ordering sequence, judging whether the submission in the vulnerability-submission pairs is patch submission, if so, indicating that the vulnerability is repaired, and submitting and restoring links of the vulnerability and the patch corresponding to the vulnerability; if not, the vulnerability is not repaired, and the patch corresponding to the vulnerability and the vulnerability is submitted without a recovery link.

Further, in step 1,

Further, in step 2,

calculating the word frequency TF of the word w in the text sequence t:

Date_Diff＝N_Date-C_Date

Further, in step 3,

Wherein

Using the cross entropy C as a loss function of the model:

then training a model to minimize a loss function;

obtaining a trained ranking model RankNet through the steps;

The invention provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and is characterized in that the processor realizes the steps of the C/C + + vulnerability based on sequencing and the inter-patch link recovery method thereof when executing the computer program.

The invention provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the above-mentioned method for recovering a C/C + + vulnerability based on sorting and inter-patch linkage thereof.

Has the advantages that:

1. the implicit relation between vulnerability-submission pairs can be better reflected by using targeted characteristics;

2. two vulnerability data sources are adopted, different weights are given to different vulnerability characteristics in the sequencing stage, and the condition that part of information in a certain vulnerability-submission pair is lost can be better dealt with;

3. counter-example reduction is performed before the vulnerability-submission pairs are sorted, which can greatly reduce the workload of manual inspection ultimately required without reducing recall rates.

Drawings

FIG. 1 is a flowchart of a method for recovering a C/C + + vulnerability based on sequencing and inter-patch links thereof according to the present invention.

FIG. 2 is a vulnerability-submission versus date difference pre-investigation diagram in the present invention.

FIG. 3 is a vulnerability-submission versus feature generation flow diagram in the present invention.

FIG. 4 is a vulnerability-submission versus ranking flow diagram in the present invention.

Detailed Description

The following describes embodiments of the present invention with reference to the drawings.

Example one

As shown in fig. 1, the present embodiment discloses a system for recovering C/C + + vulnerabilities and inter-patch links thereof based on sorting, which includes: the vulnerability data acquisition module is used for extracting basic data of vulnerabilities related to open source C/C + + projects from a vulnerability database NVD and a vulnerability database X-Force, extracting basic data related to submission in a software storage library of the projects corresponding to the vulnerabilities, and establishing vulnerability-submission pairs by using the basic data of the vulnerabilities and the basic data related to the submission; the vulnerability related characteristic generating module is used for generating vulnerability related characteristics for each vulnerability-submission pair through basic data of the vulnerability and the related submitted basic data; the sequencing module is used for training a sequencing model RankNet based on a neural network through vulnerability related characteristics to obtain a trained sequencing model RankNet, inputting a vulnerability-submission pair related to a patch unknown vulnerability into the trained sequencing model RankNet, calculating the score of each pair of vulnerability-submission pairs related to the patch unknown vulnerability through the sequencing model RankNet, and sequencing the vulnerability-submission pairs related to the patch unknown vulnerability according to the scoring sequence to obtain sequenced vulnerability-submission pairs; the manual inspection module is used for manually inspecting the sequenced vulnerability-submission pairs according to the sequencing sequence and judging whether the submission in the vulnerability-submission pairs is patch submission or not, if so, the vulnerability is repaired, and the vulnerability and the patch submission corresponding to the vulnerability are linked; if not, the vulnerability is not repaired, and the patch corresponding to the vulnerability and the vulnerability is submitted without a recovery link.

Furthermore, in the vulnerability data acquisition module,

for a vulnerability database NVD, screening related vulnerabilities of a C/C + + project according to whether vulnerability descriptions comprise C/cpp files, and acquiring vulnerability IDs (CVE-IDs) of related vulnerability entries, disclosure dates (N _ Dates), vulnerability descriptions (N _ Descriptions) corresponding to each ID, Reference links (references), software Version configuration (N _ Version) influenced by the vulnerabilities and vulnerability types (CWEs); for a vulnerability database X-Force, obtaining the Description (X _ Description) of the related vulnerability; for a software repository of a project corresponding to the vulnerability, screening submission according to whether the C/. cpp file is modified in the submission and whether the submission is issued after the affected version, acquiring submitted code changes (Codeff), submission dates (C _ Date), submission titles (C _ Title) and submission information (C _ Message) which meet screening conditions, and splicing the submission titles and the submission information into a submission Text (C _ Text); regarding a certain vulnerability entry, taking Data extracted from a vulnerability database NVD and a vulnerability database X-Force as vulnerability Data (V _ Data), and taking Data extracted from n submissions meeting conditions in a software repository of a corresponding item of the vulnerability as submission Data (C _ Data)_i(i∈[1,n]) And then merging the vulnerability data of the vulnerability and the submitted data submitted by each of the n submissions once respectively to finally obtain n sets of vulnerability-submission pairs containing vulnerability information and submitted information.

Further, as shown in fig. 3, in the vulnerability-related feature generation module,

generating vulnerability identifier characteristics: for each vulnerability-submission pair, based on the reference links extracted from the vulnerability database NVD, a regular expression ". jira./(. a)", ". bugzilla.,? id ═ ma? id ═ and ". bugs. eclipse.? Extracting existing defect IDs, judging whether submitted texts contain related defect IDs and vulnerability IDs by combining the corresponding vulnerability IDs, and if so, generating vulnerability identifier characteristics, wherein the vulnerability identifier characteristics comprise: a defect ID feature (Bug _ ID feature) and a vulnerability ID feature (CVE _ ID feature); if not, the vulnerability does not have vulnerability identifier characteristics;

generating vulnerability location characteristics: for each vulnerability-submission pair, extracting a File Location (File Location) where the vulnerability occurs using regular expressions "(. is; the extraction process of the function position is as follows: converting the vulnerability Description (N _ Description) into an input sequence of a Bert pre-training model, and inputting the input sequence into the Bert pre-training model for calculation to obtain a representation Vector (Desc _ Vector); inputting the expression vector into a Bi-LSTM layer of the named entity recognition model for calculation to obtain a text Feature (Desc _ Feature); then, the text features are used as the input of a CRF layer of a named entity recognition model, text marking is carried out, a function position entity of the vulnerability is obtained, and the function position entity is used as the function position of the vulnerability;

comparing the file position and the function position extracted from the vulnerability description with the file position and the function position directly extracted from the submitted code change, judging whether the file position and the function position extracted from the vulnerability description are changed or not, and if the file position and the function position are changed, generating vulnerability position characteristics, wherein the vulnerability position characteristics comprise: a File Location feature (File _ Location feature) and a function Location feature (Func _ Location feature); if not, the vulnerability has no vulnerability position characteristics;

generating vulnerability type characteristics: for each vulnerability-submission pair, acquiring a pre-change code and a post-change code of a file change function based on submitted code change (Codeff) in a code change file, respectively performing function level vulnerability detection based on a pre-trained graph convolution neural network on the pre-change code and the post-change code, if a certain type of vulnerability is detected in the pre-change code and the type of vulnerability is not detected in the post-change code, judging that the vulnerability is repaired in the code change, and acquiring a vulnerability type corresponding to the vulnerability, wherein the specific process is as follows:

for a file affected in a C (C project)/. cpp (C + + project) file, firstly, dividing a code before modification of the file into n functions, extracting an abstract syntax tree AST, a control flow graph CFG and a data flow graph DFG of each function to represent grammatical and semantic information of the function, coding a source code of the function into a vector, inputting the vector into a pre-trained graph convolution neural network model for vulnerability detection, and extracting a vulnerability type (C _ TypeBeform _ n) of the vulnerability if the model outputs that the function has the vulnerability; if the model output indicates that the function has no leak, detecting the next function; after all the functions in the file are detected, obtaining a vulnerability dictionary (C _ TypeBeform _ Dict) consisting of the function name and the vulnerability type of the file;

then dividing the changed codes of the file into n functions, and executing the operation process in the same way to obtain a vulnerability dictionary (C _ TypeAfter _ Dict) consisting of the function name and the vulnerability type of the file;

comparing the difference of the bug dictionaries before and after the file is changed, if the bug dictionaries before and after the file is changed are not different, judging that the code change does not repair the bug, if a certain function and bug in the bug dictionary before the file is changed disappear from the bug dictionary after the file is changed, judging that the bug is repaired in the code change, and obtaining the bug Type (C _ Type) of the repaired bug; comparing the vulnerability Type (C _ Type) with a vulnerability Type (CWE) extracted from a vulnerability database NVD, and generating a vulnerability Type feature (Vul _ Type feature);

calculating the word frequency TF of the word w in the text sequence t:

calculating to obtain Text Similarity X _ Similarity between the Text sequence X _ Text and the Text sequence C _ Text by using the formula, and generating vulnerability Text Similarity characteristics according to the numerical values of N _ Similarity and X _ Similarity, wherein the vulnerability Text Similarity characteristics comprise: the vulnerability description of the vulnerability database NVD (network video language) is characterized by Similarity among submitted texts (N _ Similarity characteristic) and the vulnerability description of the vulnerability database X-Force is characterized by Similarity among submitted texts (X _ Similarity characteristic);

generating vulnerability repair possibility characteristics: for each vulnerability-submission pair, judging whether the code change repairs the vulnerability or not according to a vulnerability detection result when generating vulnerability type characteristics based on the submitted code change, if the code change repairs the vulnerability, generating vulnerability repair possibility characteristics (Vul _ Fix characteristics), and if the code change does not repair the vulnerability, determining that the vulnerability-submission pair does not have the vulnerability repair possibility characteristics;

generate Date difference feature (Date _ Diff feature): for each vulnerability-submission pair, calculating a difference value between the vulnerability disclosure time extracted from the vulnerability database NVD and the submission time of the vulnerability:

Date_Diff＝N_Date-C_Date

as shown in fig. 2, the interval where the difference is located is determined according to the research result, and a date difference feature is generated according to the proportion of the number of vulnerability-patch pairs in the interval in the research result to the number of all research objects; the vulnerability-patch pair is a vulnerability and a patch corresponding to the vulnerability, and the investigation object is the vulnerability-patch pair.

So far, a pair of loopholes-6 types of characteristics of submission pairs (totally 9 loophole characteristics) are formed, which are respectively:

(1) vulnerability identifier characteristics: a Bug _ ID feature, a CVE _ ID feature;

(2) vulnerability location characteristics: file _ Location feature, Func _ Location feature;

(3) vulnerability type characteristics: a Vul _ Type feature;

(4) text similarity characteristics: an N _ Similarity feature, an X _ Similarity feature;

(5) repair possibility characteristics: a Vul _ Fix feature;

(6) the date difference characteristic: date _ Diff feature.

Further, as shown in fig. 4, in the sorting module,

Wherein

wherein k is the total number of features, where k is 9; f_nAn nth feature score for a vulnerability-submission pair; w_nScoring coefficients for the nth feature of the vulnerability-submission pairs, generated by the ranking model RankNe;

Using the cross entropy C as a loss function of the model:

then training a model to minimize a loss function;

obtaining a trained ranking model RankNet through the steps;

inputting a vulnerability-submission pair related to a vulnerability with unknown patch into a trained sequencing model RankNet, calculating the score of each pair of vulnerability-submission pairs related to the vulnerability with unknown patch, and sequencing the vulnerability-submission pairs related to the vulnerability with unknown patch according to the grading sequence to obtain vulnerability-submission pairs which are sequenced into Top1, Top5 and Top 30.

Further, in the manual inspection module,

manually checking the obtained vulnerability-submission pairs which are ranked to be a vulnerability-submission pair of Top1, Top5 and Top30, judging whether the submission in the vulnerability-submission pair is patch submission, if so, indicating that the vulnerability is repaired, and submitting and restoring the link of the vulnerability and the patch corresponding to the vulnerability; if not, the vulnerability is not repaired, and the patch corresponding to the vulnerability and the vulnerability is submitted without a recovery link.

Example two

As shown in fig. 1, this embodiment provides a method for recovering C/C + + vulnerabilities and inter-patch links thereof based on sorting, which includes: step 1, extracting basic data of a vulnerability related to an open source C/C + + project from a vulnerability database NVD and a vulnerability database X-Force, extracting basic data related to submission in a software repository of the project corresponding to the vulnerability, and establishing a vulnerability-submission pair by using the basic data of the vulnerability and the basic data related to the vulnerability; step 2, for each vulnerability-submission pair, generating vulnerability related characteristics through basic data of the vulnerability and related submitted basic data; step 3, training a sequencing model RankNet based on a neural network through vulnerability-related characteristics to obtain a trained sequencing model RankNet, inputting a vulnerability-submission pair related to a patch unknown vulnerability into the trained sequencing model RankNet, calculating the score of each pair of vulnerability-submission pairs related to the patch unknown vulnerability through the sequencing model RankNet, and sequencing the vulnerability-submission pairs related to the patch unknown vulnerability according to the scoring sequence to obtain sequenced vulnerability-submission pairs; step 4, manually checking the ordered vulnerability-submission pairs according to the ordering sequence, judging whether the submission in the vulnerability-submission pairs is patch submission, if so, indicating that the vulnerability is repaired, and submitting and restoring links of the vulnerability and the patch corresponding to the vulnerability; if not, the vulnerability is not repaired, and the patch corresponding to the vulnerability and the vulnerability is submitted without a recovery link.

Further, in step 1,

Further, as shown in fig. 3, in step 2,

calculating the word frequency TF of the word w in the text sequence t:

Date_Diff＝N_Date-C_Date

(3) vulnerability type characteristics: a Vul _ Type feature;

(5) repair possibility characteristics: a Vul _ Fix feature;

(6) the date difference characteristic: date _ Diff feature.

Further, as shown in fig. 4, in step 3,

Wherein

Using the cross entropy C as a loss function of the model:

then training a model to minimize a loss function;

obtaining a trained ranking model RankNet through the steps;

Further, in step 4,

The embodiment provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and is characterized in that the processor implements the steps of the above method for recovering C/C + + vulnerabilities and inter-patch links thereof based on sorting when executing the computer program.

The present embodiment provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program, when being executed by a processor, implements the steps of the above-mentioned method for recovering a C/C + + vulnerability based on ordering and inter-patch linkage thereof.

The invention embodies a number of methods and approaches to this solution and the foregoing is only a preferred embodiment of the invention. It should be noted that, for those skilled in the art, without departing from the principle of the present invention, several improvements and modifications can be made, and these improvements and modifications should also be construed as the protection scope of the present invention. All the components not specified in the present embodiment can be realized by the prior art.

Claims

1. A C/C + + vulnerability based on sequencing and inter-patch link recovery system thereof are characterized by comprising:

the vulnerability data acquisition module is used for extracting basic data of vulnerabilities related to open source C/C + + projects from a vulnerability database NVD and a vulnerability database X-Force, extracting related submitted basic data in a software storage library of the corresponding projects of the vulnerabilities, and establishing vulnerability-submission pairs by using the basic data of the vulnerabilities and the related submitted basic data;

the vulnerability related characteristic generating module is used for generating vulnerability related characteristics for each vulnerability-submission pair through the basic data of the vulnerability and the basic data of the related submission;

the sequencing module is used for training a sequencing model RankNet based on a neural network through the vulnerability related characteristics to obtain a trained sequencing model RankNet, inputting a vulnerability-submission pair related to the patch unknown vulnerability into the trained sequencing model RankNet, calculating the score of each pair of vulnerability-submission pairs related to the patch unknown vulnerability through the sequencing model RankNet, and sequencing the vulnerability-submission pairs related to the patch unknown vulnerability according to the scoring sequence to obtain sequenced vulnerability-submission pairs;

the manual checking module is used for manually checking the ordered vulnerability-submission pairs according to the ordering sequence, judging whether the submission in the vulnerability-submission pairs is patch submission or not, and if so, indicating that the vulnerability is repaired and the vulnerability and the patch submission corresponding to the vulnerability are linked for recovery; if not, the vulnerability is not repaired, and the patch corresponding to the vulnerability and the vulnerability is submitted without a recovery link.

2. The sequencing-based C/C + + vulnerability and inter-patch linkage recovery system of claim 1, wherein in the vulnerability data acquisition module,

screening related vulnerabilities of the C/C + + project according to the fact that whether the vulnerability description contains the C/. cpp file or not, and obtaining vulnerability IDs, disclosure dates, vulnerability descriptions corresponding to the IDs, reference links, software version configuration influenced by the vulnerabilities and vulnerability types of related vulnerability entries;

for a vulnerability database X-Force, obtaining the description of a relevant vulnerability;

for the software repository of the project corresponding to the vulnerability, screening submission is carried out according to whether the c/. cpp file is modified in the submission and whether the submission is released after the affected version, code change, submission date, submission title and submission information which meet the screening conditions are obtained, and the submission title and the submission information are spliced into a submission text;

for a certain vulnerability entry, taking data extracted from a vulnerability database NVD and a vulnerability database X-Force as vulnerability data, taking data extracted from n submissions meeting conditions in a software repository of a project corresponding to the vulnerability as submission data, merging the vulnerability data of the vulnerability and the submission data submitted by each submission in the n submissions respectively for one time, and finally obtaining n groups of vulnerability-submission pairs containing vulnerability information and submission information.

3. The C/C + + vulnerability and inter-patch linkage recovery system based on ordering of claim 2, wherein, in the vulnerability related characteristics generation module,

calculating the word frequency TF of the word w in the text sequence t:

Date_Diff＝N_Date-C_Date

4. The C/C + + vulnerability and inter-patch linkage recovery system based on sorting of claim 3, wherein in the sorting module,

for a certain patchKnown vulnerabilities, any two pairs of vulnerability-submission pairs that are relevant are extracted<i,j>Calculating the true probability of the vulnerability-submission pair i being ranked before the vulnerability-submission pair j

Using the cross entropy C as a loss function of the model:

then training a model to minimize a loss function;

obtaining a trained ranking model RankNet through the steps;

inputting a vulnerability-submission pair related to a vulnerability unknown in the patch into a trained sequencing model RankNet, calculating the score of each pair of vulnerability-submission pairs related to the vulnerability unknown in the patch, and sequencing the vulnerability-submission pairs related to the vulnerability unknown in the patch according to the grading sequence to obtain sequenced vulnerability-submission pairs.

5. A C/C + + vulnerability based on sequencing and an inter-patch link recovery method thereof comprise the following steps:

step 1, extracting basic data of a vulnerability related to an open source C/C + + project from a vulnerability database NVD and a vulnerability database X-Force, extracting related submitted basic data in a software repository of the project corresponding to the vulnerability, and establishing a vulnerability-submission pair by using the basic data of the vulnerability and the related submitted basic data;

step 2, for each vulnerability-submission pair, generating vulnerability related characteristics through the basic data of the vulnerability and the related submitted basic data;

step 3, training a sequencing model RankNet based on a neural network through the vulnerability-related characteristics to obtain a trained sequencing model RankNet, inputting a vulnerability-submission pair related to the patch unknown vulnerability into the trained sequencing model RankNet, calculating the score of each pair of vulnerability-submission pairs related to the patch unknown vulnerability through the sequencing model RankNet, and sequencing the vulnerability-submission pairs related to the patch unknown vulnerability according to the scoring sequence to obtain sequenced vulnerability-submission pairs;

step 4, manually checking the ordered vulnerability-submission pairs according to the ordering sequence, judging whether the submission in the vulnerability-submission pairs is patch submission, if so, indicating that the vulnerability is repaired, and submitting and restoring the links of the vulnerability and the patch corresponding to the vulnerability; if not, the vulnerability is not repaired, and the patch corresponding to the vulnerability and the vulnerability is submitted without a recovery link.

6. The method for C/C + + vulnerability and inter-patch link recovery based on ordering according to claim 5, wherein in step 1,

7. The method for C/C + + vulnerability and inter-patch link recovery based on ordering according to claim 6, wherein in step 2,

calculating the word frequency TF of the word w in the text sequence t:

Date_Diff＝N_Date-C_Date

8. The method for C/C + + vulnerability and inter-patch link recovery based on ordering according to claim 7, wherein in step 3,

Using the cross entropy C as a loss function of the model:

then training a model to minimize a loss function;

obtaining a trained ranking model RankNet through the steps;

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 5 to 8 are implemented by the processor when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 5 to 8.