CN114329482A - C/C + + vulnerability based on sequencing and inter-patch link recovery system and method thereof - Google Patents

C/C + + vulnerability based on sequencing and inter-patch link recovery system and method thereof Download PDF

Info

Publication number
CN114329482A
CN114329482A CN202111560155.7A CN202111560155A CN114329482A CN 114329482 A CN114329482 A CN 114329482A CN 202111560155 A CN202111560155 A CN 202111560155A CN 114329482 A CN114329482 A CN 114329482A
Authority
CN
China
Prior art keywords
vulnerability
submission
text
file
function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111560155.7A
Other languages
Chinese (zh)
Inventor
孙小兵
杨云帆
薄莉莉
魏颖
李斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yangzhou University
Original Assignee
Yangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yangzhou University filed Critical Yangzhou University
Priority to CN202111560155.7A priority Critical patent/CN114329482A/en
Publication of CN114329482A publication Critical patent/CN114329482A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Stored Programmes (AREA)

Abstract

The invention discloses a C/C + + vulnerability based on sequencing and inter-patch link recovery system and a method thereof, wherein the method comprises the following steps: collecting vulnerability data; generating vulnerability related characteristics, including named entity identification based on Bert + BilSTM + CRF, vulnerability automatic detection based on deep learning and text similarity calculation based on TF-IDF + cosine similarity; and sequencing related characteristics between the patches and the vulnerabilities based on a sequencing model RankNet, and realizing linkage recovery between the vulnerabilities and the patches with manual verification. The method can better utilize semantic information and type information of patch codes in vulnerability description, fully excavate the correlation between the vulnerability and the patches, and better balance the relation between the coverage amount and the manual workload of the vulnerability patches.

Description

C/C + + vulnerability based on sequencing and inter-patch link recovery system and method thereof
Technical Field
The invention belongs to the field of software security, and particularly relates to a C/C + + vulnerability based on sequencing and inter-patch link recovery system and method thereof.
Background
Software bugs can weaken the security of computer software, and can cause problems of data loss and tampering, privacy disclosure and the like, and more than half of the bugs are C/C + + project-related bugs. In order to support the work of detection, analysis, diagnosis and the like of C/C + + project related bugs, it is necessary to collect the description of open-source software bugs and repair patch codes thereof. However, most open-source software bug fixes are silent submission, and existing software bug repositories such as NVD, X-Force and other stored bugs often lack patch link for fixing, so that patch codes of bugs are inconvenient to collect, and therefore linking recovery work needs to be performed on C/C + + bugs and patches thereof. Most of the previous link recovery work is carried out aiming at defects and patches, and link recovery work aiming at vulnerabilities and patches is rare, and most of the work uses a matching-based method, so long as certain submitted information is not completely matched with defect information, the submitted information can be abandoned, a large number of patches with incompletely matched information are abandoned, and the coverage area of defect-patch link recovery is reduced. Many recent works explore the construction of vulnerability data sets, but the work often cannot fully utilize vulnerability information, only uses partial information directly provided in a vulnerability repository and a software repository, does not consider deep-level association between vulnerabilities and patches, and causes the link recovery effect to be not ideal.
There are some works that have been done to build vulnerability datasets using a ranking-based approach, such as: the documents "Xin Tan, Yuan Zhang, Chenyuan Mi, et al locating the Security Patches for exposed OSS Vulnerariates with Vulnerariability-Commit Correlation rating [ C ]. Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security,2021:3282 & 3299 ] extract Vulnerability IDs, Vulnerability locations, Vulnerability type correlations, Vulnerability text statistics to rank the submissions. However, code modification information is not fully utilized when vulnerability type correlation is extracted, the vulnerability description and submission information are utilized only based on some statistical indexes without considering text similarity, counterexample reduction is not performed before sequencing, so that the ranking of the patches in submission is low, a large amount of effort is required for manual verification, and the patch coverage area is low.
Disclosure of Invention
The purpose of the invention is as follows: in view of the above problems in the prior art, the present invention aims to provide a system and a method for restoring a C/C + + vulnerability and inter-patch links thereof based on sequencing, which can restore the vulnerability and inter-patch links thereof and can assist downstream vulnerability detection, analysis, diagnosis, and repair tasks.
The technical scheme is as follows: in order to achieve the purpose, the invention specifically adopts the following technical scheme:
the invention provides a C/C + + vulnerability based on sequencing and an inter-patch link recovery system thereof, which comprises the following steps: the vulnerability data acquisition module is used for extracting basic data of vulnerabilities related to open source C/C + + projects from a vulnerability database NVD and a vulnerability database X-Force, extracting basic data related to submission in a software storage library of the projects corresponding to the vulnerabilities, and establishing vulnerability-submission pairs by using the basic data of the vulnerabilities and the basic data related to the submission; the vulnerability related characteristic generating module is used for generating vulnerability related characteristics for each vulnerability-submission pair through basic data of the vulnerability and the related submitted basic data; the sequencing module is used for training a sequencing model RankNet based on a neural network through vulnerability related characteristics to obtain a trained sequencing model RankNet, inputting a vulnerability-submission pair related to a patch unknown vulnerability into the trained sequencing model RankNet, calculating the score of each pair of vulnerability-submission pairs related to the patch unknown vulnerability through the sequencing model RankNet, and sequencing the vulnerability-submission pairs related to the patch unknown vulnerability according to the scoring sequence to obtain sequenced vulnerability-submission pairs; the manual inspection module is used for manually inspecting the sequenced vulnerability-submission pairs according to the sequencing sequence and judging whether the submission in the vulnerability-submission pairs is patch submission or not, if so, the vulnerability is repaired, and the vulnerability and the patch submission corresponding to the vulnerability are linked; if not, the vulnerability is not repaired, and the patch corresponding to the vulnerability and the vulnerability is submitted without a recovery link.
Furthermore, in the vulnerability data acquisition module,
screening related vulnerabilities of the C/C + + project according to the fact that whether the vulnerability description contains the C/. cpp file or not, and obtaining vulnerability IDs, disclosure dates, vulnerability descriptions corresponding to the IDs, reference links, software version configuration influenced by the vulnerabilities and vulnerability types of related vulnerability entries; for a vulnerability database X-Force, obtaining the description of a relevant vulnerability; for the software repository of the project corresponding to the vulnerability, screening submission is carried out according to whether the c/. cpp file is modified in the submission and whether the submission is released after the affected version, code change, submission date, submission title and submission information which meet the screening conditions are obtained, and the submission title and the submission information are spliced into a submission text; for a certain vulnerability entry, taking data extracted from a vulnerability database NVD and a vulnerability database X-Force as vulnerability data, taking data extracted from n submissions meeting conditions in a software repository of a project corresponding to the vulnerability as submission data, merging the vulnerability data of the vulnerability and the submission data submitted by each submission in the n submissions respectively for one time, and finally obtaining n groups of vulnerability-submission pairs containing vulnerability information and submission information.
Further, in the vulnerability-related characteristic generation module,
generating vulnerability identifier characteristics: for each vulnerability-submission pair, based on the reference links extracted from the vulnerability database NVD, a regular expression ". jira./(. a)", ". bugzilla.,? id ═ ma? id ═ and ". bugs. eclipse.? Extracting existing defect IDs, judging whether submitted texts contain related defect IDs and vulnerability IDs by combining the corresponding vulnerability IDs, and if so, generating vulnerability identifier characteristics, wherein the vulnerability identifier characteristics comprise: a defect ID feature and a vulnerability ID feature; if not, the vulnerability does not have vulnerability identifier characteristics;
generating vulnerability location characteristics: for each vulnerability-submission pair, extracting a file location where the vulnerability occurs using regular expressions "(. eras.c)" and "(. eras.cpp)" based on the vulnerability description extracted from the vulnerability database NVD, and then performing named entity identification on the vulnerability description to extract a function location where the vulnerability occurs; the extraction process of the function position is as follows: converting the vulnerability description into an input sequence of a Bert pre-training model and inputting the input sequence into the Bert pre-training model for calculation to obtain an expression vector; inputting the expression vector into a Bi-LSTM layer of the named entity recognition model for calculation to obtain text features; then, the text features are used as the input of a CRF layer of a named entity recognition model, text marking is carried out, a function position entity of the vulnerability is obtained, and the function position entity is used as the function position of the vulnerability;
comparing the file position and the function position extracted from the vulnerability description with the file position and the function position directly extracted from the submitted code change, judging whether the file position and the function position extracted from the vulnerability description are changed or not, and if the file position and the function position are changed, generating vulnerability position characteristics, wherein the vulnerability position characteristics comprise: file location features and function location features; if not, the vulnerability has no vulnerability position characteristics;
generating vulnerability type characteristics: for each vulnerability-submission pair, based on submitted code change, performing function level vulnerability detection based on a graph convolution neural network on a pre-trained code and a pre-trained code after change of a file change function for a C (C project)/. cpp (C + + project) file in a code change file, if a certain type of vulnerability is detected in the code before change and the type of vulnerability is not detected in the code after change, judging that the vulnerability is repaired in the code change, and obtaining a vulnerability type corresponding to the vulnerability, wherein the specific process is as follows:
for a file affected in a C (C project)/. cpp (C + + project) file, firstly, dividing a code before modification of the file into n functions, extracting an abstract syntax tree AST, a control flow graph CFG and a data flow graph DFG of each function to represent syntax and semantic information of the function, coding a source code of the function into a vector, inputting the vector into a pre-trained graph convolution neural network model for vulnerability detection, and extracting a vulnerability type of the vulnerability if the model is output as a function with vulnerability; if the model output indicates that the function has no leak, detecting the next function; after all the functions in the file are detected, obtaining a vulnerability dictionary formed by the function name and the vulnerability type of the file;
then dividing the changed codes of the file into n functions, and executing the operation process to obtain a vulnerability dictionary formed by the function name and the vulnerability type of the file;
comparing the difference of the bug dictionaries before and after the file is changed, if the bug dictionaries before and after the file is changed are not different, judging that the code change does not repair the bug, if a certain function and bug in the bug dictionary before the file is changed disappear from the bug dictionary after the file is changed, judging that the bug is repaired in the code change, and obtaining the bug type of the repaired bug; comparing the vulnerability type with the vulnerability type extracted from the vulnerability database NVD, and generating vulnerability type characteristics;
generating vulnerability text similarity characteristics: for each vulnerability-submission pair, based on vulnerability description extracted from a vulnerability database NVD, deleting a text corresponding to a function position entity which is identified and extracted by using a named entity when generating vulnerability position characteristics, then performing text marking, combining marked words into a text Sequence, and then performing stop word removal, morphological restoration and synonym replacement on the text Sequence to obtain a text Sequence N _ Sequence;
based on vulnerability descriptions extracted from a vulnerability database X-Force and submitted submission texts, respectively executing the following operations:
performing text marking, combining marked words into a text sequence, and then performing stop word removal, word form reduction and synonym replacement on the text sequence;
finally, respectively obtaining a text Sequence X _ Sequence and a text Sequence C _ Sequence;
calculating the importance of each word in the Text sequence N _ Text by using a TF-IDF algorithm, quantizing the importance of the words, and generating a Vector N _ Vector corresponding to the Text sequence N _ Text, wherein the calculation process is as follows:
calculating the word frequency TF of the word w in the text sequence t:
Figure BDA0003420270550000041
wherein w _ num is the number of times the word w appears in the text sequence, and t _ num is the total number of words in the text sequence t;
calculating the inverse document frequency IDF of the word w in the text sequence set d:
Figure BDA0003420270550000042
wherein | d | represents the total number of text sequences in the text sequence set d, | { t ∈ d: w ∈ t } | represents the number of text sequences containing the word w;
calculating the word frequency of the word w-the inverse file frequency TF-IDF: TF-IDF ═ TF × IDF;
vectorizing the Text sequence N _ Text into a Vector N _ Vector according to the word frequency-inverse file frequency of each word;
obtaining a Vector X _ Vector of the Text sequence X _ Text and a Vector C _ Vector of the Text sequence C _ Text by using the method;
and finally, respectively calculating the Text similarity between the Text sequence N _ Text and the Text sequence C _ Text by using the cosine similarity:
Figure BDA0003420270550000051
calculating to obtain Text Similarity X _ Similarity between the Text sequence X _ Text and the Text sequence C _ Text by using the formula, and generating vulnerability Text Similarity characteristics according to the numerical values of N _ Similarity and X _ Similarity, wherein the vulnerability Text Similarity characteristics comprise: the vulnerability description-submission text similarity characteristic of the vulnerability database NVD and the vulnerability description-submission text similarity characteristic of the vulnerability database X-Force vulnerability;
generating vulnerability repair possibility characteristics: for each vulnerability-submission pair, judging whether the code change repairs the vulnerability or not according to a vulnerability detection result when generating vulnerability type characteristics based on the submitted code change, if the code change repairs the vulnerability, generating vulnerability repair possibility characteristics, and if the code change does not repair the vulnerability, judging that the vulnerability-submission pair does not have the vulnerability repair possibility characteristics;
generating a date difference characteristic: for each vulnerability-submission pair, calculating a difference value between the vulnerability disclosure time extracted from the vulnerability database NVD and the submission time of the vulnerability:
Date_Diff=N_Date-C_Date
wherein, Date _ Diff represents the difference value of the two, N _ Date represents the vulnerability disclosure time, and C _ Date represents the submission time;
judging the interval where the difference value is located according to the investigation result, and generating a date difference value characteristic according to the proportion of the number of vulnerability-patch pairs of the interval in the investigation result to the number of all investigation objects; the vulnerability-patch pair is a vulnerability and a patch corresponding to the vulnerability, and the investigation object is the vulnerability-patch pair.
Further, in the sorting module,
for a known bug of a certain patch, extracting any two pairs of related bugs-submission pairs<i,j>Calculating the true probability of the vulnerability-submission pair i being ranked before the vulnerability-submission pair j
Figure BDA0003420270550000052
Figure BDA0003420270550000061
Wherein
Figure BDA0003420270550000062
Then, according to the vulnerability relevant characteristics, expressing two pairs of vulnerability-submission pairs < i, j > as characteristic vectors, and inputting a scoring function S:
Figure BDA0003420270550000063
wherein k is the total number of features; fnAn nth feature score for a vulnerability-submission pair; wnScoring coefficients for the nth feature of the vulnerability-submission pairs, generated by the ranking model RankNe;
two pairs of vulnerability-submission pairs computed by a scoring function S<i,j>Fraction of (2)<Si,Sj>And calculating the prediction probability P of the vulnerability-submission pair i before the vulnerability-submission pair j by using a sigmoid functionij
Figure BDA0003420270550000064
Using the cross entropy C as a loss function of the model:
Figure BDA0003420270550000065
then training a model to minimize a loss function;
obtaining a trained ranking model RankNet through the steps;
inputting a vulnerability-submission pair related to a vulnerability unknown patch into a trained sequencing model RankNet, calculating the score of each pair of vulnerability-submission pairs related to the vulnerability unknown patch, and sequencing the vulnerability-submission pairs related to the vulnerability unknown patch according to the grading sequence to obtain sequenced vulnerability-submission pairs.
The invention provides a C/C + + vulnerability based on sequencing and an inter-patch link recovery method thereof, which comprises the following steps: step 1, extracting basic data of a vulnerability related to an open source C/C + + project from a vulnerability database NVD and a vulnerability database X-Force, extracting basic data related to submission in a software repository of the project corresponding to the vulnerability, and establishing a vulnerability-submission pair by using the basic data of the vulnerability and the basic data related to the vulnerability; step 2, for each vulnerability-submission pair, generating vulnerability related characteristics through basic data of the vulnerability and related submitted basic data; step 3, training a sequencing model RankNet based on a neural network through vulnerability-related characteristics to obtain a trained sequencing model RankNet, inputting a vulnerability-submission pair related to a patch unknown vulnerability into the trained sequencing model RankNet, calculating the score of each pair of vulnerability-submission pairs related to the patch unknown vulnerability through the sequencing model RankNet, and sequencing the vulnerability-submission pairs related to the patch unknown vulnerability according to the scoring sequence to obtain sequenced vulnerability-submission pairs; step 4, manually checking the ordered vulnerability-submission pairs according to the ordering sequence, judging whether the submission in the vulnerability-submission pairs is patch submission, if so, indicating that the vulnerability is repaired, and submitting and restoring links of the vulnerability and the patch corresponding to the vulnerability; if not, the vulnerability is not repaired, and the patch corresponding to the vulnerability and the vulnerability is submitted without a recovery link.
Further, in step 1,
screening related vulnerabilities of the C/C + + project according to the fact that whether the vulnerability description contains the C/. cpp file or not, and obtaining vulnerability IDs, disclosure dates, vulnerability descriptions corresponding to the IDs, reference links, software version configuration influenced by the vulnerabilities and vulnerability types of related vulnerability entries; for a vulnerability database X-Force, obtaining the description of a relevant vulnerability; for the software repository of the project corresponding to the vulnerability, screening submission is carried out according to whether the c/. cpp file is modified in the submission and whether the submission is released after the affected version, code change, submission date, submission title and submission information which meet the screening conditions are obtained, and the submission title and the submission information are spliced into a submission text; for a certain vulnerability entry, taking data extracted from a vulnerability database NVD and a vulnerability database X-Force as vulnerability data, taking data extracted from n submissions meeting conditions in a software repository of a project corresponding to the vulnerability as submission data, merging the vulnerability data of the vulnerability and the submission data submitted by each submission in the n submissions respectively for one time, and finally obtaining n groups of vulnerability-submission pairs containing vulnerability information and submission information.
Further, in step 2,
generating vulnerability identifier characteristics: for each vulnerability-submission pair, based on the reference links extracted from the vulnerability database NVD, a regular expression ". jira./(. a)", ". bugzilla.,? id ═ ma? id ═ and ". bugs. eclipse.? Extracting existing defect IDs, judging whether submitted texts contain related defect IDs and vulnerability IDs by combining the corresponding vulnerability IDs, and if so, generating vulnerability identifier characteristics, wherein the vulnerability identifier characteristics comprise: a defect ID feature and a vulnerability ID feature; if not, the vulnerability does not have vulnerability identifier characteristics;
generating vulnerability location characteristics: for each vulnerability-submission pair, extracting a file location where the vulnerability occurs using regular expressions "(. eras.c)" and "(. eras.cpp)" based on the vulnerability description extracted from the vulnerability database NVD, and then performing named entity identification on the vulnerability description to extract a function location where the vulnerability occurs; the extraction process of the function position is as follows: converting the vulnerability description into an input sequence of a Bert pre-training model and inputting the input sequence into the Bert pre-training model for calculation to obtain an expression vector; inputting the expression vector into a Bi-LSTM layer of the named entity recognition model for calculation to obtain text features; then, the text features are used as the input of a CRF layer of a named entity recognition model, text marking is carried out, a function position entity of the vulnerability is obtained, and the function position entity is used as the function position of the vulnerability;
comparing the file position and the function position extracted from the vulnerability description with the file position and the function position directly extracted from the submitted code change, judging whether the file position and the function position extracted from the vulnerability description are changed or not, and if the file position and the function position are changed, generating vulnerability position characteristics, wherein the vulnerability position characteristics comprise: file location features and function location features; if not, the vulnerability has no vulnerability position characteristics;
generating vulnerability type characteristics: for each vulnerability-submission pair, based on submitted code change, performing function level vulnerability detection based on a graph convolution neural network on a pre-trained code and a pre-trained code after change of a file change function for a C (C project)/. cpp (C + + project) file in a code change file, if a certain type of vulnerability is detected in the code before change and the type of vulnerability is not detected in the code after change, judging that the vulnerability is repaired in the code change, and obtaining a vulnerability type corresponding to the vulnerability, wherein the specific process is as follows:
for a file affected in a C (C project)/. cpp (C + + project) file, firstly, dividing a code before modification of the file into n functions, extracting an abstract syntax tree AST, a control flow graph CFG and a data flow graph DFG of each function to represent syntax and semantic information of the function, coding a source code of the function into a vector, inputting the vector into a pre-trained graph convolution neural network model for vulnerability detection, and extracting a vulnerability type of the vulnerability if the model is output as a function with vulnerability; if the model output indicates that the function has no leak, detecting the next function; after all the functions in the file are detected, obtaining a vulnerability dictionary formed by the function name and the vulnerability type of the file;
then dividing the changed codes of the file into n functions, and executing the operation process to obtain a vulnerability dictionary formed by the function name and the vulnerability type of the file;
comparing the difference of the bug dictionaries before and after the file is changed, if the bug dictionaries before and after the file is changed are not different, judging that the code change does not repair the bug, if a certain function and bug in the bug dictionary before the file is changed disappear from the bug dictionary after the file is changed, judging that the bug is repaired in the code change, and obtaining the bug type of the repaired bug; comparing the vulnerability type with the vulnerability type extracted from the vulnerability database NVD, and generating vulnerability type characteristics;
generating vulnerability text similarity characteristics: for each vulnerability-submission pair, based on vulnerability description extracted from a vulnerability database NVD, deleting a text corresponding to a function position entity which is identified and extracted by using a named entity when generating vulnerability position characteristics, then performing text marking, combining marked words into a text Sequence, and then performing stop word removal, morphological restoration and synonym replacement on the text Sequence to obtain a text Sequence N _ Sequence;
based on vulnerability descriptions extracted from a vulnerability database X-Force and submitted submission texts, respectively executing the following operations:
performing text marking, combining marked words into a text sequence, and then performing stop word removal, word form reduction and synonym replacement on the text sequence;
finally, respectively obtaining a text Sequence X _ Sequence and a text Sequence C _ Sequence;
calculating the importance of each word in the Text sequence N _ Text by using a TF-IDF algorithm, quantizing the importance of the words, and generating a Vector N _ Vector corresponding to the Text sequence N _ Text, wherein the calculation process is as follows:
calculating the word frequency TF of the word w in the text sequence t:
Figure BDA0003420270550000091
wherein w _ num is the number of times the word w appears in the text sequence, and t _ num is the total number of words in the text sequence t;
calculating the inverse document frequency IDF of the word w in the text sequence set d:
Figure BDA0003420270550000092
wherein | d | represents the total number of text sequences in the text sequence set d, | { t ∈ d: w ∈ t } | represents the number of text sequences containing the word w;
calculating the word frequency of the word w-the inverse file frequency TF-IDF: TF-IDF ═ TF × IDF;
vectorizing the Text sequence N _ Text into a Vector N _ Vector according to the word frequency-inverse file frequency of each word;
obtaining a Vector X _ Vector of the Text sequence X _ Text and a Vector C _ Vector of the Text sequence C _ Text by using the method;
and finally, respectively calculating the Text similarity between the Text sequence N _ Text and the Text sequence C _ Text by using the cosine similarity:
Figure BDA0003420270550000093
calculating to obtain Text Similarity X _ Similarity between the Text sequence X _ Text and the Text sequence C _ Text by using the formula, and generating vulnerability Text Similarity characteristics according to the numerical values of N _ Similarity and X _ Similarity, wherein the vulnerability Text Similarity characteristics comprise: the vulnerability description-submission text similarity characteristic of the vulnerability database NVD and the vulnerability description-submission text similarity characteristic of the vulnerability database X-Force vulnerability;
generating vulnerability repair possibility characteristics: for each vulnerability-submission pair, judging whether the code change repairs the vulnerability or not according to a vulnerability detection result when generating vulnerability type characteristics based on the submitted code change, if the code change repairs the vulnerability, generating vulnerability repair possibility characteristics, and if the code change does not repair the vulnerability, judging that the vulnerability-submission pair does not have the vulnerability repair possibility characteristics;
generating a date difference characteristic: for each vulnerability-submission pair, calculating a difference value between the vulnerability disclosure time extracted from the vulnerability database NVD and the submission time of the vulnerability:
Date_Diff=N_Date-C_Date
wherein, Date _ Diff represents the difference value of the two, N _ Date represents the vulnerability disclosure time, and C _ Date represents the submission time;
judging the interval where the difference value is located according to the investigation result, and generating a date difference value characteristic according to the proportion of the number of vulnerability-patch pairs of the interval in the investigation result to the number of all investigation objects; the vulnerability-patch pair is a vulnerability and a patch corresponding to the vulnerability, and the investigation object is the vulnerability-patch pair.
Further, in step 3,
for a known bug of a certain patch, extracting any two pairs of related bugs-submission pairs<i,j>Calculating the true probability of the vulnerability-submission pair i being ranked before the vulnerability-submission pair j
Figure BDA0003420270550000101
Figure BDA0003420270550000102
Wherein
Figure BDA0003420270550000103
Then, according to the vulnerability relevant characteristics, expressing two pairs of vulnerability-submission pairs < i, j > as characteristic vectors, and inputting a scoring function S:
Figure BDA0003420270550000104
wherein k is the total number of features; fnAn nth feature score for a vulnerability-submission pair; wnScoring coefficients for the nth feature of the vulnerability-submission pairs, generated by the ranking model RankNe;
two pairs of vulnerability-submission pairs computed by a scoring function S<i,j>Fraction of (2)<Si,Sj>And calculating the prediction probability P of the vulnerability-submission pair i before the vulnerability-submission pair j by using a sigmoid functionij
Figure BDA0003420270550000105
Using the cross entropy C as a loss function of the model:
Figure BDA0003420270550000106
then training a model to minimize a loss function;
obtaining a trained ranking model RankNet through the steps;
inputting a vulnerability-submission pair related to a vulnerability unknown patch into a trained sequencing model RankNet, calculating the score of each pair of vulnerability-submission pairs related to the vulnerability unknown patch, and sequencing the vulnerability-submission pairs related to the vulnerability unknown patch according to the grading sequence to obtain sequenced vulnerability-submission pairs.
The invention provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and is characterized in that the processor realizes the steps of the C/C + + vulnerability based on sequencing and the inter-patch link recovery method thereof when executing the computer program.
The invention provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the above-mentioned method for recovering a C/C + + vulnerability based on sorting and inter-patch linkage thereof.
Has the advantages that:
1. the implicit relation between vulnerability-submission pairs can be better reflected by using targeted characteristics;
2. two vulnerability data sources are adopted, different weights are given to different vulnerability characteristics in the sequencing stage, and the condition that part of information in a certain vulnerability-submission pair is lost can be better dealt with;
3. counter-example reduction is performed before the vulnerability-submission pairs are sorted, which can greatly reduce the workload of manual inspection ultimately required without reducing recall rates.
Drawings
FIG. 1 is a flowchart of a method for recovering a C/C + + vulnerability based on sequencing and inter-patch links thereof according to the present invention.
FIG. 2 is a vulnerability-submission versus date difference pre-investigation diagram in the present invention.
FIG. 3 is a vulnerability-submission versus feature generation flow diagram in the present invention.
FIG. 4 is a vulnerability-submission versus ranking flow diagram in the present invention.
Detailed Description
The following describes embodiments of the present invention with reference to the drawings.
Example one
As shown in fig. 1, the present embodiment discloses a system for recovering C/C + + vulnerabilities and inter-patch links thereof based on sorting, which includes: the vulnerability data acquisition module is used for extracting basic data of vulnerabilities related to open source C/C + + projects from a vulnerability database NVD and a vulnerability database X-Force, extracting basic data related to submission in a software storage library of the projects corresponding to the vulnerabilities, and establishing vulnerability-submission pairs by using the basic data of the vulnerabilities and the basic data related to the submission; the vulnerability related characteristic generating module is used for generating vulnerability related characteristics for each vulnerability-submission pair through basic data of the vulnerability and the related submitted basic data; the sequencing module is used for training a sequencing model RankNet based on a neural network through vulnerability related characteristics to obtain a trained sequencing model RankNet, inputting a vulnerability-submission pair related to a patch unknown vulnerability into the trained sequencing model RankNet, calculating the score of each pair of vulnerability-submission pairs related to the patch unknown vulnerability through the sequencing model RankNet, and sequencing the vulnerability-submission pairs related to the patch unknown vulnerability according to the scoring sequence to obtain sequenced vulnerability-submission pairs; the manual inspection module is used for manually inspecting the sequenced vulnerability-submission pairs according to the sequencing sequence and judging whether the submission in the vulnerability-submission pairs is patch submission or not, if so, the vulnerability is repaired, and the vulnerability and the patch submission corresponding to the vulnerability are linked; if not, the vulnerability is not repaired, and the patch corresponding to the vulnerability and the vulnerability is submitted without a recovery link.
Furthermore, in the vulnerability data acquisition module,
for a vulnerability database NVD, screening related vulnerabilities of a C/C + + project according to whether vulnerability descriptions comprise C/cpp files, and acquiring vulnerability IDs (CVE-IDs) of related vulnerability entries, disclosure dates (N _ Dates), vulnerability descriptions (N _ Descriptions) corresponding to each ID, Reference links (references), software Version configuration (N _ Version) influenced by the vulnerabilities and vulnerability types (CWEs); for a vulnerability database X-Force, obtaining the Description (X _ Description) of the related vulnerability; for a software repository of a project corresponding to the vulnerability, screening submission according to whether the C/. cpp file is modified in the submission and whether the submission is issued after the affected version, acquiring submitted code changes (Codeff), submission dates (C _ Date), submission titles (C _ Title) and submission information (C _ Message) which meet screening conditions, and splicing the submission titles and the submission information into a submission Text (C _ Text); regarding a certain vulnerability entry, taking Data extracted from a vulnerability database NVD and a vulnerability database X-Force as vulnerability Data (V _ Data), and taking Data extracted from n submissions meeting conditions in a software repository of a corresponding item of the vulnerability as submission Data (C _ Data)i(i∈[1,n]) And then merging the vulnerability data of the vulnerability and the submitted data submitted by each of the n submissions once respectively to finally obtain n sets of vulnerability-submission pairs containing vulnerability information and submitted information.
Further, as shown in fig. 3, in the vulnerability-related feature generation module,
generating vulnerability identifier characteristics: for each vulnerability-submission pair, based on the reference links extracted from the vulnerability database NVD, a regular expression ". jira./(. a)", ". bugzilla.,? id ═ ma? id ═ and ". bugs. eclipse.? Extracting existing defect IDs, judging whether submitted texts contain related defect IDs and vulnerability IDs by combining the corresponding vulnerability IDs, and if so, generating vulnerability identifier characteristics, wherein the vulnerability identifier characteristics comprise: a defect ID feature (Bug _ ID feature) and a vulnerability ID feature (CVE _ ID feature); if not, the vulnerability does not have vulnerability identifier characteristics;
generating vulnerability location characteristics: for each vulnerability-submission pair, extracting a File Location (File Location) where the vulnerability occurs using regular expressions "(. is; the extraction process of the function position is as follows: converting the vulnerability Description (N _ Description) into an input sequence of a Bert pre-training model, and inputting the input sequence into the Bert pre-training model for calculation to obtain a representation Vector (Desc _ Vector); inputting the expression vector into a Bi-LSTM layer of the named entity recognition model for calculation to obtain a text Feature (Desc _ Feature); then, the text features are used as the input of a CRF layer of a named entity recognition model, text marking is carried out, a function position entity of the vulnerability is obtained, and the function position entity is used as the function position of the vulnerability;
comparing the file position and the function position extracted from the vulnerability description with the file position and the function position directly extracted from the submitted code change, judging whether the file position and the function position extracted from the vulnerability description are changed or not, and if the file position and the function position are changed, generating vulnerability position characteristics, wherein the vulnerability position characteristics comprise: a File Location feature (File _ Location feature) and a function Location feature (Func _ Location feature); if not, the vulnerability has no vulnerability position characteristics;
generating vulnerability type characteristics: for each vulnerability-submission pair, acquiring a pre-change code and a post-change code of a file change function based on submitted code change (Codeff) in a code change file, respectively performing function level vulnerability detection based on a pre-trained graph convolution neural network on the pre-change code and the post-change code, if a certain type of vulnerability is detected in the pre-change code and the type of vulnerability is not detected in the post-change code, judging that the vulnerability is repaired in the code change, and acquiring a vulnerability type corresponding to the vulnerability, wherein the specific process is as follows:
for a file affected in a C (C project)/. cpp (C + + project) file, firstly, dividing a code before modification of the file into n functions, extracting an abstract syntax tree AST, a control flow graph CFG and a data flow graph DFG of each function to represent grammatical and semantic information of the function, coding a source code of the function into a vector, inputting the vector into a pre-trained graph convolution neural network model for vulnerability detection, and extracting a vulnerability type (C _ TypeBeform _ n) of the vulnerability if the model outputs that the function has the vulnerability; if the model output indicates that the function has no leak, detecting the next function; after all the functions in the file are detected, obtaining a vulnerability dictionary (C _ TypeBeform _ Dict) consisting of the function name and the vulnerability type of the file;
then dividing the changed codes of the file into n functions, and executing the operation process in the same way to obtain a vulnerability dictionary (C _ TypeAfter _ Dict) consisting of the function name and the vulnerability type of the file;
comparing the difference of the bug dictionaries before and after the file is changed, if the bug dictionaries before and after the file is changed are not different, judging that the code change does not repair the bug, if a certain function and bug in the bug dictionary before the file is changed disappear from the bug dictionary after the file is changed, judging that the bug is repaired in the code change, and obtaining the bug Type (C _ Type) of the repaired bug; comparing the vulnerability Type (C _ Type) with a vulnerability Type (CWE) extracted from a vulnerability database NVD, and generating a vulnerability Type feature (Vul _ Type feature);
generating vulnerability text similarity characteristics: for each vulnerability-submission pair, based on vulnerability description extracted from a vulnerability database NVD, deleting a text corresponding to a function position entity which is identified and extracted by using a named entity when generating vulnerability position characteristics, then performing text marking, combining marked words into a text Sequence, and then performing stop word removal, morphological restoration and synonym replacement on the text Sequence to obtain a text Sequence N _ Sequence;
based on vulnerability descriptions extracted from a vulnerability database X-Force and submitted submission texts, respectively executing the following operations:
performing text marking, combining marked words into a text sequence, and then performing stop word removal, word form reduction and synonym replacement on the text sequence;
finally, respectively obtaining a text Sequence X _ Sequence and a text Sequence C _ Sequence;
calculating the importance of each word in the Text sequence N _ Text by using a TF-IDF algorithm, quantizing the importance of the words, and generating a Vector N _ Vector corresponding to the Text sequence N _ Text, wherein the calculation process is as follows:
calculating the word frequency TF of the word w in the text sequence t:
Figure BDA0003420270550000141
wherein w _ num is the number of times the word w appears in the text sequence, and t _ num is the total number of words in the text sequence t;
calculating the inverse document frequency IDF of the word w in the text sequence set d:
Figure BDA0003420270550000142
wherein | d | represents the total number of text sequences in the text sequence set d, | { t ∈ d: w ∈ t } | represents the number of text sequences containing the word w;
calculating the word frequency of the word w-the inverse file frequency TF-IDF: TF-IDF ═ TF × IDF;
vectorizing the Text sequence N _ Text into a Vector N _ Vector according to the word frequency-inverse file frequency of each word;
obtaining a Vector X _ Vector of the Text sequence X _ Text and a Vector C _ Vector of the Text sequence C _ Text by using the method;
and finally, respectively calculating the Text similarity between the Text sequence N _ Text and the Text sequence C _ Text by using the cosine similarity:
Figure BDA0003420270550000151
calculating to obtain Text Similarity X _ Similarity between the Text sequence X _ Text and the Text sequence C _ Text by using the formula, and generating vulnerability Text Similarity characteristics according to the numerical values of N _ Similarity and X _ Similarity, wherein the vulnerability Text Similarity characteristics comprise: the vulnerability description of the vulnerability database NVD (network video language) is characterized by Similarity among submitted texts (N _ Similarity characteristic) and the vulnerability description of the vulnerability database X-Force is characterized by Similarity among submitted texts (X _ Similarity characteristic);
generating vulnerability repair possibility characteristics: for each vulnerability-submission pair, judging whether the code change repairs the vulnerability or not according to a vulnerability detection result when generating vulnerability type characteristics based on the submitted code change, if the code change repairs the vulnerability, generating vulnerability repair possibility characteristics (Vul _ Fix characteristics), and if the code change does not repair the vulnerability, determining that the vulnerability-submission pair does not have the vulnerability repair possibility characteristics;
generate Date difference feature (Date _ Diff feature): for each vulnerability-submission pair, calculating a difference value between the vulnerability disclosure time extracted from the vulnerability database NVD and the submission time of the vulnerability:
Date_Diff=N_Date-C_Date
wherein, Date _ Diff represents the difference value of the two, N _ Date represents the vulnerability disclosure time, and C _ Date represents the submission time;
as shown in fig. 2, the interval where the difference is located is determined according to the research result, and a date difference feature is generated according to the proportion of the number of vulnerability-patch pairs in the interval in the research result to the number of all research objects; the vulnerability-patch pair is a vulnerability and a patch corresponding to the vulnerability, and the investigation object is the vulnerability-patch pair.
So far, a pair of loopholes-6 types of characteristics of submission pairs (totally 9 loophole characteristics) are formed, which are respectively:
(1) vulnerability identifier characteristics: a Bug _ ID feature, a CVE _ ID feature;
(2) vulnerability location characteristics: file _ Location feature, Func _ Location feature;
(3) vulnerability type characteristics: a Vul _ Type feature;
(4) text similarity characteristics: an N _ Similarity feature, an X _ Similarity feature;
(5) repair possibility characteristics: a Vul _ Fix feature;
(6) the date difference characteristic: date _ Diff feature.
Further, as shown in fig. 4, in the sorting module,
for a known bug of a certain patch, extracting any two pairs of related bugs-submission pairs<i,j>Calculating the true probability of the vulnerability-submission pair i being ranked before the vulnerability-submission pair j
Figure BDA0003420270550000161
Figure BDA0003420270550000162
Wherein
Figure BDA0003420270550000163
Then, according to the vulnerability relevant characteristics, expressing two pairs of vulnerability-submission pairs < i, j > as characteristic vectors, and inputting a scoring function S:
Figure BDA0003420270550000164
wherein k is the total number of features, where k is 9; fnAn nth feature score for a vulnerability-submission pair; wnScoring coefficients for the nth feature of the vulnerability-submission pairs, generated by the ranking model RankNe;
two pairs of vulnerability-submission pairs computed by a scoring function S<i,j>Fraction of (2)<Si,Sj>And calculating the prediction probability P of the vulnerability-submission pair i before the vulnerability-submission pair j by using a sigmoid functionij
Figure BDA0003420270550000165
Using the cross entropy C as a loss function of the model:
Figure BDA0003420270550000166
then training a model to minimize a loss function;
obtaining a trained ranking model RankNet through the steps;
inputting a vulnerability-submission pair related to a vulnerability with unknown patch into a trained sequencing model RankNet, calculating the score of each pair of vulnerability-submission pairs related to the vulnerability with unknown patch, and sequencing the vulnerability-submission pairs related to the vulnerability with unknown patch according to the grading sequence to obtain vulnerability-submission pairs which are sequenced into Top1, Top5 and Top 30.
Further, in the manual inspection module,
manually checking the obtained vulnerability-submission pairs which are ranked to be a vulnerability-submission pair of Top1, Top5 and Top30, judging whether the submission in the vulnerability-submission pair is patch submission, if so, indicating that the vulnerability is repaired, and submitting and restoring the link of the vulnerability and the patch corresponding to the vulnerability; if not, the vulnerability is not repaired, and the patch corresponding to the vulnerability and the vulnerability is submitted without a recovery link.
Example two
As shown in fig. 1, this embodiment provides a method for recovering C/C + + vulnerabilities and inter-patch links thereof based on sorting, which includes: step 1, extracting basic data of a vulnerability related to an open source C/C + + project from a vulnerability database NVD and a vulnerability database X-Force, extracting basic data related to submission in a software repository of the project corresponding to the vulnerability, and establishing a vulnerability-submission pair by using the basic data of the vulnerability and the basic data related to the vulnerability; step 2, for each vulnerability-submission pair, generating vulnerability related characteristics through basic data of the vulnerability and related submitted basic data; step 3, training a sequencing model RankNet based on a neural network through vulnerability-related characteristics to obtain a trained sequencing model RankNet, inputting a vulnerability-submission pair related to a patch unknown vulnerability into the trained sequencing model RankNet, calculating the score of each pair of vulnerability-submission pairs related to the patch unknown vulnerability through the sequencing model RankNet, and sequencing the vulnerability-submission pairs related to the patch unknown vulnerability according to the scoring sequence to obtain sequenced vulnerability-submission pairs; step 4, manually checking the ordered vulnerability-submission pairs according to the ordering sequence, judging whether the submission in the vulnerability-submission pairs is patch submission, if so, indicating that the vulnerability is repaired, and submitting and restoring links of the vulnerability and the patch corresponding to the vulnerability; if not, the vulnerability is not repaired, and the patch corresponding to the vulnerability and the vulnerability is submitted without a recovery link.
Further, in step 1,
for a vulnerability database NVD, screening related vulnerabilities of a C/C + + project according to whether vulnerability descriptions comprise C/cpp files, and acquiring vulnerability IDs (CVE-IDs) of related vulnerability entries, disclosure dates (N _ Dates), vulnerability descriptions (N _ Descriptions) corresponding to each ID, Reference links (references), software Version configuration (N _ Version) influenced by the vulnerabilities and vulnerability types (CWEs); for a vulnerability database X-Force, obtaining the Description (X _ Description) of the related vulnerability; for a software repository of a project corresponding to the vulnerability, screening submission according to whether the C/. cpp file is modified in the submission and whether the submission is issued after the affected version, acquiring submitted code changes (Codeff), submission dates (C _ Date), submission titles (C _ Title) and submission information (C _ Message) which meet screening conditions, and splicing the submission titles and the submission information into a submission Text (C _ Text); regarding a certain vulnerability entry, taking Data extracted from a vulnerability database NVD and a vulnerability database X-Force as vulnerability Data (V _ Data), and taking Data extracted from n submissions meeting conditions in a software repository of a corresponding item of the vulnerability as submission Data (C _ Data)i(i∈[1,n]) And then merging the vulnerability data of the vulnerability and the submitted data submitted by each of the n submissions once respectively to finally obtain n sets of vulnerability-submission pairs containing vulnerability information and submitted information.
Further, as shown in fig. 3, in step 2,
generating vulnerability identifier characteristics: for each vulnerability-submission pair, based on the reference links extracted from the vulnerability database NVD, a regular expression ". jira./(. a)", ". bugzilla.,? id ═ ma? id ═ and ". bugs. eclipse.? Extracting existing defect IDs, judging whether submitted texts contain related defect IDs and vulnerability IDs by combining the corresponding vulnerability IDs, and if so, generating vulnerability identifier characteristics, wherein the vulnerability identifier characteristics comprise: a defect ID feature (Bug _ ID feature) and a vulnerability ID feature (CVE _ ID feature); if not, the vulnerability does not have vulnerability identifier characteristics;
generating vulnerability location characteristics: for each vulnerability-submission pair, extracting a File Location (File Location) where the vulnerability occurs using regular expressions "(. is; the extraction process of the function position is as follows: converting the vulnerability Description (N _ Description) into an input sequence of a Bert pre-training model, and inputting the input sequence into the Bert pre-training model for calculation to obtain a representation Vector (Desc _ Vector); inputting the expression vector into a Bi-LSTM layer of the named entity recognition model for calculation to obtain a text Feature (Desc _ Feature); then, the text features are used as the input of a CRF layer of a named entity recognition model, text marking is carried out, a function position entity of the vulnerability is obtained, and the function position entity is used as the function position of the vulnerability;
comparing the file position and the function position extracted from the vulnerability description with the file position and the function position directly extracted from the submitted code change, judging whether the file position and the function position extracted from the vulnerability description are changed or not, and if the file position and the function position are changed, generating vulnerability position characteristics, wherein the vulnerability position characteristics comprise: a File Location feature (File _ Location feature) and a function Location feature (Func _ Location feature); if not, the vulnerability has no vulnerability position characteristics;
generating vulnerability type characteristics: for each vulnerability-submission pair, acquiring a pre-change code and a post-change code of a file change function based on submitted code change (Codeff) in a code change file, respectively performing function level vulnerability detection based on a pre-trained graph convolution neural network on the pre-change code and the post-change code, if a certain type of vulnerability is detected in the pre-change code and the type of vulnerability is not detected in the post-change code, judging that the vulnerability is repaired in the code change, and acquiring a vulnerability type corresponding to the vulnerability, wherein the specific process is as follows:
for a file affected in a C (C project)/. cpp (C + + project) file, firstly, dividing a code before modification of the file into n functions, extracting an abstract syntax tree AST, a control flow graph CFG and a data flow graph DFG of each function to represent grammatical and semantic information of the function, coding a source code of the function into a vector, inputting the vector into a pre-trained graph convolution neural network model for vulnerability detection, and extracting a vulnerability type (C _ TypeBeform _ n) of the vulnerability if the model outputs that the function has the vulnerability; if the model output indicates that the function has no leak, detecting the next function; after all the functions in the file are detected, obtaining a vulnerability dictionary (C _ TypeBeform _ Dict) consisting of the function name and the vulnerability type of the file;
then dividing the changed codes of the file into n functions, and executing the operation process in the same way to obtain a vulnerability dictionary (C _ TypeAfter _ Dict) consisting of the function name and the vulnerability type of the file;
comparing the difference of the bug dictionaries before and after the file is changed, if the bug dictionaries before and after the file is changed are not different, judging that the code change does not repair the bug, if a certain function and bug in the bug dictionary before the file is changed disappear from the bug dictionary after the file is changed, judging that the bug is repaired in the code change, and obtaining the bug Type (C _ Type) of the repaired bug; comparing the vulnerability Type (C _ Type) with a vulnerability Type (CWE) extracted from a vulnerability database NVD, and generating a vulnerability Type feature (Vul _ Type feature);
generating vulnerability text similarity characteristics: for each vulnerability-submission pair, based on vulnerability description extracted from a vulnerability database NVD, deleting a text corresponding to a function position entity which is identified and extracted by using a named entity when generating vulnerability position characteristics, then performing text marking, combining marked words into a text Sequence, and then performing stop word removal, morphological restoration and synonym replacement on the text Sequence to obtain a text Sequence N _ Sequence;
based on vulnerability descriptions extracted from a vulnerability database X-Force and submitted submission texts, respectively executing the following operations:
performing text marking, combining marked words into a text sequence, and then performing stop word removal, word form reduction and synonym replacement on the text sequence;
finally, respectively obtaining a text Sequence X _ Sequence and a text Sequence C _ Sequence;
calculating the importance of each word in the Text sequence N _ Text by using a TF-IDF algorithm, quantizing the importance of the words, and generating a Vector N _ Vector corresponding to the Text sequence N _ Text, wherein the calculation process is as follows:
calculating the word frequency TF of the word w in the text sequence t:
Figure BDA0003420270550000191
wherein w _ num is the number of times the word w appears in the text sequence, and t _ num is the total number of words in the text sequence t;
calculating the inverse document frequency IDF of the word w in the text sequence set d:
Figure BDA0003420270550000201
wherein | d | represents the total number of text sequences in the text sequence set d, | { t ∈ d: w ∈ t } | represents the number of text sequences containing the word w;
calculating the word frequency of the word w-the inverse file frequency TF-IDF: TF-IDF ═ TF × IDF;
vectorizing the Text sequence N _ Text into a Vector N _ Vector according to the word frequency-inverse file frequency of each word;
obtaining a Vector X _ Vector of the Text sequence X _ Text and a Vector C _ Vector of the Text sequence C _ Text by using the method;
and finally, respectively calculating the Text similarity between the Text sequence N _ Text and the Text sequence C _ Text by using the cosine similarity:
Figure BDA0003420270550000202
calculating to obtain Text Similarity X _ Similarity between the Text sequence X _ Text and the Text sequence C _ Text by using the formula, and generating vulnerability Text Similarity characteristics according to the numerical values of N _ Similarity and X _ Similarity, wherein the vulnerability Text Similarity characteristics comprise: the vulnerability description of the vulnerability database NVD (network video language) is characterized by Similarity among submitted texts (N _ Similarity characteristic) and the vulnerability description of the vulnerability database X-Force is characterized by Similarity among submitted texts (X _ Similarity characteristic);
generating vulnerability repair possibility characteristics: for each vulnerability-submission pair, judging whether the code change repairs the vulnerability or not according to a vulnerability detection result when generating vulnerability type characteristics based on the submitted code change, if the code change repairs the vulnerability, generating vulnerability repair possibility characteristics (Vul _ Fix characteristics), and if the code change does not repair the vulnerability, determining that the vulnerability-submission pair does not have the vulnerability repair possibility characteristics;
generate Date difference feature (Date _ Diff feature): for each vulnerability-submission pair, calculating a difference value between the vulnerability disclosure time extracted from the vulnerability database NVD and the submission time of the vulnerability:
Date_Diff=N_Date-C_Date
wherein, Date _ Diff represents the difference value of the two, N _ Date represents the vulnerability disclosure time, and C _ Date represents the submission time;
as shown in fig. 2, the interval where the difference is located is determined according to the research result, and a date difference feature is generated according to the proportion of the number of vulnerability-patch pairs in the interval in the research result to the number of all research objects; the vulnerability-patch pair is a vulnerability and a patch corresponding to the vulnerability, and the investigation object is the vulnerability-patch pair.
So far, a pair of loopholes-6 types of characteristics of submission pairs (totally 9 loophole characteristics) are formed, which are respectively:
(1) vulnerability identifier characteristics: a Bug _ ID feature, a CVE _ ID feature;
(2) vulnerability location characteristics: file _ Location feature, Func _ Location feature;
(3) vulnerability type characteristics: a Vul _ Type feature;
(4) text similarity characteristics: an N _ Similarity feature, an X _ Similarity feature;
(5) repair possibility characteristics: a Vul _ Fix feature;
(6) the date difference characteristic: date _ Diff feature.
Further, as shown in fig. 4, in step 3,
for a known bug of a certain patch, extracting any two pairs of related bugs-submission pairs<i,j>Calculating the true probability of the vulnerability-submission pair i being ranked before the vulnerability-submission pair j
Figure BDA0003420270550000211
Figure BDA0003420270550000212
Wherein
Figure BDA0003420270550000213
Then, according to the vulnerability relevant characteristics, expressing two pairs of vulnerability-submission pairs < i, j > as characteristic vectors, and inputting a scoring function S:
Figure BDA0003420270550000214
wherein k is the total number of features, where k is 9; fnAn nth feature score for a vulnerability-submission pair; wnScoring coefficients for the nth feature of the vulnerability-submission pairs, generated by the ranking model RankNe;
two pairs of vulnerability-submission pairs computed by a scoring function S<i,j>Fraction of (2)<Si,Sj>And calculating the prediction probability P of the vulnerability-submission pair i before the vulnerability-submission pair j by using a sigmoid functionij
Figure BDA0003420270550000215
Using the cross entropy C as a loss function of the model:
Figure BDA0003420270550000216
then training a model to minimize a loss function;
obtaining a trained ranking model RankNet through the steps;
inputting a vulnerability-submission pair related to a vulnerability with unknown patch into a trained sequencing model RankNet, calculating the score of each pair of vulnerability-submission pairs related to the vulnerability with unknown patch, and sequencing the vulnerability-submission pairs related to the vulnerability with unknown patch according to the grading sequence to obtain vulnerability-submission pairs which are sequenced into Top1, Top5 and Top 30.
Further, in step 4,
manually checking the obtained vulnerability-submission pairs which are ranked to be a vulnerability-submission pair of Top1, Top5 and Top30, judging whether the submission in the vulnerability-submission pair is patch submission, if so, indicating that the vulnerability is repaired, and submitting and restoring the link of the vulnerability and the patch corresponding to the vulnerability; if not, the vulnerability is not repaired, and the patch corresponding to the vulnerability and the vulnerability is submitted without a recovery link.
The embodiment provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and is characterized in that the processor implements the steps of the above method for recovering C/C + + vulnerabilities and inter-patch links thereof based on sorting when executing the computer program.
The present embodiment provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program, when being executed by a processor, implements the steps of the above-mentioned method for recovering a C/C + + vulnerability based on ordering and inter-patch linkage thereof.
The invention embodies a number of methods and approaches to this solution and the foregoing is only a preferred embodiment of the invention. It should be noted that, for those skilled in the art, without departing from the principle of the present invention, several improvements and modifications can be made, and these improvements and modifications should also be construed as the protection scope of the present invention. All the components not specified in the present embodiment can be realized by the prior art.

Claims (10)

1. A C/C + + vulnerability based on sequencing and inter-patch link recovery system thereof are characterized by comprising:
the vulnerability data acquisition module is used for extracting basic data of vulnerabilities related to open source C/C + + projects from a vulnerability database NVD and a vulnerability database X-Force, extracting related submitted basic data in a software storage library of the corresponding projects of the vulnerabilities, and establishing vulnerability-submission pairs by using the basic data of the vulnerabilities and the related submitted basic data;
the vulnerability related characteristic generating module is used for generating vulnerability related characteristics for each vulnerability-submission pair through the basic data of the vulnerability and the basic data of the related submission;
the sequencing module is used for training a sequencing model RankNet based on a neural network through the vulnerability related characteristics to obtain a trained sequencing model RankNet, inputting a vulnerability-submission pair related to the patch unknown vulnerability into the trained sequencing model RankNet, calculating the score of each pair of vulnerability-submission pairs related to the patch unknown vulnerability through the sequencing model RankNet, and sequencing the vulnerability-submission pairs related to the patch unknown vulnerability according to the scoring sequence to obtain sequenced vulnerability-submission pairs;
the manual checking module is used for manually checking the ordered vulnerability-submission pairs according to the ordering sequence, judging whether the submission in the vulnerability-submission pairs is patch submission or not, and if so, indicating that the vulnerability is repaired and the vulnerability and the patch submission corresponding to the vulnerability are linked for recovery; if not, the vulnerability is not repaired, and the patch corresponding to the vulnerability and the vulnerability is submitted without a recovery link.
2. The sequencing-based C/C + + vulnerability and inter-patch linkage recovery system of claim 1, wherein in the vulnerability data acquisition module,
screening related vulnerabilities of the C/C + + project according to the fact that whether the vulnerability description contains the C/. cpp file or not, and obtaining vulnerability IDs, disclosure dates, vulnerability descriptions corresponding to the IDs, reference links, software version configuration influenced by the vulnerabilities and vulnerability types of related vulnerability entries;
for a vulnerability database X-Force, obtaining the description of a relevant vulnerability;
for the software repository of the project corresponding to the vulnerability, screening submission is carried out according to whether the c/. cpp file is modified in the submission and whether the submission is released after the affected version, code change, submission date, submission title and submission information which meet the screening conditions are obtained, and the submission title and the submission information are spliced into a submission text;
for a certain vulnerability entry, taking data extracted from a vulnerability database NVD and a vulnerability database X-Force as vulnerability data, taking data extracted from n submissions meeting conditions in a software repository of a project corresponding to the vulnerability as submission data, merging the vulnerability data of the vulnerability and the submission data submitted by each submission in the n submissions respectively for one time, and finally obtaining n groups of vulnerability-submission pairs containing vulnerability information and submission information.
3. The C/C + + vulnerability and inter-patch linkage recovery system based on ordering of claim 2, wherein, in the vulnerability related characteristics generation module,
generating vulnerability identifier characteristics: for each vulnerability-submission pair, based on the reference links extracted from the vulnerability database NVD, a regular expression ". jira./(. a)", ". bugzilla.,? id ═ ma? id ═ and ". bugs. eclipse.? Extracting existing defect IDs, judging whether submitted texts contain related defect IDs and vulnerability IDs by combining the corresponding vulnerability IDs, and if so, generating vulnerability identifier characteristics, wherein the vulnerability identifier characteristics comprise: a defect ID feature and a vulnerability ID feature; if not, the vulnerability does not have vulnerability identifier characteristics;
generating vulnerability location characteristics: for each vulnerability-submission pair, extracting a file location where the vulnerability occurs using regular expressions "(. eras.c)" and "(. eras.cpp)" based on the vulnerability description extracted from the vulnerability database NVD, and then performing named entity identification on the vulnerability description to extract a function location where the vulnerability occurs; the extraction process of the function position is as follows: converting the vulnerability description into an input sequence of a Bert pre-training model and inputting the input sequence into the Bert pre-training model for calculation to obtain an expression vector; inputting the expression vector into a Bi-LSTM layer of the named entity recognition model for calculation to obtain text features; then, the text features are used as the input of a CRF layer of a named entity recognition model, text marking is carried out, a function position entity of the vulnerability is obtained, and the function position entity is used as the function position of the vulnerability;
comparing the file position and the function position extracted from the vulnerability description with the file position and the function position directly extracted from the submitted code change, judging whether the file position and the function position extracted from the vulnerability description are changed or not, and if the file position and the function position are changed, generating vulnerability position characteristics, wherein the vulnerability position characteristics comprise: file location features and function location features; if not, the vulnerability has no vulnerability position characteristics;
generating vulnerability type characteristics: for each vulnerability-submission pair, based on submitted code change, performing function level vulnerability detection based on a graph convolution neural network on a pre-trained code and a pre-trained code after change of a file change function for a C (C project)/. cpp (C + + project) file in a code change file, if a certain type of vulnerability is detected in the code before change and the type of vulnerability is not detected in the code after change, judging that the vulnerability is repaired in the code change, and obtaining a vulnerability type corresponding to the vulnerability, wherein the specific process is as follows:
for a file affected in a C (C project)/. cpp (C + + project) file, firstly, dividing a code before modification of the file into n functions, extracting an abstract syntax tree AST, a control flow graph CFG and a data flow graph DFG of each function to represent syntax and semantic information of the function, coding a source code of the function into a vector, inputting the vector into a pre-trained graph convolution neural network model for vulnerability detection, and extracting a vulnerability type of the vulnerability if the model is output as a function with vulnerability; if the model output indicates that the function has no leak, detecting the next function; after all the functions in the file are detected, obtaining a vulnerability dictionary formed by the function name and the vulnerability type of the file;
then dividing the changed codes of the file into n functions, and executing the operation process to obtain a vulnerability dictionary formed by the function name and the vulnerability type of the file;
comparing the difference of the bug dictionaries before and after the file is changed, if the bug dictionaries before and after the file is changed are not different, judging that the code change does not repair the bug, if a certain function and bug in the bug dictionary before the file is changed disappear from the bug dictionary after the file is changed, judging that the bug is repaired in the code change, and obtaining the bug type of the repaired bug; comparing the vulnerability type with the vulnerability type extracted from the vulnerability database NVD, and generating vulnerability type characteristics;
generating vulnerability text similarity characteristics: for each vulnerability-submission pair, based on vulnerability description extracted from a vulnerability database NVD, deleting a text corresponding to a function position entity which is identified and extracted by using a named entity when generating vulnerability position characteristics, then performing text marking, combining marked words into a text Sequence, and then performing stop word removal, morphological restoration and synonym replacement on the text Sequence to obtain a text Sequence N _ Sequence;
based on vulnerability descriptions extracted from a vulnerability database X-Force and submitted submission texts, respectively executing the following operations:
performing text marking, combining marked words into a text sequence, and then performing stop word removal, word form reduction and synonym replacement on the text sequence;
finally, respectively obtaining a text Sequence X _ Sequence and a text Sequence C _ Sequence;
calculating the importance of each word in the Text sequence N _ Text by using a TF-IDF algorithm, quantizing the importance of the words, and generating a Vector N _ Vector corresponding to the Text sequence N _ Text, wherein the calculation process is as follows:
calculating the word frequency TF of the word w in the text sequence t:
Figure FDA0003420270540000031
wherein w _ num is the number of times the word w appears in the text sequence, and t _ num is the total number of words in the text sequence t;
calculating the inverse document frequency IDF of the word w in the text sequence set d:
Figure FDA0003420270540000041
wherein | d | represents the total number of text sequences in the text sequence set d, | { t ∈ d: w ∈ t } | represents the number of text sequences containing the word w;
calculating the word frequency of the word w-the inverse file frequency TF-IDF: TF-IDF ═ TF × IDF;
vectorizing the Text sequence N _ Text into a Vector N _ Vector according to the word frequency-inverse file frequency of each word;
obtaining a Vector X _ Vector of the Text sequence X _ Text and a Vector C _ Vector of the Text sequence C _ Text by using the method;
and finally, respectively calculating the Text similarity between the Text sequence N _ Text and the Text sequence C _ Text by using the cosine similarity:
Figure FDA0003420270540000042
calculating to obtain Text Similarity X _ Similarity between the Text sequence X _ Text and the Text sequence C _ Text by using the formula, and generating vulnerability Text Similarity characteristics according to the numerical values of N _ Similarity and X _ Similarity, wherein the vulnerability Text Similarity characteristics comprise: the vulnerability description-submission text similarity characteristic of the vulnerability database NVD and the vulnerability description-submission text similarity characteristic of the vulnerability database X-Force vulnerability;
generating vulnerability repair possibility characteristics: for each vulnerability-submission pair, judging whether the code change repairs the vulnerability or not according to a vulnerability detection result when generating vulnerability type characteristics based on the submitted code change, if the code change repairs the vulnerability, generating vulnerability repair possibility characteristics, and if the code change does not repair the vulnerability, judging that the vulnerability-submission pair does not have the vulnerability repair possibility characteristics;
generating a date difference characteristic: for each vulnerability-submission pair, calculating a difference value between the vulnerability disclosure time extracted from the vulnerability database NVD and the submission time of the vulnerability:
Date_Diff=N_Date-C_Date
wherein, Date _ Diff represents the difference value of the two, N _ Date represents the vulnerability disclosure time, and C _ Date represents the submission time;
judging the interval where the difference value is located according to the investigation result, and generating a date difference value characteristic according to the proportion of the number of vulnerability-patch pairs of the interval in the investigation result to the number of all investigation objects; the vulnerability-patch pair is a vulnerability and a patch corresponding to the vulnerability, and the investigation object is the vulnerability-patch pair.
4. The C/C + + vulnerability and inter-patch linkage recovery system based on sorting of claim 3, wherein in the sorting module,
for a certain patchKnown vulnerabilities, any two pairs of vulnerability-submission pairs that are relevant are extracted<i,j>Calculating the true probability of the vulnerability-submission pair i being ranked before the vulnerability-submission pair j
Figure FDA0003420270540000051
Figure FDA0003420270540000052
Then, according to the vulnerability relevant characteristics, expressing two pairs of vulnerability-submission pairs < i, j > as characteristic vectors, and inputting a scoring function S:
Figure FDA0003420270540000053
wherein k is the total number of features; fnAn nth feature score for a vulnerability-submission pair; wnScoring coefficients for the nth feature of the vulnerability-submission pairs, generated by the ranking model RankNe;
two pairs of vulnerability-submission pairs computed by a scoring function S<i,j>Fraction of (2)<Si,Sj>And calculating the prediction probability P of the vulnerability-submission pair i before the vulnerability-submission pair j by using a sigmoid functionij
Figure FDA0003420270540000054
Using the cross entropy C as a loss function of the model:
Figure FDA0003420270540000055
then training a model to minimize a loss function;
obtaining a trained ranking model RankNet through the steps;
inputting a vulnerability-submission pair related to a vulnerability unknown in the patch into a trained sequencing model RankNet, calculating the score of each pair of vulnerability-submission pairs related to the vulnerability unknown in the patch, and sequencing the vulnerability-submission pairs related to the vulnerability unknown in the patch according to the grading sequence to obtain sequenced vulnerability-submission pairs.
5. A C/C + + vulnerability based on sequencing and an inter-patch link recovery method thereof comprise the following steps:
step 1, extracting basic data of a vulnerability related to an open source C/C + + project from a vulnerability database NVD and a vulnerability database X-Force, extracting related submitted basic data in a software repository of the project corresponding to the vulnerability, and establishing a vulnerability-submission pair by using the basic data of the vulnerability and the related submitted basic data;
step 2, for each vulnerability-submission pair, generating vulnerability related characteristics through the basic data of the vulnerability and the related submitted basic data;
step 3, training a sequencing model RankNet based on a neural network through the vulnerability-related characteristics to obtain a trained sequencing model RankNet, inputting a vulnerability-submission pair related to the patch unknown vulnerability into the trained sequencing model RankNet, calculating the score of each pair of vulnerability-submission pairs related to the patch unknown vulnerability through the sequencing model RankNet, and sequencing the vulnerability-submission pairs related to the patch unknown vulnerability according to the scoring sequence to obtain sequenced vulnerability-submission pairs;
step 4, manually checking the ordered vulnerability-submission pairs according to the ordering sequence, judging whether the submission in the vulnerability-submission pairs is patch submission, if so, indicating that the vulnerability is repaired, and submitting and restoring the links of the vulnerability and the patch corresponding to the vulnerability; if not, the vulnerability is not repaired, and the patch corresponding to the vulnerability and the vulnerability is submitted without a recovery link.
6. The method for C/C + + vulnerability and inter-patch link recovery based on ordering according to claim 5, wherein in step 1,
screening related vulnerabilities of the C/C + + project according to the fact that whether the vulnerability description contains the C/. cpp file or not, and obtaining vulnerability IDs, disclosure dates, vulnerability descriptions corresponding to the IDs, reference links, software version configuration influenced by the vulnerabilities and vulnerability types of related vulnerability entries;
for a vulnerability database X-Force, obtaining the description of a relevant vulnerability;
for the software repository of the project corresponding to the vulnerability, screening submission is carried out according to whether the c/. cpp file is modified in the submission and whether the submission is released after the affected version, code change, submission date, submission title and submission information which meet the screening conditions are obtained, and the submission title and the submission information are spliced into a submission text;
for a certain vulnerability entry, taking data extracted from a vulnerability database NVD and a vulnerability database X-Force as vulnerability data, taking data extracted from n submissions meeting conditions in a software repository of a project corresponding to the vulnerability as submission data, merging the vulnerability data of the vulnerability and the submission data submitted by each submission in the n submissions respectively for one time, and finally obtaining n groups of vulnerability-submission pairs containing vulnerability information and submission information.
7. The method for C/C + + vulnerability and inter-patch link recovery based on ordering according to claim 6, wherein in step 2,
generating vulnerability identifier characteristics: for each vulnerability-submission pair, based on the reference links extracted from the vulnerability database NVD, a regular expression ". jira./(. a)", ". bugzilla.,? id ═ ma? id ═ and ". bugs. eclipse.? Extracting existing defect IDs, judging whether submitted texts contain related defect IDs and vulnerability IDs by combining the corresponding vulnerability IDs, and if so, generating vulnerability identifier characteristics, wherein the vulnerability identifier characteristics comprise: a defect ID feature and a vulnerability ID feature; if not, the vulnerability does not have vulnerability identifier characteristics;
generating vulnerability location characteristics: for each vulnerability-submission pair, extracting a file location where the vulnerability occurs using regular expressions "(. eras.c)" and "(. eras.cpp)" based on the vulnerability description extracted from the vulnerability database NVD, and then performing named entity identification on the vulnerability description to extract a function location where the vulnerability occurs; the extraction process of the function position is as follows: converting the vulnerability description into an input sequence of a Bert pre-training model and inputting the input sequence into the Bert pre-training model for calculation to obtain an expression vector; inputting the expression vector into a Bi-LSTM layer of the named entity recognition model for calculation to obtain text features; then, the text features are used as the input of a CRF layer of a named entity recognition model, text marking is carried out, a function position entity of the vulnerability is obtained, and the function position entity is used as the function position of the vulnerability;
comparing the file position and the function position extracted from the vulnerability description with the file position and the function position directly extracted from the submitted code change, judging whether the file position and the function position extracted from the vulnerability description are changed or not, and if the file position and the function position are changed, generating vulnerability position characteristics, wherein the vulnerability position characteristics comprise: file location features and function location features; if not, the vulnerability has no vulnerability position characteristics;
generating vulnerability type characteristics: for each vulnerability-submission pair, based on submitted code change, performing function level vulnerability detection based on a graph convolution neural network on a pre-trained code and a pre-trained code after change of a file change function for a C (C project)/. cpp (C + + project) file in a code change file, if a certain type of vulnerability is detected in the code before change and the type of vulnerability is not detected in the code after change, judging that the vulnerability is repaired in the code change, and obtaining a vulnerability type corresponding to the vulnerability, wherein the specific process is as follows:
for a file affected in a C (C project)/. cpp (C + + project) file, firstly, dividing a code before modification of the file into n functions, extracting an abstract syntax tree AST, a control flow graph CFG and a data flow graph DFG of each function to represent syntax and semantic information of the function, coding a source code of the function into a vector, inputting the vector into a pre-trained graph convolution neural network model for vulnerability detection, and extracting a vulnerability type of the vulnerability if the model is output as a function with vulnerability; if the model output indicates that the function has no leak, detecting the next function; after all the functions in the file are detected, obtaining a vulnerability dictionary formed by the function name and the vulnerability type of the file;
then dividing the changed codes of the file into n functions, and executing the operation process to obtain a vulnerability dictionary formed by the function name and the vulnerability type of the file;
comparing the difference of the bug dictionaries before and after the file is changed, if the bug dictionaries before and after the file is changed are not different, judging that the code change does not repair the bug, if a certain function and bug in the bug dictionary before the file is changed disappear from the bug dictionary after the file is changed, judging that the bug is repaired in the code change, and obtaining the bug type of the repaired bug; comparing the vulnerability type with the vulnerability type extracted from the vulnerability database NVD, and generating vulnerability type characteristics;
generating vulnerability text similarity characteristics: for each vulnerability-submission pair, based on vulnerability description extracted from a vulnerability database NVD, deleting a text corresponding to a function position entity which is identified and extracted by using a named entity when generating vulnerability position characteristics, then performing text marking, combining marked words into a text Sequence, and then performing stop word removal, morphological restoration and synonym replacement on the text Sequence to obtain a text Sequence N _ Sequence;
based on vulnerability descriptions extracted from a vulnerability database X-Force and submitted submission texts, respectively executing the following operations:
performing text marking, combining marked words into a text sequence, and then performing stop word removal, word form reduction and synonym replacement on the text sequence;
finally, respectively obtaining a text Sequence X _ Sequence and a text Sequence C _ Sequence;
calculating the importance of each word in the Text sequence N _ Text by using a TF-IDF algorithm, quantizing the importance of the words, and generating a Vector N _ Vector corresponding to the Text sequence N _ Text, wherein the calculation process is as follows:
calculating the word frequency TF of the word w in the text sequence t:
Figure FDA0003420270540000081
wherein w _ num is the number of times the word w appears in the text sequence, and t _ num is the total number of words in the text sequence t;
calculating the inverse document frequency IDF of the word w in the text sequence set d:
Figure FDA0003420270540000082
wherein | d | represents the total number of text sequences in the text sequence set d, | { t ∈ d: w ∈ t } | represents the number of text sequences containing the word w;
calculating the word frequency of the word w-the inverse file frequency TF-IDF: TF-IDF ═ TF × IDF;
vectorizing the Text sequence N _ Text into a Vector N _ Vector according to the word frequency-inverse file frequency of each word;
obtaining a Vector X _ Vector of the Text sequence X _ Text and a Vector C _ Vector of the Text sequence C _ Text by using the method;
and finally, respectively calculating the Text similarity between the Text sequence N _ Text and the Text sequence C _ Text by using the cosine similarity:
Figure FDA0003420270540000091
calculating to obtain Text Similarity X _ Similarity between the Text sequence X _ Text and the Text sequence C _ Text by using the formula, and generating vulnerability Text Similarity characteristics according to the numerical values of N _ Similarity and X _ Similarity, wherein the vulnerability Text Similarity characteristics comprise: the vulnerability description-submission text similarity characteristic of the vulnerability database NVD and the vulnerability description-submission text similarity characteristic of the vulnerability database X-Force vulnerability;
generating vulnerability repair possibility characteristics: for each vulnerability-submission pair, judging whether the code change repairs the vulnerability or not according to a vulnerability detection result when generating vulnerability type characteristics based on the submitted code change, if the code change repairs the vulnerability, generating vulnerability repair possibility characteristics, and if the code change does not repair the vulnerability, judging that the vulnerability-submission pair does not have the vulnerability repair possibility characteristics;
generating a date difference characteristic: for each vulnerability-submission pair, calculating a difference value between the vulnerability disclosure time extracted from the vulnerability database NVD and the submission time of the vulnerability:
Date_Diff=N_Date-C_Date
wherein, Date _ Diff represents the difference value of the two, N _ Date represents the vulnerability disclosure time, and C _ Date represents the submission time;
judging the interval where the difference value is located according to the investigation result, and generating a date difference value characteristic according to the proportion of the number of vulnerability-patch pairs of the interval in the investigation result to the number of all investigation objects; the vulnerability-patch pair is a vulnerability and a patch corresponding to the vulnerability, and the investigation object is the vulnerability-patch pair.
8. The method for C/C + + vulnerability and inter-patch link recovery based on ordering according to claim 7, wherein in step 3,
for a known bug of a certain patch, extracting any two pairs of related bugs-submission pairs<i,j>Calculating the true probability of the vulnerability-submission pair i being ranked before the vulnerability-submission pair j
Figure FDA0003420270540000092
Figure FDA0003420270540000093
Then, according to the vulnerability relevant characteristics, expressing two pairs of vulnerability-submission pairs < i, j > as characteristic vectors, and inputting a scoring function S:
Figure FDA0003420270540000101
wherein k is the total number of features; fnAn nth feature score for a vulnerability-submission pair; wnScoring coefficients for the nth feature of the vulnerability-submission pairs, generated by the ranking model RankNe;
two pairs of vulnerability-submission pairs computed by a scoring function S<i,j>Fraction of (2)<Si,Sj>And calculating the prediction probability P of the vulnerability-submission pair i before the vulnerability-submission pair j by using a sigmoid functionij
Figure FDA0003420270540000102
Using the cross entropy C as a loss function of the model:
Figure FDA0003420270540000103
then training a model to minimize a loss function;
obtaining a trained ranking model RankNet through the steps;
inputting a vulnerability-submission pair related to a vulnerability unknown in the patch into a trained sequencing model RankNet, calculating the score of each pair of vulnerability-submission pairs related to the vulnerability unknown in the patch, and sequencing the vulnerability-submission pairs related to the vulnerability unknown in the patch according to the grading sequence to obtain sequenced vulnerability-submission pairs.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 5 to 8 are implemented by the processor when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 5 to 8.
CN202111560155.7A 2021-12-20 2021-12-20 C/C + + vulnerability based on sequencing and inter-patch link recovery system and method thereof Pending CN114329482A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111560155.7A CN114329482A (en) 2021-12-20 2021-12-20 C/C + + vulnerability based on sequencing and inter-patch link recovery system and method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111560155.7A CN114329482A (en) 2021-12-20 2021-12-20 C/C + + vulnerability based on sequencing and inter-patch link recovery system and method thereof

Publications (1)

Publication Number Publication Date
CN114329482A true CN114329482A (en) 2022-04-12

Family

ID=81052159

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111560155.7A Pending CN114329482A (en) 2021-12-20 2021-12-20 C/C + + vulnerability based on sequencing and inter-patch link recovery system and method thereof

Country Status (1)

Country Link
CN (1) CN114329482A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115563619A (en) * 2022-09-27 2023-01-03 北京墨云科技有限公司 Vulnerability similarity comparison method and system based on text pre-training model
CN117056940A (en) * 2023-10-12 2023-11-14 中关村科学城城市大脑股份有限公司 Method, device, electronic equipment and medium for repairing loopholes of server system
CN117235744A (en) * 2023-11-14 2023-12-15 中关村科学城城市大脑股份有限公司 Source file online method, device, electronic equipment and computer readable medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115563619A (en) * 2022-09-27 2023-01-03 北京墨云科技有限公司 Vulnerability similarity comparison method and system based on text pre-training model
CN117056940A (en) * 2023-10-12 2023-11-14 中关村科学城城市大脑股份有限公司 Method, device, electronic equipment and medium for repairing loopholes of server system
CN117056940B (en) * 2023-10-12 2024-01-16 中关村科学城城市大脑股份有限公司 Method, device, electronic equipment and medium for repairing loopholes of server system
CN117235744A (en) * 2023-11-14 2023-12-15 中关村科学城城市大脑股份有限公司 Source file online method, device, electronic equipment and computer readable medium
CN117235744B (en) * 2023-11-14 2024-02-02 中关村科学城城市大脑股份有限公司 Source file online method, device, electronic equipment and computer readable medium

Similar Documents

Publication Publication Date Title
Tian et al. Evaluating representation learning of code changes for predicting patch correctness in program repair
Lu et al. Codexglue: A machine learning benchmark dataset for code understanding and generation
CN109697162B (en) Software defect automatic detection method based on open source code library
CN111459799B (en) Software defect detection model establishing and detecting method and system based on Github
Nguyen et al. Multi-layered approach for recovering links between bug reports and fixes
CN114329482A (en) C/C + + vulnerability based on sequencing and inter-patch link recovery system and method thereof
WO2019051426A1 (en) Pruning engine
CN107102993B (en) User appeal analysis method and device
CN109871688B (en) Vulnerability threat degree evaluation method
CN106649557B (en) Semantic association mining method for defect report and mail list
CN113656805B (en) Event map automatic construction method and system for multi-source vulnerability information
Murgia et al. A machine learning approach for text categorization of fixing-issue commits on CVS
Vanamala et al. Topic modeling and classification of Common Vulnerabilities And Exposures database
CN116305158A (en) Vulnerability identification method based on slice code dependency graph semantic learning
Cheng et al. A similarity integration method based information retrieval and word embedding in bug localization
CN111091009B (en) Document association auditing method based on semantic analysis
Zhang et al. Understanding programmatic weak supervision via source-aware influence function
Guo et al. Deep review sharing
CN113011156A (en) Quality inspection method, device and medium for audit text and electronic equipment
Yang et al. Locating faulty methods with a mixed RNN and attention model
Liu et al. Software Vulnerability Detection with GPT and In-Context Learning
Seyam et al. Code complexity and version history for enhancing hybrid bug localization
Magalhães et al. Mare: an active learning approach for requirements classification
Alhindawi et al. A Topic Modeling Based Solution for Confirming Software Documentation Quality
CN114386048A (en) Open source software security vulnerability patch positioning method based on sequencing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination