CN116561332A

CN116561332A - Software vulnerability feature knowledge extraction method and system based on N-element phrase similarity

Info

Publication number: CN116561332A
Application number: CN202211605951.2A
Authority: CN
Inventors: 吴潇雪; 郑炜; 郑彬; 刘新岩; 高亿人; 单文婧; 薄莉莉; 孙小兵
Original assignee: Yangzhou University
Current assignee: Yangzhou University
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2023-08-08

Abstract

The invention discloses a software vulnerability feature knowledge extraction method and system based on N-element phrase similarity, which comprises the steps of firstly collecting vulnerability description text in a vulnerability database, and performing data cleaning to generate vulnerability key feature description text; extracting N-element candidate keywords, and generating vulnerability feature description candidate keywords by using semantic similarity of words and texts; secondly, extracting keyword entities, constructing a text similarity model, calculating similarity between candidate keywords in the text and a standard vulnerability description text after MASK operation, sequencing the texts according to the similarity, and generating vulnerability characteristic description entity keywords; defining a vulnerability characteristic entity relationship, and generating a triplet representing the vulnerability characteristic relationship; and constructing a knowledge graph by using the triples, and analyzing the vulnerability text to be analyzed by using the knowledge graph. The method and the device can better help analyze the vulnerability characteristics and improve the safety of the software system.

Description

Software vulnerability feature knowledge extraction method and system based on N-element phrase similarity

Technical Field

The invention relates to the field of software security, in particular to a software vulnerability feature knowledge extraction method and system based on N-element phrase similarity.

Background

Vulnerability research is critical to software security, and vulnerability feature research is an important component of software security research. The existing deep learning model depends on a large number of marked vulnerability samples, the marked vulnerability samples are difficult to obtain in the actual development process, a large amount of resource waste is caused by manual marking, and the correctness is to be studied. And the data set quantity of the small sample is not large enough to train a vulnerability classification model with higher classification accuracy. Therefore, the vulnerability classification model trained by the labeled data set method based on the small sample and the manual labeling is low in classification accuracy and difficult to be used for the vulnerability classification task in the actual environment.

At present, some works analyze vulnerabilities by means of a software security vulnerability list database CWE and a vulnerability disclosure database CVE, but the text semantic information rich in CWE and CVE is not fully used. Even if text information is used, when entity identification is performed, the training set is generated by manually labeling the data set, and then entity identification is performed mainly by means of deep learning, such as a classical BiLSTM-CRF method in document Neural Architectures for Named Entity Recognition, an IDCNN-CRF method proposed in document Fast and Accurate Entity Recognition with Iterated Dilated Convolutions, and the like. These approaches rely heavily on prior expert experience, still requiring a significant amount of manual work therein, requiring significant manual experience and time costs.

Disclosure of Invention

The invention aims to: the invention aims to provide a software vulnerability feature knowledge extraction method and system based on N-element phrase similarity, which improve entity keyword extraction efficiency and rapidly position vulnerability feature knowledge.

The technical scheme is as follows: the invention provides a software vulnerability characteristic knowledge extraction method based on N-element phrase similarity, which comprises the following steps:

1) Collecting vulnerability description text in a vulnerability database, preprocessing, and generating a standard vulnerability description text;

2) Extracting N-element candidate keywords from the standard vulnerability description text, and generating a vulnerability feature description candidate keyword set by using the semantic similarity of the phrase and the text;

3) After MASK operation is carried out on candidate keywords in the vulnerability feature description candidate keyword set, calculating similarity between the candidate keywords and the standard vulnerability description text, sequencing the similarity from high to low, generating entity keywords, and forming triples by the entity keywords according to categories;

4) Formatting the triples and constructing a knowledge graph; and extracting knowledge from the vulnerability text to be classified to form a vulnerability triplet, and analyzing and matching the vulnerability text to be classified according to the vulnerability triplet and the knowledge graph to obtain characteristic keywords of the vulnerability text to be classified.

Further, step 1) includes the steps of:

1.1 Collecting vulnerability description texts of a software security vulnerability list database CWE and a vulnerability disclosure database CVE, wherein the vulnerability description texts comprise titles, descriptions and expansion description fields of the texts;

1.2 Preprocessing the collected vulnerability description text, wherein the preprocessing comprises text punctuation removal, stop word removal, word segmentation, part-of-speech tagging and morphological reduction, and generating a standard vulnerability description text without interference information.

Further, in the step 2), the following steps are included:

2.1 Word segmentation is carried out on standard vulnerability description texts, and each text respectively generates a phrase set A1= { w with N=1 ₁ ,w ₂ ,w ₃ ,w ₄ ,…,w _n Phrase set a of n=2 and ₂ ＝{w ₁ w ₂ ,w ₂ w ₃ ,w ₃ w ₄ ,w ₄ w ₅ ,…,w _n-1 w _n }；

2.2 Using pre-training model BERT for phrase set A ₁ 、A ₂ Carrying out semantic coding on texts to which the phrase sets belong, and obtaining phrase characterization of each phrase and text characterization to which the phrases belong by using MaxPooling maximum pooling;

2.3 Traversing phrase set A ₁ And A ₂ The phrase representation of each phrase and the text representation of the phrase are subjected to cosine similarity calculation, and the calculation formula is as follows:

wherein w is phrase set A ₁ A is a ₂ The phrase representation in (1), T is the text representation to which the phrase belongs;

calculating to obtain similarity scores of each phrase and the text, sequencing the scores from high to low, and taking the first k phrases to form a set, wherein the set is a candidate keyword set B= { wo ₁ ,wo ₂ ,wo ₃ ,wo ₄ ,…,wo ₂₀ }。

Further, 3.1) after performing a MASK operation on candidate keywords in the candidate keyword set for describing the vulnerability characteristics, calculating the similarity between the candidate keywords and the standard vulnerability description text, wherein the MASK operation formula is as follows:

wherein T is _mask To operate the candidate key word MASKThis characterization, T _c The text which belongs to the candidate keywords; t (T) _maski For the ith text representation, T _ci The text to which the ith candidate keyword belongs;

3.2 Obtaining similarity scores of text representations subjected to MASK operation of all candidate keywords and text representations to which the candidate keywords belong, sorting according to the similarity scores, and taking a set formed by the first k (k=6) phrases as an entity keyword set;

3.3 Classifying each entity keyword in the entity keyword set;

3.4 Defining relationships between entities according to categories of the entities;

3.5 Each entity keyword is formed into (entity, relation, entity) triples according to the entity category and relation.

Further, in step 4), the method comprises the following steps:

4.1 Performing format conversion on the formed (entity, relation and entity) triples, and constructing a knowledge graph by utilizing neo4 j;

4.2 Extracting knowledge from the vulnerability text to be classified to form a vulnerability triplet;

4.3 And (3) according to the vulnerability triples and the knowledge graph, analyzing and matching the vulnerability text to be classified to obtain the feature keywords of the vulnerability text to be classified.

The invention correspondingly provides a feature knowledge extraction system based on N-element phrase similarity learning, which comprises a vulnerability data set construction module, a candidate keyword extraction module, a keyword entity extraction module and a vulnerability classification module;

the vulnerability data set construction module is used for collecting vulnerability description texts in the vulnerability database, preprocessing the vulnerability description texts and generating standard vulnerability description texts;

the candidate keyword extraction module is used for extracting N-element candidate keywords from the standard vulnerability description text and generating a vulnerability feature description candidate keyword set by utilizing the semantic similarity of the phrase and the text;

the keyword entity extraction module is used for calculating the similarity between candidate keywords and the standard vulnerability description text after MASK operation of the candidate keywords in the vulnerability feature description candidate keyword set, sequencing the similarity from high to low to generate entity keywords, and forming triples by the entity keywords according to categories;

the vulnerability classification module is used for formatting the triples and constructing a knowledge graph; and extracting knowledge from the vulnerability text to be classified to form a vulnerability triplet, and analyzing and matching the vulnerability text to be classified according to the vulnerability triplet and the knowledge graph to obtain characteristic keywords of the vulnerability text to be classified.

Further, the vulnerability data set construction module comprises an acquisition unit and a preprocessing unit;

the acquisition unit is used for acquiring vulnerability description texts of the software security vulnerability list database CWE and the vulnerability disclosure database CVE, and comprises a title, description and expansion description fields of the texts;

the preprocessing unit is used for preprocessing the collected vulnerability description text, wherein the preprocessing comprises text punctuation removal, stop word removal, word segmentation, part-of-speech tagging and morphological reduction, and standard vulnerability description text without interference information is generated.

Further, the candidate keyword extraction module comprises a word segmentation unit, a coding unit and a traversal calculation unit;

the word segmentation unit is used for segmenting the standard vulnerability description text, and each text respectively generates a phrase set A with N=1 ₁ ＝{w ₁ ,w ₂ ,w ₃ ,w ₄ ,…,w _n Phrase set a of n=2 and ₂ ＝{w ₁ w ₂ ,w ₂ w ₃ ,w ₃ w ₄ ,w ₄ w ₅ ,…,w _n-1 w _n }；

the coding unit is used for adopting a pre-training model BERT to set the phrase A ₁ 、A ₂ Carrying out semantic coding on texts to which the phrase sets belong, and obtaining phrase characterization of each phrase and text characterization to which the phrases belong by using MaxPooling maximum pooling;

the traversal calculation unit is used for traversing the phrase set A ₁ And A ₂ Phrase representation of each phrase and cosine similarity with text representation of the phraseAnd calculating the degree, wherein the calculation formula is as follows:

Further, the keyword entity extraction module comprises a MASK operation unit, a sorting unit, a classification unit, a definition unit and a triplet unit;

the MASK operation unit is configured to calculate similarity between the candidate keyword and the standard vulnerability description text after performing MASK operation on the candidate keyword in the vulnerability feature description candidate keyword set, where the MASK operation formula is:

wherein T is _mask To characterize the text after the candidate keyword MASK operation, T _c The text which belongs to the candidate keywords; t (T) _maski For the ith text representation, T _ci The text to which the ith candidate keyword belongs;

the ordering unit is used for obtaining similarity scores of text tokens after all candidate keyword MASK operations and text tokens to which the candidate keywords belong, ordering the text tokens according to the similarity scores, and taking a set formed by the first k (k=6) phrases as an entity keyword set;

the classification unit is used for classifying each entity keyword in the entity keyword set;

the definition unit is used for defining the relation among the entities according to the categories of the entities;

the triplet unit is used for forming (entity, relation, entity) triples according to the entity category and relation of each entity keyword.

Further, the vulnerability classification module comprises a knowledge graph construction unit, a vulnerability triplet unit and an analysis matching unit;

the knowledge graph unit is constructed to perform format conversion on the formed (entity, relation and entity) triples, and a knowledge graph is constructed by neo4 j;

the vulnerability triplet unit is used for extracting knowledge from the vulnerability text to be classified to form a vulnerability triplet;

and the analysis matching unit is used for carrying out analysis matching on the vulnerability text to be classified according to the vulnerability triples and the knowledge graph to obtain characteristic keywords of the vulnerability text to be classified.

The beneficial effects are that: compared with the prior art, the method has the remarkable characteristics that the candidate keyword set is obtained through N-element candidate word extraction, the candidate keywords are formed into the entity keyword set through MASK operation, the entity keywords are rapidly positioned, and the entity keyword extraction efficiency is improved; after the entity keywords are classified, the classified entities form triples according to the categories and the relations, a knowledge graph is constructed, the knowledge graph is utilized to analyze the vulnerability text to be analyzed, the feature keywords of the vulnerability text to be classified are obtained, and the rapid positioning of vulnerability features is achieved.

Drawings

Fig. 1 is a schematic flow chart in the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific embodiments and drawings.

Example 1

The invention provides a software vulnerability feature knowledge extraction method based on N-element phrase similarity, which is shown in FIG. 1, and comprises the following steps:

1) And collecting vulnerability description text in a vulnerability database, preprocessing and generating a standard vulnerability description text.

1.1 Collecting vulnerability description texts of a software security vulnerability list database CWE and a vulnerability disclosure database CVE, wherein the vulnerability description texts comprise titles, descriptions and expansion description fields of the texts.

2) Extracting N-element candidate keywords from the standard vulnerability description text, and generating a vulnerability feature description candidate keyword set by using the semantic similarity of the phrase and the text.

2.1 Word segmentation is carried out on standard vulnerability description texts, and a phrase set A with N=1 is respectively generated for each text ₁ ＝{w ₁ ,w ₂ ,w ₃ ,w ₄ ,…,w _n Phrase set a of n=2 and ₂ ＝{w ₁ w ₂ ,w ₂ w ₃ ,w ₃ w ₄ ,w ₄ w ₅ ,…,w _n-1 w _n }。

2.2 Using pre-training model BERT for phrase set A ₁ 、A ₂ And carrying out semantic coding on the text to which the phrase set belongs, and obtaining phrase characterization of each phrase and text characterization to which the phrase belongs by using MaxPooling maximum pooling.

wherein w is phrase set A ₁ A is a ₂ The phrase representation in (1), T is the text representation to which the phrase belongs.

3) And after performing MASK operation on candidate keywords in the vulnerability feature description candidate keyword set, calculating the similarity between the candidate keywords and the standard vulnerability description text, sequencing the similarity from high to low, generating entity keywords, and forming triples by the entity keywords according to categories.

3.1 After candidate keywords MASK in the candidate keyword set are described by the vulnerability characteristics, similarity between the candidate keywords and the standard vulnerability description text is calculated, and the MASK operation formula is as follows:

wherein T is _mask To characterize the text after the candidate keyword MASK operation, T _c The text which belongs to the candidate keywords; t (T) _maski For the ith text representation, T _ci And the text to which the ith candidate keyword belongs.

3.2 Obtaining similarity scores of text representations subjected to MASK operation of all candidate keywords and text representations to which the candidate keywords belong, sorting according to the similarity scores, and taking a set formed by the first k (k=6) phrases as an entity keyword set.

3.3 Classifying each entity keyword in the entity keyword set; based on software security domain knowledge, the software is classified into 10 classes as shown in the following table 1:

TABLE 1

3.4 Defining relationships between entities according to categories of the entities; the relationship is 7 in total, as shown in table 2 below:

TABLE 2

4.1 Format conversion is carried out on the formed (entity, relation and entity) triples, and a knowledge graph is constructed by neo4 j.

4.2 Extracting knowledge from the vulnerability text to be classified to form a vulnerability triplet.

The knowledge extraction method is consistent with the methods in the steps 2) to 3), firstly, N-element candidate keywords are extracted for the vulnerability texts to be classified to generate a vulnerability keyword set, then, similarity between the vulnerability keywords and the standard vulnerability description text is calculated after MASK operation is carried out on the vulnerability keywords in the vulnerability keyword set, the similarity is ordered from high to low, the vulnerability keywords are generated, and the vulnerability keywords are formed into vulnerability triples according to categories.

Example 2

Corresponding to the software vulnerability feature knowledge extraction method based on N-element phrase similarity of embodiment 1, the embodiment correspondingly provides a software vulnerability feature knowledge extraction system based on N-element phrase similarity, please refer to fig. 1, which comprises a vulnerability dataset construction module, a candidate keyword extraction module, a keyword entity extraction module and a vulnerability classification module;

the vulnerability data set construction module is used for collecting vulnerability description text in the vulnerability database, preprocessing the vulnerability description text and generating standard vulnerability description text.

The vulnerability data set construction module comprises an acquisition unit and a preprocessing unit;

the acquisition unit is used for acquiring vulnerability description texts of the software security vulnerability list database CWE and the vulnerability disclosure database CVE, and comprises a title, description and expansion description fields of the texts.

The candidate keyword extraction module is used for extracting N-element candidate keywords from the standard vulnerability description text, and generating a vulnerability feature description candidate keyword set by using the semantic similarity of the phrase and the text.

The candidate keyword extraction module comprises a word segmentation unit, a coding unit and a traversal calculation unit;

the word segmentation unit is used for segmenting the standard vulnerability description text, and each text respectively generates a phrase set A with N=1 ₁ ＝{w ₁ ,w ₂ ,w ₃ ,w ₄ ,…,w _n Phrase set a of n=2 and ₂ ＝{w ₁ w ₂ ,w ₂ w ₃ ,w ₃ w ₄ ,w ₄ w ₅ ,…,w _n-1 w _n }。

the coding unit is used for adopting a pre-training model BERT to set the phrase A ₁ 、A ₂ And carrying out semantic coding on the text to which the phrase set belongs, and obtaining phrase characterization of each phrase and text characterization to which the phrase belongs by using MaxPooling maximum pooling.

The traversal calculation unit is used for traversing the phrase set A ₁ And A ₂ The phrase representation of each phrase and the text representation of the phrase are subjected to cosine similarity calculation, and the calculation formula is as follows:

The keyword entity extraction module is used for calculating the similarity between the candidate keywords and the standard vulnerability description text after MASK operation on the candidate keywords in the vulnerability feature description candidate keyword set, sequencing the similarity from high to low, generating entity keywords, and forming triples by the entity keywords according to categories.

The keyword entity extraction module comprises a MASK operation unit, a sorting unit, a classification unit, a definition unit and a triplet unit;

The ordering unit is used for obtaining similarity scores of text tokens after the operation of all candidate keywords MASK and text tokens to which the candidate keywords belong, ordering the text tokens according to the similarity scores, and taking a set formed by the first k (k=6) phrases as an entity keyword set.

The classification unit is used for classifying each entity keyword in the entity keyword set; based on software security domain knowledge, the software is classified into 10 classes as shown in the following table 1:

TABLE 1

The definition unit is used for defining the relation among the entities according to the categories of the entities; the relationship is 7 in total, as shown in table 2 below:

TABLE 2

The vulnerability classification module comprises a knowledge graph construction unit, a vulnerability triplet unit and an analysis matching unit.

The knowledge graph unit is constructed to convert the format of the formed (entity, relation, entity) triples, and the neo4j is utilized to construct the knowledge graph.

And the vulnerability triplet unit is used for extracting knowledge from the vulnerability text to be classified to form a vulnerability triplet.

The knowledge extraction method is consistent with the method from the candidate keyword extraction module to the keyword entity extraction module, firstly N-element candidate keyword extraction is carried out on the vulnerability text to be classified to generate a vulnerability keyword set, then similarity between the vulnerability keywords and the standard vulnerability description text is calculated after MASK operation is carried out on the vulnerability keywords in the vulnerability keyword set, the similarity is ordered from high to low, the vulnerability keywords are generated, and the vulnerability keywords are formed into vulnerability triples according to categories.

Claims

1. A software vulnerability feature knowledge extraction method based on N-element phrase similarity is characterized by comprising the following steps:

2. The method for extracting knowledge of software vulnerability characteristics based on similarity of N-gram according to claim 1, wherein the step 1) comprises the steps of:

3. The method for extracting knowledge of software vulnerability characteristics based on similarity of N-gram according to claim 1, wherein the step 2) comprises the steps of:

2.1 Word segmentation is carried out on standard vulnerability description texts, and a phrase set A with N=1 is respectively generated for each text ₁ ＝{w ₁ ,w ₂ ,w ₃ ,w ₄ ,…,w _n Phrase set a of n=2 and ₂ ＝{w ₁ w ₂ ,w ₂ w ₃ ,w ₃ w ₄ ,w ₄ w ₅ ,…,w _n-1 w _n }；

4. The method for extracting knowledge of software vulnerability characteristics based on similarity of N-gram according to claim 1, wherein the step 3) comprises the steps of:

wherein T is _mask To characterize the text after candidate keyword MASK operations,T _c the text which belongs to the candidate keywords; t (T) _maski For the ith text representation, T _ci The text to which the ith candidate keyword belongs;

3.3 Classifying each entity keyword in the entity keyword set;

5. The method for extracting knowledge of software vulnerability characteristics based on similarity of N-gram according to claim 1, wherein the step 4) comprises the steps of:

6. The feature knowledge extraction system based on N-element phrase similarity learning is characterized by comprising a vulnerability data set construction module, a candidate keyword extraction module, a keyword entity extraction module and a vulnerability classification module;

7. The feature knowledge extraction system based on N-gram similarity learning of claim 6, wherein the vulnerability dataset construction module comprises an acquisition unit and a preprocessing unit;

8. The feature knowledge extraction system based on N-gram similarity learning of claim 6, wherein the candidate keyword extraction module comprises a word segmentation unit, a coding unit and a traversal calculation unit;

9. The feature knowledge extraction system based on N-gram similarity learning of claim 6, wherein the keyword entity extraction module comprises a MASK operation unit, a sorting unit, a classification unit, a definition unit, and a triplet unit;

wherein T is _mask To characterize the text after the candidate keyword MASK operation, T _c The text which belongs to the candidate keywords; t (T) _maski Characterization of the ith text，T _ci For the text to which the ith candidate keyword belongs

10. The feature knowledge extraction system based on N-gram similarity learning of claim 6, wherein the vulnerability classification module comprises a knowledge graph construction unit, a vulnerability triplet unit and an analysis matching unit;