CN116561332A - Software vulnerability feature knowledge extraction method and system based on N-element phrase similarity - Google Patents

Software vulnerability feature knowledge extraction method and system based on N-element phrase similarity Download PDF

Info

Publication number
CN116561332A
CN116561332A CN202211605951.2A CN202211605951A CN116561332A CN 116561332 A CN116561332 A CN 116561332A CN 202211605951 A CN202211605951 A CN 202211605951A CN 116561332 A CN116561332 A CN 116561332A
Authority
CN
China
Prior art keywords
vulnerability
text
phrase
entity
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211605951.2A
Other languages
Chinese (zh)
Inventor
吴潇雪
郑炜
郑彬
刘新岩
高亿人
单文婧
薄莉莉
孙小兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yangzhou University
Original Assignee
Yangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yangzhou University filed Critical Yangzhou University
Priority to CN202211605951.2A priority Critical patent/CN116561332A/en
Publication of CN116561332A publication Critical patent/CN116561332A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Animal Behavior & Ethology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a software vulnerability feature knowledge extraction method and system based on N-element phrase similarity, which comprises the steps of firstly collecting vulnerability description text in a vulnerability database, and performing data cleaning to generate vulnerability key feature description text; extracting N-element candidate keywords, and generating vulnerability feature description candidate keywords by using semantic similarity of words and texts; secondly, extracting keyword entities, constructing a text similarity model, calculating similarity between candidate keywords in the text and a standard vulnerability description text after MASK operation, sequencing the texts according to the similarity, and generating vulnerability characteristic description entity keywords; defining a vulnerability characteristic entity relationship, and generating a triplet representing the vulnerability characteristic relationship; and constructing a knowledge graph by using the triples, and analyzing the vulnerability text to be analyzed by using the knowledge graph. The method and the device can better help analyze the vulnerability characteristics and improve the safety of the software system.

Description

Software vulnerability feature knowledge extraction method and system based on N-element phrase similarity
Technical Field
The invention relates to the field of software security, in particular to a software vulnerability feature knowledge extraction method and system based on N-element phrase similarity.
Background
Vulnerability research is critical to software security, and vulnerability feature research is an important component of software security research. The existing deep learning model depends on a large number of marked vulnerability samples, the marked vulnerability samples are difficult to obtain in the actual development process, a large amount of resource waste is caused by manual marking, and the correctness is to be studied. And the data set quantity of the small sample is not large enough to train a vulnerability classification model with higher classification accuracy. Therefore, the vulnerability classification model trained by the labeled data set method based on the small sample and the manual labeling is low in classification accuracy and difficult to be used for the vulnerability classification task in the actual environment.
At present, some works analyze vulnerabilities by means of a software security vulnerability list database CWE and a vulnerability disclosure database CVE, but the text semantic information rich in CWE and CVE is not fully used. Even if text information is used, when entity identification is performed, the training set is generated by manually labeling the data set, and then entity identification is performed mainly by means of deep learning, such as a classical BiLSTM-CRF method in document Neural Architectures for Named Entity Recognition, an IDCNN-CRF method proposed in document Fast and Accurate Entity Recognition with Iterated Dilated Convolutions, and the like. These approaches rely heavily on prior expert experience, still requiring a significant amount of manual work therein, requiring significant manual experience and time costs.
Disclosure of Invention
The invention aims to: the invention aims to provide a software vulnerability feature knowledge extraction method and system based on N-element phrase similarity, which improve entity keyword extraction efficiency and rapidly position vulnerability feature knowledge.
The technical scheme is as follows: the invention provides a software vulnerability characteristic knowledge extraction method based on N-element phrase similarity, which comprises the following steps:
1) Collecting vulnerability description text in a vulnerability database, preprocessing, and generating a standard vulnerability description text;
2) Extracting N-element candidate keywords from the standard vulnerability description text, and generating a vulnerability feature description candidate keyword set by using the semantic similarity of the phrase and the text;
3) After MASK operation is carried out on candidate keywords in the vulnerability feature description candidate keyword set, calculating similarity between the candidate keywords and the standard vulnerability description text, sequencing the similarity from high to low, generating entity keywords, and forming triples by the entity keywords according to categories;
4) Formatting the triples and constructing a knowledge graph; and extracting knowledge from the vulnerability text to be classified to form a vulnerability triplet, and analyzing and matching the vulnerability text to be classified according to the vulnerability triplet and the knowledge graph to obtain characteristic keywords of the vulnerability text to be classified.
Further, step 1) includes the steps of:
1.1 Collecting vulnerability description texts of a software security vulnerability list database CWE and a vulnerability disclosure database CVE, wherein the vulnerability description texts comprise titles, descriptions and expansion description fields of the texts;
1.2 Preprocessing the collected vulnerability description text, wherein the preprocessing comprises text punctuation removal, stop word removal, word segmentation, part-of-speech tagging and morphological reduction, and generating a standard vulnerability description text without interference information.
Further, in the step 2), the following steps are included:
2.1 Word segmentation is carried out on standard vulnerability description texts, and each text respectively generates a phrase set A1= { w with N=1 1 ,w 2 ,w 3 ,w 4 ,…,w n Phrase set a of n=2 and 2 ={w 1 w 2 ,w 2 w 3 ,w 3 w 4 ,w 4 w 5 ,…,w n-1 w n };
2.2 Using pre-training model BERT for phrase set A 1 、A 2 Carrying out semantic coding on texts to which the phrase sets belong, and obtaining phrase characterization of each phrase and text characterization to which the phrases belong by using MaxPooling maximum pooling;
2.3 Traversing phrase set A 1 And A 2 The phrase representation of each phrase and the text representation of the phrase are subjected to cosine similarity calculation, and the calculation formula is as follows:
wherein w is phrase set A 1 A is a 2 The phrase representation in (1), T is the text representation to which the phrase belongs;
calculating to obtain similarity scores of each phrase and the text, sequencing the scores from high to low, and taking the first k phrases to form a set, wherein the set is a candidate keyword set B= { wo 1 ,wo 2 ,wo 3 ,wo 4 ,…,wo 20 }。
Further, 3.1) after performing a MASK operation on candidate keywords in the candidate keyword set for describing the vulnerability characteristics, calculating the similarity between the candidate keywords and the standard vulnerability description text, wherein the MASK operation formula is as follows:
wherein T is mask To operate the candidate key word MASKThis characterization, T c The text which belongs to the candidate keywords; t (T) maski For the ith text representation, T ci The text to which the ith candidate keyword belongs;
3.2 Obtaining similarity scores of text representations subjected to MASK operation of all candidate keywords and text representations to which the candidate keywords belong, sorting according to the similarity scores, and taking a set formed by the first k (k=6) phrases as an entity keyword set;
3.3 Classifying each entity keyword in the entity keyword set;
3.4 Defining relationships between entities according to categories of the entities;
3.5 Each entity keyword is formed into (entity, relation, entity) triples according to the entity category and relation.
Further, in step 4), the method comprises the following steps:
4.1 Performing format conversion on the formed (entity, relation and entity) triples, and constructing a knowledge graph by utilizing neo4 j;
4.2 Extracting knowledge from the vulnerability text to be classified to form a vulnerability triplet;
4.3 And (3) according to the vulnerability triples and the knowledge graph, analyzing and matching the vulnerability text to be classified to obtain the feature keywords of the vulnerability text to be classified.
The invention correspondingly provides a feature knowledge extraction system based on N-element phrase similarity learning, which comprises a vulnerability data set construction module, a candidate keyword extraction module, a keyword entity extraction module and a vulnerability classification module;
the vulnerability data set construction module is used for collecting vulnerability description texts in the vulnerability database, preprocessing the vulnerability description texts and generating standard vulnerability description texts;
the candidate keyword extraction module is used for extracting N-element candidate keywords from the standard vulnerability description text and generating a vulnerability feature description candidate keyword set by utilizing the semantic similarity of the phrase and the text;
the keyword entity extraction module is used for calculating the similarity between candidate keywords and the standard vulnerability description text after MASK operation of the candidate keywords in the vulnerability feature description candidate keyword set, sequencing the similarity from high to low to generate entity keywords, and forming triples by the entity keywords according to categories;
the vulnerability classification module is used for formatting the triples and constructing a knowledge graph; and extracting knowledge from the vulnerability text to be classified to form a vulnerability triplet, and analyzing and matching the vulnerability text to be classified according to the vulnerability triplet and the knowledge graph to obtain characteristic keywords of the vulnerability text to be classified.
Further, the vulnerability data set construction module comprises an acquisition unit and a preprocessing unit;
the acquisition unit is used for acquiring vulnerability description texts of the software security vulnerability list database CWE and the vulnerability disclosure database CVE, and comprises a title, description and expansion description fields of the texts;
the preprocessing unit is used for preprocessing the collected vulnerability description text, wherein the preprocessing comprises text punctuation removal, stop word removal, word segmentation, part-of-speech tagging and morphological reduction, and standard vulnerability description text without interference information is generated.
Further, the candidate keyword extraction module comprises a word segmentation unit, a coding unit and a traversal calculation unit;
the word segmentation unit is used for segmenting the standard vulnerability description text, and each text respectively generates a phrase set A with N=1 1 ={w 1 ,w 2 ,w 3 ,w 4 ,…,w n Phrase set a of n=2 and 2 ={w 1 w 2 ,w 2 w 3 ,w 3 w 4 ,w 4 w 5 ,…,w n-1 w n };
the coding unit is used for adopting a pre-training model BERT to set the phrase A 1 、A 2 Carrying out semantic coding on texts to which the phrase sets belong, and obtaining phrase characterization of each phrase and text characterization to which the phrases belong by using MaxPooling maximum pooling;
the traversal calculation unit is used for traversing the phrase set A 1 And A 2 Phrase representation of each phrase and cosine similarity with text representation of the phraseAnd calculating the degree, wherein the calculation formula is as follows:
wherein w is phrase set A 1 A is a 2 The phrase representation in (1), T is the text representation to which the phrase belongs;
calculating to obtain similarity scores of each phrase and the text, sequencing the scores from high to low, and taking the first k phrases to form a set, wherein the set is a candidate keyword set B= { wo 1 ,wo 2 ,wo 3 ,wo 4 ,…,wo 20 }。
Further, the keyword entity extraction module comprises a MASK operation unit, a sorting unit, a classification unit, a definition unit and a triplet unit;
the MASK operation unit is configured to calculate similarity between the candidate keyword and the standard vulnerability description text after performing MASK operation on the candidate keyword in the vulnerability feature description candidate keyword set, where the MASK operation formula is:
wherein T is mask To characterize the text after the candidate keyword MASK operation, T c The text which belongs to the candidate keywords; t (T) maski For the ith text representation, T ci The text to which the ith candidate keyword belongs;
the ordering unit is used for obtaining similarity scores of text tokens after all candidate keyword MASK operations and text tokens to which the candidate keywords belong, ordering the text tokens according to the similarity scores, and taking a set formed by the first k (k=6) phrases as an entity keyword set;
the classification unit is used for classifying each entity keyword in the entity keyword set;
the definition unit is used for defining the relation among the entities according to the categories of the entities;
the triplet unit is used for forming (entity, relation, entity) triples according to the entity category and relation of each entity keyword.
Further, the vulnerability classification module comprises a knowledge graph construction unit, a vulnerability triplet unit and an analysis matching unit;
the knowledge graph unit is constructed to perform format conversion on the formed (entity, relation and entity) triples, and a knowledge graph is constructed by neo4 j;
the vulnerability triplet unit is used for extracting knowledge from the vulnerability text to be classified to form a vulnerability triplet;
and the analysis matching unit is used for carrying out analysis matching on the vulnerability text to be classified according to the vulnerability triples and the knowledge graph to obtain characteristic keywords of the vulnerability text to be classified.
The beneficial effects are that: compared with the prior art, the method has the remarkable characteristics that the candidate keyword set is obtained through N-element candidate word extraction, the candidate keywords are formed into the entity keyword set through MASK operation, the entity keywords are rapidly positioned, and the entity keyword extraction efficiency is improved; after the entity keywords are classified, the classified entities form triples according to the categories and the relations, a knowledge graph is constructed, the knowledge graph is utilized to analyze the vulnerability text to be analyzed, the feature keywords of the vulnerability text to be classified are obtained, and the rapid positioning of vulnerability features is achieved.
Drawings
Fig. 1 is a schematic flow chart in the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific embodiments and drawings.
Example 1
The invention provides a software vulnerability feature knowledge extraction method based on N-element phrase similarity, which is shown in FIG. 1, and comprises the following steps:
1) And collecting vulnerability description text in a vulnerability database, preprocessing and generating a standard vulnerability description text.
1.1 Collecting vulnerability description texts of a software security vulnerability list database CWE and a vulnerability disclosure database CVE, wherein the vulnerability description texts comprise titles, descriptions and expansion description fields of the texts.
1.2 Preprocessing the collected vulnerability description text, wherein the preprocessing comprises text punctuation removal, stop word removal, word segmentation, part-of-speech tagging and morphological reduction, and generating a standard vulnerability description text without interference information.
2) Extracting N-element candidate keywords from the standard vulnerability description text, and generating a vulnerability feature description candidate keyword set by using the semantic similarity of the phrase and the text.
2.1 Word segmentation is carried out on standard vulnerability description texts, and a phrase set A with N=1 is respectively generated for each text 1 ={w 1 ,w 2 ,w 3 ,w 4 ,…,w n Phrase set a of n=2 and 2 ={w 1 w 2 ,w 2 w 3 ,w 3 w 4 ,w 4 w 5 ,…,w n-1 w n }。
2.2 Using pre-training model BERT for phrase set A 1 、A 2 And carrying out semantic coding on the text to which the phrase set belongs, and obtaining phrase characterization of each phrase and text characterization to which the phrase belongs by using MaxPooling maximum pooling.
2.3 Traversing phrase set A 1 And A 2 The phrase representation of each phrase and the text representation of the phrase are subjected to cosine similarity calculation, and the calculation formula is as follows:
wherein w is phrase set A 1 A is a 2 The phrase representation in (1), T is the text representation to which the phrase belongs.
Calculating to obtain similarity scores of each phrase and the text, sequencing the scores from high to low, and taking the first k phrases to form a set, wherein the set is a candidate keyword set B= { wo 1 ,wo 2 ,wo 3 ,wo 4 ,…,wo 20 }。
3) And after performing MASK operation on candidate keywords in the vulnerability feature description candidate keyword set, calculating the similarity between the candidate keywords and the standard vulnerability description text, sequencing the similarity from high to low, generating entity keywords, and forming triples by the entity keywords according to categories.
3.1 After candidate keywords MASK in the candidate keyword set are described by the vulnerability characteristics, similarity between the candidate keywords and the standard vulnerability description text is calculated, and the MASK operation formula is as follows:
wherein T is mask To characterize the text after the candidate keyword MASK operation, T c The text which belongs to the candidate keywords; t (T) maski For the ith text representation, T ci And the text to which the ith candidate keyword belongs.
3.2 Obtaining similarity scores of text representations subjected to MASK operation of all candidate keywords and text representations to which the candidate keywords belong, sorting according to the similarity scores, and taking a set formed by the first k (k=6) phrases as an entity keyword set.
3.3 Classifying each entity keyword in the entity keyword set; based on software security domain knowledge, the software is classified into 10 classes as shown in the following table 1:
TABLE 1
3.4 Defining relationships between entities according to categories of the entities; the relationship is 7 in total, as shown in table 2 below:
TABLE 2
3.5 Each entity keyword is formed into (entity, relation, entity) triples according to the entity category and relation.
4) Formatting the triples and constructing a knowledge graph; and extracting knowledge from the vulnerability text to be classified to form a vulnerability triplet, and analyzing and matching the vulnerability text to be classified according to the vulnerability triplet and the knowledge graph to obtain characteristic keywords of the vulnerability text to be classified.
4.1 Format conversion is carried out on the formed (entity, relation and entity) triples, and a knowledge graph is constructed by neo4 j.
4.2 Extracting knowledge from the vulnerability text to be classified to form a vulnerability triplet.
The knowledge extraction method is consistent with the methods in the steps 2) to 3), firstly, N-element candidate keywords are extracted for the vulnerability texts to be classified to generate a vulnerability keyword set, then, similarity between the vulnerability keywords and the standard vulnerability description text is calculated after MASK operation is carried out on the vulnerability keywords in the vulnerability keyword set, the similarity is ordered from high to low, the vulnerability keywords are generated, and the vulnerability keywords are formed into vulnerability triples according to categories.
4.3 And (3) according to the vulnerability triples and the knowledge graph, analyzing and matching the vulnerability text to be classified to obtain the feature keywords of the vulnerability text to be classified.
Example 2
Corresponding to the software vulnerability feature knowledge extraction method based on N-element phrase similarity of embodiment 1, the embodiment correspondingly provides a software vulnerability feature knowledge extraction system based on N-element phrase similarity, please refer to fig. 1, which comprises a vulnerability dataset construction module, a candidate keyword extraction module, a keyword entity extraction module and a vulnerability classification module;
the vulnerability data set construction module is used for collecting vulnerability description text in the vulnerability database, preprocessing the vulnerability description text and generating standard vulnerability description text.
The vulnerability data set construction module comprises an acquisition unit and a preprocessing unit;
the acquisition unit is used for acquiring vulnerability description texts of the software security vulnerability list database CWE and the vulnerability disclosure database CVE, and comprises a title, description and expansion description fields of the texts.
The preprocessing unit is used for preprocessing the collected vulnerability description text, wherein the preprocessing comprises text punctuation removal, stop word removal, word segmentation, part-of-speech tagging and morphological reduction, and standard vulnerability description text without interference information is generated.
The candidate keyword extraction module is used for extracting N-element candidate keywords from the standard vulnerability description text, and generating a vulnerability feature description candidate keyword set by using the semantic similarity of the phrase and the text.
The candidate keyword extraction module comprises a word segmentation unit, a coding unit and a traversal calculation unit;
the word segmentation unit is used for segmenting the standard vulnerability description text, and each text respectively generates a phrase set A with N=1 1 ={w 1 ,w 2 ,w 3 ,w 4 ,…,w n Phrase set a of n=2 and 2 ={w 1 w 2 ,w 2 w 3 ,w 3 w 4 ,w 4 w 5 ,…,w n-1 w n }。
the coding unit is used for adopting a pre-training model BERT to set the phrase A 1 、A 2 And carrying out semantic coding on the text to which the phrase set belongs, and obtaining phrase characterization of each phrase and text characterization to which the phrase belongs by using MaxPooling maximum pooling.
The traversal calculation unit is used for traversing the phrase set A 1 And A 2 The phrase representation of each phrase and the text representation of the phrase are subjected to cosine similarity calculation, and the calculation formula is as follows:
wherein w is phrase set A 1 A is a 2 The phrase representation in (1), T is the text representation to which the phrase belongs.
Calculating to obtain similarity scores of each phrase and the text, sequencing the scores from high to low, and taking the first k phrases to form a set, wherein the set is a candidate keyword set B= { wo 1 ,wo 2 ,wo 3 ,wo 4 ,…,wo 20 }。
The keyword entity extraction module is used for calculating the similarity between the candidate keywords and the standard vulnerability description text after MASK operation on the candidate keywords in the vulnerability feature description candidate keyword set, sequencing the similarity from high to low, generating entity keywords, and forming triples by the entity keywords according to categories.
The keyword entity extraction module comprises a MASK operation unit, a sorting unit, a classification unit, a definition unit and a triplet unit;
the MASK operation unit is configured to calculate similarity between the candidate keyword and the standard vulnerability description text after performing MASK operation on the candidate keyword in the vulnerability feature description candidate keyword set, where the MASK operation formula is:
wherein T is mask To characterize the text after the candidate keyword MASK operation, T c The text which belongs to the candidate keywords; t (T) maski For the ith text representation, T ci And the text to which the ith candidate keyword belongs.
The ordering unit is used for obtaining similarity scores of text tokens after the operation of all candidate keywords MASK and text tokens to which the candidate keywords belong, ordering the text tokens according to the similarity scores, and taking a set formed by the first k (k=6) phrases as an entity keyword set.
The classification unit is used for classifying each entity keyword in the entity keyword set; based on software security domain knowledge, the software is classified into 10 classes as shown in the following table 1:
TABLE 1
The definition unit is used for defining the relation among the entities according to the categories of the entities; the relationship is 7 in total, as shown in table 2 below:
TABLE 2
The triplet unit is used for forming (entity, relation, entity) triples according to the entity category and relation of each entity keyword.
The vulnerability classification module is used for formatting the triples and constructing a knowledge graph; and extracting knowledge from the vulnerability text to be classified to form a vulnerability triplet, and analyzing and matching the vulnerability text to be classified according to the vulnerability triplet and the knowledge graph to obtain characteristic keywords of the vulnerability text to be classified.
The vulnerability classification module comprises a knowledge graph construction unit, a vulnerability triplet unit and an analysis matching unit.
The knowledge graph unit is constructed to convert the format of the formed (entity, relation, entity) triples, and the neo4j is utilized to construct the knowledge graph.
And the vulnerability triplet unit is used for extracting knowledge from the vulnerability text to be classified to form a vulnerability triplet.
The knowledge extraction method is consistent with the method from the candidate keyword extraction module to the keyword entity extraction module, firstly N-element candidate keyword extraction is carried out on the vulnerability text to be classified to generate a vulnerability keyword set, then similarity between the vulnerability keywords and the standard vulnerability description text is calculated after MASK operation is carried out on the vulnerability keywords in the vulnerability keyword set, the similarity is ordered from high to low, the vulnerability keywords are generated, and the vulnerability keywords are formed into vulnerability triples according to categories.
And the analysis matching unit is used for carrying out analysis matching on the vulnerability text to be classified according to the vulnerability triples and the knowledge graph to obtain characteristic keywords of the vulnerability text to be classified.

Claims (10)

1. A software vulnerability feature knowledge extraction method based on N-element phrase similarity is characterized by comprising the following steps:
1) Collecting vulnerability description text in a vulnerability database, preprocessing, and generating a standard vulnerability description text;
2) Extracting N-element candidate keywords from the standard vulnerability description text, and generating a vulnerability feature description candidate keyword set by using the semantic similarity of the phrase and the text;
3) After MASK operation is carried out on candidate keywords in the vulnerability feature description candidate keyword set, calculating similarity between the candidate keywords and the standard vulnerability description text, sequencing the similarity from high to low, generating entity keywords, and forming triples by the entity keywords according to categories;
4) Formatting the triples and constructing a knowledge graph; and extracting knowledge from the vulnerability text to be classified to form a vulnerability triplet, and analyzing and matching the vulnerability text to be classified according to the vulnerability triplet and the knowledge graph to obtain characteristic keywords of the vulnerability text to be classified.
2. The method for extracting knowledge of software vulnerability characteristics based on similarity of N-gram according to claim 1, wherein the step 1) comprises the steps of:
1.1 Collecting vulnerability description texts of a software security vulnerability list database CWE and a vulnerability disclosure database CVE, wherein the vulnerability description texts comprise titles, descriptions and expansion description fields of the texts;
1.2 Preprocessing the collected vulnerability description text, wherein the preprocessing comprises text punctuation removal, stop word removal, word segmentation, part-of-speech tagging and morphological reduction, and generating a standard vulnerability description text without interference information.
3. The method for extracting knowledge of software vulnerability characteristics based on similarity of N-gram according to claim 1, wherein the step 2) comprises the steps of:
2.1 Word segmentation is carried out on standard vulnerability description texts, and a phrase set A with N=1 is respectively generated for each text 1 ={w 1 ,w 2 ,w 3 ,w 4 ,…,w n Phrase set a of n=2 and 2 ={w 1 w 2 ,w 2 w 3 ,w 3 w 4 ,w 4 w 5 ,…,w n-1 w n };
2.2 Using pre-training model BERT for phrase set A 1 、A 2 Carrying out semantic coding on texts to which the phrase sets belong, and obtaining phrase characterization of each phrase and text characterization to which the phrases belong by using MaxPooling maximum pooling;
2.3 Traversing phrase set A 1 And A 2 The phrase representation of each phrase and the text representation of the phrase are subjected to cosine similarity calculation, and the calculation formula is as follows:
wherein w is phrase set A 1 A is a 2 The phrase representation in (1), T is the text representation to which the phrase belongs;
calculating to obtain similarity scores of each phrase and the text, sequencing the scores from high to low, and taking the first k phrases to form a set, wherein the set is a candidate keyword set B= { wo 1 ,wo 2 ,wo 3 ,wo 4 ,…,wo 20 }。
4. The method for extracting knowledge of software vulnerability characteristics based on similarity of N-gram according to claim 1, wherein the step 3) comprises the steps of:
3.1 After candidate keywords MASK in the candidate keyword set are described by the vulnerability characteristics, similarity between the candidate keywords and the standard vulnerability description text is calculated, and the MASK operation formula is as follows:
wherein T is mask To characterize the text after candidate keyword MASK operations,T c the text which belongs to the candidate keywords; t (T) maski For the ith text representation, T ci The text to which the ith candidate keyword belongs;
3.2 Obtaining similarity scores of text representations subjected to MASK operation of all candidate keywords and text representations to which the candidate keywords belong, sorting according to the similarity scores, and taking a set formed by the first k (k=6) phrases as an entity keyword set;
3.3 Classifying each entity keyword in the entity keyword set;
3.4 Defining relationships between entities according to categories of the entities;
3.5 Each entity keyword is formed into (entity, relation, entity) triples according to the entity category and relation.
5. The method for extracting knowledge of software vulnerability characteristics based on similarity of N-gram according to claim 1, wherein the step 4) comprises the steps of:
4.1 Performing format conversion on the formed (entity, relation and entity) triples, and constructing a knowledge graph by utilizing neo4 j;
4.2 Extracting knowledge from the vulnerability text to be classified to form a vulnerability triplet;
4.3 And (3) according to the vulnerability triples and the knowledge graph, analyzing and matching the vulnerability text to be classified to obtain the feature keywords of the vulnerability text to be classified.
6. The feature knowledge extraction system based on N-element phrase similarity learning is characterized by comprising a vulnerability data set construction module, a candidate keyword extraction module, a keyword entity extraction module and a vulnerability classification module;
the vulnerability data set construction module is used for collecting vulnerability description texts in the vulnerability database, preprocessing the vulnerability description texts and generating standard vulnerability description texts;
the candidate keyword extraction module is used for extracting N-element candidate keywords from the standard vulnerability description text and generating a vulnerability feature description candidate keyword set by utilizing the semantic similarity of the phrase and the text;
the keyword entity extraction module is used for calculating the similarity between candidate keywords and the standard vulnerability description text after MASK operation of the candidate keywords in the vulnerability feature description candidate keyword set, sequencing the similarity from high to low to generate entity keywords, and forming triples by the entity keywords according to categories;
the vulnerability classification module is used for formatting the triples and constructing a knowledge graph; and extracting knowledge from the vulnerability text to be classified to form a vulnerability triplet, and analyzing and matching the vulnerability text to be classified according to the vulnerability triplet and the knowledge graph to obtain characteristic keywords of the vulnerability text to be classified.
7. The feature knowledge extraction system based on N-gram similarity learning of claim 6, wherein the vulnerability dataset construction module comprises an acquisition unit and a preprocessing unit;
the acquisition unit is used for acquiring vulnerability description texts of the software security vulnerability list database CWE and the vulnerability disclosure database CVE, and comprises a title, description and expansion description fields of the texts;
the preprocessing unit is used for preprocessing the collected vulnerability description text, wherein the preprocessing comprises text punctuation removal, stop word removal, word segmentation, part-of-speech tagging and morphological reduction, and standard vulnerability description text without interference information is generated.
8. The feature knowledge extraction system based on N-gram similarity learning of claim 6, wherein the candidate keyword extraction module comprises a word segmentation unit, a coding unit and a traversal calculation unit;
the word segmentation unit is used for segmenting the standard vulnerability description text, and each text respectively generates a phrase set A with N=1 1 ={w 1 ,w 2 ,w 3 ,w 4 ,…,w n Phrase set a of n=2 and 2 ={w 1 w 2 ,w 2 w 3 ,w 3 w 4 ,w 4 w 5 ,…,w n-1 w n };
the coding unit is used for adopting a pre-training model BERT to set the phrase A 1 、A 2 Carrying out semantic coding on texts to which the phrase sets belong, and obtaining phrase characterization of each phrase and text characterization to which the phrases belong by using MaxPooling maximum pooling;
the traversal calculation unit is used for traversing the phrase set A 1 And A 2 The phrase representation of each phrase and the text representation of the phrase are subjected to cosine similarity calculation, and the calculation formula is as follows:
wherein w is phrase set A 1 A is a 2 The phrase representation in (1), T is the text representation to which the phrase belongs;
calculating to obtain similarity scores of each phrase and the text, sequencing the scores from high to low, and taking the first k phrases to form a set, wherein the set is a candidate keyword set B= { wo 1 ,wo 2 ,wo 3 ,wo 4 ,…,wo 20 }。
9. The feature knowledge extraction system based on N-gram similarity learning of claim 6, wherein the keyword entity extraction module comprises a MASK operation unit, a sorting unit, a classification unit, a definition unit, and a triplet unit;
the MASK operation unit is configured to calculate similarity between the candidate keyword and the standard vulnerability description text after performing MASK operation on the candidate keyword in the vulnerability feature description candidate keyword set, where the MASK operation formula is:
wherein T is mask To characterize the text after the candidate keyword MASK operation, T c The text which belongs to the candidate keywords; t (T) maski Characterization of the ith text,T ci For the text to which the ith candidate keyword belongs
The ordering unit is used for obtaining similarity scores of text tokens after all candidate keyword MASK operations and text tokens to which the candidate keywords belong, ordering the text tokens according to the similarity scores, and taking a set formed by the first k (k=6) phrases as an entity keyword set;
the classification unit is used for classifying each entity keyword in the entity keyword set;
the definition unit is used for defining the relation among the entities according to the categories of the entities;
the triplet unit is used for forming (entity, relation, entity) triples according to the entity category and relation of each entity keyword.
10. The feature knowledge extraction system based on N-gram similarity learning of claim 6, wherein the vulnerability classification module comprises a knowledge graph construction unit, a vulnerability triplet unit and an analysis matching unit;
the knowledge graph unit is constructed to perform format conversion on the formed (entity, relation and entity) triples, and a knowledge graph is constructed by neo4 j;
the vulnerability triplet unit is used for extracting knowledge from the vulnerability text to be classified to form a vulnerability triplet;
and the analysis matching unit is used for carrying out analysis matching on the vulnerability text to be classified according to the vulnerability triples and the knowledge graph to obtain characteristic keywords of the vulnerability text to be classified.
CN202211605951.2A 2022-12-12 2022-12-12 Software vulnerability feature knowledge extraction method and system based on N-element phrase similarity Pending CN116561332A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211605951.2A CN116561332A (en) 2022-12-12 2022-12-12 Software vulnerability feature knowledge extraction method and system based on N-element phrase similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211605951.2A CN116561332A (en) 2022-12-12 2022-12-12 Software vulnerability feature knowledge extraction method and system based on N-element phrase similarity

Publications (1)

Publication Number Publication Date
CN116561332A true CN116561332A (en) 2023-08-08

Family

ID=87488578

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211605951.2A Pending CN116561332A (en) 2022-12-12 2022-12-12 Software vulnerability feature knowledge extraction method and system based on N-element phrase similarity

Country Status (1)

Country Link
CN (1) CN116561332A (en)

Similar Documents

Publication Publication Date Title
CN110598005B (en) Public safety event-oriented multi-source heterogeneous data knowledge graph construction method
CN106886580B (en) Image emotion polarity analysis method based on deep learning
CN111783394B (en) Training method of event extraction model, event extraction method, system and equipment
CN108563638B (en) Microblog emotion analysis method based on topic identification and integrated learning
CN107193796B (en) Public opinion event detection method and device
CN111368049A (en) Information acquisition method and device, electronic equipment and computer readable storage medium
CN111597328B (en) New event theme extraction method
CN113742733B (en) Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type
CN111462752B (en) Attention mechanism, feature embedding and BI-LSTM (business-to-business) based customer intention recognition method
CN116775874B (en) Information intelligent classification method and system based on multiple semantic information
CN112926345A (en) Multi-feature fusion neural machine translation error detection method based on data enhancement training
CN110910175A (en) Tourist ticket product portrait generation method
CN108763192B (en) Entity relation extraction method and device for text processing
CN114266256A (en) Method and system for extracting new words in field
CN116451114A (en) Internet of things enterprise classification system and method based on enterprise multisource entity characteristic information
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
Omayio et al. Historical manuscript dating: traditional and current trends
CN114239579A (en) Electric power searchable document extraction method and device based on regular expression and CRF model
CN116305257A (en) Privacy information monitoring device and privacy information monitoring method
CN115713085A (en) Document theme content analysis method and device
CN112397201B (en) Intelligent inquiry system-oriented repeated sentence generation optimization method
CN114416991A (en) Method and system for analyzing text emotion reason based on prompt
CN114298041A (en) Network security named entity identification method and identification device
CN116561332A (en) Software vulnerability feature knowledge extraction method and system based on N-element phrase similarity
CN114491033A (en) Method for building user interest model based on word vector and topic model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination