CN115563619A

CN115563619A - Vulnerability similarity comparison method and system based on text pre-training model

Info

Publication number: CN115563619A
Application number: CN202211182151.4A
Authority: CN
Inventors: 宋同庆; 张佳琪; 何召阳; 董昊辰; 刘兵; 郭路路
Original assignee: Beijing Moyun Technology Co ltd
Current assignee: Beijing Moyun Technology Co ltd
Priority date: 2022-09-27
Filing date: 2022-09-27
Publication date: 2023-01-03
Anticipated expiration: 2042-09-27
Also published as: CN115563619B

Abstract

The application discloses a vulnerability similarity comparison method and system based on a text pre-training model. Firstly, acquiring a vulnerability text data set of a vulnerability scanning product, and preprocessing the vulnerability text data set to obtain a target vulnerability text; vectorizing the target vulnerability text based on a sequence-BERT model to obtain a vulnerability text vector; performing text segmentation and main word bank filtering on the target vulnerability text, and extracting main words; then processing the target vulnerability text based on the vulnerability keyword regular matching and the HMCN model to obtain the vulnerability type of the target vulnerability text; and finally, respectively carrying out vulnerability similarity calculation on the obtained vulnerability text vectors, the subject words and the vulnerability types, and carrying out weighted summation on the calculation results of the vulnerability similarities to obtain vulnerability similarity comparison results. According to the vulnerability similarity judging method, whether two vulnerability texts belong to the same vulnerability description or not is judged according to the three dimensions of the text similarity, the body words and the vulnerability types, and therefore the accuracy of judging the vulnerability similarity is improved.

Description

Vulnerability similarity comparison method and system based on text pre-training model

Technical Field

The invention relates to the field of vulnerability data detection, in particular to a vulnerability similarity comparison method and system based on a text pre-training model.

Background

At present, the vulnerability scanning and evaluating product mainly adopts a technology based on a vulnerability knowledge base. And the vulnerability knowledge base is a vulnerability base established by information security centers of various countries and information security manufacturers and organizations, such as CVE (Common Vulnerabilities & Exposueres) and the like. The existing vulnerability scanning products often support various vulnerability libraries and even support integration of various vulnerability scanning technologies. In order to improve the accuracy of vulnerability scanning results and better perform vulnerability analysis and risk assessment, a vulnerability similarity comparison technology is needed to normalize similar vulnerabilities.

The existing vulnerability similarity detection technology mainly comprises a rule matching-based method and a text mining-based method. For the rule matching method, keywords in vulnerability information are extracted, and the keyword overlap ratio is used as the similarity between vulnerabilities. Vulnerability keywords are often extracted from information such as vulnerability description, vulnerability types and vulnerability risk levels. The method depends on the integrity and consistency of the vulnerability information, and does not dig out the semantic information of deep level in the vulnerability information. Due to the fact that specifications of different vulnerability scanning technologies are different, description modes of vulnerability information are often different, and misjudgment is easy to occur. For the method based on text mining, vulnerability information is modeled and compared mainly by using the existing Natural Language Processing (NLP) technology. The existing vulnerability similarity comparison technology mainly converts a vulnerability similarity comparison problem into a text similarity problem in NLP, vectorizes a vulnerability text by applying a Word2Vec Word vector generation model and a TF-IDF (Term Frequency-Inverse Document Frequency) weighting technology, and then takes the vector similarity as the vulnerability similarity. Compared with a rule matching method, the technology is more flexible, can extract deep semantic information in the loophole text, and makes up for the defects of the rule matching method.

However, due to the rapid development of the NLP technology, the existing vulnerability similarity is more outdated than the technology type selection of the Word2Vec + TF-IDF adopted by the technology, the effect can only meet simple vulnerability similarity judgment with less information, and in the actual vulnerability similarity comparison problem, a plurality of more troublesome similarity judgment problems exist, for example, the rest parts of two vulnerability texts are completely the same except the asset type; or the two vulnerabilities describe different vulnerabilities under the same asset, and the like. Because the vulnerability texts under the conditions have slight differences, even if some text mining technologies are applied, high similarity can be obtained, but the actual description is not of the same vulnerability. Therefore, a more refined and multidimensional vulnerability similarity comparison technology is needed, which can more accurately judge the vulnerability similarity.

Disclosure of Invention

Based on the vulnerability similarity comparison method and system based on the text pre-training model, whether two vulnerability texts belong to the same vulnerability description or not is judged according to three dimensions of text similarity, main word and vulnerability type, and therefore accuracy of vulnerability similarity judgment is improved.

In a first aspect, a vulnerability similarity comparison method based on a text pre-training model is provided, and the method includes:

acquiring a vulnerability text data set of a vulnerability scanning product;

preprocessing a vulnerability text data set to obtain a target vulnerability text;

vectorizing the target vulnerability text based on a pre-trained sequence-BERT model to obtain a vulnerability text vector; the vulnerability text vector is used for representing semantic information of a sentence on a vector space;

performing text word segmentation and main word bank filtering on the target vulnerability text, and extracting main words of the target vulnerability text;

processing the target vulnerability text based on vulnerability keyword regular matching and an HMCN model to obtain a vulnerability type of the target vulnerability text;

and respectively carrying out vulnerability similarity calculation on the obtained vulnerability text vector, the subject word and the vulnerability type, and carrying out weighted summation on the calculation results of the vulnerability similarities to obtain a vulnerability similarity comparison result.

Optionally, the preprocessing the vulnerability text data set includes:

and filtering the vulnerability text data set to describe short and/or long texts, and converting English into lowercase.

Optionally, vectorizing the target vulnerability text based on a pre-trained sequence-BERT model to obtain a vulnerability text vector, including:

and generating a sentence Embedding vector with semantics by using the twin network model and the triplet network model.

Optionally, performing text segmentation and main word bank filtering on the target vulnerability text, and extracting main words of the target vulnerability text, including:

extracting an English part in the vulnerability text, performing word segmentation processing, comparing the part with an English main word bank, and taking a word in a preset word list in a comparison result as a subject word of the vulnerability text; wherein, the preset word list is manually set to have interesting word list.

Optionally, the vulnerability similarity calculation is performed on the obtained vulnerability text vector, and includes:

and calculating to obtain a first vulnerability similarity calculation result based on cosine similarity among vulnerability text vectors.

Alternatively,

and carrying out vulnerability similarity calculation on the obtained subject words, wherein the vulnerability similarity calculation comprises the following steps:

acquiring a main word list and a position weight list of the mobile terminal;

acquiring an intersection part of the main word list and the position weight list;

and according to the formula

Obtaining a second vulnerability similarity calculation result; wherein, A represents a target vulnerability text, B represents a contrast vulnerability text, SPL _A (i) Representing the position weight, SPL, of the subject word i in the target vulnerability text _B (i) And the position weight of the subject word i in the comparison loophole text is represented, and n represents a subject word list.

Optionally, the vulnerability similarity calculation is performed on the obtained vulnerability types, and includes:

when the types of the vulnerability text pairs are the same, assigning the third vulnerability similarity calculation result to be 1;

and when the types of the vulnerability text pairs are different, assigning the third vulnerability similarity calculation result to be 0.

In a second aspect, a vulnerability similarity comparison system based on a text pre-training model is provided, and the system includes:

the acquisition module is used for acquiring a vulnerability text data set of a vulnerability scanning product;

the preprocessing module is used for preprocessing the vulnerability text data set to obtain a target vulnerability text;

the vectorization module is used for vectorizing the target vulnerability text based on a pre-trained sequence-BERT model to obtain a vulnerability text vector; the vulnerability text vector is used for representing semantic information of a sentence on a vector space;

the extraction module is used for performing text word segmentation and main word bank filtering on the target vulnerability text and extracting main words of the target vulnerability text;

the processing module is used for processing the target vulnerability text based on vulnerability keyword regular matching and an HMCN model to obtain the vulnerability type of the target vulnerability text;

and the calculation module is used for respectively calculating the vulnerability similarity of the obtained vulnerability text vectors, the subject words and the vulnerability types, and weighting and summing the calculation results of the obtained vulnerability similarities to obtain a vulnerability similarity comparison result.

Optionally, the preprocessing module specifically includes:

and filtering and describing the vulnerability text data set.

Optionally, the vectorization module specifically includes:

According to the technical scheme provided by the embodiment of the application, firstly, a vulnerability text data set of a vulnerability scanning product is obtained; preprocessing a vulnerability text data set to obtain a target vulnerability text; vectorizing a target vulnerability text based on a pre-trained sequence-BERT model to obtain a vulnerability text vector; performing text word segmentation and main word bank filtering on the target vulnerability text, and extracting main words of the target vulnerability text; then processing the target vulnerability text based on the vulnerability keyword regular matching and the HMCN model to obtain the vulnerability type of the target vulnerability text; and finally, respectively carrying out vulnerability similarity calculation on the obtained vulnerability text vectors, the subject words and the vulnerability types, and carrying out weighted summation on the calculation results of the vulnerability similarities to obtain vulnerability similarity comparison results. It can be seen that the beneficial effects of the present invention at least include:

(1) Based on multi-dimensional similarity comparison, the calculation accuracy is high;

(2) A large amount of rule matching is not needed, and the calculation efficiency is high;

(3) The model can be reused after being trained, and the maintenance labor cost is low;

(4) The similarity calculation flexibility is high, and the false alarm is low;

(5) The encapsulation degree is high, and the professional level requires lowly.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

Fig. 1 is a flowchart of a vulnerability similarity comparison method based on a text pre-training model according to an embodiment of the present application;

fig. 2 is a flowchart of vulnerability text subject word extraction provided in the embodiment of the present application;

fig. 3 is a flowchart of vulnerability text type identification provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

In the description of the present invention, the meaning of "a plurality" is two or more unless otherwise specified. The terms "first," "second," "third," "fourth," and the like in the description and claims of the present invention and in the above-described drawings are intended to distinguish between the referenced items. For a scheme with a time sequence flow, the expression of the terms is not necessarily understood to describe a specific sequence or order, and for a scheme with a device structure, the expression of the terms does not have distinction of importance degree, position relation and the like.

Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements specifically listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus or added steps as further optimized based on the inventive concept.

The application provides a multi-dimensional vulnerability similarity comparison technology based on a text pre-training model. The technology mainly judges whether two vulnerability texts belong to the same vulnerability description or not according to three dimensions of text similarity, main words and vulnerability types. Firstly, the technology applies a sequence-BERT text pre-training model to carry out vectorization processing on a vulnerability text to obtain semantic information of a Sentence on a vector space; then, a main word list of vulnerability description is obtained in a text word segmentation and main word library filtering mode; then, a specific vulnerability type of the vulnerability description is obtained through an HMCN (Hierarchical Multi-Label Classification Networks) model. Finally, the technology carries out weighted summation on the data of the three dimensions, and the similarity between the vulnerability texts is calculated. Specifically, please refer to fig. 1, which shows a flowchart of a vulnerability similarity comparison method based on a text pre-training model according to an embodiment of the present application, where the method may include the following steps:

s1, acquiring a vulnerability text data set of a vulnerability scanning product.

In this embodiment, vulnerability text data sets obtained by different vulnerability scanning products can be integrated.

And S2, preprocessing the vulnerability text data set to obtain a target vulnerability text.

In this embodiment, a series of data preprocessing operations are performed on the vulnerability text data set, including filtering text that describes too short or too long, and converting english to lowercase.

And S3, vectorizing the target vulnerability text based on a pre-trained sequence-BERT model to obtain a vulnerability text vector.

The vulnerability text vector is used for representing semantic information of the sentence on the vector space.

In this embodiment, the vulnerability text is input into a sequence-BERT (SBERT) pre-training model to obtain a Sentence vector. The model is based on pre-training BERT, and uses Siamese and tripletNet to generate semantic sentence Embedding vector. Because the Chinese vulnerability text contains Chinese and English (asset-multipurpose English description), the SBERT pre-training model paramhrase-multilingual-MiniLM-L12-v 2 supporting multiple languages is selected as a basic model in the embodiment of the application, and the pre-training model is finely adjusted based on the tagged data set, so that the final SBERT model is obtained. The model can map any text into a sentence vector with specified dimensionality, and the vector contains rich semantic information.

And S4, performing text word segmentation and main word bank filtering on the target vulnerability text, and extracting main words of the target vulnerability text.

In this embodiment, vulnerability text main words (english) are extracted through a word segmentation technology and an asset lexicon, and the extraction flow is shown in fig. 2. As most of the main words are formed by English, the processing method of the method extracts all English parts in the vulnerability text, compares the English parts with an English main word bank after word segmentation processing, and only retains meaningful words as the main words of the vulnerability text.

And S5, processing the target vulnerability text based on the vulnerability keyword regular matching and the HMCN model to obtain the vulnerability type of the target vulnerability text.

In this embodiment, the vulnerability type of the vulnerability text is predicted through vulnerability keyword regular matching and an HMCN (Hierarchical Multi-Label Classification Networks) model. Fig. 3 is an overall flow of the vulnerability type identification scheme. The vulnerability keyword rule matching method with low cost and good performance is preferentially used for determining the vulnerability type. The vulnerability description or the vulnerability name usually directly contains common keywords of the vulnerability, and the vulnerability type can be quickly identified in a regular matching mode. If the text does not contain the keywords, the judgment needs to be made by means of the CWE number of the vulnerability text, the CWE number can indicate a detailed vulnerability type, and the vulnerability type can be directly determined through the corresponding relation between the CWE number and the vulnerability type. If the vulnerability type can not be determined in the rule-based mode, the CWE number corresponding to a section of text needs to be predicted by means of a classification model, and then the vulnerability type of the text is identified by the CWE number.

And S6, respectively carrying out vulnerability similarity calculation on the obtained vulnerability text vector, the subject word and the vulnerability type, and carrying out weighted summation on the calculation results of the vulnerability similarities to obtain a vulnerability similarity comparison result.

In this embodiment, vulnerability text similarity calculation is performed based on < vulnerability text vector, subject word, vulnerability type > triple obtained through the above process. For any pair of vulnerability texts, after vulnerability text similarity calculation, vulnerability subject word similarity calculation and vulnerability type identification are carried out according to the introduction, scores of three dimensions (namely a first vulnerability similarity calculation result, a second vulnerability similarity calculation result and a third vulnerability similarity calculation result) can be obtained. At present, the scores of three dimensions are combined in a weighting mode to obtain the final similarity score of a pair of loophole texts. The vulnerability text similarity is calculated based on cosine similarity between vulnerability text vectors, and the calculation formula is as follows:

and x and y represent loophole text pairs with loophole text similarity to be obtained.

For the calculation of the similarity of the subject words, firstly, the present application acquires a subject Word List WL (Word List) and a Position weight List PL (Position List), where PL (i) = len (WL) -i-1. Then, the present application takes the intersection parts SWL (Same Word List) and SPL (Same Position List) of the two loophole texts WL and PL, and calculates the similarity by using the following formula, wherein n = len (SWL). The cosine similarity calculation formula is used for reference, and the similarity between the main word lists can be measured in the aspect of text contact degree and position contact degree.

As can be seen from the above, A represents the target vulnerability text, B represents the contrast vulnerability text, SPL _A (i) Representing the position weight, SPL, of the subject word i in the target vulnerability text _B (i) Means that the subject word i is in contrastAnd the position weight in the vulnerability text, and n represents a main word list.

And for the vulnerability type similarity, directly adopting AND operation, wherein if the types of the vulnerability text pairs are the same, the vulnerability text pairs are 1, and otherwise, the vulnerability text pairs are 0. After the scores of the three dimensions are obtained, the final Score similarity Score is calculated according to the formula Score (X, Y) =0.6 × textsimilarity (X, Y) +0.2 × entitysimilarity (X, Y) +0.2 (VulType (X) & VulType (Y)), where the weight of each dimension can be adjusted according to actual conditions.

The embodiment of the application further provides a vulnerability similarity comparison system based on the text pre-training model. The system comprises:

the processing module is used for processing the target vulnerability text based on vulnerability keyword regular matching and the HMCN model to obtain the vulnerability type of the target vulnerability text;

and the calculation module is used for respectively calculating the vulnerability similarity of the acquired vulnerability text vector, the subject word and the vulnerability type, and weighting and summing the calculation results of the vulnerability similarity to obtain a vulnerability similarity comparison result.

In an optional embodiment of the present application, the preprocessing module specifically includes: and filtering and describing the vulnerability text data set.

In an optional embodiment of the present application, the vectorization module specifically includes: and generating a sentence Embedding vector with semantics by using the twin network model and the triplet network model.

The vulnerability similarity comparison system based on the text pre-training model provided by the embodiment of the application is used for realizing the vulnerability similarity comparison method based on the text pre-training model, and for the specific limitation of the vulnerability similarity comparison system based on the text pre-training model, reference can be made to the limitation of the vulnerability similarity comparison method based on the text pre-training model, and the details are not repeated here. All parts of the vulnerability similarity comparison system based on the text pre-training model can be wholly or partially realized through software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the device, and can also be stored in a memory in the device in a software form, so that the processor can call and execute operations corresponding to the modules.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several implementation modes of the present application, and the description thereof is specific and detailed, but not construed as limiting the scope of the claims. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A vulnerability similarity comparison method based on a text pre-training model is characterized by comprising the following steps:

acquiring a vulnerability text data set of a vulnerability scanning product;

and respectively carrying out vulnerability similarity calculation on the obtained vulnerability text vector, the main word and the vulnerability type, and carrying out weighted summation on the calculation results of the vulnerability similarities to obtain a vulnerability similarity comparison result.

2. The method of claim 1, wherein preprocessing the vulnerability text data set comprises:

and filtering and describing short and/or long texts on the vulnerability text data set, and converting English into lowercase.

3. The method of claim 1, wherein vectorizing the target vulnerability text based on a pre-trained sequence-BERT model to obtain a vulnerability text vector comprises:

4. The method according to claim 1, wherein performing text segmentation and main word bank filtering on the target vulnerability text to extract main words of the target vulnerability text comprises:

5. The method according to claim 1, wherein the vulnerability similarity calculation of the obtained vulnerability text vectors comprises:

6. The method of claim 1, wherein performing vulnerability similarity calculation on the obtained subject words comprises:

acquiring a main word list and a position weight list of the mobile terminal;

and according to the formula

Obtaining a second vulnerability similarity calculation result; wherein, A represents a target vulnerability text, B represents a contrast vulnerability text, SPL _A (i) Representing the position weight, SPL, of the subject word i in the target vulnerability text _B (i) And the position weight of the subject word i in the comparison vulnerability text is represented, and n represents a subject word list.

7. The method of claim 1, wherein performing vulnerability similarity calculation on the obtained vulnerability types comprises:

8. A vulnerability similarity comparison system based on a text pre-training model is characterized by comprising:

9. The system of claim 8, wherein the preprocessing module specifically comprises:

and filtering and describing the vulnerability text data set.

10. The system of claim 9, wherein the vectorization module specifically comprises: