Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In order to realize an accurate hole-to-floor matching method, the application provides a hole matching method, a hole matching device, computer equipment and a readable storage medium, in the hole matching method, hole information of two holes to be matched is obtained firstly, and then the similarity of the two holes in multiple dimensions is calculated according to the hole information, namely, the similarity of the two holes is measured from different dimensions; and then, when calculating the comprehensive similarity of the two vulnerabilities, simultaneously referring to the similarities on different dimensions, and finally judging whether the two vulnerabilities are matched according to the comprehensive similarity, wherein the higher the comprehensive similarity is, the higher the probability of matching the two vulnerabilities is. According to the method and the device, the comprehensive similarity can reflect the similarity of the vulnerabilities in different dimensions based on a mode of combining the similarities in multiple dimensions, and accurate and effective vulnerability matching can be achieved.
Specific embodiments of the vulnerability matching method, apparatus, computer device and readable storage medium provided by the present application will be described in detail below.
Example one
The embodiment of the application provides a vulnerability matching method, and the accuracy of vulnerability matching can be improved through the method. Specifically, fig. 1 is a flowchart of a vulnerability matching method provided in an embodiment of the present application, and as shown in fig. 1, the vulnerability matching method provided in the embodiment includes the following steps S101 to S106.
Step S101: and acquiring vulnerability information of two vulnerabilities to be matched.
When two vulnerabilities are matched, vulnerability information of the two vulnerabilities is firstly obtained, and when a plurality of vulnerabilities are matched, vulnerability information of the plurality of vulnerabilities can be obtained in the step. The vulnerability information includes information describing and defining vulnerabilities from different dimensions, specifically, the vulnerability information may include a plurality of fields, each field describes and defines vulnerabilities from one dimension, or one field may include information from a plurality of dimensions, which is not limited in the present application.
Step S102: and calculating the similarity of the two vulnerabilities on multiple dimensions according to the vulnerability information.
In this step, for different dimensions, the similarity of two vulnerabilities is calculated by using part or all of the information required in the vulnerability information, for example, the similarity corresponding to each dimension is calculated from the dimension of semantics, from the dimension of language description similarity, from the characteristics of the vulnerability itself (including the source of the vulnerability, the time of generation and/or the solution, etc.). During calculation, different calculation models can be preset for different dimensions, calculation is performed through the calculation models, and for example, confidence coefficients of two vulnerabilities belonging to the same category in a certain dimension can be calculated by adopting deep learning models such as a neural network.
Optionally, when calculating the similarity of two vulnerabilities in multiple dimensions, the calculation is performed through the following partial or all steps: calculating semantic similarity of the two vulnerabilities according to description information of the vulnerability information and a preset machine learning model; calculating the description similarity of the two vulnerabilities according to the number of the same product names in the description information; calculating the version similarity of the two vulnerabilities according to the version number in the description information; calculating the URL similarity of the two vulnerabilities according to the relevance degree of URL information in the reference information of the vulnerability information; and calculating the title similarity of the two vulnerabilities according to the number of the same product names in the title information of the vulnerability information.
That is, in this step, some or all of semantic similarity, description similarity, version similarity, URL similarity, and version similarity may be calculated.
Step S103: and calculating the comprehensive similarity of the two loopholes according to the similarity on multiple dimensions.
In this step, when calculating the comprehensive similarity of two vulnerabilities, the weighted summation can be performed on the similarities in different dimensions, a larger weight can be set for the similarity with a larger influence, and a smaller weight can be set for the similarity with a smaller influence.
Step S104: and judging whether the two loopholes are matched or not according to the comprehensive similarity.
Optionally, a comprehensive similarity threshold value may be set, when the calculated comprehensive similarity of the two vulnerabilities reaches the comprehensive similarity threshold value, it is determined that the two vulnerabilities are similar, that is, matched with each other, and when the calculated comprehensive similarity of the two vulnerabilities is smaller than the comprehensive similarity threshold value, it is determined that the two vulnerabilities are dissimilar, that is, not matched. Or matching one vulnerability with a plurality of vulnerabilities, selecting two vulnerabilities with the highest comprehensive similarity as target matching vulnerabilities, and further analyzing the target matching vulnerabilities to determine whether the vulnerabilities are matched.
In the vulnerability matching method provided in this embodiment, vulnerability information of two vulnerabilities to be matched is obtained, according to the vulnerability information, similarities of the two vulnerabilities are calculated from different dimensions, then a comprehensive similarity of the two vulnerabilities is calculated with reference to the similarities in multiple dimensions, and finally whether the two vulnerabilities are matched or not is judged according to the comprehensive similarity. By adopting the vulnerability matching method provided by the embodiment, when vulnerability matching is carried out, comprehensive similarity is calculated by considering the similarity on multiple dimensions, and whether vulnerability is matched or not is finally judged according to the comprehensive similarity, so that more accurate floor drain hole matching can be realized.
Example two
The embodiment of the application provides a vulnerability matching method, by which automatic hole-to-floor matching can be realized, and as the information content contained in vulnerability description information is fully utilized, the comprehensive similarity of vulnerabilities is calculated from semantic similarity, description similarity and version similarity, and the accuracy of vulnerability matching can be improved. Specifically, fig. 2 is a flowchart of a vulnerability matching method provided in the second embodiment of the present application, and as shown in fig. 2, the vulnerability matching method provided in this embodiment includes the following steps S201 to S206.
Step S201: and acquiring the description information of the vulnerability.
For a known vulnerability in the vulnerability database, the vulnerability information usually includes description information, which includes the related content of the entity affected by the vulnerability, the type of the vulnerability, the generation reason, the attack mode of the vulnerability, the generation effect, and the like.
Take Microsoft Exchange Server Outlook Web App cross site script vulnerability (CNNVD-201503-. The description information includes the following contents:
microsoft Exchange Server is a suite of email services programs from Microsoft corporation of America. It provides the functions of mail access, storage, transmission, voice mail and mail filtering and screening. Outlook Web App (OWA) is one version of the Web browser that is used to access Exchange mailboxes. Cross-site scripting holes exist in the Microsoft Exchange Server, and the holes result from the fact that a program cannot correctly arrange page contents in the OWA. By modifying attributes in the OWA and then enticing the user to browse the target OWA site, an attacker can run scripts in the context of the current user. The following products and versions are affected: microsoft Exchange Server 2013SP1, Cumulatient Update 7.
In step S201, description information of each vulnerability to be matched is acquired.
Step S202: and calculating the semantic similarity of the two vulnerabilities according to the description information of the two vulnerabilities and a preset machine learning model.
The preset machine learning model can be obtained through vulnerability sample training of known matching results. Based on vulnerability sample training, the machine learning model has the capability of judging the vulnerability similarity degree semantically from the vulnerability description information, so that the semantic similarity of the two vulnerabilities can be calculated based on the description information of the two vulnerabilities to be matched through the machine learning model. Optionally, the machine learning model is a classification model, the classification model outputs a number between 0 and 1 as semantic similarity, the closer the output numerical value is to 1, the greater the probability that the two vulnerabilities are similar is indicated, and the closer the output numerical value is to 0, the smaller the probability that the two vulnerabilities are similar is indicated.
Step S203: and calculating the description similarity of the two vulnerabilities according to the number of the same product names in the description information of the two vulnerabilities.
The product name in the vulnerability description information generally refers to the name of the product affected by the vulnerability, for example, Microsoft Exchange Server referred to in the above example belongs to the product name.
Specifically, in this step S203, the description similarity of two vulnerabilities is calculated based on the number of the same product names in the description information, wherein the description similarity is larger the number of the same product names is.
Optionally, when the description similarity of the two vulnerabilities is calculated according to the number of the same product names in the description information of the two vulnerabilities, the specifically executed steps include: respectively obtaining product names in the description information of the two vulnerabilities to obtain a first product name sequence corresponding to each vulnerability; calculating the number of the same products in the two first product name sequences to obtain a first number value; determining the quantity of the product names which are less in the two first product name sequences to obtain a second quantity value; and calculating the ratio of the first numerical value and the second numerical value to obtain the description similarity.
Step S204: and calculating the version similarity of the two vulnerabilities according to the similarity of the version numbers in the description information of the two vulnerabilities.
The version number in the vulnerability description information generally refers to the version of the product affected by the vulnerability, for example, 2013SP1 referred to in the above example belongs to the version number.
Specifically, in this step S204, the version similarity of the two vulnerabilities is calculated based on the similarity of the version numbers in the description information, where the similarity of the version number strings may be used as the similarity of the version numbers, and the greater the similarity of the version numbers in the description information of the two vulnerabilities, the greater the calculated version similarity.
Optionally, when the version similarity of the two vulnerabilities is calculated according to the similarity of the version numbers in the description information of the two vulnerabilities, the specifically executed steps include: respectively obtaining version numbers in the description information of the two vulnerabilities to obtain a version number list corresponding to each vulnerability; sequentially calculating the similarity of each version number between the two version number lists; and selecting the maximum value from the similarity of the version numbers as the version similarity.
Step S205: and calculating the comprehensive similarity of the two vulnerabilities according to the semantic similarity, the description similarity and the version similarity.
Specifically, when calculating the comprehensive similarity of two vulnerabilities, simultaneously taking the semantic similarity, the description similarity and the version similarity as reference factors, wherein the larger the semantic similarity is, the larger the comprehensive similarity is; the greater the description similarity is, the greater the comprehensive similarity is; the greater the version similarity, the greater the overall similarity.
Optionally, different weights are assigned to the semantic similarity, the description similarity and the version similarity, and the weighted sum of the semantic similarity, the description similarity and the version similarity is used as the comprehensive similarity.
Step S206: and judging whether the two loopholes are matched or not according to the comprehensive similarity.
Optionally, a comprehensive similarity threshold value may be set, when the calculated comprehensive similarity of the two vulnerabilities reaches the comprehensive similarity threshold value, it is determined that the two vulnerabilities are similar, that is, matched with each other, and when the calculated comprehensive similarity of the two vulnerabilities is smaller than the comprehensive similarity threshold value, it is determined that the two vulnerabilities are dissimilar, that is, not matched. Or matching one vulnerability with a plurality of vulnerabilities, selecting two vulnerabilities with the highest comprehensive similarity as target matching vulnerabilities, and further analyzing the target matching vulnerabilities to determine whether the vulnerabilities are matched.
In the vulnerability matching method provided by this embodiment, description information of vulnerabilities is obtained, and then semantic similarity of the two vulnerabilities is calculated based on the description information of the two vulnerabilities by adopting a preset machine learning model; calculating the description similarity of the two vulnerabilities according to the number of the same product names in the description information of the two vulnerabilities; and calculating the version similarity of the two vulnerabilities according to the similarity of version numbers in the description information of the two vulnerabilities, and finally judging whether the two vulnerabilities are matched or not according to the comprehensive similarity when calculating the comprehensive similarity of the two vulnerabilities by simultaneously referring to the semantic similarity, the description similarity and the version similarity. By adopting the vulnerability matching method provided by the embodiment, semantic similarity is calculated by utilizing semantics of description information based on a machine learning model, description similarity is calculated by utilizing product names in the description information, version similarity is calculated by utilizing version numbers in the description information, comprehensive similarity is calculated by combining, contents in all aspects in the description information of the vulnerability are fully utilized, automatic matching of the vulnerability is realized, high efficiency and high speed are ensured on the basis of accuracy and effectiveness, and when the vulnerability matching method is adopted to match vulnerabilities in different vulnerability libraries, matching efficiency can be improved, and time cost and labor cost of enterprises are reduced.
Optionally, in an embodiment, the vulnerability matching method further includes: acquiring reference information of the vulnerability; calculating the URL similarity of the two vulnerabilities according to the relevance degree of URL information in the reference information of the two vulnerabilities; the step of calculating the comprehensive similarity of the two vulnerabilities according to the semantic similarity, the description similarity and the version similarity comprises the following steps: and calculating the comprehensive similarity of the two vulnerabilities according to the semantic similarity, the description similarity, the version similarity and the URL similarity.
Specifically, for known vulnerabilities in the vulnerability database, reference information of the vulnerabilities is stored in the vulnerability database, the reference information comprises source websites representing relevant contents such as vulnerability description information and the like and URL information such as source websites of vulnerability solutions, and when the two vulnerabilities are matched, the URL information is overlapped, so that when the comprehensive similarity of the two vulnerabilities is calculated, the URL information in the reference information is added to serve as a reference factor, and the accuracy of comprehensive similarity calculation can be further improved. After the reference information of the vulnerabilities is obtained, URL information in the reference information can be extracted from the vulnerability, and then URL similarity of the two vulnerabilities is calculated according to the correlation degree of the URL information in the reference information of the two vulnerabilities, wherein when the reference information of the two vulnerabilities comprises the same URL information, the URL information in the reference information representing the two vulnerabilities is correlated, the URL similarity is high, and when the reference information of the two vulnerabilities does not comprise the same URL information, the URL information in the reference information representing the two vulnerabilities is not correlated, and the URL similarity is low. When the URL similarity of the two vulnerabilities is high, some reference contents representing the two vulnerabilities may be from the same website, the possibility that the two vulnerabilities are similar is higher at this moment, and therefore the comprehensive similarity is also high.
Further optionally, in an embodiment, when calculating the URL similarity of two vulnerabilities according to the degree of association of URL information in the reference information of the two vulnerabilities, the specifically executed step includes: obtaining a URL set corresponding to each vulnerability through URL information in the reference information extracted by the regular expression; if at least one same URL exists between the two URLs, the similarity of the URLs is determined to be 1, and if the same URL does not exist between the two URLs, the similarity of the URLs is determined to be 0.
Optionally, in an embodiment, the vulnerability matching method further includes: acquiring title information of the vulnerability; calculating the title similarity of the two vulnerabilities according to the number of the same product names in the title information of the two vulnerabilities; the step of calculating the comprehensive similarity of the two vulnerabilities according to the semantic similarity, the description similarity, the version similarity and the URL similarity comprises the following steps: and calculating the comprehensive similarity of the two vulnerabilities according to the semantic similarity, the description similarity, the version similarity, the URL similarity and the title similarity.
Specifically, for known vulnerabilities in the vulnerability database, title information of the vulnerabilities is stored in the vulnerability database, the title information comprises product names, the product names generally refer to names of products affected by the vulnerabilities, and when the two vulnerabilities are matched, the same product names are easy to appear in the title information. And calculating the title similarity of the two bugs based on the number of the same product names in the title information, wherein the more the number of the same product names is, the greater the title similarity is, and the greater the comprehensive similarity of the two bugs is.
Further optionally, in an embodiment, when the similarity between the titles of the two vulnerabilities is calculated according to the number of the same product names in the title information of the two vulnerabilities, the specifically executed steps include: respectively obtaining product names in the title information of the two vulnerabilities to obtain a second product name sequence corresponding to each vulnerability; calculating the number of the same products in the two second product name sequences to obtain a third number value; determining the quantity of the product names which are fewer in the two second product name sequences to obtain a fourth quantity value; and calculating the ratio of the third quantity value to the fourth quantity value to obtain the title similarity.
Optionally, in an embodiment, the machine learning model includes a matrix calculation layer, a convolution pooling layer and a full connection layer, where the convolution pooling layer includes a plurality of groups of convolution layers and pooling layers, and the step of calculating the semantic similarity between two vulnerabilities according to the description information of the two vulnerabilities and a preset machine learning model includes: calculating word vectors corresponding to the words in the description information; calculating a word vector of one vulnerability in the two vulnerabilities through a matrix calculation layer, and sequentially obtaining vector similarity values of all the word vectors of the other vulnerability to obtain a similarity matrix; sequentially performing convolution operation and maximum pooling operation on the similar matrix through a convolution pooling layer to obtain a characteristic diagram; and inputting the feature graph into a full link layer to obtain the semantic similarity of the two vulnerabilities.
Specifically, the description information is firstly segmented, then words obtained after segmentation are filtered, the words are removed, stop words and the like are removed, a plurality of words corresponding to the description information are obtained, word vectors corresponding to all the words are calculated according to a word bank dictionary, word vector groups corresponding to vulnerabilities are obtained, and the word vector groups are input vectors of a machine learning model. As shown in fig. 3, the machine learning model sequentially includes a matrix calculation layer, a convolution pooling layer and a full connection layer, the calculated word vector group T1 (including word vectors w1, w2 and w3 … w8) of one vulnerability and the word vector group T2 (including word vectors v1, v2 and v3 … v8) of another vulnerability are input into the machine learning model, and calculating vector similarity values of word vectors of one vulnerability and all word vectors of another vulnerability in sequence through a matrix calculation layer, specifically, the calculation mode can use cosine similarity or a dot product mode to finally obtain a two-dimensional similarity matrix, then sequentially performing multilayer convolution operation and maximum pooling operation on the similarity matrix through a convolution layer and a pooling layer to obtain a feature map, finally connecting a plurality of layers of full-connection layers for secondary classification, outputting the full-connection layer of the last layer after passing through an activation function, and outputting the output value to represent the semantic similarity between vulnerabilities.
Further optionally, in an embodiment, the two vulnerabilities originate from different vulnerability libraries, languages of description information in the different vulnerability libraries are different, and the step of calculating a word vector corresponding to a word in the description information includes: searching identifiers corresponding to words of description information of the vulnerability in a word bank dictionary corresponding to a source vulnerability bank, wherein the description information of all the vulnerabilities in the vulnerability bank is segmented by adopting a word segmentation method corresponding to the languages to obtain a word set, and the word bank dictionary corresponding to the vulnerability bank is constructed according to the word set; and converting the identification corresponding to the word into a word vector of the word.
Specifically, when matching vulnerabilities from different vulnerability libraries, a lexicon dictionary corresponding to each vulnerability library is first constructed by using description information of the vulnerabilities in each vulnerability library. Specifically, the description information of all the holes in the hole library is segmented by adopting a segmentation method corresponding to the language type to obtain a word set, then a word library dictionary corresponding to the hole library is constructed according to the word set, for example, when the words are segmented, if the language type of the description information is Chinese, the words are segmented by adopting a Chinese corresponding segmentation method, if the language type of the description information is English, the words are segmented by adopting an English corresponding segmentation method, after the word set is obtained by segmenting, the word set can be further filtered based on the characteristics of different languages, for example, when the language type of the description information is English, the capital letters in the word set can be replaced by lowercase letters, stop words in the word set can be removed, and the words with grammatical structures in the word set can be restored, including restoring the plural form back to the singular form, restoring the deformed form of verbs to the original form, and after filtering, and then, the same words are subjected to duplication removal, and finally, a lexicon dictionary corresponding to each vulnerability can be obtained, wherein in the lexicon dictionary, different words have different identifications, for example, the identifications can be a string of numbers and/or characters, and the like, and for example, the identifications are the unique serial numbers of the words in the lexicon dictionary. When a word vector corresponding to a word in description information of a vulnerability is calculated, a word library dictionary corresponding to a source vulnerability library of the vulnerability is obtained, then identifiers of the words are searched in the word library dictionary, and then the identifiers are converted into vectors, so that the word vector of the words can be obtained.
By adopting the vulnerability matching method provided by the embodiment, different lexicon dictionaries are established for different vulnerability libraries, so that when word vectors of vulnerability description information words in the vulnerability library are calculated, calculation is performed based on different lexicon dictionaries, and the accuracy of vulnerability matching can be further improved.
Optionally, in an embodiment, the mutually matched vulnerability samples are positive samples, the unmatched vulnerability samples are negative samples, and the machine learning model is obtained by the following steps: randomly acquiring a plurality of negative samples as first negative samples; constructing a first training set by a plurality of positive samples and a plurality of first negative samples; training through a first training set to obtain an intermediate machine learning model; calculating semantic similarity of a plurality of negative samples through an intermediate machine learning model to obtain a plurality of first similarity; selecting a negative sample as a second negative sample according to the first similarity, wherein the greater the first similarity of the negative sample is, the greater the probability of being selected as the second negative sample is; constructing a second training set by the plurality of positive samples, the plurality of first negative samples and the plurality of second negative samples; and training through the second training set to obtain the machine learning model.
Specifically, before vulnerability matching, a vulnerability sample is adopted to train an initial machine learning model, and after a trained machine learning model is obtained, semantic similarity is calculated through the trained machine learning model. When the initial machine learning model is trained, a training set (namely, a second training set) needs to be constructed, wherein the training set comprises positive samples and two types of negative samples, one type of negative samples is a negative sample which is randomly obtained, the other type of negative samples is a negative sample which is large in similarity and unmatched, and the later type of negative samples are screened through a trained intermediate machine learning model. Specifically, an initial learning model is trained through a first training set constructed by positive samples and first negative samples to obtain an intermediate machine learning model, then a plurality of negative samples are screened through the intermediate machine learning model, the negative samples with high similarity are screened out from the negative samples to serve as second negative samples, a second training set is constructed through the positive samples, the first negative samples and the second negative samples, and then the second training set is used for training the initial learning model to obtain the machine learning model. The initial learning model during training by the second training set may be the same as the initial learning model during training by the first training set, may be an intermediate learning model obtained by training by the first training set, or may be another learning model.
By adopting the vulnerability matching method provided by the embodiment, when the training set of the machine learning model is constructed by adopting the negative samples, a part of the negative samples with high similarity are screened, and the training set is ensured to comprise the positive samples and the negative samples with high similarity, so that the machine learning model obtained by training has high accuracy, and the vulnerability matching accuracy is further improved.
Optionally, in an embodiment, in the second training set, the ratio of the positive and negative samples is a preset ratio value, where the preset ratio value is smaller than 1.
In general machine learning, positive samples account for a large proportion, and negative samples account for a small proportion, but the inventor finds that by adopting the conventional positive and negative sample proportion, the trained model cannot accurately identify the real matching situation of the vulnerability. In this embodiment, the ratio of the positive sample and the negative sample is set as a preset ratio value, wherein the preset ratio value is a numerical value smaller than 1, and the accuracy of the model matching result can be effectively improved. For example, a training set can be constructed by setting the ratio of positive and negative samples to be 1:5, and the ratio of the negative samples is greatly increased by adopting a mode different from the conventional and customary technical means, so that the diversity of the negative samples is increased, the accuracy of the trained machine learning model can be improved, and the accuracy of vulnerability matching is further improved.
EXAMPLE III
The third embodiment is a preferred embodiment based on the first and second embodiments, and the third embodiment can implement efficient, fast and automatic vulnerability matching, and by using the vulnerability matching method provided by the third embodiment, the similarity between vulnerabilities in different vulnerability libraries can be automatically calculated, wherein the truly matched vulnerabilities have the highest similarity value, and the vulnerability matching method can quickly select the truly matched vulnerabilities or candidate vulnerabilities with a small range defined by a circle to narrow the search range by sorting according to the size of the similarity value. The vulnerability matching method provided by the embodiment comprises five steps of data acquisition, data preprocessing, model training, similarity calculation and vulnerability sequencing. In the following description, only matching and sorting of vulnerabilities in the CNNVD vulnerability database and the CVE vulnerability database are taken as an example for introduction, and the method may also be applied to matching and sorting of single-language or multi-language vulnerabilities in other vulnerability databases. Fig. 4 is a flowchart of a vulnerability matching method provided in the third embodiment of the present application, and as shown in fig. 4, each step is described in detail as follows.
Step 1, data acquisition. This step will build the raw data set for training and testing of the machine learning model. First, the following data cases and examples are described. The CNNVD vulnerability database and the CVE vulnerability database respectively describe vulnerabilities in Chinese and English. Tables 1 to 3 are examples of the CNNVD vulnerability database and the CVE vulnerability database under several different conditions, and only the fields related to the embodiment are shown here.
Table 1 shows that the vulnerability pair samples that are matched with each other, that is, the positive samples, can be found that the two describe the same vulnerability; table 2 shows that vulnerability pair samples with similar semantics but not matching semantics can be found, and the description contents of the vulnerability pair samples are slightly similar, but actually are not the same vulnerability, and the vulnerability is relatively difficult to distinguish by the machine learning model, i.e. a second negative sample; table 3 shows a sample of common unmatched pairs of vulnerabilities that the machine learning model can distinguish relatively easily.
TABLE 1 vulnerability Pair examples matched to each other
TABLE 2 similar but unmatched vulnerability Pair samples
TABLE 3 mismatched vulnerability Pair samples
During data acquisition, mutually matched vulnerability pairs are respectively obtained from the CNNVD vulnerability database and the CVE vulnerability database to serve as positive samples, and then negative samples are obtained in a negative sampling mode, namely unmatched vulnerability pairs are taken as negative samples. In order to increase the negative sample diversity, the negative sample ratio is increased appropriately, and the ratio of positive and negative samples is set to 1:5 in this embodiment. For the negative sampling mode, the embodiment adopts a combination of random negative sampling and ordered negative sampling. Random negative sampling is selected randomly from unmatched vulnerability pairs, and sequencing negative sampling selects similar semantics from unmatched vulnerability pairs, so that attention to the vulnerability pairs is increased, the identification capability of a machine learning model to the similar semantics but unmatched vulnerability pairs is enhanced, and the method is of great importance for reducing mismatching of the machine learning model. The specific method for selecting the semantic similar vulnerability through sequencing negative sampling comprises the following steps: firstly, training an initial version of a machine learning model by using random negative sampling samples to obtain an intermediate machine learning model, outputting a high similarity value to a semantic close vulnerability pair by using the model, and then screening out negative samples with similar semantics according to a sequencing output result of the intermediate machine learning model.
And 2, preprocessing data. An important part in each piece of vulnerability information is a vulnerability description part, namely description information of the vulnerability, wherein the description information is a text string with semantic information and is used for describing the technical details of the vulnerability and the contents of related products and the like. After the vulnerability samples are obtained in step 1, a series of preprocessing needs to be performed on the text strings of the description information of each vulnerability, and then the vulnerability samples can be used for model training. As shown in fig. 5 and 6, the pretreatment measures include the following:
and (5) word segmentation. And segmenting words in the sentence, and filtering stop words. Because the vulnerability descriptions of the CNNVD standard and the CVE standard respectively adopt Chinese and English, a Chinese word segmentation method is adopted for Chinese description information of the CNNVD standard vulnerability and Chinese stop words are filtered, and an English word segmentation method is adopted for English description of the CVE standard and English stop words are filtered.
Capital letters turn to lowercase. For English words after CVE vulnerability segmentation, as the meaning represented by the capital and lowercase letters of the first letter of the vocabulary is not different in the task, the capital letters are all converted into lowercase letters.
And restoring the part of speech. For the English words after CVE loophole word segmentation, since the semantics represented by the plural forms of word lists and various variations of verb original forms are not different in the task, the plural nouns are all converted into the singular form, and the verb forms are all converted into the original forms.
And constructing a word stock dictionary. And all the processed CNNVD words and CVE words are respectively constructed into a CNNVD word bank dictionary and a CVE word bank dictionary, and the words can be converted into integer ids of corresponding word banks through the word bank dictionaries, so that the method is convenient to use in the subsequent machine learning model training step.
And step 3, training a model. The neural network model constructed by the existing deep learning method is used for calculating semantic similarity between the preprocessed CNNVD text sequence and the preprocessed CVE text sequence, the value range of the output value of the model is [0,1], the training target of the model is to enable the similarity of the positive sample output to approach 1 as much as possible, and the similarity of the negative sample output to approach 0 as much as possible. The data set in step 2 is further divided into a training set, a test set and a validation set before training. The training set is used for training the model, the verification set is used for timely verifying the effect of the model in the training process, the overfitting of the model is prevented, and the test set is used for testing and evaluating the actual effect of the model.
As shown in fig. 4, the model first constructs a similarity matrix of the text sequence and the text sequence, and then performs two classifications on the similarity matrix and the dissimilarity matrix by using a multilayer convolutional neural network, where the similarity is 1 and the dissimilarity is 0. The specific flow of the model is as follows:
word vectors for the two text sequences are calculated. In the neural network model, each word is represented by a word vector of fixed length. And converting the unique id of each word in the text sequence in the dictionary of the respective word stock into a word vector.
And constructing a similarity matrix. And calculating a vector similarity value of the word vector of each word in the text sequence of the vulnerability and the word vectors of all words in another sequence in sequence, wherein the calculation mode uses cosine similarity or a point multiplication mode, and finally a two-dimensional similarity matrix is obtained.
And sequentially carrying out multilayer convolution operation and maximum pooling operation on the similar matrix to obtain a characteristic diagram, finally connecting two layers of fully-connected networks for secondary classification, outputting the fully-connected layer of the last layer after a sigmoid activation function is carried out, and outputting a value to represent semantic similarity among vulnerabilities.
And carrying out iterative training on the model by a back propagation method until the training result meets the target.
And 4, calculating the similarity. In order to fully utilize various information of the vulnerability and make the matching result more robust, the embodiment calculates five types of similarity for each pair of CNNVD vulnerability and CVE vulnerability, wherein the five types of similarity are respectively the model semantic similarity, the title similarity, the description similarity, the version number similarity and the URL similarity, and the five similarity values are weighted and summed to obtain the final vulnerability similarity. The weighted sum formula is
Similarity of loophole is alpha1Smodel+α2Stitle+α3Sdesc+α4Sversion+α5Surl,
Wherein the weight α1ˉα5The constant value represents the weight of each similarity value, and the value is obtained through experiments. The value range of each similarity is [0,1]]The specific meanings and calculation are as follows:
(1) model semantic similarity SmodelAnd (3) representing the similarity of the descriptions of the two vulnerabilities at a semantic level, and obtaining the similarity from the neural network model trained in the step 3.
(2) Title similarity StitleMark for representing CNNVD loopholeThe topic part is similar to the word statistics of the description part of the CVE vulnerability (in this embodiment, statistics are performed with the description part of the CVE vulnerability since the CVE vulnerability does not include a title field). The title part of the CNNVD vulnerability contains English words such as product names, and the information is matched with the English product names in the description part of the CVE vulnerability, so that vulnerabilities with high topic relevance (same products or related products) can be conveniently screened out.
StitleThe calculation method is as follows: english product names in the CNNVD title information and the CVE description information are respectively extracted to form respective product name lists, then the proportion of the number of the same products in the two product lists is calculated to serve as the title similarity, and the denominator is the number of the product names in the CNNVD title.
(3) Description of similarity SdescAnd the word statistics similarity degree of the description part for representing the CNNVD vulnerability and the description part for representing the CVE vulnerability. The description part of the CNNVD vulnerability contains English words such as product names, and the information is matched with the English product names in the description part of the CVE vulnerability, so that vulnerabilities with high topic relevance (same products or related products) can be conveniently screened out.
SdescThe calculation method is as follows: english product names in the CNNVD description information and the CVE description information are respectively extracted to form respective product name lists, and then the proportion of the same product quantity in the two product lists is calculated to serve as description similarity, wherein denominator is the quantity of the product names in the CNNVD description.
(4) Similarity of version numbers SversionThe similarity degree of the version number information in the CNNVD vulnerability description and the CNNVD vulnerability description is represented, and the version number between two matching vulnerabilities is generally consistent, so that the information can be used as a strong feature to assist in matching.
SversionThe calculation method is as follows: and respectively extracting version number information in the CNNVD description information and the CVE description information to form respective version number lists. And then sequentially calculating the similarity of each version number character string between the two version number lists. And taking the maximum value of the mutual similarity as the final version similarity.
(5) URL similarity SurlAnd the association degree between the URL information representing the CNNVD vulnerability and the URL information representing the CVE vulnerability. Many pieces of website address information (URLs) are attached to the reference fields of the CNNVD vulnerability and the CVE vulnerability, and if a vulnerability pair contains one or more identical URLs (such as two URLs with thick parts in the example in table 1), the two URLs are most likely to describe the same vulnerability, so that the information is used as a strong feature to assist matching.
SurlThe calculation method is as follows: extracting URL information of two vulnerabilities through a regular expression, and enabling S to be used if the two vulnerabilities contain one or more same URLsurlIs 1, otherwise is 0.
And 5, sequencing the vulnerabilities. And (4) respectively calculating the vulnerability similarity of one CNNVD vulnerability or CVE vulnerability and all candidate vulnerabilities to be matched according to the method in the step (4), sequencing all candidate vulnerabilities in a descending order according to the similarity, and selecting the first k vulnerabilities as candidate matching results. Thus, a truly matching vulnerability can be selected or a small range of candidate vulnerabilities can be delineated to narrow the search range. Table 4 is an example of a return value of a CNNVD vulnerability search CVE vulnerability, and of all CVE candidate vulnerabilities, the first ten CVE vulnerabilities with the highest similarity to CNNVD-202001-.
TABLE 4 ranking results example of vulnerability matches
The traditional method for matching the vulnerabilities manually depends on professionals, time cost and labor cost in the matching process are very high, and efficiency is low. By adopting the automatic vulnerability matching method based on deep learning provided by the embodiment, the matching efficiency can be improved under the condition of ensuring higher accuracy, the labor cost and the time cost are greatly reduced, and the maintenance efficiency of the vulnerability library is further improved. Under a real usage scenario, the hit rate of the top ten results of the ranked results (top10 hit rate) is higher than 90%. With the increase of the data volume, the accuracy of the vulnerability matching method provided by the embodiment is higher.
In addition, in this embodiment, the manner of ordering the negative samples is added when negative samples are taken. Namely, only the random negative sampling sample is used for training the initial model, the model can output a high similarity value to the vulnerability pair with similar semantics, and then the negative sample with similar semantics is screened out according to the sequencing output result of the initial model. Compared with random negative sampling, the sequencing negative sampling selects vulnerability pairs with similar semantics but unmatched semantics as negative samples to increase the attention to the vulnerability pairs, and improves the identification capability of the model to the vulnerability pairs, so that model mismatching is reduced.
According to the embodiment, vulnerability similarity values are calculated among vulnerabilities of different vulnerability libraries, five parts of model semantic similarity, title similarity, description similarity, version number similarity and URL similarity are considered, the problem of inconsistent subjects can be solved through the title similarity and the description similarity, and the matching results can be more robust through a plurality of similarity weighted sum calculation methods.
In the embodiment, the similarity matrix of the text sequence and the text sequence is constructed, then the multilayer convolutional neural network is utilized to carry out secondary classification on the similarity matrix, and the output is the value of the semantic similarity of the vulnerability.
Example four
Corresponding to the first and second embodiments, a fourth embodiment of the present application provides a vulnerability matching device, and accordingly, reference may be made to the first and second embodiments for technical feature details and corresponding technical effects, which are not described in detail in this embodiment. Fig. 7 is a block diagram of a vulnerability matching apparatus provided in the fourth embodiment of the present application, and as shown in fig. 7, the vulnerability matching apparatus includes: an acquisition module 301, a first calculation module 302, a second calculation module 303, and a determination module 304.
The obtaining module 301 is configured to obtain vulnerability information of two vulnerabilities to be matched; the first calculation module 302 is configured to calculate similarities of the two vulnerabilities in multiple dimensions according to the vulnerability information; the second calculation module 303 is configured to calculate a comprehensive similarity of the two vulnerabilities according to the similarity in the plurality of dimensions; and the judging module 304 is configured to judge whether the two vulnerabilities are matched according to the comprehensive similarity.
Optionally, in an embodiment, when calculating the similarity of the two vulnerabilities in multiple dimensions, the first calculation module 302 performs calculation through some or all of the following steps: calculating semantic similarity of the two vulnerabilities according to the description information of the vulnerability information and a preset machine learning model; calculating the description similarity of the two vulnerabilities according to the number of the same product names in the description information; calculating the version similarity of the two vulnerabilities according to the version number in the description information; calculating the URL similarity of the two vulnerabilities according to the relevance degree of URL information in the reference information of the vulnerability information; and calculating the title similarity of the two vulnerabilities according to the number of the same product names in the title information of the vulnerability information.
Optionally, in an embodiment, the machine learning model includes a matrix computation layer, a convolution pooling layer and a full connection layer, where the convolution pooling layer includes several groups of convolution layers and pooling layers, and the step of computing the semantic similarity of the two vulnerabilities includes: calculating word vectors corresponding to the words in the description information; calculating the word vector of one vulnerability in the two vulnerabilities through the matrix calculation layer, and sequentially obtaining a similarity matrix with vector similarity values of all the word vectors of the other vulnerability; sequentially performing convolution operation and maximum pooling operation on the similar matrix through the convolution pooling layer to obtain a characteristic diagram; and inputting the feature graph into the full link layer to obtain the semantic similarity of the two vulnerabilities.
Optionally, in an embodiment, the two vulnerabilities are derived from different vulnerability libraries, languages of description information in different vulnerability libraries are different, and the step of calculating a word vector corresponding to a word in the description information includes: searching an identifier corresponding to the word in a word library dictionary corresponding to a source vulnerability library, wherein the description information of the vulnerability in the vulnerability library is segmented by adopting a segmentation method corresponding to the language to obtain a word set, and the word library dictionary corresponding to the vulnerability library is constructed according to the word set; and converting the identification corresponding to the word into a word vector of the word.
Optionally, in an embodiment, the mutually matched vulnerability samples are positive samples, the unmatched vulnerability samples are negative samples, and the machine learning model is obtained by the following steps: randomly acquiring a plurality of negative samples as first negative samples; constructing a first training set by a plurality of positive samples and a plurality of the first negative samples; training through the first training set to obtain an intermediate machine learning model; calculating semantic similarity of the negative samples through the intermediate machine learning model to obtain a plurality of first similarity; selecting a negative sample as a second negative sample according to the first similarity, wherein the greater the first similarity of the negative sample is, the greater the probability of being selected as the second negative sample is; constructing a second training set by a plurality of the positive samples, a plurality of the first negative samples and a plurality of the second negative samples; and training through the second training set to obtain the machine learning model.
Optionally, in an embodiment, in the second training set, the ratio of the positive and negative samples is a preset ratio value, where the preset ratio value is smaller than 1.
Optionally, in an embodiment, the step of calculating the description similarity of the two vulnerabilities includes: respectively obtaining product names in the description information of the two vulnerabilities to obtain a first product name sequence corresponding to each vulnerability; calculating the number of the same products in the two first product name sequences to obtain a first number value; determining the quantity of the product names which are less in the two first product name sequences to obtain a second quantity value; and calculating the ratio of the first numerical value to the second numerical value to obtain the description similarity.
Optionally, in an embodiment, the step of calculating the version similarity of the two vulnerabilities includes: respectively obtaining version numbers in the description information of the two vulnerabilities to obtain a version number list corresponding to each vulnerability; sequentially calculating the similarity of each version number between the two version number lists; and selecting the maximum value from the similarity of the version numbers as the version similarity.
Optionally, in an embodiment, the step of calculating the URL similarity of the two vulnerabilities includes: extracting URL information in the reference information through a regular expression to obtain a URL set corresponding to each vulnerability; and if at least one same URL exists between the two URLs, determining that the similarity of the URLs is 1, and if the same URL does not exist between the two URLs, determining that the similarity of the URLs is 0.
Optionally, in an embodiment, the step of calculating the title similarity of the two vulnerabilities includes: respectively obtaining product names in the title information of the two vulnerabilities to obtain a second product name sequence corresponding to each vulnerability; calculating the number of the same products in the two second product name sequences to obtain a third number value; determining the quantity of the product names which are fewer in the two second product name sequences to obtain a fourth quantity value; and calculating the ratio of the third quantity value to the fourth quantity value to obtain the title similarity.
EXAMPLE five
In this fifth embodiment, a computer device is further provided, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a rack server (including an independent server or a server cluster formed by a plurality of servers) capable of executing programs, and the like. As shown in fig. 8, the computer device 01 of the present embodiment at least includes but is not limited to: the memory 012 and the processor 011 can be communicatively connected to each other via a system bus, as shown in fig. 8. It is noted that fig. 8 only shows the computer device 01 having the component memories 012 and the processor 011, but it is to be understood that not all of the shown components are required to be implemented, and more or fewer components may instead be implemented.
In this embodiment, the memory 012 (i.e., a readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., an SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 012 may be an internal storage unit of the computer device 01, such as a hard disk or a memory of the computer device 01. In other embodiments, the memory 012 may also be an external storage device of the computer device 01, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device 01. Of course, the memory 012 may also include both an internal storage unit and an external storage device of the computer device 01. In this embodiment, the memory 012 is generally used to store an operating system and various types of application software installed in the computer device 01, for example, the bug matching device in the fourth embodiment. Further, the memory 012 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 011 can be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 011 generally controls the overall operation of the computer apparatus 01. In this embodiment, the processor 011 is used to run program codes stored in the memory 012 or process data, such as a bug matching method.
EXAMPLE six
The sixth embodiment further provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application store, etc., on which a computer program is stored, which when executed by a processor implements corresponding functions. The computer-readable storage medium of this embodiment is used for storing a vulnerability matching apparatus, and when executed by a processor, the vulnerability matching method of the first embodiment is implemented.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are included in the scope of the present application.