CN117494132A

CN117494132A - Intelligent vulnerability recurrence retrieval method and system

Info

Publication number: CN117494132A
Application number: CN202311414864.3A
Authority: CN
Inventors: 刘阳; 杨晶
Original assignee: Information and Data Security Solutions Co Ltd
Current assignee: Information and Data Security Solutions Co Ltd
Priority date: 2023-10-27
Filing date: 2023-10-27
Publication date: 2024-02-02

Abstract

The invention discloses an intelligent vulnerability recurrence retrieval method and system, comprising the following steps: firstly, collecting and sorting HTML webpages containing vulnerability EXP; the well-arranged vulnerability data is thrown into a TEXTCNN framework for model learning analysis, so as to form an AI engine; crawling vulnerability data by utilizing a Scrapy framework, and simultaneously issuing a task to a task center; the AI engine goes to a task service center to acquire a task, loads a webpage file in a file system for analysis, and discards an analysis result into a calculation result pool; the main service obtains a calculation result, carries out vulnerability EXP warehousing, and automatically associates CVE and CNNVD numbers; after the associated vulnerability EXP is put in storage, performing manual auxiliary auditing, and updating the state of the vulnerability judged to be the credible EXP and using the vulnerability to model training. The invention is based on the machine learning technology, can analyze and extract the original data from different sources and then carry out scoring judgment, thus not only greatly enhancing the accuracy of the data, but also saving the retrieval time.

Description

Intelligent vulnerability recurrence retrieval method and system

Technical Field

The invention relates to the technical field of automatic missed search, in particular to an AI engine analysis, research and judgment and self-learning method, wherein vulnerability detection is a search vulnerability reproduction method.

Background

The frames and the storage modes in the existing vulnerability management system are basically to update, arrange and put vulnerabilities in storage in real time according to the time sequence, judge vulnerability information manually, have low efficiency, and part of links in the vulnerability information base do not give specific reproduction schemes and repair schemes, so that the capability of preventing security threats is weak, and the timeliness of how to repair and prevent vulnerabilities is also relatively weak.

Because the mode of manually searching for the vulnerability reproduction method is too complicated, a proper vulnerability reproduction method cannot be found, a large amount of time is consumed, and the efficiency is too low. In order to effectively solve the problem, an automatic framework retrieval technology is introduced to perform vulnerability retrieval, and an AI engine is combined to perform research and judgment.

Disclosure of Invention

The technical problem to be solved by the invention is how to solve the occurrence process of the accurate positioning loopholes in the existing scheme and how to search the loophole reproduction mode without manual work.

The invention solves the technical problems by the following technical means:

an intelligent vulnerability recurrence retrieval method comprises the following steps:

a101, collecting and sorting HTML webpages containing vulnerability EXP;

a102, throwing the HTML data of the loophole EXP webpage which is arranged in the step A101 into a TEXTCNN framework, and performing model learning analysis to form an AI engine;

a103, crawling of vulnerability data CVE and CNNVD is carried out by utilizing the Scopy framework, and a task is issued to a task center;

a104, the AI engine goes to a task service center to acquire a task, loads a webpage file in a file system for analysis, and drops a result into a calculation result pool after the analysis is finished;

a105, the main service obtains a calculation result, carries out vulnerability EXP warehousing, and automatically associates CVE and CNNVD numbers;

and A106, after the associated vulnerability EXP is put in storage, performing manual auxiliary audit, and performing state update on the vulnerability judged as the trusted EXP, after the state update, automatically synchronizing the EXP data judged as feasible to the model through a synchronization center (the synchronization center is a synchronization service between the data and the model), and directly feeding back to the model training.

The invention is based on the machine learning technology, can analyze and extract the original data from different sources and then carry out scoring judgment, thus not only greatly enhancing the accuracy of the data, but also saving the retrieval time. Compared with the traditional statistical analysis and retrieval method, the automatic AI engine-based retrieval method is more accurate and efficient.

Further, the method in step a101 is as follows: various HTML webpages containing the vulnerability EXP are collected, and whether the data need to meet the following contents is judged manually: the field, the serial number of the vulnerability, the EXP title, the date, the software link, the version number, the software platform, the test system and the reproduction step can contain other information such as attack scenes, vulnerability descriptions, screenshot and the like.

Furthermore, the arranged vulnerability EXP data is imported into a TEXTCNN framework, model training learning is carried out, vulnerability EXP key elements are extracted, text data is preprocessed into word vectors or character vector representations with fixed lengths, and the text data at least comprises: the method comprises the steps of numbering vulnerabilities, EXP titles, dates, software links, version numbers, software platforms, test systems, reproduction steps, attack scenes, vulnerability descriptions, screenshot and other information, and marking texts so that each text corresponds to one label;

the method comprises the following steps:

a1021 model structure definition: the model structure of the textCNN is constructed and comprises an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer. The input layer receives text vectors as input, the convolution layer is used for extracting local features, the pooling layer is used for selecting the most obvious features, the full connection layer is used for learning the representation of the text, and the output layer is used for classifying the text;

a1022 loss function selection: selecting an appropriate loss function for TextCNN, typically in a text classification task, using a cross entropy loss function to measure the difference between model predictions and real labels;

a1023 model optimization: an optimization algorithm (such as a gradient descent method) is adopted to minimize a loss function, and parameters of the model are updated, so that a prediction result of the model is more similar to a real label;

a1024 model training: and training the textCNN by using the labeled training data. In the training process, the model gradually learns the characteristics of the text by continuously and iteratively updating parameters, and optimizes the performance of the model;

a1024 model evaluation: evaluating the trained textCNN by using a verification set or a test set, and calculating indexes such as accuracy, precision, recall rate and the like of the model to measure the performance of the model;

and A1025, model tuning: according to the evaluation result, the model is optimized, and the super parameter, the model structure or the optimization algorithm can be possibly adjusted so as to further improve the performance of the model;

a1026 model application: after the model training and evaluation reach satisfactory effects, the trained TextCNN model can be used for practical applications.

Further, the method in a103 is as follows: using a Scrapy framework to acquire CVE and CNNVD basic data, cleaning and warehousing the data by using a Pipeline after acquiring, wherein the warehousing data comprises a field CVE number, a vulnerability name, a vulnerability cause, a vulnerability level, a vulnerability source release time, a vulnerability hazard, a vulnerability description, a repairing measure, a reference link, POC, bugtraq number, affected software, a security suggestion, a vulnerability provider, CNNVD number, a patch, a threat type, a vulnerability source and an updating time, and simultaneously issuing a task in a task center;

the method comprises the following steps:

a1031, the engine acquires initial URL of vulnerability basic data from the Spider and adds an initial request into a scheduler;

a1032, the scheduler selects the next request from the request queue according to the set scheduling strategy; the dispatcher sends the selected request to the downloader, and the downloader sends the request to the website server and obtains the response;

a1033, the downloader returns the response to the engine, and the engine gives the response to the Spider for analysis;

a1034, extracting data from the response by the Spider, generating a new request, and adding the new request to the scheduler;

a1035, steps A1032-A1035 are looped until all requests in the scheduler are processed. Finally, the data is sent to Pipeline for processing and storage;

and A1036, after the data storage is successful, simultaneously issuing a task in the task center.

Further, the method in a104 is as follows: the AI engine acquires a task from a task center, loads an HTML webpage file acquired through the Scrapy in a file system, analyzes and scores the HTML webpage file, and selects an HTML page which is the most consistent with the HTML page; after analysis is finished, synchronizing information such as analysis data containing fields CVE_ID, vulnerability names, exp links, link sources, average scores, update time and the like, states and the like into a result pool;

the method comprises the following steps:

a1041, an AI engine acquires an analysis task from a task center, analyzes a webpage to be analyzed, judges the webpage, and scores and outputs the webpage;

a1042, selecting and setting links according to analysis scenes, and formulating various strategies to score credible EXP;

and A1043, synchronizing the data information to a result pool after the grading judgment of the AI engine.

Corresponding to the method, the invention also provides an intelligent vulnerability recurrence retrieval system, which comprises the following steps:

the data collection module is used for collecting and sorting the HTML webpage containing the vulnerability EXP;

the learning module is used for throwing the managed vulnerability EXP webpage HTML data into the TEXTCNN framework, and performing model learning analysis to form an AI engine;

the data crawling module is used for crawling the vulnerability data CVE and CNNVD by utilizing the Scopy framework and simultaneously issuing a task to the task center;

the analysis module is used for the AI engine to go to a task service center to acquire tasks, load the webpage files in the file system for analysis, and after the analysis is finished, throw the results into a calculation result pool;

the association module is used for obtaining a calculation result by the main service, carrying out vulnerability EXP warehousing, and automatically associating CVE and CNNVD numbers;

and the updating module is used for carrying out manual auxiliary audit after the associated vulnerability EXP is put in storage, carrying out state updating on the vulnerability judged to be the trusted EXP, and automatically feeding back to the model training through the synchronization center again after updating.

Further, the method in the data collection module is as follows: various HTML webpages containing the vulnerability EXP are collected, and whether the data need to meet the following contents is judged manually: the field, the serial number of the vulnerability, the EXP title, the date, the software link, the version number, the software platform, the test system and the reproduction step can contain other information such as attack scenes, vulnerability descriptions, screenshot and the like.

Further, the method in the learning module is as follows: importing the managed vulnerability EXP data into a TEXTCNN framework, performing model training learning, extracting the vulnerability EXP key elements, preprocessing text data into word vectors or character vector representations with fixed lengths, wherein the text data at least comprises: the method comprises the steps of numbering vulnerabilities, EXP titles, dates, software links, version numbers, software platforms, test systems, reproduction steps, attack scenes, vulnerability descriptions, screenshot and other information, and marking texts so that each text corresponds to one label;

the method comprises the following steps:

Further, the method in the data crawling module is as follows: using a Scrapy framework to acquire CVE and CNNVD basic data, cleaning and warehousing the data by using a Pipeline after acquiring, wherein the warehousing data comprises a field CVE number, a vulnerability name, a vulnerability cause, a vulnerability level, a vulnerability source release time, a vulnerability hazard, a vulnerability description, a repairing measure, a reference link, POC, bugtraq number, affected software, a security suggestion, a vulnerability provider, CNNVD number, a patch, a threat type, a vulnerability source and an updating time, and simultaneously issuing a task in a task center;

the method comprises the following steps:

Further, the method in the analysis module is as follows: the AI engine acquires a task from a task center, loads an HTML webpage file acquired through the Scrapy in a file system, analyzes and scores the HTML webpage file, and selects an HTML page which is the most consistent with the HTML page; after analysis is finished, synchronizing information such as analysis data containing fields CVE_ID, vulnerability names, exp links, link sources, average scores, update time and the like, states and the like into a result pool;

the method comprises the following steps:

The invention has the advantages that:

The automatic combined AI engine vulnerability retrieval method is characterized in that after an automatic retrieval framework is adopted, needed CVE and CNNVD vulnerabilities are put into ES (ElasticSearch) libraries, all the vulnerabilities are put into an AI engine after being put into storage, the vulnerabilities are researched, judged and scored and equally distributed, all links with vulnerabilities are put into storage and displayed, and finally packaging is carried out.

Drawings

Fig. 1 is a flow chart of an intelligent vulnerability retrieval method in an embodiment of the invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described in the following in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. An intelligent vulnerability reproduction retrieval method, as shown in figure 1, comprises the following specific steps:

a101, collecting and sorting HTML webpages containing vulnerability EXP;

a103, utilizing the Scrapy framework to perform vulnerability (CVE, CNNVD) data crawling, and simultaneously issuing a task to a task center (MQ);

a104, the AI engine goes to a task service center to acquire tasks, loads webpage files in a file system for analysis, and loses the results into a calculation result pool (MQ) after the analysis is finished;

the contents of each step are specifically described below:

and A106, after the associated vulnerability EXP is put in storage, performing manual auxiliary audit, and performing state update on the vulnerability judged to be the true EXP, and after the vulnerability is updated, directly feeding back to the model training through the synchronization center again. To improve accuracy; the method in A101 is as follows:

various HTML webpages containing the vulnerability EXP are collected, and whether the data need to meet the following contents is judged manually: fields that must be included, the number of the vulnerability (CVE, CNNVD number), EXP Title (explet Title), date (data), software Link (Software Link), version number (Version), software Platform (Platform), test system (test on), reproduction step (Steps To Reproduce) may include other information such as Attack scenario (Attack scene), vulnerability Description (Description), screenshot, etc.

The method in A102 is as follows:

importing the arranged vulnerability EXP data into a TEXTCNN framework, performing model training learning, extracting vulnerability EXP key elements, preprocessing text data into word vectors or character vector representations with fixed lengths, wherein the text data comprises but is not limited to the following contents: the method comprises the steps of numbering vulnerabilities, EXP titles, dates, software links, version numbers, software platforms, test systems, reproduction steps, attack scenes, vulnerability descriptions, screenshot and other information, and marking texts so that each text corresponds to one label.

The method can be divided into the following steps:

a1021 model structure definition: the model structure of the textCNN is constructed and comprises an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer. The input layer receives text vectors as input, the convolution layer is used for extracting local features, the pooling layer is used for selecting the most obvious features, the full connection layer is used for learning the representation of the text, and the output layer is used for classifying the text.

A1022 loss function selection: the appropriate loss function is selected for TextCNN, typically in a text classification task, using a cross entropy loss function to measure the difference between model predictions and real labels.

A1023 model optimization: an optimization algorithm (such as a gradient descent method) is adopted to minimize the loss function, and parameters of the model are updated so that a prediction result of the model is closer to a real label.

A1024 model training: and training the textCNN by using the labeled training data. In the training process, the model gradually learns the characteristics of the text by continuously and iteratively updating parameters, and optimizes the performance of the model.

A1024 model evaluation: and evaluating the trained textCNN by using a verification set or a test set, and calculating indexes such as accuracy, precision, recall rate and the like of the model to measure the performance of the model.

And A1025, model tuning: according to the evaluation result, the model is optimized, and the super parameter, the model structure or the optimization algorithm may be adjusted to further improve the performance of the model.

The method in A103 is as follows:

and (3) performing CVE and CNNVD basic data acquisition by utilizing a Scrapy framework, cleaning and warehousing the acquired data by utilizing a Pipeline, wherein the warehousing data comprises a field CVE number, a vulnerability name, a vulnerability cause, a vulnerability level, a vulnerability source release time, a vulnerability hazard, a vulnerability description, a repairing measure, a reference link, POC (point-to-multipoint) and Bugtraq number, affected software, a security suggestion, a vulnerability provider, CNNVD number, a patch, a threat type, a vulnerability source and an updating time. And simultaneously, a task is issued at a task center.

The method can be divided into the following steps:

a1031, the engine acquires initial URL of vulnerability basic data from the Spider and adds the initial request to the scheduler.

A1032, the scheduler selects the next request from the request queue according to the set scheduling policy. The dispatcher sends the selected request to the downloader, and the downloader sends the request to the website server and obtains the response.

And A1033, the downloader returns the response to the engine, and the engine gives the response to the Spider for analysis.

A1034, spider extracts the data from the response and generates a new request, which is added to the scheduler.

A1035, steps A1032-A1035 are looped until all requests in the scheduler are processed. Finally, the data is sent to Pipeline for processing and storage.

The method in A104 is as follows:

the AI engine acquires the task from the task center, loads the HTML webpage file acquired through the Scrapy in the file system, analyzes and scores the HTML webpage file, and selects the most conforming HTML page. After analysis, the analysis data contains information such as field CVE_ID, vulnerability name, exp link, link source, average score, update time and the like, state and the like and is synchronized into a result pool (MQ).

The method can be divided into the following steps:

a1042, selecting 5 links according to analysis scenes, and formulating three strategies to score and output, wherein the three strategies are as follows:

1. when one of the 5 link scores is more than or equal to 0.9, judging that the link score is the trusted EXP;

2. when three scores in the 5 link scores are more than or equal to 0.5, judging that the link scores are trusted EXP;

3. when the average score of the 5 links is more than or equal to 0.6, judging the link as the trusted EXP;

a1043, after the grading judgment of the AI engine, synchronizing the data information to a result pool (MQ);

the method in A105 is as follows:

constructing a main service, acquiring a calculation result from a result pool (MQ), continuously warehousing data conforming to the calculation result, and correlating vulnerability numbers;

the method in A106 is as follows:

after the data are associated and put in storage, the links conforming to the vulnerability EXP are manually marked through manual auxiliary auditing, after the marking is successful, a task is sent to a task center, the webpage content is returned to an AI engine, secondary and repeated training is performed again, and finally the model is consolidated.

the method comprises the following steps:

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An intelligent vulnerability reproduction retrieval method is characterized by comprising the following steps:

a101, collecting and sorting HTML webpages containing vulnerability EXP;

and A106, after the associated vulnerability EXP is put in storage, performing manual auxiliary audit, and performing state update on the vulnerability judged to be the trusted EXP, and after the state update, automatically feeding back to the model training through the synchronization center again.

2. The method for retrieving intelligent vulnerability discovery according to claim 1, wherein the method in step a101 is as follows: various HTML webpages containing the vulnerability EXP are collected, and whether the data need to meet the following contents is judged manually: the field, the serial number of the vulnerability, the EXP title, the date, the software link, the version number, the software platform, the test system and the reproduction step can contain other information such as attack scenes, vulnerability descriptions, screenshot and the like.

3. The method for retrieving intelligent vulnerability discovery according to claim 1, wherein the method in a102 is as follows: importing the managed vulnerability EXP data into a TEXTCNN framework, performing model training learning, extracting the vulnerability EXP key elements, preprocessing text data into word vectors or character vector representations with fixed lengths, wherein the text data at least comprises: the method comprises the steps of numbering vulnerabilities, EXP titles, dates, software links, version numbers, software platforms, test systems, reproduction steps, attack scenes, vulnerability descriptions, screenshot and other information, and marking texts so that each text corresponds to one label;

the method comprises the following steps:

4. A method for retrieving an intelligent vulnerability discovery according to any one of claims 1-3, wherein the method in a103 is as follows: using a Scrapy framework to acquire CVE and CNNVD basic data, cleaning and warehousing the data by using a Pipeline after acquiring, wherein the warehousing data comprises a field CVE number, a vulnerability name, a vulnerability cause, a vulnerability level, a vulnerability source release time, a vulnerability hazard, a vulnerability description, a repairing measure, a reference link, POC, bugtraq number, affected software, a security suggestion, a vulnerability provider, CNNVD number, a patch, a threat type, a vulnerability source and an updating time, and simultaneously issuing a task in a task center;

the method comprises the following steps:

5. A method for retrieving an intelligent vulnerability discovery according to any one of claims 1-3, wherein the method in a104 is: the AI engine acquires a task from a task center, loads an HTML webpage file acquired through the Scrapy in a file system, analyzes and scores the HTML webpage file, and selects an HTML page which is the most consistent with the HTML page; after analysis is finished, synchronizing information such as analysis data containing fields CVE_ID, vulnerability names, exp links, link sources, average scores, update time and the like, states and the like into a result pool;

the method comprises the following steps:

6. An intelligent vulnerability retrieval system, comprising:

7. The intelligent vulnerability retrieval system of claim 6, wherein the method in the data collection module is: various HTML webpages containing the vulnerability EXP are collected, and whether the data need to meet the following contents is judged manually: the field, the serial number of the vulnerability, the EXP title, the date, the software link, the version number, the software platform, the test system and the reproduction step can contain other information such as attack scenes, vulnerability descriptions, screenshot and the like.

8. The intelligent vulnerability retrieval system of claim 6, wherein the learning module comprises: importing the managed vulnerability EXP data into a TEXTCNN framework, performing model training learning, extracting the vulnerability EXP key elements, preprocessing text data into word vectors or character vector representations with fixed lengths, wherein the text data at least comprises: the method comprises the steps of numbering vulnerabilities, EXP titles, dates, software links, version numbers, software platforms, test systems, reproduction steps, attack scenes, vulnerability descriptions, screenshot and other information, and marking texts so that each text corresponds to one label;

the method comprises the following steps:

9. An intelligent vulnerability retrieval system according to any one of claims 6-8, wherein the method in the data crawling module is as follows: using a Scrapy framework to acquire CVE and CNNVD basic data, cleaning and warehousing the data by using a Pipeline after acquiring, wherein the warehousing data comprises a field CVE number, a vulnerability name, a vulnerability cause, a vulnerability level, a vulnerability source release time, a vulnerability hazard, a vulnerability description, a repairing measure, a reference link, POC, bugtraq number, affected software, a security suggestion, a vulnerability provider, CNNVD number, a patch, a threat type, a vulnerability source and an updating time, and simultaneously issuing a task in a task center;

the method comprises the following steps:

10. An intelligent vulnerability retrieval system according to any one of claims 6-8, wherein the method in the analysis module is: the AI engine acquires a task from a task center, loads an HTML webpage file acquired through the Scrapy in a file system, analyzes and scores the HTML webpage file, and selects an HTML page which is the most consistent with the HTML page; after analysis is finished, synchronizing information such as analysis data containing fields CVE_ID, vulnerability names, exp links, link sources, average scores, update time and the like, states and the like into a result pool;

the method comprises the following steps: