CN117494132A - Intelligent vulnerability recurrence retrieval method and system - Google Patents
Intelligent vulnerability recurrence retrieval method and system Download PDFInfo
- Publication number
- CN117494132A CN117494132A CN202311414864.3A CN202311414864A CN117494132A CN 117494132 A CN117494132 A CN 117494132A CN 202311414864 A CN202311414864 A CN 202311414864A CN 117494132 A CN117494132 A CN 117494132A
- Authority
- CN
- China
- Prior art keywords
- vulnerability
- model
- data
- exp
- task
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 77
- 238000004458 analytical method Methods 0.000 claims abstract description 58
- 238000012549 training Methods 0.000 claims abstract description 44
- 238000004364 calculation method Methods 0.000 claims abstract description 16
- 230000009193 crawling Effects 0.000 claims abstract description 13
- 230000006870 function Effects 0.000 claims description 24
- 230000004044 response Effects 0.000 claims description 24
- 238000012360 testing method Methods 0.000 claims description 19
- 241000239290 Araneae Species 0.000 claims description 18
- 238000011156 evaluation Methods 0.000 claims description 18
- 238000005457 optimization Methods 0.000 claims description 18
- 239000013598 vector Substances 0.000 claims description 18
- 238000004422 calculation algorithm Methods 0.000 claims description 12
- 238000011176 pooling Methods 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 7
- 238000012550 audit Methods 0.000 claims description 6
- 238000004140 cleaning Methods 0.000 claims description 6
- 238000013480 data collection Methods 0.000 claims description 6
- 238000013500 data storage Methods 0.000 claims description 6
- 230000000694 effects Effects 0.000 claims description 6
- 238000011478 gradient descent method Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 6
- 238000012795 verification Methods 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 5
- 238000005516 engineering process Methods 0.000 abstract description 4
- 239000000284 extract Substances 0.000 abstract description 4
- 230000002708 enhancing effect Effects 0.000 abstract description 3
- 238000010801 machine learning Methods 0.000 abstract description 3
- 230000008439 repair process Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/57—Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
- G06F21/577—Assessing vulnerabilities and evaluating computer system security
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/0985—Hyperparameter optimisation; Meta-learning; Learning-to-learn
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Computer Security & Cryptography (AREA)
- Databases & Information Systems (AREA)
- Computer Hardware Design (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses an intelligent vulnerability recurrence retrieval method and system, comprising the following steps: firstly, collecting and sorting HTML webpages containing vulnerability EXP; the well-arranged vulnerability data is thrown into a TEXTCNN framework for model learning analysis, so as to form an AI engine; crawling vulnerability data by utilizing a Scrapy framework, and simultaneously issuing a task to a task center; the AI engine goes to a task service center to acquire a task, loads a webpage file in a file system for analysis, and discards an analysis result into a calculation result pool; the main service obtains a calculation result, carries out vulnerability EXP warehousing, and automatically associates CVE and CNNVD numbers; after the associated vulnerability EXP is put in storage, performing manual auxiliary auditing, and updating the state of the vulnerability judged to be the credible EXP and using the vulnerability to model training. The invention is based on the machine learning technology, can analyze and extract the original data from different sources and then carry out scoring judgment, thus not only greatly enhancing the accuracy of the data, but also saving the retrieval time.
Description
Technical Field
The invention relates to the technical field of automatic missed search, in particular to an AI engine analysis, research and judgment and self-learning method, wherein vulnerability detection is a search vulnerability reproduction method.
Background
The frames and the storage modes in the existing vulnerability management system are basically to update, arrange and put vulnerabilities in storage in real time according to the time sequence, judge vulnerability information manually, have low efficiency, and part of links in the vulnerability information base do not give specific reproduction schemes and repair schemes, so that the capability of preventing security threats is weak, and the timeliness of how to repair and prevent vulnerabilities is also relatively weak.
Because the mode of manually searching for the vulnerability reproduction method is too complicated, a proper vulnerability reproduction method cannot be found, a large amount of time is consumed, and the efficiency is too low. In order to effectively solve the problem, an automatic framework retrieval technology is introduced to perform vulnerability retrieval, and an AI engine is combined to perform research and judgment.
Disclosure of Invention
The technical problem to be solved by the invention is how to solve the occurrence process of the accurate positioning loopholes in the existing scheme and how to search the loophole reproduction mode without manual work.
The invention solves the technical problems by the following technical means:
an intelligent vulnerability recurrence retrieval method comprises the following steps:
a101, collecting and sorting HTML webpages containing vulnerability EXP;
a102, throwing the HTML data of the loophole EXP webpage which is arranged in the step A101 into a TEXTCNN framework, and performing model learning analysis to form an AI engine;
a103, crawling of vulnerability data CVE and CNNVD is carried out by utilizing the Scopy framework, and a task is issued to a task center;
a104, the AI engine goes to a task service center to acquire a task, loads a webpage file in a file system for analysis, and drops a result into a calculation result pool after the analysis is finished;
a105, the main service obtains a calculation result, carries out vulnerability EXP warehousing, and automatically associates CVE and CNNVD numbers;
and A106, after the associated vulnerability EXP is put in storage, performing manual auxiliary audit, and performing state update on the vulnerability judged as the trusted EXP, after the state update, automatically synchronizing the EXP data judged as feasible to the model through a synchronization center (the synchronization center is a synchronization service between the data and the model), and directly feeding back to the model training.
The invention is based on the machine learning technology, can analyze and extract the original data from different sources and then carry out scoring judgment, thus not only greatly enhancing the accuracy of the data, but also saving the retrieval time. Compared with the traditional statistical analysis and retrieval method, the automatic AI engine-based retrieval method is more accurate and efficient.
Further, the method in step a101 is as follows: various HTML webpages containing the vulnerability EXP are collected, and whether the data need to meet the following contents is judged manually: the field, the serial number of the vulnerability, the EXP title, the date, the software link, the version number, the software platform, the test system and the reproduction step can contain other information such as attack scenes, vulnerability descriptions, screenshot and the like.
Furthermore, the arranged vulnerability EXP data is imported into a TEXTCNN framework, model training learning is carried out, vulnerability EXP key elements are extracted, text data is preprocessed into word vectors or character vector representations with fixed lengths, and the text data at least comprises: the method comprises the steps of numbering vulnerabilities, EXP titles, dates, software links, version numbers, software platforms, test systems, reproduction steps, attack scenes, vulnerability descriptions, screenshot and other information, and marking texts so that each text corresponds to one label;
the method comprises the following steps:
a1021 model structure definition: the model structure of the textCNN is constructed and comprises an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer. The input layer receives text vectors as input, the convolution layer is used for extracting local features, the pooling layer is used for selecting the most obvious features, the full connection layer is used for learning the representation of the text, and the output layer is used for classifying the text;
a1022 loss function selection: selecting an appropriate loss function for TextCNN, typically in a text classification task, using a cross entropy loss function to measure the difference between model predictions and real labels;
a1023 model optimization: an optimization algorithm (such as a gradient descent method) is adopted to minimize a loss function, and parameters of the model are updated, so that a prediction result of the model is more similar to a real label;
a1024 model training: and training the textCNN by using the labeled training data. In the training process, the model gradually learns the characteristics of the text by continuously and iteratively updating parameters, and optimizes the performance of the model;
a1024 model evaluation: evaluating the trained textCNN by using a verification set or a test set, and calculating indexes such as accuracy, precision, recall rate and the like of the model to measure the performance of the model;
and A1025, model tuning: according to the evaluation result, the model is optimized, and the super parameter, the model structure or the optimization algorithm can be possibly adjusted so as to further improve the performance of the model;
a1026 model application: after the model training and evaluation reach satisfactory effects, the trained TextCNN model can be used for practical applications.
Further, the method in a103 is as follows: using a Scrapy framework to acquire CVE and CNNVD basic data, cleaning and warehousing the data by using a Pipeline after acquiring, wherein the warehousing data comprises a field CVE number, a vulnerability name, a vulnerability cause, a vulnerability level, a vulnerability source release time, a vulnerability hazard, a vulnerability description, a repairing measure, a reference link, POC, bugtraq number, affected software, a security suggestion, a vulnerability provider, CNNVD number, a patch, a threat type, a vulnerability source and an updating time, and simultaneously issuing a task in a task center;
the method comprises the following steps:
a1031, the engine acquires initial URL of vulnerability basic data from the Spider and adds an initial request into a scheduler;
a1032, the scheduler selects the next request from the request queue according to the set scheduling strategy; the dispatcher sends the selected request to the downloader, and the downloader sends the request to the website server and obtains the response;
a1033, the downloader returns the response to the engine, and the engine gives the response to the Spider for analysis;
a1034, extracting data from the response by the Spider, generating a new request, and adding the new request to the scheduler;
a1035, steps A1032-A1035 are looped until all requests in the scheduler are processed. Finally, the data is sent to Pipeline for processing and storage;
and A1036, after the data storage is successful, simultaneously issuing a task in the task center.
Further, the method in a104 is as follows: the AI engine acquires a task from a task center, loads an HTML webpage file acquired through the Scrapy in a file system, analyzes and scores the HTML webpage file, and selects an HTML page which is the most consistent with the HTML page; after analysis is finished, synchronizing information such as analysis data containing fields CVE_ID, vulnerability names, exp links, link sources, average scores, update time and the like, states and the like into a result pool;
the method comprises the following steps:
a1041, an AI engine acquires an analysis task from a task center, analyzes a webpage to be analyzed, judges the webpage, and scores and outputs the webpage;
a1042, selecting and setting links according to analysis scenes, and formulating various strategies to score credible EXP;
and A1043, synchronizing the data information to a result pool after the grading judgment of the AI engine.
Corresponding to the method, the invention also provides an intelligent vulnerability recurrence retrieval system, which comprises the following steps:
the data collection module is used for collecting and sorting the HTML webpage containing the vulnerability EXP;
the learning module is used for throwing the managed vulnerability EXP webpage HTML data into the TEXTCNN framework, and performing model learning analysis to form an AI engine;
the data crawling module is used for crawling the vulnerability data CVE and CNNVD by utilizing the Scopy framework and simultaneously issuing a task to the task center;
the analysis module is used for the AI engine to go to a task service center to acquire tasks, load the webpage files in the file system for analysis, and after the analysis is finished, throw the results into a calculation result pool;
the association module is used for obtaining a calculation result by the main service, carrying out vulnerability EXP warehousing, and automatically associating CVE and CNNVD numbers;
and the updating module is used for carrying out manual auxiliary audit after the associated vulnerability EXP is put in storage, carrying out state updating on the vulnerability judged to be the trusted EXP, and automatically feeding back to the model training through the synchronization center again after updating.
Further, the method in the data collection module is as follows: various HTML webpages containing the vulnerability EXP are collected, and whether the data need to meet the following contents is judged manually: the field, the serial number of the vulnerability, the EXP title, the date, the software link, the version number, the software platform, the test system and the reproduction step can contain other information such as attack scenes, vulnerability descriptions, screenshot and the like.
Further, the method in the learning module is as follows: importing the managed vulnerability EXP data into a TEXTCNN framework, performing model training learning, extracting the vulnerability EXP key elements, preprocessing text data into word vectors or character vector representations with fixed lengths, wherein the text data at least comprises: the method comprises the steps of numbering vulnerabilities, EXP titles, dates, software links, version numbers, software platforms, test systems, reproduction steps, attack scenes, vulnerability descriptions, screenshot and other information, and marking texts so that each text corresponds to one label;
the method comprises the following steps:
a1021 model structure definition: the model structure of the textCNN is constructed and comprises an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer. The input layer receives text vectors as input, the convolution layer is used for extracting local features, the pooling layer is used for selecting the most obvious features, the full connection layer is used for learning the representation of the text, and the output layer is used for classifying the text;
a1022 loss function selection: selecting an appropriate loss function for TextCNN, typically in a text classification task, using a cross entropy loss function to measure the difference between model predictions and real labels;
a1023 model optimization: an optimization algorithm (such as a gradient descent method) is adopted to minimize a loss function, and parameters of the model are updated, so that a prediction result of the model is more similar to a real label;
a1024 model training: and training the textCNN by using the labeled training data. In the training process, the model gradually learns the characteristics of the text by continuously and iteratively updating parameters, and optimizes the performance of the model;
a1024 model evaluation: evaluating the trained textCNN by using a verification set or a test set, and calculating indexes such as accuracy, precision, recall rate and the like of the model to measure the performance of the model;
and A1025, model tuning: according to the evaluation result, the model is optimized, and the super parameter, the model structure or the optimization algorithm can be possibly adjusted so as to further improve the performance of the model;
a1026 model application: after the model training and evaluation reach satisfactory effects, the trained TextCNN model can be used for practical applications.
Further, the method in the data crawling module is as follows: using a Scrapy framework to acquire CVE and CNNVD basic data, cleaning and warehousing the data by using a Pipeline after acquiring, wherein the warehousing data comprises a field CVE number, a vulnerability name, a vulnerability cause, a vulnerability level, a vulnerability source release time, a vulnerability hazard, a vulnerability description, a repairing measure, a reference link, POC, bugtraq number, affected software, a security suggestion, a vulnerability provider, CNNVD number, a patch, a threat type, a vulnerability source and an updating time, and simultaneously issuing a task in a task center;
the method comprises the following steps:
a1031, the engine acquires initial URL of vulnerability basic data from the Spider and adds an initial request into a scheduler;
a1032, the scheduler selects the next request from the request queue according to the set scheduling strategy; the dispatcher sends the selected request to the downloader, and the downloader sends the request to the website server and obtains the response;
a1033, the downloader returns the response to the engine, and the engine gives the response to the Spider for analysis;
a1034, extracting data from the response by the Spider, generating a new request, and adding the new request to the scheduler;
a1035, steps A1032-A1035 are looped until all requests in the scheduler are processed. Finally, the data is sent to Pipeline for processing and storage;
and A1036, after the data storage is successful, simultaneously issuing a task in the task center.
Further, the method in the analysis module is as follows: the AI engine acquires a task from a task center, loads an HTML webpage file acquired through the Scrapy in a file system, analyzes and scores the HTML webpage file, and selects an HTML page which is the most consistent with the HTML page; after analysis is finished, synchronizing information such as analysis data containing fields CVE_ID, vulnerability names, exp links, link sources, average scores, update time and the like, states and the like into a result pool;
the method comprises the following steps:
a1041, an AI engine acquires an analysis task from a task center, analyzes a webpage to be analyzed, judges the webpage, and scores and outputs the webpage;
a1042, selecting and setting links according to analysis scenes, and formulating various strategies to score credible EXP;
and A1043, synchronizing the data information to a result pool after the grading judgment of the AI engine.
The invention has the advantages that:
the invention is based on the machine learning technology, can analyze and extract the original data from different sources and then carry out scoring judgment, thus not only greatly enhancing the accuracy of the data, but also saving the retrieval time. Compared with the traditional statistical analysis and retrieval method, the automatic AI engine-based retrieval method is more accurate and efficient.
The automatic combined AI engine vulnerability retrieval method is characterized in that after an automatic retrieval framework is adopted, needed CVE and CNNVD vulnerabilities are put into ES (ElasticSearch) libraries, all the vulnerabilities are put into an AI engine after being put into storage, the vulnerabilities are researched, judged and scored and equally distributed, all links with vulnerabilities are put into storage and displayed, and finally packaging is carried out.
Drawings
Fig. 1 is a flow chart of an intelligent vulnerability retrieval method in an embodiment of the invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described in the following in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. An intelligent vulnerability reproduction retrieval method, as shown in figure 1, comprises the following specific steps:
a101, collecting and sorting HTML webpages containing vulnerability EXP;
a102, throwing the HTML data of the loophole EXP webpage which is arranged in the step A101 into a TEXTCNN framework, and performing model learning analysis to form an AI engine;
a103, utilizing the Scrapy framework to perform vulnerability (CVE, CNNVD) data crawling, and simultaneously issuing a task to a task center (MQ);
a104, the AI engine goes to a task service center to acquire tasks, loads webpage files in a file system for analysis, and loses the results into a calculation result pool (MQ) after the analysis is finished;
a105, the main service obtains a calculation result, carries out vulnerability EXP warehousing, and automatically associates CVE and CNNVD numbers;
the contents of each step are specifically described below:
and A106, after the associated vulnerability EXP is put in storage, performing manual auxiliary audit, and performing state update on the vulnerability judged to be the true EXP, and after the vulnerability is updated, directly feeding back to the model training through the synchronization center again. To improve accuracy; the method in A101 is as follows:
various HTML webpages containing the vulnerability EXP are collected, and whether the data need to meet the following contents is judged manually: fields that must be included, the number of the vulnerability (CVE, CNNVD number), EXP Title (explet Title), date (data), software Link (Software Link), version number (Version), software Platform (Platform), test system (test on), reproduction step (Steps To Reproduce) may include other information such as Attack scenario (Attack scene), vulnerability Description (Description), screenshot, etc.
The method in A102 is as follows:
importing the arranged vulnerability EXP data into a TEXTCNN framework, performing model training learning, extracting vulnerability EXP key elements, preprocessing text data into word vectors or character vector representations with fixed lengths, wherein the text data comprises but is not limited to the following contents: the method comprises the steps of numbering vulnerabilities, EXP titles, dates, software links, version numbers, software platforms, test systems, reproduction steps, attack scenes, vulnerability descriptions, screenshot and other information, and marking texts so that each text corresponds to one label.
The method can be divided into the following steps:
a1021 model structure definition: the model structure of the textCNN is constructed and comprises an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer. The input layer receives text vectors as input, the convolution layer is used for extracting local features, the pooling layer is used for selecting the most obvious features, the full connection layer is used for learning the representation of the text, and the output layer is used for classifying the text.
A1022 loss function selection: the appropriate loss function is selected for TextCNN, typically in a text classification task, using a cross entropy loss function to measure the difference between model predictions and real labels.
A1023 model optimization: an optimization algorithm (such as a gradient descent method) is adopted to minimize the loss function, and parameters of the model are updated so that a prediction result of the model is closer to a real label.
A1024 model training: and training the textCNN by using the labeled training data. In the training process, the model gradually learns the characteristics of the text by continuously and iteratively updating parameters, and optimizes the performance of the model.
A1024 model evaluation: and evaluating the trained textCNN by using a verification set or a test set, and calculating indexes such as accuracy, precision, recall rate and the like of the model to measure the performance of the model.
And A1025, model tuning: according to the evaluation result, the model is optimized, and the super parameter, the model structure or the optimization algorithm may be adjusted to further improve the performance of the model.
A1026 model application: after the model training and evaluation reach satisfactory effects, the trained TextCNN model can be used for practical applications.
The method in A103 is as follows:
and (3) performing CVE and CNNVD basic data acquisition by utilizing a Scrapy framework, cleaning and warehousing the acquired data by utilizing a Pipeline, wherein the warehousing data comprises a field CVE number, a vulnerability name, a vulnerability cause, a vulnerability level, a vulnerability source release time, a vulnerability hazard, a vulnerability description, a repairing measure, a reference link, POC (point-to-multipoint) and Bugtraq number, affected software, a security suggestion, a vulnerability provider, CNNVD number, a patch, a threat type, a vulnerability source and an updating time. And simultaneously, a task is issued at a task center.
The method can be divided into the following steps:
a1031, the engine acquires initial URL of vulnerability basic data from the Spider and adds the initial request to the scheduler.
A1032, the scheduler selects the next request from the request queue according to the set scheduling policy. The dispatcher sends the selected request to the downloader, and the downloader sends the request to the website server and obtains the response.
And A1033, the downloader returns the response to the engine, and the engine gives the response to the Spider for analysis.
A1034, spider extracts the data from the response and generates a new request, which is added to the scheduler.
A1035, steps A1032-A1035 are looped until all requests in the scheduler are processed. Finally, the data is sent to Pipeline for processing and storage.
And A1036, after the data storage is successful, simultaneously issuing a task in the task center.
The method in A104 is as follows:
the AI engine acquires the task from the task center, loads the HTML webpage file acquired through the Scrapy in the file system, analyzes and scores the HTML webpage file, and selects the most conforming HTML page. After analysis, the analysis data contains information such as field CVE_ID, vulnerability name, exp link, link source, average score, update time and the like, state and the like and is synchronized into a result pool (MQ).
The method can be divided into the following steps:
a1041, an AI engine acquires an analysis task from a task center, analyzes a webpage to be analyzed, judges the webpage, and scores and outputs the webpage;
a1042, selecting 5 links according to analysis scenes, and formulating three strategies to score and output, wherein the three strategies are as follows:
1. when one of the 5 link scores is more than or equal to 0.9, judging that the link score is the trusted EXP;
2. when three scores in the 5 link scores are more than or equal to 0.5, judging that the link scores are trusted EXP;
3. when the average score of the 5 links is more than or equal to 0.6, judging the link as the trusted EXP;
a1043, after the grading judgment of the AI engine, synchronizing the data information to a result pool (MQ);
the method in A105 is as follows:
constructing a main service, acquiring a calculation result from a result pool (MQ), continuously warehousing data conforming to the calculation result, and correlating vulnerability numbers;
the method in A106 is as follows:
after the data are associated and put in storage, the links conforming to the vulnerability EXP are manually marked through manual auxiliary auditing, after the marking is successful, a task is sent to a task center, the webpage content is returned to an AI engine, secondary and repeated training is performed again, and finally the model is consolidated.
Corresponding to the method, the invention also provides an intelligent vulnerability recurrence retrieval system, which comprises the following steps:
the data collection module is used for collecting and sorting the HTML webpage containing the vulnerability EXP;
the learning module is used for throwing the managed vulnerability EXP webpage HTML data into the TEXTCNN framework, and performing model learning analysis to form an AI engine;
the data crawling module is used for crawling the vulnerability data CVE and CNNVD by utilizing the Scopy framework and simultaneously issuing a task to the task center;
the analysis module is used for the AI engine to go to a task service center to acquire tasks, load the webpage files in the file system for analysis, and after the analysis is finished, throw the results into a calculation result pool;
the association module is used for obtaining a calculation result by the main service, carrying out vulnerability EXP warehousing, and automatically associating CVE and CNNVD numbers;
and the updating module is used for carrying out manual auxiliary audit after the associated vulnerability EXP is put in storage, carrying out state updating on the vulnerability judged to be the trusted EXP, and automatically feeding back to the model training through the synchronization center again after updating.
Further, the method in the data collection module is as follows: various HTML webpages containing the vulnerability EXP are collected, and whether the data need to meet the following contents is judged manually: the field, the serial number of the vulnerability, the EXP title, the date, the software link, the version number, the software platform, the test system and the reproduction step can contain other information such as attack scenes, vulnerability descriptions, screenshot and the like.
Further, the method in the learning module is as follows: importing the managed vulnerability EXP data into a TEXTCNN framework, performing model training learning, extracting the vulnerability EXP key elements, preprocessing text data into word vectors or character vector representations with fixed lengths, wherein the text data at least comprises: the method comprises the steps of numbering vulnerabilities, EXP titles, dates, software links, version numbers, software platforms, test systems, reproduction steps, attack scenes, vulnerability descriptions, screenshot and other information, and marking texts so that each text corresponds to one label;
the method comprises the following steps:
a1021 model structure definition: the model structure of the textCNN is constructed and comprises an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer. The input layer receives text vectors as input, the convolution layer is used for extracting local features, the pooling layer is used for selecting the most obvious features, the full connection layer is used for learning the representation of the text, and the output layer is used for classifying the text;
a1022 loss function selection: selecting an appropriate loss function for TextCNN, typically in a text classification task, using a cross entropy loss function to measure the difference between model predictions and real labels;
a1023 model optimization: an optimization algorithm (such as a gradient descent method) is adopted to minimize a loss function, and parameters of the model are updated, so that a prediction result of the model is more similar to a real label;
a1024 model training: and training the textCNN by using the labeled training data. In the training process, the model gradually learns the characteristics of the text by continuously and iteratively updating parameters, and optimizes the performance of the model;
a1024 model evaluation: evaluating the trained textCNN by using a verification set or a test set, and calculating indexes such as accuracy, precision, recall rate and the like of the model to measure the performance of the model;
and A1025, model tuning: according to the evaluation result, the model is optimized, and the super parameter, the model structure or the optimization algorithm can be possibly adjusted so as to further improve the performance of the model;
a1026 model application: after the model training and evaluation reach satisfactory effects, the trained TextCNN model can be used for practical applications.
Further, the method in the data crawling module is as follows: using a Scrapy framework to acquire CVE and CNNVD basic data, cleaning and warehousing the data by using a Pipeline after acquiring, wherein the warehousing data comprises a field CVE number, a vulnerability name, a vulnerability cause, a vulnerability level, a vulnerability source release time, a vulnerability hazard, a vulnerability description, a repairing measure, a reference link, POC, bugtraq number, affected software, a security suggestion, a vulnerability provider, CNNVD number, a patch, a threat type, a vulnerability source and an updating time, and simultaneously issuing a task in a task center;
the method comprises the following steps:
a1031, the engine acquires initial URL of vulnerability basic data from the Spider and adds an initial request into a scheduler;
a1032, the scheduler selects the next request from the request queue according to the set scheduling strategy; the dispatcher sends the selected request to the downloader, and the downloader sends the request to the website server and obtains the response;
a1033, the downloader returns the response to the engine, and the engine gives the response to the Spider for analysis;
a1034, extracting data from the response by the Spider, generating a new request, and adding the new request to the scheduler;
a1035, steps A1032-A1035 are looped until all requests in the scheduler are processed. Finally, the data is sent to Pipeline for processing and storage;
and A1036, after the data storage is successful, simultaneously issuing a task in the task center.
Further, the method in the analysis module is as follows: the AI engine acquires a task from a task center, loads an HTML webpage file acquired through the Scrapy in a file system, analyzes and scores the HTML webpage file, and selects an HTML page which is the most consistent with the HTML page; after analysis is finished, synchronizing information such as analysis data containing fields CVE_ID, vulnerability names, exp links, link sources, average scores, update time and the like, states and the like into a result pool;
the method comprises the following steps:
a1041, an AI engine acquires an analysis task from a task center, analyzes a webpage to be analyzed, judges the webpage, and scores and outputs the webpage;
a1042, selecting and setting links according to analysis scenes, and formulating various strategies to score credible EXP;
and A1043, synchronizing the data information to a result pool after the grading judgment of the AI engine.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (10)
1. An intelligent vulnerability reproduction retrieval method is characterized by comprising the following steps:
a101, collecting and sorting HTML webpages containing vulnerability EXP;
a102, throwing the HTML data of the loophole EXP webpage which is arranged in the step A101 into a TEXTCNN framework, and performing model learning analysis to form an AI engine;
a103, crawling of vulnerability data CVE and CNNVD is carried out by utilizing the Scopy framework, and a task is issued to a task center;
a104, the AI engine goes to a task service center to acquire a task, loads a webpage file in a file system for analysis, and drops a result into a calculation result pool after the analysis is finished;
a105, the main service obtains a calculation result, carries out vulnerability EXP warehousing, and automatically associates CVE and CNNVD numbers;
and A106, after the associated vulnerability EXP is put in storage, performing manual auxiliary audit, and performing state update on the vulnerability judged to be the trusted EXP, and after the state update, automatically feeding back to the model training through the synchronization center again.
2. The method for retrieving intelligent vulnerability discovery according to claim 1, wherein the method in step a101 is as follows: various HTML webpages containing the vulnerability EXP are collected, and whether the data need to meet the following contents is judged manually: the field, the serial number of the vulnerability, the EXP title, the date, the software link, the version number, the software platform, the test system and the reproduction step can contain other information such as attack scenes, vulnerability descriptions, screenshot and the like.
3. The method for retrieving intelligent vulnerability discovery according to claim 1, wherein the method in a102 is as follows: importing the managed vulnerability EXP data into a TEXTCNN framework, performing model training learning, extracting the vulnerability EXP key elements, preprocessing text data into word vectors or character vector representations with fixed lengths, wherein the text data at least comprises: the method comprises the steps of numbering vulnerabilities, EXP titles, dates, software links, version numbers, software platforms, test systems, reproduction steps, attack scenes, vulnerability descriptions, screenshot and other information, and marking texts so that each text corresponds to one label;
the method comprises the following steps:
a1021 model structure definition: the model structure of the textCNN is constructed and comprises an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer. The input layer receives text vectors as input, the convolution layer is used for extracting local features, the pooling layer is used for selecting the most obvious features, the full connection layer is used for learning the representation of the text, and the output layer is used for classifying the text;
a1022 loss function selection: selecting an appropriate loss function for TextCNN, typically in a text classification task, using a cross entropy loss function to measure the difference between model predictions and real labels;
a1023 model optimization: an optimization algorithm (such as a gradient descent method) is adopted to minimize a loss function, and parameters of the model are updated, so that a prediction result of the model is more similar to a real label;
a1024 model training: and training the textCNN by using the labeled training data. In the training process, the model gradually learns the characteristics of the text by continuously and iteratively updating parameters, and optimizes the performance of the model;
a1024 model evaluation: evaluating the trained textCNN by using a verification set or a test set, and calculating indexes such as accuracy, precision, recall rate and the like of the model to measure the performance of the model;
and A1025, model tuning: according to the evaluation result, the model is optimized, and the super parameter, the model structure or the optimization algorithm can be possibly adjusted so as to further improve the performance of the model;
a1026 model application: after the model training and evaluation reach satisfactory effects, the trained TextCNN model can be used for practical applications.
4. A method for retrieving an intelligent vulnerability discovery according to any one of claims 1-3, wherein the method in a103 is as follows: using a Scrapy framework to acquire CVE and CNNVD basic data, cleaning and warehousing the data by using a Pipeline after acquiring, wherein the warehousing data comprises a field CVE number, a vulnerability name, a vulnerability cause, a vulnerability level, a vulnerability source release time, a vulnerability hazard, a vulnerability description, a repairing measure, a reference link, POC, bugtraq number, affected software, a security suggestion, a vulnerability provider, CNNVD number, a patch, a threat type, a vulnerability source and an updating time, and simultaneously issuing a task in a task center;
the method comprises the following steps:
a1031, the engine acquires initial URL of vulnerability basic data from the Spider and adds an initial request into a scheduler;
a1032, the scheduler selects the next request from the request queue according to the set scheduling strategy; the dispatcher sends the selected request to the downloader, and the downloader sends the request to the website server and obtains the response;
a1033, the downloader returns the response to the engine, and the engine gives the response to the Spider for analysis;
a1034, extracting data from the response by the Spider, generating a new request, and adding the new request to the scheduler;
a1035, steps A1032-A1035 are looped until all requests in the scheduler are processed. Finally, the data is sent to Pipeline for processing and storage;
and A1036, after the data storage is successful, simultaneously issuing a task in the task center.
5. A method for retrieving an intelligent vulnerability discovery according to any one of claims 1-3, wherein the method in a104 is: the AI engine acquires a task from a task center, loads an HTML webpage file acquired through the Scrapy in a file system, analyzes and scores the HTML webpage file, and selects an HTML page which is the most consistent with the HTML page; after analysis is finished, synchronizing information such as analysis data containing fields CVE_ID, vulnerability names, exp links, link sources, average scores, update time and the like, states and the like into a result pool;
the method comprises the following steps:
a1041, an AI engine acquires an analysis task from a task center, analyzes a webpage to be analyzed, judges the webpage, and scores and outputs the webpage;
a1042, selecting and setting links according to analysis scenes, and formulating various strategies to score credible EXP;
and A1043, synchronizing the data information to a result pool after the grading judgment of the AI engine.
6. An intelligent vulnerability retrieval system, comprising:
the data collection module is used for collecting and sorting the HTML webpage containing the vulnerability EXP;
the learning module is used for throwing the managed vulnerability EXP webpage HTML data into the TEXTCNN framework, and performing model learning analysis to form an AI engine;
the data crawling module is used for crawling the vulnerability data CVE and CNNVD by utilizing the Scopy framework and simultaneously issuing a task to the task center;
the analysis module is used for the AI engine to go to a task service center to acquire tasks, load the webpage files in the file system for analysis, and after the analysis is finished, throw the results into a calculation result pool;
the association module is used for obtaining a calculation result by the main service, carrying out vulnerability EXP warehousing, and automatically associating CVE and CNNVD numbers;
and the updating module is used for carrying out manual auxiliary audit after the associated vulnerability EXP is put in storage, carrying out state updating on the vulnerability judged to be the trusted EXP, and automatically feeding back to the model training through the synchronization center again after updating.
7. The intelligent vulnerability retrieval system of claim 6, wherein the method in the data collection module is: various HTML webpages containing the vulnerability EXP are collected, and whether the data need to meet the following contents is judged manually: the field, the serial number of the vulnerability, the EXP title, the date, the software link, the version number, the software platform, the test system and the reproduction step can contain other information such as attack scenes, vulnerability descriptions, screenshot and the like.
8. The intelligent vulnerability retrieval system of claim 6, wherein the learning module comprises: importing the managed vulnerability EXP data into a TEXTCNN framework, performing model training learning, extracting the vulnerability EXP key elements, preprocessing text data into word vectors or character vector representations with fixed lengths, wherein the text data at least comprises: the method comprises the steps of numbering vulnerabilities, EXP titles, dates, software links, version numbers, software platforms, test systems, reproduction steps, attack scenes, vulnerability descriptions, screenshot and other information, and marking texts so that each text corresponds to one label;
the method comprises the following steps:
a1021 model structure definition: the model structure of the textCNN is constructed and comprises an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer. The input layer receives text vectors as input, the convolution layer is used for extracting local features, the pooling layer is used for selecting the most obvious features, the full connection layer is used for learning the representation of the text, and the output layer is used for classifying the text;
a1022 loss function selection: selecting an appropriate loss function for TextCNN, typically in a text classification task, using a cross entropy loss function to measure the difference between model predictions and real labels;
a1023 model optimization: an optimization algorithm (such as a gradient descent method) is adopted to minimize a loss function, and parameters of the model are updated, so that a prediction result of the model is more similar to a real label;
a1024 model training: and training the textCNN by using the labeled training data. In the training process, the model gradually learns the characteristics of the text by continuously and iteratively updating parameters, and optimizes the performance of the model;
a1024 model evaluation: evaluating the trained textCNN by using a verification set or a test set, and calculating indexes such as accuracy, precision, recall rate and the like of the model to measure the performance of the model;
and A1025, model tuning: according to the evaluation result, the model is optimized, and the super parameter, the model structure or the optimization algorithm can be possibly adjusted so as to further improve the performance of the model;
a1026 model application: after the model training and evaluation reach satisfactory effects, the trained TextCNN model can be used for practical applications.
9. An intelligent vulnerability retrieval system according to any one of claims 6-8, wherein the method in the data crawling module is as follows: using a Scrapy framework to acquire CVE and CNNVD basic data, cleaning and warehousing the data by using a Pipeline after acquiring, wherein the warehousing data comprises a field CVE number, a vulnerability name, a vulnerability cause, a vulnerability level, a vulnerability source release time, a vulnerability hazard, a vulnerability description, a repairing measure, a reference link, POC, bugtraq number, affected software, a security suggestion, a vulnerability provider, CNNVD number, a patch, a threat type, a vulnerability source and an updating time, and simultaneously issuing a task in a task center;
the method comprises the following steps:
a1031, the engine acquires initial URL of vulnerability basic data from the Spider and adds an initial request into a scheduler;
a1032, the scheduler selects the next request from the request queue according to the set scheduling strategy; the dispatcher sends the selected request to the downloader, and the downloader sends the request to the website server and obtains the response;
a1033, the downloader returns the response to the engine, and the engine gives the response to the Spider for analysis;
a1034, extracting data from the response by the Spider, generating a new request, and adding the new request to the scheduler;
a1035, steps A1032-A1035 are looped until all requests in the scheduler are processed. Finally, the data is sent to Pipeline for processing and storage;
and A1036, after the data storage is successful, simultaneously issuing a task in the task center.
10. An intelligent vulnerability retrieval system according to any one of claims 6-8, wherein the method in the analysis module is: the AI engine acquires a task from a task center, loads an HTML webpage file acquired through the Scrapy in a file system, analyzes and scores the HTML webpage file, and selects an HTML page which is the most consistent with the HTML page; after analysis is finished, synchronizing information such as analysis data containing fields CVE_ID, vulnerability names, exp links, link sources, average scores, update time and the like, states and the like into a result pool;
the method comprises the following steps:
a1041, an AI engine acquires an analysis task from a task center, analyzes a webpage to be analyzed, judges the webpage, and scores and outputs the webpage;
a1042, selecting and setting links according to analysis scenes, and formulating various strategies to score credible EXP;
and A1043, synchronizing the data information to a result pool after the grading judgment of the AI engine.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311414864.3A CN117494132A (en) | 2023-10-27 | 2023-10-27 | Intelligent vulnerability recurrence retrieval method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311414864.3A CN117494132A (en) | 2023-10-27 | 2023-10-27 | Intelligent vulnerability recurrence retrieval method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117494132A true CN117494132A (en) | 2024-02-02 |
Family
ID=89671848
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311414864.3A Pending CN117494132A (en) | 2023-10-27 | 2023-10-27 | Intelligent vulnerability recurrence retrieval method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117494132A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117744632A (en) * | 2024-02-20 | 2024-03-22 | 深圳融安网络科技有限公司 | Method, device, equipment and medium for constructing vulnerability information keyword extraction model |
-
2023
- 2023-10-27 CN CN202311414864.3A patent/CN117494132A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117744632A (en) * | 2024-02-20 | 2024-03-22 | 深圳融安网络科技有限公司 | Method, device, equipment and medium for constructing vulnerability information keyword extraction model |
CN117744632B (en) * | 2024-02-20 | 2024-05-10 | 深圳融安网络科技有限公司 | Method, device, equipment and medium for constructing vulnerability information keyword extraction model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102508859B (en) | Advertisement classification method and device based on webpage characteristic | |
CN103491205B (en) | The method for pushing of a kind of correlated resources address based on video search and device | |
CN107566376A (en) | One kind threatens information generation method, apparatus and system | |
CN101118554A (en) | Intelligent interactive request-answering system and processing method thereof | |
CN103049532A (en) | Method for creating knowledge base engine on basis of sudden event emergency management and method for inquiring knowledge base engine | |
CN108984667A (en) | A kind of public sentiment monitoring system | |
CN101782998A (en) | Intelligent judging method for illegal on-line product information and system | |
CN117494132A (en) | Intelligent vulnerability recurrence retrieval method and system | |
CN105718533A (en) | Information pushing method and device | |
CN101630315B (en) | Quick retrieval method and system | |
US20150294005A1 (en) | Method and device for acquiring information | |
CN102663060A (en) | Method and device for identifying tampered webpage | |
CN116384889A (en) | Intelligent analysis method for information big data based on natural language processing technology | |
CN113297457A (en) | High-precision intelligent information resource pushing system and pushing method | |
CN107741960A (en) | URL sorting technique and device | |
CN110310012B (en) | Data analysis method, device, equipment and computer readable storage medium | |
CN101957860A (en) | Method and device for releasing and searching information | |
CN102902790A (en) | Web page classification system and method | |
CN102902794A (en) | Web page classification system and method | |
US8463799B2 (en) | System and method for consolidating search engine results | |
CN115982429B (en) | Knowledge management method and system based on flow control | |
KR20140056402A (en) | Method, system, and apparatus for targeted searching of multi-sectional documents within an electronic document collection | |
CN103823847A (en) | Keyword extension method and device | |
CN116108955A (en) | Method, device, equipment and storage medium for upgrading and early warning of social contradiction disputes | |
CN102222067A (en) | Searching method for accurately querying information according to IP (Internet Protocol) address of keyword |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |