CN117494132A - Intelligent vulnerability recurrence retrieval method and system - Google Patents

Intelligent vulnerability recurrence retrieval method and system Download PDF

Info

Publication number
CN117494132A
CN117494132A CN202311414864.3A CN202311414864A CN117494132A CN 117494132 A CN117494132 A CN 117494132A CN 202311414864 A CN202311414864 A CN 202311414864A CN 117494132 A CN117494132 A CN 117494132A
Authority
CN
China
Prior art keywords
vulnerability
model
data
exp
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311414864.3A
Other languages
Chinese (zh)
Inventor
刘阳
杨晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information and Data Security Solutions Co Ltd
Original Assignee
Information and Data Security Solutions Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information and Data Security Solutions Co Ltd filed Critical Information and Data Security Solutions Co Ltd
Priority to CN202311414864.3A priority Critical patent/CN117494132A/en
Publication of CN117494132A publication Critical patent/CN117494132A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0985Hyperparameter optimisation; Meta-learning; Learning-to-learn

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an intelligent vulnerability recurrence retrieval method and system, comprising the following steps: firstly, collecting and sorting HTML webpages containing vulnerability EXP; the well-arranged vulnerability data is thrown into a TEXTCNN framework for model learning analysis, so as to form an AI engine; crawling vulnerability data by utilizing a Scrapy framework, and simultaneously issuing a task to a task center; the AI engine goes to a task service center to acquire a task, loads a webpage file in a file system for analysis, and discards an analysis result into a calculation result pool; the main service obtains a calculation result, carries out vulnerability EXP warehousing, and automatically associates CVE and CNNVD numbers; after the associated vulnerability EXP is put in storage, performing manual auxiliary auditing, and updating the state of the vulnerability judged to be the credible EXP and using the vulnerability to model training. The invention is based on the machine learning technology, can analyze and extract the original data from different sources and then carry out scoring judgment, thus not only greatly enhancing the accuracy of the data, but also saving the retrieval time.

Description

Intelligent vulnerability recurrence retrieval method and system
Technical Field
The invention relates to the technical field of automatic missed search, in particular to an AI engine analysis, research and judgment and self-learning method, wherein vulnerability detection is a search vulnerability reproduction method.
Background
The frames and the storage modes in the existing vulnerability management system are basically to update, arrange and put vulnerabilities in storage in real time according to the time sequence, judge vulnerability information manually, have low efficiency, and part of links in the vulnerability information base do not give specific reproduction schemes and repair schemes, so that the capability of preventing security threats is weak, and the timeliness of how to repair and prevent vulnerabilities is also relatively weak.
Because the mode of manually searching for the vulnerability reproduction method is too complicated, a proper vulnerability reproduction method cannot be found, a large amount of time is consumed, and the efficiency is too low. In order to effectively solve the problem, an automatic framework retrieval technology is introduced to perform vulnerability retrieval, and an AI engine is combined to perform research and judgment.
Disclosure of Invention
The technical problem to be solved by the invention is how to solve the occurrence process of the accurate positioning loopholes in the existing scheme and how to search the loophole reproduction mode without manual work.
The invention solves the technical problems by the following technical means:
an intelligent vulnerability recurrence retrieval method comprises the following steps:
a101, collecting and sorting HTML webpages containing vulnerability EXP;
a102, throwing the HTML data of the loophole EXP webpage which is arranged in the step A101 into a TEXTCNN framework, and performing model learning analysis to form an AI engine;
a103, crawling of vulnerability data CVE and CNNVD is carried out by utilizing the Scopy framework, and a task is issued to a task center;
a104, the AI engine goes to a task service center to acquire a task, loads a webpage file in a file system for analysis, and drops a result into a calculation result pool after the analysis is finished;
a105, the main service obtains a calculation result, carries out vulnerability EXP warehousing, and automatically associates CVE and CNNVD numbers;
and A106, after the associated vulnerability EXP is put in storage, performing manual auxiliary audit, and performing state update on the vulnerability judged as the trusted EXP, after the state update, automatically synchronizing the EXP data judged as feasible to the model through a synchronization center (the synchronization center is a synchronization service between the data and the model), and directly feeding back to the model training.
The invention is based on the machine learning technology, can analyze and extract the original data from different sources and then carry out scoring judgment, thus not only greatly enhancing the accuracy of the data, but also saving the retrieval time. Compared with the traditional statistical analysis and retrieval method, the automatic AI engine-based retrieval method is more accurate and efficient.
Further, the method in step a101 is as follows: various HTML webpages containing the vulnerability EXP are collected, and whether the data need to meet the following contents is judged manually: the field, the serial number of the vulnerability, the EXP title, the date, the software link, the version number, the software platform, the test system and the reproduction step can contain other information such as attack scenes, vulnerability descriptions, screenshot and the like.
Furthermore, the arranged vulnerability EXP data is imported into a TEXTCNN framework, model training learning is carried out, vulnerability EXP key elements are extracted, text data is preprocessed into word vectors or character vector representations with fixed lengths, and the text data at least comprises: the method comprises the steps of numbering vulnerabilities, EXP titles, dates, software links, version numbers, software platforms, test systems, reproduction steps, attack scenes, vulnerability descriptions, screenshot and other information, and marking texts so that each text corresponds to one label;
the method comprises the following steps:
a1021 model structure definition: the model structure of the textCNN is constructed and comprises an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer. The input layer receives text vectors as input, the convolution layer is used for extracting local features, the pooling layer is used for selecting the most obvious features, the full connection layer is used for learning the representation of the text, and the output layer is used for classifying the text;
a1022 loss function selection: selecting an appropriate loss function for TextCNN, typically in a text classification task, using a cross entropy loss function to measure the difference between model predictions and real labels;
a1023 model optimization: an optimization algorithm (such as a gradient descent method) is adopted to minimize a loss function, and parameters of the model are updated, so that a prediction result of the model is more similar to a real label;
a1024 model training: and training the textCNN by using the labeled training data. In the training process, the model gradually learns the characteristics of the text by continuously and iteratively updating parameters, and optimizes the performance of the model;
a1024 model evaluation: evaluating the trained textCNN by using a verification set or a test set, and calculating indexes such as accuracy, precision, recall rate and the like of the model to measure the performance of the model;
and A1025, model tuning: according to the evaluation result, the model is optimized, and the super parameter, the model structure or the optimization algorithm can be possibly adjusted so as to further improve the performance of the model;
a1026 model application: after the model training and evaluation reach satisfactory effects, the trained TextCNN model can be used for practical applications.
Further, the method in a103 is as follows: using a Scrapy framework to acquire CVE and CNNVD basic data, cleaning and warehousing the data by using a Pipeline after acquiring, wherein the warehousing data comprises a field CVE number, a vulnerability name, a vulnerability cause, a vulnerability level, a vulnerability source release time, a vulnerability hazard, a vulnerability description, a repairing measure, a reference link, POC, bugtraq number, affected software, a security suggestion, a vulnerability provider, CNNVD number, a patch, a threat type, a vulnerability source and an updating time, and simultaneously issuing a task in a task center;
the method comprises the following steps:
a1031, the engine acquires initial URL of vulnerability basic data from the Spider and adds an initial request into a scheduler;
a1032, the scheduler selects the next request from the request queue according to the set scheduling strategy; the dispatcher sends the selected request to the downloader, and the downloader sends the request to the website server and obtains the response;
a1033, the downloader returns the response to the engine, and the engine gives the response to the Spider for analysis;
a1034, extracting data from the response by the Spider, generating a new request, and adding the new request to the scheduler;
a1035, steps A1032-A1035 are looped until all requests in the scheduler are processed. Finally, the data is sent to Pipeline for processing and storage;
and A1036, after the data storage is successful, simultaneously issuing a task in the task center.
Further, the method in a104 is as follows: the AI engine acquires a task from a task center, loads an HTML webpage file acquired through the Scrapy in a file system, analyzes and scores the HTML webpage file, and selects an HTML page which is the most consistent with the HTML page; after analysis is finished, synchronizing information such as analysis data containing fields CVE_ID, vulnerability names, exp links, link sources, average scores, update time and the like, states and the like into a result pool;
the method comprises the following steps:
a1041, an AI engine acquires an analysis task from a task center, analyzes a webpage to be analyzed, judges the webpage, and scores and outputs the webpage;
a1042, selecting and setting links according to analysis scenes, and formulating various strategies to score credible EXP;
and A1043, synchronizing the data information to a result pool after the grading judgment of the AI engine.
Corresponding to the method, the invention also provides an intelligent vulnerability recurrence retrieval system, which comprises the following steps:
the data collection module is used for collecting and sorting the HTML webpage containing the vulnerability EXP;
the learning module is used for throwing the managed vulnerability EXP webpage HTML data into the TEXTCNN framework, and performing model learning analysis to form an AI engine;
the data crawling module is used for crawling the vulnerability data CVE and CNNVD by utilizing the Scopy framework and simultaneously issuing a task to the task center;
the analysis module is used for the AI engine to go to a task service center to acquire tasks, load the webpage files in the file system for analysis, and after the analysis is finished, throw the results into a calculation result pool;
the association module is used for obtaining a calculation result by the main service, carrying out vulnerability EXP warehousing, and automatically associating CVE and CNNVD numbers;
and the updating module is used for carrying out manual auxiliary audit after the associated vulnerability EXP is put in storage, carrying out state updating on the vulnerability judged to be the trusted EXP, and automatically feeding back to the model training through the synchronization center again after updating.
Further, the method in the data collection module is as follows: various HTML webpages containing the vulnerability EXP are collected, and whether the data need to meet the following contents is judged manually: the field, the serial number of the vulnerability, the EXP title, the date, the software link, the version number, the software platform, the test system and the reproduction step can contain other information such as attack scenes, vulnerability descriptions, screenshot and the like.
Further, the method in the learning module is as follows: importing the managed vulnerability EXP data into a TEXTCNN framework, performing model training learning, extracting the vulnerability EXP key elements, preprocessing text data into word vectors or character vector representations with fixed lengths, wherein the text data at least comprises: the method comprises the steps of numbering vulnerabilities, EXP titles, dates, software links, version numbers, software platforms, test systems, reproduction steps, attack scenes, vulnerability descriptions, screenshot and other information, and marking texts so that each text corresponds to one label;
the method comprises the following steps:
a1021 model structure definition: the model structure of the textCNN is constructed and comprises an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer. The input layer receives text vectors as input, the convolution layer is used for extracting local features, the pooling layer is used for selecting the most obvious features, the full connection layer is used for learning the representation of the text, and the output layer is used for classifying the text;
a1022 loss function selection: selecting an appropriate loss function for TextCNN, typically in a text classification task, using a cross entropy loss function to measure the difference between model predictions and real labels;
a1023 model optimization: an optimization algorithm (such as a gradient descent method) is adopted to minimize a loss function, and parameters of the model are updated, so that a prediction result of the model is more similar to a real label;
a1024 model training: and training the textCNN by using the labeled training data. In the training process, the model gradually learns the characteristics of the text by continuously and iteratively updating parameters, and optimizes the performance of the model;
a1024 model evaluation: evaluating the trained textCNN by using a verification set or a test set, and calculating indexes such as accuracy, precision, recall rate and the like of the model to measure the performance of the model;
and A1025, model tuning: according to the evaluation result, the model is optimized, and the super parameter, the model structure or the optimization algorithm can be possibly adjusted so as to further improve the performance of the model;
a1026 model application: after the model training and evaluation reach satisfactory effects, the trained TextCNN model can be used for practical applications.
Further, the method in the data crawling module is as follows: using a Scrapy framework to acquire CVE and CNNVD basic data, cleaning and warehousing the data by using a Pipeline after acquiring, wherein the warehousing data comprises a field CVE number, a vulnerability name, a vulnerability cause, a vulnerability level, a vulnerability source release time, a vulnerability hazard, a vulnerability description, a repairing measure, a reference link, POC, bugtraq number, affected software, a security suggestion, a vulnerability provider, CNNVD number, a patch, a threat type, a vulnerability source and an updating time, and simultaneously issuing a task in a task center;
the method comprises the following steps:
a1031, the engine acquires initial URL of vulnerability basic data from the Spider and adds an initial request into a scheduler;
a1032, the scheduler selects the next request from the request queue according to the set scheduling strategy; the dispatcher sends the selected request to the downloader, and the downloader sends the request to the website server and obtains the response;
a1033, the downloader returns the response to the engine, and the engine gives the response to the Spider for analysis;
a1034, extracting data from the response by the Spider, generating a new request, and adding the new request to the scheduler;
a1035, steps A1032-A1035 are looped until all requests in the scheduler are processed. Finally, the data is sent to Pipeline for processing and storage;
and A1036, after the data storage is successful, simultaneously issuing a task in the task center.
Further, the method in the analysis module is as follows: the AI engine acquires a task from a task center, loads an HTML webpage file acquired through the Scrapy in a file system, analyzes and scores the HTML webpage file, and selects an HTML page which is the most consistent with the HTML page; after analysis is finished, synchronizing information such as analysis data containing fields CVE_ID, vulnerability names, exp links, link sources, average scores, update time and the like, states and the like into a result pool;
the method comprises the following steps:
a1041, an AI engine acquires an analysis task from a task center, analyzes a webpage to be analyzed, judges the webpage, and scores and outputs the webpage;
a1042, selecting and setting links according to analysis scenes, and formulating various strategies to score credible EXP;
and A1043, synchronizing the data information to a result pool after the grading judgment of the AI engine.
The invention has the advantages that:
the invention is based on the machine learning technology, can analyze and extract the original data from different sources and then carry out scoring judgment, thus not only greatly enhancing the accuracy of the data, but also saving the retrieval time. Compared with the traditional statistical analysis and retrieval method, the automatic AI engine-based retrieval method is more accurate and efficient.
The automatic combined AI engine vulnerability retrieval method is characterized in that after an automatic retrieval framework is adopted, needed CVE and CNNVD vulnerabilities are put into ES (ElasticSearch) libraries, all the vulnerabilities are put into an AI engine after being put into storage, the vulnerabilities are researched, judged and scored and equally distributed, all links with vulnerabilities are put into storage and displayed, and finally packaging is carried out.
Drawings
Fig. 1 is a flow chart of an intelligent vulnerability retrieval method in an embodiment of the invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described in the following in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. An intelligent vulnerability reproduction retrieval method, as shown in figure 1, comprises the following specific steps:
a101, collecting and sorting HTML webpages containing vulnerability EXP;
a102, throwing the HTML data of the loophole EXP webpage which is arranged in the step A101 into a TEXTCNN framework, and performing model learning analysis to form an AI engine;
a103, utilizing the Scrapy framework to perform vulnerability (CVE, CNNVD) data crawling, and simultaneously issuing a task to a task center (MQ);
a104, the AI engine goes to a task service center to acquire tasks, loads webpage files in a file system for analysis, and loses the results into a calculation result pool (MQ) after the analysis is finished;
a105, the main service obtains a calculation result, carries out vulnerability EXP warehousing, and automatically associates CVE and CNNVD numbers;
the contents of each step are specifically described below:
and A106, after the associated vulnerability EXP is put in storage, performing manual auxiliary audit, and performing state update on the vulnerability judged to be the true EXP, and after the vulnerability is updated, directly feeding back to the model training through the synchronization center again. To improve accuracy; the method in A101 is as follows:
various HTML webpages containing the vulnerability EXP are collected, and whether the data need to meet the following contents is judged manually: fields that must be included, the number of the vulnerability (CVE, CNNVD number), EXP Title (explet Title), date (data), software Link (Software Link), version number (Version), software Platform (Platform), test system (test on), reproduction step (Steps To Reproduce) may include other information such as Attack scenario (Attack scene), vulnerability Description (Description), screenshot, etc.
The method in A102 is as follows:
importing the arranged vulnerability EXP data into a TEXTCNN framework, performing model training learning, extracting vulnerability EXP key elements, preprocessing text data into word vectors or character vector representations with fixed lengths, wherein the text data comprises but is not limited to the following contents: the method comprises the steps of numbering vulnerabilities, EXP titles, dates, software links, version numbers, software platforms, test systems, reproduction steps, attack scenes, vulnerability descriptions, screenshot and other information, and marking texts so that each text corresponds to one label.
The method can be divided into the following steps:
a1021 model structure definition: the model structure of the textCNN is constructed and comprises an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer. The input layer receives text vectors as input, the convolution layer is used for extracting local features, the pooling layer is used for selecting the most obvious features, the full connection layer is used for learning the representation of the text, and the output layer is used for classifying the text.
A1022 loss function selection: the appropriate loss function is selected for TextCNN, typically in a text classification task, using a cross entropy loss function to measure the difference between model predictions and real labels.
A1023 model optimization: an optimization algorithm (such as a gradient descent method) is adopted to minimize the loss function, and parameters of the model are updated so that a prediction result of the model is closer to a real label.
A1024 model training: and training the textCNN by using the labeled training data. In the training process, the model gradually learns the characteristics of the text by continuously and iteratively updating parameters, and optimizes the performance of the model.
A1024 model evaluation: and evaluating the trained textCNN by using a verification set or a test set, and calculating indexes such as accuracy, precision, recall rate and the like of the model to measure the performance of the model.
And A1025, model tuning: according to the evaluation result, the model is optimized, and the super parameter, the model structure or the optimization algorithm may be adjusted to further improve the performance of the model.
A1026 model application: after the model training and evaluation reach satisfactory effects, the trained TextCNN model can be used for practical applications.
The method in A103 is as follows:
and (3) performing CVE and CNNVD basic data acquisition by utilizing a Scrapy framework, cleaning and warehousing the acquired data by utilizing a Pipeline, wherein the warehousing data comprises a field CVE number, a vulnerability name, a vulnerability cause, a vulnerability level, a vulnerability source release time, a vulnerability hazard, a vulnerability description, a repairing measure, a reference link, POC (point-to-multipoint) and Bugtraq number, affected software, a security suggestion, a vulnerability provider, CNNVD number, a patch, a threat type, a vulnerability source and an updating time. And simultaneously, a task is issued at a task center.
The method can be divided into the following steps:
a1031, the engine acquires initial URL of vulnerability basic data from the Spider and adds the initial request to the scheduler.
A1032, the scheduler selects the next request from the request queue according to the set scheduling policy. The dispatcher sends the selected request to the downloader, and the downloader sends the request to the website server and obtains the response.
And A1033, the downloader returns the response to the engine, and the engine gives the response to the Spider for analysis.
A1034, spider extracts the data from the response and generates a new request, which is added to the scheduler.
A1035, steps A1032-A1035 are looped until all requests in the scheduler are processed. Finally, the data is sent to Pipeline for processing and storage.
And A1036, after the data storage is successful, simultaneously issuing a task in the task center.
The method in A104 is as follows:
the AI engine acquires the task from the task center, loads the HTML webpage file acquired through the Scrapy in the file system, analyzes and scores the HTML webpage file, and selects the most conforming HTML page. After analysis, the analysis data contains information such as field CVE_ID, vulnerability name, exp link, link source, average score, update time and the like, state and the like and is synchronized into a result pool (MQ).
The method can be divided into the following steps:
a1041, an AI engine acquires an analysis task from a task center, analyzes a webpage to be analyzed, judges the webpage, and scores and outputs the webpage;
a1042, selecting 5 links according to analysis scenes, and formulating three strategies to score and output, wherein the three strategies are as follows:
1. when one of the 5 link scores is more than or equal to 0.9, judging that the link score is the trusted EXP;
2. when three scores in the 5 link scores are more than or equal to 0.5, judging that the link scores are trusted EXP;
3. when the average score of the 5 links is more than or equal to 0.6, judging the link as the trusted EXP;
a1043, after the grading judgment of the AI engine, synchronizing the data information to a result pool (MQ);
the method in A105 is as follows:
constructing a main service, acquiring a calculation result from a result pool (MQ), continuously warehousing data conforming to the calculation result, and correlating vulnerability numbers;
the method in A106 is as follows:
after the data are associated and put in storage, the links conforming to the vulnerability EXP are manually marked through manual auxiliary auditing, after the marking is successful, a task is sent to a task center, the webpage content is returned to an AI engine, secondary and repeated training is performed again, and finally the model is consolidated.
Corresponding to the method, the invention also provides an intelligent vulnerability recurrence retrieval system, which comprises the following steps:
the data collection module is used for collecting and sorting the HTML webpage containing the vulnerability EXP;
the learning module is used for throwing the managed vulnerability EXP webpage HTML data into the TEXTCNN framework, and performing model learning analysis to form an AI engine;
the data crawling module is used for crawling the vulnerability data CVE and CNNVD by utilizing the Scopy framework and simultaneously issuing a task to the task center;
the analysis module is used for the AI engine to go to a task service center to acquire tasks, load the webpage files in the file system for analysis, and after the analysis is finished, throw the results into a calculation result pool;
the association module is used for obtaining a calculation result by the main service, carrying out vulnerability EXP warehousing, and automatically associating CVE and CNNVD numbers;
and the updating module is used for carrying out manual auxiliary audit after the associated vulnerability EXP is put in storage, carrying out state updating on the vulnerability judged to be the trusted EXP, and automatically feeding back to the model training through the synchronization center again after updating.
Further, the method in the data collection module is as follows: various HTML webpages containing the vulnerability EXP are collected, and whether the data need to meet the following contents is judged manually: the field, the serial number of the vulnerability, the EXP title, the date, the software link, the version number, the software platform, the test system and the reproduction step can contain other information such as attack scenes, vulnerability descriptions, screenshot and the like.
Further, the method in the learning module is as follows: importing the managed vulnerability EXP data into a TEXTCNN framework, performing model training learning, extracting the vulnerability EXP key elements, preprocessing text data into word vectors or character vector representations with fixed lengths, wherein the text data at least comprises: the method comprises the steps of numbering vulnerabilities, EXP titles, dates, software links, version numbers, software platforms, test systems, reproduction steps, attack scenes, vulnerability descriptions, screenshot and other information, and marking texts so that each text corresponds to one label;
the method comprises the following steps:
a1021 model structure definition: the model structure of the textCNN is constructed and comprises an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer. The input layer receives text vectors as input, the convolution layer is used for extracting local features, the pooling layer is used for selecting the most obvious features, the full connection layer is used for learning the representation of the text, and the output layer is used for classifying the text;
a1022 loss function selection: selecting an appropriate loss function for TextCNN, typically in a text classification task, using a cross entropy loss function to measure the difference between model predictions and real labels;
a1023 model optimization: an optimization algorithm (such as a gradient descent method) is adopted to minimize a loss function, and parameters of the model are updated, so that a prediction result of the model is more similar to a real label;
a1024 model training: and training the textCNN by using the labeled training data. In the training process, the model gradually learns the characteristics of the text by continuously and iteratively updating parameters, and optimizes the performance of the model;
a1024 model evaluation: evaluating the trained textCNN by using a verification set or a test set, and calculating indexes such as accuracy, precision, recall rate and the like of the model to measure the performance of the model;
and A1025, model tuning: according to the evaluation result, the model is optimized, and the super parameter, the model structure or the optimization algorithm can be possibly adjusted so as to further improve the performance of the model;
a1026 model application: after the model training and evaluation reach satisfactory effects, the trained TextCNN model can be used for practical applications.
Further, the method in the data crawling module is as follows: using a Scrapy framework to acquire CVE and CNNVD basic data, cleaning and warehousing the data by using a Pipeline after acquiring, wherein the warehousing data comprises a field CVE number, a vulnerability name, a vulnerability cause, a vulnerability level, a vulnerability source release time, a vulnerability hazard, a vulnerability description, a repairing measure, a reference link, POC, bugtraq number, affected software, a security suggestion, a vulnerability provider, CNNVD number, a patch, a threat type, a vulnerability source and an updating time, and simultaneously issuing a task in a task center;
the method comprises the following steps:
a1031, the engine acquires initial URL of vulnerability basic data from the Spider and adds an initial request into a scheduler;
a1032, the scheduler selects the next request from the request queue according to the set scheduling strategy; the dispatcher sends the selected request to the downloader, and the downloader sends the request to the website server and obtains the response;
a1033, the downloader returns the response to the engine, and the engine gives the response to the Spider for analysis;
a1034, extracting data from the response by the Spider, generating a new request, and adding the new request to the scheduler;
a1035, steps A1032-A1035 are looped until all requests in the scheduler are processed. Finally, the data is sent to Pipeline for processing and storage;
and A1036, after the data storage is successful, simultaneously issuing a task in the task center.
Further, the method in the analysis module is as follows: the AI engine acquires a task from a task center, loads an HTML webpage file acquired through the Scrapy in a file system, analyzes and scores the HTML webpage file, and selects an HTML page which is the most consistent with the HTML page; after analysis is finished, synchronizing information such as analysis data containing fields CVE_ID, vulnerability names, exp links, link sources, average scores, update time and the like, states and the like into a result pool;
the method comprises the following steps:
a1041, an AI engine acquires an analysis task from a task center, analyzes a webpage to be analyzed, judges the webpage, and scores and outputs the webpage;
a1042, selecting and setting links according to analysis scenes, and formulating various strategies to score credible EXP;
and A1043, synchronizing the data information to a result pool after the grading judgment of the AI engine.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. An intelligent vulnerability reproduction retrieval method is characterized by comprising the following steps:
a101, collecting and sorting HTML webpages containing vulnerability EXP;
a102, throwing the HTML data of the loophole EXP webpage which is arranged in the step A101 into a TEXTCNN framework, and performing model learning analysis to form an AI engine;
a103, crawling of vulnerability data CVE and CNNVD is carried out by utilizing the Scopy framework, and a task is issued to a task center;
a104, the AI engine goes to a task service center to acquire a task, loads a webpage file in a file system for analysis, and drops a result into a calculation result pool after the analysis is finished;
a105, the main service obtains a calculation result, carries out vulnerability EXP warehousing, and automatically associates CVE and CNNVD numbers;
and A106, after the associated vulnerability EXP is put in storage, performing manual auxiliary audit, and performing state update on the vulnerability judged to be the trusted EXP, and after the state update, automatically feeding back to the model training through the synchronization center again.
2. The method for retrieving intelligent vulnerability discovery according to claim 1, wherein the method in step a101 is as follows: various HTML webpages containing the vulnerability EXP are collected, and whether the data need to meet the following contents is judged manually: the field, the serial number of the vulnerability, the EXP title, the date, the software link, the version number, the software platform, the test system and the reproduction step can contain other information such as attack scenes, vulnerability descriptions, screenshot and the like.
3. The method for retrieving intelligent vulnerability discovery according to claim 1, wherein the method in a102 is as follows: importing the managed vulnerability EXP data into a TEXTCNN framework, performing model training learning, extracting the vulnerability EXP key elements, preprocessing text data into word vectors or character vector representations with fixed lengths, wherein the text data at least comprises: the method comprises the steps of numbering vulnerabilities, EXP titles, dates, software links, version numbers, software platforms, test systems, reproduction steps, attack scenes, vulnerability descriptions, screenshot and other information, and marking texts so that each text corresponds to one label;
the method comprises the following steps:
a1021 model structure definition: the model structure of the textCNN is constructed and comprises an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer. The input layer receives text vectors as input, the convolution layer is used for extracting local features, the pooling layer is used for selecting the most obvious features, the full connection layer is used for learning the representation of the text, and the output layer is used for classifying the text;
a1022 loss function selection: selecting an appropriate loss function for TextCNN, typically in a text classification task, using a cross entropy loss function to measure the difference between model predictions and real labels;
a1023 model optimization: an optimization algorithm (such as a gradient descent method) is adopted to minimize a loss function, and parameters of the model are updated, so that a prediction result of the model is more similar to a real label;
a1024 model training: and training the textCNN by using the labeled training data. In the training process, the model gradually learns the characteristics of the text by continuously and iteratively updating parameters, and optimizes the performance of the model;
a1024 model evaluation: evaluating the trained textCNN by using a verification set or a test set, and calculating indexes such as accuracy, precision, recall rate and the like of the model to measure the performance of the model;
and A1025, model tuning: according to the evaluation result, the model is optimized, and the super parameter, the model structure or the optimization algorithm can be possibly adjusted so as to further improve the performance of the model;
a1026 model application: after the model training and evaluation reach satisfactory effects, the trained TextCNN model can be used for practical applications.
4. A method for retrieving an intelligent vulnerability discovery according to any one of claims 1-3, wherein the method in a103 is as follows: using a Scrapy framework to acquire CVE and CNNVD basic data, cleaning and warehousing the data by using a Pipeline after acquiring, wherein the warehousing data comprises a field CVE number, a vulnerability name, a vulnerability cause, a vulnerability level, a vulnerability source release time, a vulnerability hazard, a vulnerability description, a repairing measure, a reference link, POC, bugtraq number, affected software, a security suggestion, a vulnerability provider, CNNVD number, a patch, a threat type, a vulnerability source and an updating time, and simultaneously issuing a task in a task center;
the method comprises the following steps:
a1031, the engine acquires initial URL of vulnerability basic data from the Spider and adds an initial request into a scheduler;
a1032, the scheduler selects the next request from the request queue according to the set scheduling strategy; the dispatcher sends the selected request to the downloader, and the downloader sends the request to the website server and obtains the response;
a1033, the downloader returns the response to the engine, and the engine gives the response to the Spider for analysis;
a1034, extracting data from the response by the Spider, generating a new request, and adding the new request to the scheduler;
a1035, steps A1032-A1035 are looped until all requests in the scheduler are processed. Finally, the data is sent to Pipeline for processing and storage;
and A1036, after the data storage is successful, simultaneously issuing a task in the task center.
5. A method for retrieving an intelligent vulnerability discovery according to any one of claims 1-3, wherein the method in a104 is: the AI engine acquires a task from a task center, loads an HTML webpage file acquired through the Scrapy in a file system, analyzes and scores the HTML webpage file, and selects an HTML page which is the most consistent with the HTML page; after analysis is finished, synchronizing information such as analysis data containing fields CVE_ID, vulnerability names, exp links, link sources, average scores, update time and the like, states and the like into a result pool;
the method comprises the following steps:
a1041, an AI engine acquires an analysis task from a task center, analyzes a webpage to be analyzed, judges the webpage, and scores and outputs the webpage;
a1042, selecting and setting links according to analysis scenes, and formulating various strategies to score credible EXP;
and A1043, synchronizing the data information to a result pool after the grading judgment of the AI engine.
6. An intelligent vulnerability retrieval system, comprising:
the data collection module is used for collecting and sorting the HTML webpage containing the vulnerability EXP;
the learning module is used for throwing the managed vulnerability EXP webpage HTML data into the TEXTCNN framework, and performing model learning analysis to form an AI engine;
the data crawling module is used for crawling the vulnerability data CVE and CNNVD by utilizing the Scopy framework and simultaneously issuing a task to the task center;
the analysis module is used for the AI engine to go to a task service center to acquire tasks, load the webpage files in the file system for analysis, and after the analysis is finished, throw the results into a calculation result pool;
the association module is used for obtaining a calculation result by the main service, carrying out vulnerability EXP warehousing, and automatically associating CVE and CNNVD numbers;
and the updating module is used for carrying out manual auxiliary audit after the associated vulnerability EXP is put in storage, carrying out state updating on the vulnerability judged to be the trusted EXP, and automatically feeding back to the model training through the synchronization center again after updating.
7. The intelligent vulnerability retrieval system of claim 6, wherein the method in the data collection module is: various HTML webpages containing the vulnerability EXP are collected, and whether the data need to meet the following contents is judged manually: the field, the serial number of the vulnerability, the EXP title, the date, the software link, the version number, the software platform, the test system and the reproduction step can contain other information such as attack scenes, vulnerability descriptions, screenshot and the like.
8. The intelligent vulnerability retrieval system of claim 6, wherein the learning module comprises: importing the managed vulnerability EXP data into a TEXTCNN framework, performing model training learning, extracting the vulnerability EXP key elements, preprocessing text data into word vectors or character vector representations with fixed lengths, wherein the text data at least comprises: the method comprises the steps of numbering vulnerabilities, EXP titles, dates, software links, version numbers, software platforms, test systems, reproduction steps, attack scenes, vulnerability descriptions, screenshot and other information, and marking texts so that each text corresponds to one label;
the method comprises the following steps:
a1021 model structure definition: the model structure of the textCNN is constructed and comprises an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer. The input layer receives text vectors as input, the convolution layer is used for extracting local features, the pooling layer is used for selecting the most obvious features, the full connection layer is used for learning the representation of the text, and the output layer is used for classifying the text;
a1022 loss function selection: selecting an appropriate loss function for TextCNN, typically in a text classification task, using a cross entropy loss function to measure the difference between model predictions and real labels;
a1023 model optimization: an optimization algorithm (such as a gradient descent method) is adopted to minimize a loss function, and parameters of the model are updated, so that a prediction result of the model is more similar to a real label;
a1024 model training: and training the textCNN by using the labeled training data. In the training process, the model gradually learns the characteristics of the text by continuously and iteratively updating parameters, and optimizes the performance of the model;
a1024 model evaluation: evaluating the trained textCNN by using a verification set or a test set, and calculating indexes such as accuracy, precision, recall rate and the like of the model to measure the performance of the model;
and A1025, model tuning: according to the evaluation result, the model is optimized, and the super parameter, the model structure or the optimization algorithm can be possibly adjusted so as to further improve the performance of the model;
a1026 model application: after the model training and evaluation reach satisfactory effects, the trained TextCNN model can be used for practical applications.
9. An intelligent vulnerability retrieval system according to any one of claims 6-8, wherein the method in the data crawling module is as follows: using a Scrapy framework to acquire CVE and CNNVD basic data, cleaning and warehousing the data by using a Pipeline after acquiring, wherein the warehousing data comprises a field CVE number, a vulnerability name, a vulnerability cause, a vulnerability level, a vulnerability source release time, a vulnerability hazard, a vulnerability description, a repairing measure, a reference link, POC, bugtraq number, affected software, a security suggestion, a vulnerability provider, CNNVD number, a patch, a threat type, a vulnerability source and an updating time, and simultaneously issuing a task in a task center;
the method comprises the following steps:
a1031, the engine acquires initial URL of vulnerability basic data from the Spider and adds an initial request into a scheduler;
a1032, the scheduler selects the next request from the request queue according to the set scheduling strategy; the dispatcher sends the selected request to the downloader, and the downloader sends the request to the website server and obtains the response;
a1033, the downloader returns the response to the engine, and the engine gives the response to the Spider for analysis;
a1034, extracting data from the response by the Spider, generating a new request, and adding the new request to the scheduler;
a1035, steps A1032-A1035 are looped until all requests in the scheduler are processed. Finally, the data is sent to Pipeline for processing and storage;
and A1036, after the data storage is successful, simultaneously issuing a task in the task center.
10. An intelligent vulnerability retrieval system according to any one of claims 6-8, wherein the method in the analysis module is: the AI engine acquires a task from a task center, loads an HTML webpage file acquired through the Scrapy in a file system, analyzes and scores the HTML webpage file, and selects an HTML page which is the most consistent with the HTML page; after analysis is finished, synchronizing information such as analysis data containing fields CVE_ID, vulnerability names, exp links, link sources, average scores, update time and the like, states and the like into a result pool;
the method comprises the following steps:
a1041, an AI engine acquires an analysis task from a task center, analyzes a webpage to be analyzed, judges the webpage, and scores and outputs the webpage;
a1042, selecting and setting links according to analysis scenes, and formulating various strategies to score credible EXP;
and A1043, synchronizing the data information to a result pool after the grading judgment of the AI engine.
CN202311414864.3A 2023-10-27 2023-10-27 Intelligent vulnerability recurrence retrieval method and system Pending CN117494132A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311414864.3A CN117494132A (en) 2023-10-27 2023-10-27 Intelligent vulnerability recurrence retrieval method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311414864.3A CN117494132A (en) 2023-10-27 2023-10-27 Intelligent vulnerability recurrence retrieval method and system

Publications (1)

Publication Number Publication Date
CN117494132A true CN117494132A (en) 2024-02-02

Family

ID=89671848

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311414864.3A Pending CN117494132A (en) 2023-10-27 2023-10-27 Intelligent vulnerability recurrence retrieval method and system

Country Status (1)

Country Link
CN (1) CN117494132A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117744632A (en) * 2024-02-20 2024-03-22 深圳融安网络科技有限公司 Method, device, equipment and medium for constructing vulnerability information keyword extraction model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117744632A (en) * 2024-02-20 2024-03-22 深圳融安网络科技有限公司 Method, device, equipment and medium for constructing vulnerability information keyword extraction model
CN117744632B (en) * 2024-02-20 2024-05-10 深圳融安网络科技有限公司 Method, device, equipment and medium for constructing vulnerability information keyword extraction model

Similar Documents

Publication Publication Date Title
CN102508859B (en) Advertisement classification method and device based on webpage characteristic
CN103491205B (en) The method for pushing of a kind of correlated resources address based on video search and device
CN107566376A (en) One kind threatens information generation method, apparatus and system
CN101118554A (en) Intelligent interactive request-answering system and processing method thereof
CN103049532A (en) Method for creating knowledge base engine on basis of sudden event emergency management and method for inquiring knowledge base engine
CN108984667A (en) A kind of public sentiment monitoring system
CN101782998A (en) Intelligent judging method for illegal on-line product information and system
CN117494132A (en) Intelligent vulnerability recurrence retrieval method and system
CN105718533A (en) Information pushing method and device
CN101630315B (en) Quick retrieval method and system
US20150294005A1 (en) Method and device for acquiring information
CN102663060A (en) Method and device for identifying tampered webpage
CN116384889A (en) Intelligent analysis method for information big data based on natural language processing technology
CN113297457A (en) High-precision intelligent information resource pushing system and pushing method
CN107741960A (en) URL sorting technique and device
CN110310012B (en) Data analysis method, device, equipment and computer readable storage medium
CN101957860A (en) Method and device for releasing and searching information
CN102902790A (en) Web page classification system and method
CN102902794A (en) Web page classification system and method
US8463799B2 (en) System and method for consolidating search engine results
CN115982429B (en) Knowledge management method and system based on flow control
KR20140056402A (en) Method, system, and apparatus for targeted searching of multi-sectional documents within an electronic document collection
CN103823847A (en) Keyword extension method and device
CN116108955A (en) Method, device, equipment and storage medium for upgrading and early warning of social contradiction disputes
CN102222067A (en) Searching method for accurately querying information according to IP (Internet Protocol) address of keyword

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination