CN116775910B

CN116775910B - Automatic vulnerability reproduction knowledge base construction method and medium based on information collection

Info

Publication number: CN116775910B
Application number: CN202311041050.XA
Authority: CN
Inventors: 李季; 汪晓慧; 梁露露
Original assignee: Beijing Yuanbao Technology Co ltd
Current assignee: Beijing Yuanbao Technology Co ltd
Priority date: 2023-08-18
Filing date: 2023-08-18
Publication date: 2023-11-24
Anticipated expiration: 2043-08-18
Also published as: CN116775910A

Abstract

The application discloses an automatic vulnerability reproduction knowledge base construction method and medium based on information collection. The method may include: establishing a vulnerability reproduction knowledge base, determining CVE numbers, analyzing the CVE numbers, and constructing feature vectors; based on the feature vector, collecting vulnerability information, extracting vulnerability information and reproducing a vulnerability environment; collecting vulnerability proving scheme information based on vulnerability information; analyzing the vulnerability proving scheme information, correlating information and evaluating the information to generate a vulnerability proving scheme report; generating a vulnerability exploitation script according to the vulnerability proving scheme report; starting the vulnerability environment, executing the vulnerability exploitation script to simulate attack, and storing the vulnerability environment and the vulnerability exploitation script into a vulnerability reproduction knowledge base. According to the application, through the pre-constructed CVE number, information is automatically collected, a vulnerability environment is built, a vulnerability exploitation script is generated, and simulation attack is performed. The vulnerability reproduction efficiency is improved, and researchers can conveniently and rapidly perform tasks such as vulnerability verification and utilization.

Description

Automatic vulnerability reproduction knowledge base construction method and medium based on information collection

Technical Field

The application relates to the field of vulnerability reproduction, in particular to an automatic vulnerability reproduction knowledge base construction method and medium based on information collection.

Background

According to the relevant vulnerability organization and the data disclosed by the platform, the number of the included vulnerabilities steadily increases each year in recent years, and the influence range of the vulnerabilities is also continuously expanded. Learning known vulnerabilities, understanding the formation principle and utilization details thereof can effectively prevent most security problems.

At present, the vulnerability attack mode in the repeated security event mainly comprises the steps of configuring a vulnerability environment and carrying out vulnerability reproduction by utilizing vulnerability certification. But this approach has three problems: firstly, the configuration of the vulnerability environment is complex, manufacturers release security patches to repair public vulnerabilities along with the time, and most vulnerabilities can be successfully utilized only under specific environments or conditions. Therefore, the software and hardware environment related to the loopholes needs to be configured, and specific systems and software versions are installed, so that manpower is wasted; secondly, the information collection effect is poor, the main vulnerability information collection scheme at present comprises manual search, web page crawling, API calling and community integration, but the problems of low efficiency, narrow coverage, poor instantaneity and the like are faced by the methods, and the application requirements are difficult to meet; thirdly, the technical difficulty is high, certain requirements are provided for the capability of security practitioners, the public details of some loopholes are few, a detailed utilization process manual is lacking, the operation complexity of testers is increased, and a great amount of time and energy are required to be input.

The current vulnerability reproduction environment construction can quickly and automatically construct a vulnerability reproduction environment according to vulnerability information, can acquire the necessary vulnerability information element information for constructing the cloud native application vulnerability without manpower, writes a vulnerability information data packet according to the vulnerability information element information, and constructs the vulnerability reproduction environment according to the vulnerability information data packet. But the necessary vulnerability element information needs to be collected by security practitioners to be filled in accordance with the structural standards of the vulnerability information file, and the software installation configuration file needs to be specially written. The method needs manual operation, and the work efficiency is not obviously improved; in addition, the method does not collect vulnerability proving scheme information, only builds a vulnerability environment, does not generate vulnerability exploitation scripts, and is difficult to simulate vulnerability attack modes.

Therefore, it is necessary to develop an automated vulnerability discovery knowledge base construction method and medium based on information collection.

The information disclosed in the background section of the application is only for enhancement of understanding of the general background of the application and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

Disclosure of Invention

The application provides an automatic vulnerability reproduction knowledge base construction method and medium based on information collection, which can construct a feature vector according to a pre-constructed CVE number, automatically analyze and collect and generate the environment and vulnerability evidence of a vulnerability based on the feature vector, and finally automatically construct the vulnerability environment and execute a vulnerability exploitation script to reproduce the vulnerability. The operation complexity of researchers is reduced, so that the researchers can conveniently conduct vulnerability reproduction.

In a first aspect, an embodiment of the present disclosure provides a method for constructing an automated vulnerability discovery knowledge base based on information collection, including:

step 1: establishing a vulnerability reproduction knowledge base, determining CVE numbers of vulnerability reproduction, analyzing the CVE numbers, and constructing feature vectors;

step 2: collecting vulnerability information based on the feature vector, extracting vulnerability information and reproducing a vulnerability environment;

step 3: collecting vulnerability proving scheme information based on the vulnerability information;

step 4: analyzing the vulnerability proving scheme information, carrying out information association through a knowledge graph, carrying out information evaluation through an information evaluation mechanism, and generating a vulnerability proving scheme report;

step 5: generating a vulnerability exploitation script according to the vulnerability proving scheme report;

step 6: and starting the vulnerability environment, executing the vulnerability exploitation script to perform simulation attack, and storing the vulnerability environment and the vulnerability exploitation script into the vulnerability reproduction knowledge base.

Preferably, the vulnerability proving scheme information is analyzed through NLP technology.

Preferably, analyzing the vulnerability proving scheme intelligence includes:

cleaning the vulnerability proving scheme information, and extracting keywords from the cleaned information through a TF-IDF algorithm;

analyzing semantic information of the information text through a dependency syntax analysis technology;

generating abstract information of the information text through an LSTM algorithm;

judging the type of the information according to the keywords, the semantic information and the abstract content, and realizing classification and induction of the information.

Preferably, the keywords include vulnerability names, influencing systems/devices, exploitation tools.

Preferably, the information association by the knowledge graph includes:

extracting key information of different informations as nodes, and taking the nodes as basic elements for constructing a knowledge graph;

according to the internal association between nodes, defining different types of edges to represent the relationship, linking the related nodes, and realizing association and semantic expression between nodes;

combining the nodes and the edges to construct a knowledge triplet with semantics and storing the knowledge triplet in a knowledge graph;

when new information is collected, extracting nodes, inquiring similar or related nodes in the knowledge graph, and judging the category and the related scheme of the new information;

if the new information is related to the knowledge graph but is incomplete, extracting new information content from the new information for supplementing;

if the nodes in the new information are not directly related to the knowledge graph, generating a new knowledge triplet, and adding the new knowledge triplet to the knowledge graph.

Preferably, the information evaluation by the information evaluation mechanism includes:

determining multi-angle information evaluation indexes;

determining the weight of each evaluation index according to the characteristics of different types of information;

judging the information performance of each evaluation index according to the evaluation method, and determining the score of each index;

and calculating the total score of the information aiming at each index score and the corresponding weight thereof, thereby realizing the comprehensive evaluation of the information quality.

Preferably, the multi-angle information evaluation index comprises source authority, information integrity, consistency test and information timeliness.

Preferably, if the score of a certain index is too high or too low, the evaluation result of the index is subjected to key judgment to judge whether the overall evaluation of the information is obviously affected, so that the error judgment caused by the extreme score of the certain index is avoided.

Preferably, after the vulnerability environment executes the exploit script to perform the simulated attack, the step 6 further includes:

judging whether the simulation attack result is matched with the real result, if so, directly storing the current vulnerability environment and the vulnerability exploitation script, if not, alarming and circularly executing the steps 3-6 until the matching or simulation attack times reach the set times, and then exiting the task.

In a second aspect, the embodiments of the present disclosure further provide a computer readable storage medium storing a computer program, where the computer program when executed by a processor implements the method for constructing an automated vulnerability replication knowledge base based on intelligence collection.

The method and apparatus of the present application have other features and advantages which will be apparent from or are set forth in detail in the accompanying drawings and the following detailed description, which are incorporated herein, and which together serve to explain certain principles of the present application.

Drawings

The foregoing and other objects, features and advantages of the application will be apparent from the following more particular descriptions of exemplary embodiments of the application as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the application.

FIG. 1 illustrates a flowchart of the steps of an automated vulnerability discovery knowledge base construction method based on intelligence collection, according to an embodiment of the application.

Fig. 2 shows a flow chart of steps for intelligence analysis using NLP techniques in accordance with one embodiment of the present application.

Fig. 3 shows a flow chart of the steps of establishing a knowledge-graph for intelligence association, according to an embodiment of the application.

FIG. 4 shows a flowchart of the steps of intelligence evaluation by an intelligence evaluation mechanism, according to one embodiment of the application.

Detailed Description

Preferred embodiments of the present application will be described in more detail below. While the preferred embodiments of the present application are described below, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein.

In order to facilitate understanding of the solution and the effects of the embodiments of the present application, two specific application examples are given below. It will be understood by those of ordinary skill in the art that the examples are for ease of understanding only and that any particular details thereof are not intended to limit the present application in any way.

Example 1

As shown in fig. 1, the method for constructing the automated vulnerability reproduction knowledge base based on information collection includes:

step 2: based on the feature vector, collecting vulnerability information, extracting vulnerability information and reproducing a vulnerability environment;

step 3: collecting vulnerability proving scheme information based on vulnerability information;

step 4: analyzing the information of the vulnerability proving scheme, carrying out information association through a knowledge graph, carrying out information evaluation through an information evaluation mechanism, and generating a vulnerability proving scheme report;

step 6: starting the vulnerability environment, executing the vulnerability exploitation script to simulate attack, and storing the vulnerability environment and the vulnerability exploitation script into a vulnerability reproduction knowledge base.

In one example, vulnerability attestation solution intelligence is analyzed by NLP technology.

In one example, analyzing vulnerability attestation scheme intelligence includes:

cleaning vulnerability proving scheme information, and extracting keywords from the cleaned information through a TF-IDF algorithm;

judging the type of the information according to the key words, the semantic information and the abstract content, and realizing classification and induction of the information.

In one example, keywords include vulnerability names, impact systems/devices, exploitation tools.

In one example, information association by knowledge-graph includes:

In one example, the intelligence evaluation by the intelligence evaluation mechanism includes:

determining multi-angle information evaluation indexes;

In one example, the multi-angle intelligence evaluation index includes source authority, information integrity, consistency check, intelligence timeliness.

In one example, if the score of a certain index is too high or too low, the evaluation result of the index is subjected to key judgment to judge whether the overall evaluation of the information is significantly affected, so that the error judgment caused by the extreme score of the certain index is avoided.

In one example, after the vulnerability environment executes the exploit script to perform the simulated attack, step 6 further includes:

Specifically, step 1: a security researcher prepares a vulnerability reproduction knowledge base, and the content of the knowledge base is a CVE number for vulnerability reproduction. Analyzing the CVE number of the vulnerability to obtain the vulnerability name and vulnerability description of the current CVE number, performing text word segmentation on the vulnerability description name and vulnerability description, and constructing a feature vector.

Step 2: based on the feature vector, crawling vulnerability information platforms of enterprises of public vulnerability platforms, related Internet and the like, and collecting vulnerability information. And extracting vulnerability information based on the collected information. And customizing the Dockerfile configuration file according to the extracted vulnerability information to realize quick reproduction of the vulnerability environment.

Step 3: collecting vulnerability demonstration scheme information: based on the extracted vulnerability information, a crawler program is used for monitoring security research information such as emergency response platforms, hack forums, public vulnerability platforms, papers of security researchers and blogs of enterprises of related internet and the like, vulnerability proving scheme information is collected, and information sources are expanded to the greatest extent.

Step 4: extracting a vulnerability proving scheme: and analyzing the collected information by using an NLP technology, understanding the information key points, and identifying influencing keywords and utilizing processes. And then, associating each information through the knowledge graph, finding potential association and supplementing missing information. In addition, an information evaluation mechanism is established to evaluate the accuracy of each information and select high-quality information. And finally, generating a vulnerability proving scheme report according to the high-quality information.

As shown in fig. 2, the intelligence analysis using NLP technique includes:

1) And (5) preprocessing information. The collected information text is cleaned, such as uniform format, spelling correction, stop word filtering and the like, so that the structure is neat, the format is uniform, and the subsequent processing is convenient;

2) And extracting keywords. Keywords, such as vulnerability names, impact systems/devices, exploitation tools, etc., are extracted from the informative text using TF-IDF algorithm. Primary content and features for identifying the intelligence;

3) Semantic parsing. Semantic structures and elements of the informative text are parsed using dependency syntax analysis techniques. Identifying key information such as vulnerability names, influence ranges, utilization processes and the like in the text, and understanding logic relations and meaning expressions in the key information;

4) The information abstract. And generating abstract information of the information text by using an LSTM algorithm. The main content and the key points of the information can be quickly understood from the abstract content, and meanwhile, the downstream information association and evaluation process is facilitated;

5) And (5) classifying information. Judging the information type and the described vulnerability proving scheme according to the key words, the semantic information and the abstract content, and realizing classification and induction of the information. This facilitates the directed processing and management of different types of intelligence.

As shown in fig. 3, the information analysis information obtained by using the NLP technology is constructed into structured knowledge, and is stored in a knowledge graph, so as to realize standardized expression and management of information content, and the establishment of the knowledge graph for information association includes:

1) Node definition, extracting key information of different informations as nodes, and using the key information as a basic element for constructing a knowledge graph to represent main characteristics of the informations;

2) Edge definition, namely defining different types of edges according to the inherent association among the nodes to represent the relationship, such as 'yes', 'utilized', 'influencing' and the like, linking related nodes, and realizing association and semantic expression among the nodes;

3) Knowledge construction, namely combining nodes and edges to form a structure of 'subjects-predicates-objects', constructing a knowledge triplet with semantics, and storing the knowledge in a knowledge graph. The method comprises the steps of structuring and standardizing management of collected information content;

4) When new information is collected, firstly extracting keywords and elements from a text to serve as nodes, then inquiring similar or related nodes in a knowledge graph, judging which existing knowledge the new information is related to or repeated with, and further presuming the category and the related scheme of the new information;

5) Knowledge supplementing, if the new information is related to some knowledge in the knowledge graph, but the knowledge item is incomplete, more detailed content needs to be extracted from the new information for supplementing;

6) If some key nodes in the new information are not directly related to the knowledge in the knowledge graph, new knowledge triples can be generated, at the moment, the content of the new information needs to be analyzed, the relation and the semantics between the nodes are understood, the new knowledge triples are generated, and the new knowledge triples are added to the knowledge graph to realize the discovery and the expansion of the knowledge.

As shown in fig. 4, the information evaluation by the information evaluation mechanism includes:

1) Index determination, evaluating from a plurality of angles intelligence, comprising:

(1) the authority of the source, the authority and the public belief of the information release source are evaluated, such as government institutions, well-known enterprises, personal researchers and the like, the information quality and the accuracy of the authority source are generally higher, and the authority source can be used as an important reference factor for judging the information;

(2) the information integrity is used for evaluating whether the information contained in the information is complete and detailed, and whether the information has various elements required by the report of the vulnerability proving scheme, such as vulnerability names, influence ranges, utilization conditions, technical details and the like, so that the quality of the information with complete information is generally higher;

(3) consistency test, comparing information from different sources, judging whether the same vulnerability proving scheme is described, judging whether the extracted key information is consistent, such as whether vulnerability names in different information, influencing system versions and the like are the same, and judging whether the utilization processes are similar. The reliability of the information with higher consistency is higher;

(4) the timeliness of the information, the time of evaluating the information output, the timeliness of the information which is higher than that of the information which is released more recently, the contained information can be more detailed and accurate, and on the contrary, the timeliness of the information which is released for a longer time is reduced, wherein the information can be already out of date, and the reliability of the information can be affected.

2) And (5) weight determination. And determining the weight of each evaluation index according to the characteristics of different types of information.

3) Index score. And judging the information performance of each evaluation index according to the evaluation method, setting high scores matched with the information types and setting low scores not matched with the information types, and finally determining the score of each index.

4) And (5) calculating a total score. And multiplying each index score by the weight of the index score, and summing to obtain the total score of the information, wherein the total score represents the comprehensive evaluation of the information quality. The higher the total score, the higher its quality and importance.

5) And (5) judging key. When the score of a certain index is too high or too low, key judgment needs to be carried out on the evaluation result of the index to judge whether the overall evaluation of the information can be obviously influenced or not, and error judgment caused by extreme scoring of the certain index is avoided.

Step 5: generating a vulnerability exploitation script: and generating the vulnerability exploitation script according to the predefined template based on the generated vulnerability proving scheme report.

Step 6: and starting a vulnerability environment, executing a vulnerability exploitation script in the vulnerability environment, and performing simulation attack. If the attack result is matched with the real result, the vulnerability automatic reproduction of the current number is successful, and the test is ended; if the attack result is not matched with the real result, alarming and executing the steps 3-6 circularly until the simulated attack result is matched with the real result or the simulated attack times reach 10 times, exiting the task. And (3) processing the simulated attack feedback by a security researcher, and recording the vulnerability environment and the vulnerability utilization script into a vulnerability reproduction knowledge base.

CVE: public vulnerabilities and exposures (Common Vulnerabilities and Exposures, CVE), also known as generic vulnerability disclosure, common vulnerabilities and disclosure, are a database related to information security, collecting various information security vulnerabilities and numbering for public review. CVE assigns a proprietary number to each vulnerability in the format: CVE-YYYY-NNNN, CVE is a fixed prefix word, YYY is the century of West origin, NNNN is the running water number.

NLP: natural language processing (Natural Language Processing, NLP) is a branch discipline in the fields of artificial intelligence and linguistics. The field discusses how natural language is handled and used; natural language processing includes aspects and steps, basically including cognition, understanding, generation, and the like. Natural language cognition and understanding is to let a computer change the input language into interesting symbols and relationships, and then reprocess them according to the purpose. The natural language generation system converts the computer data into natural language.

Vulnerability information: by analyzing the CVE number, obtaining a vulnerability name and vulnerability description, collecting vulnerability information on vulnerability information platforms of enterprises of public vulnerability platforms, related Internet and the like according to the name and the description, and extracting the information to obtain vulnerability information: including vulnerability sources, impact versions, hazard levels, solutions, etc., which are used to build vulnerability environments and gather intelligence of vulnerability attestation schemes.

Vulnerability demonstration scheme: proof of concept (PoC), a short, incomplete implementation of certain ideas, can be used to verify that a vulnerability or class of vulnerabilities is actually present, demonstrating its principle, its purpose being to verify some concepts or theories.

Vulnerability exploitation script: exploit (Exp) is a section of program that can Exploit the value of a vulnerability. And compiling a vulnerability exploitation script to exploit the vulnerability through the collected vulnerability proving scheme. For example, when a certain system has SQL injection loopholes, a loophole utilizing script can be written to extract database version information and the like.

Dockerfile: dock is used to develop applications, deliver applications, and run applications, allowing users to separate applications in a base set into smaller containers, thereby increasing the speed of delivering software. Dockerfile is a file format of Docker that contains text of all commands that the user wants to build a mirror image, defining the content of a single container and the behavior at startup.

Simulation attack: after the vulnerability environment is built and the vulnerability exploitation script is generated, the script can be executed in the environment, whether the script execution result is consistent with the vulnerability response or not is judged, and if so, the simulation attack is considered to be successful.

The TF-IDF algorithm (Term Frequency-Inverse Document Frequency, TF-IDF) is a common statistical method for information retrieval and text mining to evaluate the importance of a word to one of a set of documents or a corpus of documents. The importance of a word increases proportionally with the number of times it appears in the file, but at the same time decreases inversely with the frequency with which it appears in the corpus.

Dependency syntax analysis: in natural language processing, a framework that describes a language structure in terms of word-to-word dependencies is called a dependency grammar, also called a dependency grammar. Syntactic analysis using dependency syntax is also one of the important techniques for natural language understanding. It analyzes the sentence into a dependency syntax tree describing the dependency relationship between the words. That is, a syntactically collocation relationship between words is indicated, which is semantically associated.

LSTM: long Short-Term Memory (LSTM) is a recurrent neural network suitable for processing Long sequence text. The system has a gating structure and a memory system, and can selectively forget certain information and update other information, so that the system is particularly suitable for understanding the semantics of long text sequences.

The automatic loophole reproduction knowledge base is created, so that the working such as loophole verification and utilization can be conveniently and quickly carried out by researchers, the time and energy for collecting information of the researchers and reproducing the loopholes can be saved, and the researchers can concentrate on the study of the loophole technology.

Example 2

The embodiment of the disclosure provides a computer readable storage medium, which stores a computer program, and the computer program realizes the automated vulnerability reproduction knowledge base construction method based on information collection when being executed by a processor.

A computer-readable storage medium according to an embodiment of the present disclosure has stored thereon non-transitory computer-readable instructions. When executed by a processor, perform all or part of the steps of the methods of embodiments of the present disclosure described above.

The computer-readable storage medium described above includes, but is not limited to: optical storage media (e.g., CD-ROM and DVD), magneto-optical storage media (e.g., MO), magnetic storage media (e.g., magnetic tape or removable hard disk), media with built-in rewritable non-volatile memory (e.g., memory card), and media with built-in ROM (e.g., ROM cartridge).

It will be appreciated by persons skilled in the art that the above description of embodiments of the application has been given for the purpose of illustrating the benefits of embodiments of the application only and is not intended to limit embodiments of the application to any examples given.

The foregoing description of embodiments of the application has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described.

Claims

1. An automated vulnerability discovery knowledge base construction method based on information collection is characterized by comprising the following steps:

step 2: collecting vulnerability information based on the feature vector, extracting vulnerability information and reproducing a vulnerability environment; customizing a Dockerfile configuration file according to the extracted vulnerability information, and rapidly reproducing a vulnerability environment;

step 6: starting the vulnerability environment, executing the vulnerability exploitation script to perform simulation attack, and storing the vulnerability environment and the vulnerability exploitation script into the vulnerability reproduction knowledge base;

after the vulnerability environment executes the vulnerability exploitation script to perform the simulation attack, the step 6 further includes:

2. The automated vulnerability discovery knowledge base construction method based on intelligence collection of claim 1, wherein the vulnerability demonstration scheme intelligence is analyzed by NLP technique.

3. The automated vulnerability discovery knowledge base construction method based on intelligence collection of claim 2, wherein analyzing the vulnerability attestation scheme intelligence comprises:

4. The automated vulnerability discovery knowledge base construction method based on intelligence collection of claim 3, wherein the keywords comprise vulnerability names, influencing systems/devices, and utilizing tools.

5. The automated vulnerability discovery knowledge base construction method based on intelligence collection of claim 4, wherein the information association by knowledge graph comprises:

6. The automated vulnerability discovery knowledge base construction method based on intelligence collection of claim 5, wherein the intelligence assessment by the intelligence assessment mechanism comprises:

determining multi-angle information evaluation indexes;

7. The automated vulnerability discovery knowledge base construction method based on intelligence collection of claim 6, wherein the multi-angle intelligence evaluation index comprises source authority, information integrity, consistency check, intelligence timeliness.

8. The method for constructing an automated vulnerability reproduction knowledge base based on intelligence collection according to claim 7, wherein if the score of a certain index is too high or too low, the evaluation result of the index is subjected to key judgment to judge whether the overall evaluation of intelligence is significantly affected, so as to avoid erroneous judgment caused by extreme scoring of the certain index.

9. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, implements the automated vulnerability discovery knowledge base construction method based on intelligence collection according to any one of claims 1-8.