CN115455945A

CN115455945A - Entity-relationship-based vulnerability data error correction method and system

Info

Publication number: CN115455945A
Application number: CN202210917536.4A
Authority: CN
Inventors: 杨牧天; 刘梅; 吴敬征; 罗天悦
Original assignee: Beijing Zhongke Weilan Technology Co ltd
Current assignee: Beijing Zhongke Weilan Technology Co ltd
Priority date: 2022-08-01
Filing date: 2022-08-01
Publication date: 2022-12-09

Abstract

The invention discloses a vulnerability data error correction method based on an entity-relation, which comprises the following steps: acquiring vulnerability description information from a vulnerability database, and performing word segmentation processing on the vulnerability description information to obtain a data slice; cleaning and formatting the data slice to generate representation information; performing BERT model training by using the representation information to obtain vector representation, wherein the vector representation has abundant semantic information and context information; extracting the name and version of the software package influenced by the vulnerability based on vector representation; comparing the extracted software package name and version with corresponding information in a CPE file respectively; if the comparison is consistent, the vulnerability data is considered to have no error; otherwise, judging that the bug data has errors, and correcting the bug data according to the extracted software package name and version.

Description

Entity-relationship-based vulnerability data error correction method and system

Technical Field

The invention relates to the technical field of network security, in particular to a vulnerability data error correction method and system based on entity-relation.

Background

With the rapid development of information networks, network attack techniques are also in the endlessly, and attack behaviors are generally performed for system software or application software bugs, so that finding software bugs in time and performing timely trimming are important technical means for maintaining network security. Various different network security platforms and enterprises can regularly update discovered vulnerabilities. NVD is a National computer universal Vulnerability Database (NVD) which includes Vulnerability data from 2000 to 2017 years (total of 5 million vulnerabilities, 23 Vulnerability types), and the Vulnerability data storage format is xml format for software security researchers to use. Many security detection software uses vulnerability data of NVD, but in the actual software development process, it is found that data related to vulnerabilities in NVD has errors, and in order to improve accuracy and comprehensiveness of security detection of developed software, it is necessary to correct public vulnerability data acquired from NVD.

Disclosure of Invention

In view of the above, the present invention has been developed to provide a solution that overcomes, or at least partially solves, the above-mentioned problems.

The invention provides an entity-relationship based vulnerability data error correction method, which comprises the following steps:

acquiring vulnerability description information from a vulnerability database, and performing word segmentation processing on the vulnerability description information to obtain a data slice;

cleaning and formatting the data slices to generate representation information;

performing BERT model training by using the representation information to obtain vector representation, wherein the vector representation has abundant semantic information and context information;

extracting the name and version of the software package influenced by the vulnerability based on vector representation;

comparing the extracted software package name and version with corresponding information in a CPE file respectively;

if the comparison is consistent, the vulnerability data is considered to have no errors; otherwise, judging that the bug data has errors, and correcting the bug data according to the extracted software package name and version.

Optionally, extracting the name and version of the software package affected by the vulnerability based on the vector characterization includes: carrying out entity extraction and relationship extraction on the vector characterization by using an LSTM neural network model; and determining the name and the version of the software package influenced by the vulnerability based on the extracted entity characteristics and the relationship characteristics.

Optionally, according to the corrected vulnerability data, generating entities and relationship data related to the vulnerability based on a regular expression structure, and constructing a knowledge graph.

The invention also provides an entity-relationship based bug data error correction system, which comprises:

the vulnerability description information acquisition module is used for acquiring vulnerability description information from a vulnerability database and performing word segmentation processing on the vulnerability description information to obtain data slices;

the preprocessing module is used for cleaning and formatting the data slices to generate representation information;

the BERT training module is used for carrying out BERT model training by utilizing the representation information to obtain vector representation, and the vector representation has rich semantic information and context relationship information;

the target information extraction module is used for extracting the name and version of the software package influenced by the vulnerability based on vector representation;

the information comparison module is used for comparing the extracted software package name and version with corresponding information in a CPE file respectively;

the vulnerability data correction module is used for determining that the vulnerability data has no errors if the comparison is consistent; otherwise, judging that the bug data has errors, and correcting the bug data according to the extracted software package name and version.

Optionally, the target information extracting module includes: the entity/relationship extraction submodule is used for carrying out entity extraction and relationship extraction on vector characterization by using an LSTM neural network model; and the software package name/version determining submodule determines the name and version of the software package influenced by the vulnerability based on the extracted entity characteristics and the relation characteristics.

Optionally, the system further comprises: and the knowledge graph construction module is used for generating entity and relation data related to the vulnerability based on a regular expression structure according to the corrected vulnerability data, and constructing a knowledge graph.

By the method and the device, the publicly disclosed vulnerability data can be corrected and perfected, so that more comprehensive and accurate vulnerability data can be provided, and a data basis is provided for improving the accuracy and comprehensiveness of subsequent software security detection.

The above description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the technical solutions of the present invention and the objects, features, and advantages thereof more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 illustrates a screenshot of detailed description information about a CVE vulnerability;

fig. 2 shows a screenshot of CPE information corresponding to fig. 1;

fig. 3 shows a flowchart of a vulnerability data error correction method based on entity-relationship proposed by the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The English full name of CVE is "Common Vulnerabilities & Exposuers" Common vulnerability and exposure. The CVE acts as a dictionary table giving a common name for widely recognized information security vulnerabilities or vulnerabilities that have been exposed. Using a common name may help users share data among their own separate vulnerability databases and vulnerability assessment tools, although these tools are difficult to integrate together. This makes the CVE a "key" for secure information sharing. If there is a vulnerability indicated in a vulnerability report, you can quickly find the corresponding fix information in any other CVE-compatible database if there is a CVE name. A complete CVE message contains six parts: metadata, vulnerability impact software information, vulnerability problem types, references and vulnerability introductions, configurations, vulnerability impacts and scores. At present, the platform for releasing the bug information comprises a CVE and an NVD in the United states, and a CNNVD and a CNVD in China.

Each database has a generic CVD vulnerability number and a specific field for each vulnerability disclosed. For example, the main fields of the NVD database include vulnerability name, type, description (desc), discovery date, publication date, modification date, severity, resolution, impact software, etc. The description field contains a more detailed vulnerability description, and through analysis, the description format is generally found to be: the words segmentation process is suitable for generating characterization information. And vulnerability description information is accurate and generally in a text form.

Information about software affected by the vulnerability is typically expressed through CPE files. CPE (Common Platform Enumeration) -format is a structured naming scheme for information technology systems, software packages, proposed by NVD (National virtualization Database). The list of affected resources for a vulnerability in the vulnerability library is typically in CPE format. CPE defines 11 attributes, respectively: part (type), vendor, product, version, update (Update Version, such as Update patch Version of Product), SW _ Edition (Product for a specific market or a certain class of users, such as professional, standard), target _ SW (software environment in which Product runs, such as Android), target _ HW (bytecode intermediate Language, such as x 64), language (Language, such as en/-us, ja/-jp), other (Other properties). Because the CPE is a formatted file, information such as the name and the version of software which expresses the influence of the vulnerability in the CPE file can be directly extracted. After repeated research and verification on vulnerability database data, some software affected by vulnerabilities are found to have description in vulnerability description fields, but are not referred to in CPE files, and the description of the vulnerability description fields is actually proved to be more accurate. The present invention seeks to correct software affected by a vulnerability in accordance with a vulnerability description.

As a specific implementation mode, a specific description part in the vulnerability details with the vulnerability number of CVE-2019-13377 indicates that software hostapd and wpa _ supplicant 2.X version, including 2.8 version, affected by the vulnerability is easy to receive bypass attack when an intelligent curve is utilized, so that visible time difference and cache access are caused, a screenshot related to the vulnerability details is shown in figure 1, however, in CPE, see figure 2, any data related to the software hostapd and wpa _ supplicant 2.X version are not listed at all. When the public disclosure vulnerability information is used for constructing the knowledge graph, and when software security detection is carried out based on the constructed knowledge graph, the vulnerability information used as the basis is found to directly influence the security detection result, or when various security detection software development is directly carried out by using the public disclosure vulnerability information as the data basis, certain influence is also provided, so that the invention aims to correct the vulnerability information.

The invention provides an entity-relationship-based vulnerability data error correction method, which comprises the following steps of:

s1, acquiring vulnerability description information from a vulnerability database, and performing word segmentation processing on the vulnerability description information to obtain data slices;

s2, cleaning and formatting the data slices to generate representation information;

s3, performing BERT model training by using the representation information to obtain vector representation, wherein the vector representation has abundant semantic information and context information;

s4, determining the name and version of the software package influenced by the vulnerability based on the vector characterization;

s5, comparing the determined software package name and version with corresponding information in CPE data respectively;

s61, if the comparison is consistent, the vulnerability data is considered to have no error;

and S62, if not, judging that the bug data has errors, and correcting the bug data according to the determined software package name and version.

By the method and the device, the disclosed vulnerability data can be corrected and perfected, so that more comprehensive and accurate vulnerability data can be provided, and a data basis is provided for improving the accuracy and comprehensiveness of subsequent software security detection.

The pre-trained BERT algorithm for natural language understanding, produced by google, outperforms other models far in the task performance of natural language processing.

The principle of the BERT algorithm consists of two parts, namely, in the first step, an expression method is learned by carrying out unsupervised pre-training on a large number of unlabeled corpora. Second, the pre-trained model is fine-tuned in a supervised manner using a small amount of labeled training data to perform various supervised tasks. Pre-trained machine learning models have been successful in a variety of fields, including image processing and Natural Language Processing (NLP). Since BERT is a pre-trained model, it uses only coding to learn potential expressions in the input text.

The invention selects the BERT Model for natural Language pre-training because the BERT has excellent natural Language training performance, which firstly innovates the pre-training tasks Mask Language Model (MLM) and Next Sequence Prediction (NSP), secondly trains the BERT using a large amount of data and computing power. MLM enables BERT to learn bi-directionally from text, that is, in a way that allows the model to learn its context from words before and after the word, the MLM pre-training task converts the text into tokens, representing tokens as input and output for training. And randomly taking 15% of the tokens for masking, specifically hiding in training input, and predicting the correct token content by using an objective function. Compared with the traditional training mode, the traditional mode adopts unidirectional prediction as a target or adopts two groups (unidirectional) from left to right and from right to left to approximate bidirectional. The NSP task allows BERT to learn relationships between sentences by predicting whether a subsequent sentence should follow a previous sentence. The training data used 50% of correctly sequenced sentence pairs plus another 50% of randomly selected sentence pairs. BERT trains both MLM and NSP targets simultaneously.

And performing BERT model training on the representation information by using the BERT to obtain vector representation, wherein the vector representation has abundant semantic information and context information. Analysis of the context information may be done through an LSTM (Long Short Term) neural network. LSTM is a special type of RNN that can learn long-term dependency information. All RNNs have a form of a chain of repeating neural network modules. In a standard RNN, this duplicated module has only a very simple structure, such as a tanh layer. The key to LSTM is the cellular state, which is analogous to a conveyor belt, running directly on the entire chain with only a few linear interactions. It is easy for information to remain unchanged in the stream above. LSTM has the ability to remove or add information to the state of the cell through a well-designed structure called a "gate". A gate is a method of selectively passing information. They contain a sigmoid neural network layer and a poitwise multiplication operation. The Sigmoid layer outputs a value between 0 and 1 describing how much of each part can pass through. 0. Representing "No passage of any quantity", 1 means "allow passage of any quantity"!

LSTM has three gates to protect and control cell state.

Forgetting to record the door:

the action object is as follows: state of cell

The function is as follows: selective forgetting of information in cellular states

As a language model, the next word is predicted based on what has been seen. In this case, the cell state may contain the class of the current subject, so that the correct pronoun can be selected. When we see a new subject, we want to forget the old subject.

For example, he is present today, so i.. Am to forget the previous 'he' selectively when processing to 'me'. Or to reduce the effect of this word on the following words.

Inputting a layer gate:

the acting object is as follows: state of the cell

The function is as follows: selective recording of new information into cellular states

In our example of a language model, we wish to add new classes of subjects to cellular states, replacing old subjects that need to be forgotten.

For example: that is, when the word "i", is processed, the subject i will be updated into the cell.

An output layer gate:

the acting object is as follows: hidden layer ht

In the example of a language model, because he sees a pronoun, it may be necessary to output information related to a verb. For example, it is possible to output whether the pronouns are singular or negative, so that in the case of verbs, we also know the word-shape changes that the verbs need to make.

For example: in the above example, when processing the word 'i' it can be predicted that the next word is likely to be a verb and is the first person.

The previous information is saved to the hidden layer.

The first step in our LSTM is to decide what information we will discard from the cell state. This decision is made by a so-called forgetting gate level. The gate will read h _ { t-1} and x _ t, outputting a value between 0 and 1 for each number in cell state C _ { t-1 }. 1. Indicating "complete retention" and 0 indicating "complete discard".

The next word is predicted based on what has been seen. In this case, the cell state may include the sex of the current subject, so that the correct pronouns can be selected. When we see a new subject, we want to forget the old subject.

The next step is to determine what new information is deposited in the cellular state. Here two parts are involved. First, the sigmoid layer called the "input gate layer" decides what value we are going to update. Then, a tanh layer creates a new candidate value vector, \ tilde { C } _ t, to be added to the state.

Next, we will speak these two pieces of information to generate an update to the state. In language models, it is desirable to add the sex of a new subject to the cell state to replace an old subject that needs to be forgotten.

It is now time to update the old cell state, and C _ { t-1} is updated to C _ t. The previous steps have already decided what to do and we are now actually going to do.

We multiply the old state by f _ t and discard the information we determined to need to discard. Then add i _ t \ tilde { C } _ t. This is the new candidate, which changes according to how much we decide to update each state.

In the example of a language model, this is where we actually discard gender information for the old pronouns and add new information based on the previously determined goals.

Finally, we need to determine what value to output. This output will be based on our cell state, but is also a filtered version. First, we run a sigmoid layer to determine which part of the cell state will be output. Then we process the cell state through tanh (to get a value between-1 and 1) and multiply it with the output of the sigmoid gate, and finally we will only output that part of the output we determine.

Through the specific description of the LSTM neural network, the participles (or data slices) trained through the BERT model are input into the LSTM neural network, so that the entities and the relations among the entities can be analyzed and judged, vector representations are obtained, and the name and the version of the software package influenced by the vulnerability are determined based on the vector representations.

Then, the determined software package name and version are respectively compared with corresponding information in CPE data; if the comparison is consistent, the vulnerability data is considered to have no errors; otherwise, judging that the bug data has errors, and correcting the bug data according to the determined software package name and version, so that correct bug related data based on the NVD database can be obtained.

And generating entity and relation data related to the vulnerability based on the regular expression according to the vulnerability related data from the vulnerability library after error correction, and constructing a knowledge graph.

A knowledge graph typically consists of a number of nodes and edges with independent meaning, a node representing an entity such as a software project ID, vulnerability ID, or any user-defined entity type, and an edge representing a relationship between two different entities, both with various attributes to describe internal features such as ID, name, release date, corresponding CWE ID, etc. of a CVE type node. The knowledge graph constructed based on the entity and the relation data related to the bug after error correction is more accurate and comprehensive, and the subsequent security detection of network system software or application software based on the knowledge graph is facilitated.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: rather, the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim.

Claims

1. An entity-relationship-based vulnerability data error correction method is characterized by comprising the following steps:

acquiring vulnerability description information from a vulnerability database, and performing word segmentation processing on the vulnerability description information to obtain data slices;

cleaning and formatting the data slices to generate representation information;

2. The method of claim 1, further characterized by extracting vulnerabilities based on vector characterization

The name and version of the affected software package comprise:

performing entity extraction and relationship extraction on vector quantity characterization by using an LSTM neural network model;

and determining the name and the version of the software package influenced by the vulnerability based on the extracted entity characteristics and the relationship characteristics.

3. The method of claim 1, further characterized in that the method further comprises: according to what

And (4) generating entity and relation data related to the vulnerability based on the regular expression structure and constructing a knowledge graph according to the corrected vulnerability data.

4. An entity-relationship based vulnerability data error correction system, the system comprising:

the BERT training module is used for carrying out BERT model training by utilizing the representation information to obtain vector representation, and the vector representation has rich semantic information and context information;

the information comparison module is used for comparing the extracted software package name and version with corresponding information in the CPE file respectively;

5. The system of claim 4, further characterized by a target information extraction module, package

Comprises the following steps: the entity/relationship extraction module is used for carrying out entity extraction and relationship extraction on vector characterization by using an LSTM neural network model;

and the software package name/version determining module is used for determining the name and version of the software package influenced by the vulnerability based on the extracted entity characteristics and the relationship characteristics.

6. The system of claim 4, further characterized in that the system further comprises: knowledge graph

And the spectrum construction module is used for generating entity and relation data related to the vulnerability based on a regular expression structure according to the corrected vulnerability data and constructing a knowledge spectrum.