CN112131882B

CN112131882B - Multi-source heterogeneous network security knowledge graph construction method and device

Info

Publication number: CN112131882B
Application number: CN202011059788.5A
Authority: CN
Inventors: 章瑞康; 袁军; 周娟; 李文瑾
Original assignee: Nsfocus Technologies Inc; Nsfocus Technologies Group Co Ltd
Current assignee: Nsfocus Technologies Inc; Nsfocus Technologies Group Co Ltd
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2024-02-13
Anticipated expiration: 2040-09-30
Also published as: CN112131882A

Abstract

The invention discloses a method and a device for constructing a secure knowledge graph of a multi-source heterogeneous network, wherein the method comprises the following steps: responding to a triggering request for constructing a network security knowledge graph, and extracting the matched entity and the relation between the entities from the semi-structured data set and the structured data set to generate a triplet according to the relation between the entities defined by a preset network security knowledge ontology; identifying entities matched with the entities defined by the network security ontology from an unstructured data set according to different categories of the entities in a preset identification mode, wherein the data in the unstructured data set are text data; inputting the text data into a word vector recognition model to obtain word vectors of all entities; inputting entity pairs selected according to preset rules and corresponding word vectors into a relation extraction model to obtain the relation between the entity pairs, and generating a triplet of the word vectors of the fusion entity according to the relation between the entity pairs, the corresponding word vectors and the entity pairs; and constructing a network security knowledge graph according to each triplet.

Description

Multi-source heterogeneous network security knowledge graph construction method and device

Technical Field

The invention relates to the technical field of information security, in particular to a method and a device for constructing a multi-source heterogeneous network security knowledge graph.

Background

With the development of big data age and the increasing complexity of network security environment, network attack event is frequent, in order to ensure network space security, enterprises monitor security threats in the network by deploying various service systems such as firewall, intrusion detection, intrusion protection and the like, and monitor the security threats in real time from multiple layers such as viruses, attacks, loopholes, vulnerability and the like, thereby generating a large amount of network security event data such as alarm information, monitoring logs and the like. Meanwhile, there is a lot of information and knowledge related to network security in the network, such as network security vulnerability dataset CVE (Common Vulnerabilities and Exposures, public vulnerabilities and exposures), CNNVD (China National Vulnerability Database of Information Security, national information security vulnerability library), network attack type dataset CAPEC (Common Attack Pattern Enumeration and Classification, general attack pattern enumeration and classification), network attack technology dataset ATT & CK (Adversarial Tactics, technologies, and Common Knowledge, resistance tactics, technologies and public knowledge for reflecting the behavior of each attack lifecycle), network asset dataset CPE (Common Platform Enumeration, general platform enumeration), and threat intelligence text data from security event reports, security community blogs, etc. issued by network security vendors, security analysts. Certain links are lacking among the massive fragmented heterogeneous network security data, and network threat information analysts are difficult to acquire and integrate the massive fragmented heterogeneous network security data, so that comprehensive and accurate security analysis cannot be performed.

Knowledge Graph (knowledgegraph) is essentially a semantic network that reveals relationships between entities, stores Knowledge in the form of graphs, aims at identifying, finding and deducing complex relationships between things and concepts from data, is a computable model of the relationships of things, performs logical relationship organization reasoning in a manner similar to human cognitive learning, and visually displays the data relationships. The knowledge graph is mainly constructed based on semi-structured data and unstructured data, and the knowledge graph technology can integrate relevant information of network security and threat information, so that the problem of difficulty in sharing and reusing multi-source heterogeneous data is solved, and network security analysts are assisted in comprehensively and intuitively performing security analysis.

In the construction process of the multi-source heterogeneous network security knowledge graph, the selection of the data set is not comprehensive enough, for example, the network security knowledge graph is constructed based on semi-structured data, and the knowledge graph is constructed by only extracting information from the semi-structured data such as the security hole data set, the network attack type data set and the like, wherein the extracting information is knowledge extraction, and the method comprises the following steps: entity identification and relation extraction, but most threat information in reality does not exist in a structured data form at the first time, some security manufacturers or security researchers can issue the latest threat information through unstructured data such as reports, blog articles and the like, so that the network security knowledge graph constructed by extracting information only based on semi-structured data such as a security vulnerability data set, a network attack type data set and the like often cannot timely incorporate the latest threat information, association analysis cannot be provided more comprehensively and more timeliness, and the network knowledge graph is constructed by extracting information in the semi-structured data set and the unstructured data set at the same time. However, in the prior art, a method based on rule matching is mainly used when extracting entities from unstructured data sets, and a method based on CRF (Conditional Random Field ) is mainly used when extracting relationships, so that the accuracy of identification is low due to insufficient entity characteristics extracted from complex or newly-appearing or Chinese-English mixed network security entities in the network security field, and the credibility of the constructed network security knowledge graph is further affected.

Disclosure of Invention

In order to solve the problem of low reliability of the constructed multi-source heterogeneous network safety knowledge graph caused by low accuracy of knowledge extraction and insufficient knowledge extraction in the prior art, the embodiment of the invention provides a multi-source heterogeneous network safety knowledge graph construction method and device.

In a first aspect, an embodiment of the present invention provides a method for constructing a secure knowledge graph of a multi-source heterogeneous network, including:

responding to a triggering request for constructing a network security knowledge graph, and extracting the matched entity and entity relation from a half-structured data set and a structured data set in an acquired network security domain data set according to the entity and entity relation defined by a preset network security knowledge body to generate a structured resource description framework RDF triplet, wherein the network security knowledge body is constructed according to the related network security standard of the network security domain, and the network security knowledge body defines the entity and entity relation of the network security domain;

identifying entities matched with the entities defined by the network security ontology from unstructured data sets in the network security domain data sets according to different categories of the entities in a preset identification mode, wherein the data in the unstructured data sets are text data;

Inputting the text data into a word vector recognition model to obtain word vectors of entities recognized from the unstructured dataset;

selecting entity pairs according to character intervals between every two adjacent entities identified from the unstructured dataset, inputting the selected entity pairs and corresponding word vectors into a relation extraction model, obtaining relations among the entity pairs, and generating structured RDF triples of word vectors of fusion entities according to the entity pairs, the corresponding word vectors and the relations among the entity pairs;

and constructing a network security knowledge graph according to the structured RDF triples and the structured RDF triples of the word vector of the fusion entity.

In the method for constructing the multi-source heterogeneous network security knowledge graph provided by the embodiment of the invention, a computer device responds to a trigger request for constructing the network security knowledge graph, extracts a relationship between an entity and an entity matched with the relationship between the entity defined by the network security knowledge ontology and the entity from a semi-structured data set and a structured data set in a collected network security domain data set according to the relationship between the entity defined by the network security knowledge ontology and the entity, generates a structured RDF (Resource Description Framework ) triplet, wherein the network security knowledge ontology is constructed according to the related network security standard of the network security domain, and defines the relationship between the entity and the entity in the network security domain, identifying entities matched with the entities of the network security domain defined by the network security ontology from unstructured data sets in the network security domain data sets according to different categories of the entities in a preset identification mode, wherein the data in the unstructured data sets are text data, inputting the text data into a word vector identification model, obtaining word vectors of the entities identified in the unstructured data sets, selecting entity pairs according to character intervals between every two adjacent entities identified in the unstructured data sets, inputting the selected entity pairs and the corresponding word vectors into a relation extraction model, obtaining relations among the entity pairs, generating a structured RDF triplet of word vectors of fusion entities according to the relation among the entity pairs, the corresponding word vectors and the entity pairs, and further, according to the generated structured RDF triples and the structured RDF triples fusing the word vectors of the entities, a network security knowledge graph is constructed, compared with the prior art, when the entities are identified from the unstructured data set, the entities matched with the entities in the network security domain defined by the network security knowledge ontology are identified according to different identification modes according to different categories of the entities, so that the identification of the entities is more accurate and comprehensive, the word vectors of the entities are further obtained, the corresponding relation between the entities and the word vectors of the entities is established, the influence of character intervals between adjacent entities on the selection of the entity pairs is considered, more effective entity pairs are selected, and the selected entity pairs and the word vectors of the entity pairs are used as the input of a relation extraction model to extract the relation between the entity pairs.

Preferably, identifying the entity matched with the entity in the network security domain from the unstructured data set according to different categories of the entity in a preset identification mode, specifically includes:

identifying matched entities of a preset category from the text data according to a preset regular expression;

and identifying the entities except the entities of the preset category from the text data by using an entity identification model, wherein the entity identification model is obtained by performing sequence labeling on the entities in the network security field in a sample set and then training according to a preset training model.

In the above preferred embodiment, when the unstructured dataset is text data, that is, the entity matching with the entity of the network security domain defined by the network security ontology is identified, specifically, the entity of the preset category is identified from the text data by using a method of matching regular expression, and the entity matching with the entity of the network security domain defined by the network security ontology except the entity of the preset category is identified from the text data by using the entity identification model, so that the identification of the entity is more accurate and comprehensive.

Preferably, the entity pairs are selected according to the character interval between every two adjacent entities identified from the unstructured dataset, and specifically include:

Determining a number of characters spaced between each two adjacent entities identified from the unstructured dataset;

deleting entity pairs, the number of characters of which is larger than a preset threshold value and does not accord with the relation defined by the network security ontology;

the remaining entity pairs are determined to be the selected entity pairs.

In the above preferred embodiment, after identifying the entities from the unstructured dataset, screening is required for the identified entities, when the number of characters spaced between every two adjacent entities is greater than a preset threshold and the relationship between the two is not in accordance with the relationship defined by the security ontology, deleting the entity pair, determining the remaining entity pair as the selected entity pair, and considering that the entities exceeding a certain character interval do not have a definite semantic relationship, only selecting two adjacent entity pairs within a certain distance range to form the entity pair, that is, selecting the entity pair within a certain distance range by using a sliding window mechanism, thereby making the selected entity pair more effective.

Preferably, the network security knowledge body comprises N layers, the first layer of body is the network security knowledge top layer body, each layer of body is further classified into the upper layer of body, and the network security knowledge top layer body at least comprises the following entities: public vulnerabilities and exposure CVEs, universal vulnerability enumeration CWE, national information security vulnerability library CNNVD, universal platform enumeration CPE, universal attack pattern enumeration and classification CAPEC, resistance tactics, technology and public knowledge ATT & CK, structured threat information expression STIX, and MALWARE malwax.

Optionally, the method further comprises:

inquiring an entity matched with the entity to be inquired in the network security knowledge graph according to the word vector of the entity to be inquired.

In the above-mentioned alternative embodiment, in the network security knowledge graph fused with the word vectors of the entities, the entity matched with the entity to be queried may be queried according to the word vectors of the entity to be queried, and the word vectors of the entity may be incorporated into the network security knowledge graph, so as to provide a basis for calculating cosine similarity between the word vectors of two entities in the vector space, and further achieve a higher matching effect.

Preferably, querying an entity matched with the entity to be queried in the network security knowledge graph according to a word vector of the entity to be queried, which specifically comprises:

determining the category of the entity to be queried;

respectively calculating cosine similarity between word vectors of the entity to be queried and word vectors of the entities with the same category as the entity to be queried in the network security knowledge graph, and acquiring the corresponding entity with the same category as the entity to be queried when the cosine similarity is greater than or equal to a preset threshold value to form a first entity set;

Searching a first entity associated with a target entity in the network security knowledge graph, and forming a second entity set by the first entity with the same category as the entity to be queried;

comparing each entity in the first entity set and the second entity set, and determining the same entity as the entity matched with the entity to be queried when determining that the same entity exists in the first entity set and the second entity set.

In the above preferred embodiment, the entity matching with the entity to be queried is queried in the network security knowledge graph in the following manner: and respectively calculating cosine similarity between word vectors of the entity to be queried and entity vectors of the network security knowledge graph, which are the same as the types of the entity to be queried, and forming the entity in the corresponding network security knowledge graph into a first entity set when the cosine similarity is greater than or equal to a preset threshold value, further searching the first entity associated with the target entity in the network security knowledge graph, forming the first entity of the same type as the entity to be queried into a second entity set, determining the same entity in the first entity set and the second entity set as the entity matched with the entity to be queried, and achieving high-efficiency matching in the network security knowledge graph based on the query mode of the cosine similarity between the vectors of the entity, thereby improving the query efficiency.

Optionally, the method further comprises:

obtaining a graph vector of each entity on the network security knowledge graph according to a graph representation learning model, wherein the graph vector of each entity represents a vector representation of the entity on the network security knowledge graph;

for each entity, splicing the graph vector of the entity and the word vector of the entity, and determining the spliced vector as a target entity vector of the entity;

and predicting the relation between any two entities according to the target entity vector and the knowledge prediction model of the any two entities.

In the above optional embodiment, the graph representation learning model is used to obtain the graph vectors of the entities on the network security knowledge graph, and the graph vectors of the entities are respectively spliced with the word vectors of the entities to generate the target entity vectors of the entities, so that when the relationship between any two entities is predicted, the prediction can be performed according to the target entity vectors of the any two entity vectors and the knowledge prediction model, and because the graph vectors and the word vectors of the entities are combined to generate abundant entity vector information (i.e., target entity vector), which includes the communication structure information of the network security knowledge graph and the semantic information of the entities, the relationship between the entities is predicted based on the neural network model and the abundant entity vector information, and the prediction result is more accurate.

In a second aspect, an embodiment of the present invention provides a device for constructing a secure knowledge graph of a multi-source heterogeneous network, including:

the first generation unit is used for responding to a trigger request for constructing a network security knowledge graph, extracting the matched entity and entity relation from the semi-structured data set and the structured data set in the acquired network security domain data set according to the entity and entity relation defined by a preset network security knowledge ontology, and generating a structured resource description framework RDF triplet, wherein the network security knowledge ontology is constructed according to the related network security standard of the network security domain, and the network security knowledge ontology defines the entity and entity relation of the network security domain;

the entity identification unit is used for identifying the entity matched with the entity defined by the network security ontology from the unstructured data set in the network security domain data set according to different categories of the entity in a preset identification mode, wherein the data in the unstructured data set is text data;

an obtaining unit, configured to input the text data into a word vector recognition model, and obtain word vectors of entities recognized from the unstructured dataset;

A second generating unit, configured to select an entity pair according to a character interval between every two adjacent entities identified from the unstructured dataset, input each selected entity pair and a corresponding word vector into a relation extraction model, obtain a relation between each entity pair, and generate a structured RDF triplet of word vectors of a fusion entity according to each entity pair and a corresponding word vector and the relation between each entity pair;

and the knowledge graph construction unit is used for constructing a network security knowledge graph according to the structured RDF triples and the structured RDF triples of the word vectors of the fusion entity.

Preferably, the entity identifying unit is specifically configured to identify, from the text data, a matched entity of a preset category according to a preset regular expression; and identifying the entities except the entities of the preset category from the text data by using an entity identification model, wherein the entity identification model is obtained by performing sequence labeling on the entities in the network security field in a sample set and then training according to a preset training model.

Preferably, the second generating unit is specifically configured to determine the number of characters that are identified from the unstructured dataset and that are spaced between every two adjacent entities; deleting entity pairs, the number of characters of which is larger than a preset threshold value and does not accord with the relation defined by the network security ontology; the remaining entity pairs are determined to be the selected entity pairs.

Optionally, the apparatus further comprises:

and the query unit is used for querying the entity matched with the entity to be queried in the network security knowledge graph according to the word vector of the entity to be queried.

Preferably, the query unit is specifically configured to determine a category to which the entity to be queried belongs; respectively calculating cosine similarity between word vectors of the entity to be queried and word vectors of the entities with the same category as the entity to be queried in the network security knowledge graph, and acquiring the corresponding entity with the same category as the entity to be queried when the cosine similarity is greater than or equal to a preset threshold value to form a first entity set; searching a first entity associated with a target entity in the network security knowledge graph, and forming a second entity set by the first entity with the same category as the entity to be queried; comparing each entity in the first entity set and the second entity set, and determining the same entity as the entity matched with the entity to be queried when determining that the same entity exists in the first entity set and the second entity set.

Optionally, the apparatus further comprises:

the acquisition unit is used for acquiring the graph vectors of the entities on the network security knowledge graph according to the graph representation learning model, wherein the graph vectors of the entities represent the vector representation of the entities on the network security knowledge graph;

the determining unit is used for splicing the graph vector of each entity with the word vector of the entity, and determining the spliced vector as a target entity vector of the entity;

and the prediction unit is used for predicting the relation between any two entities according to the target entity vector and the knowledge prediction model of the any two entities.

The technical effects of the device for constructing a secure knowledge graph of a multi-source heterogeneous network provided by the invention can be referred to the technical effects of the first aspect or each implementation manner of the first aspect, and are not repeated here.

In a third aspect, an embodiment of the present invention provides a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the method for constructing a multi-source heterogeneous network security knowledge graph according to the present invention when executing the program.

In a fourth aspect, an embodiment of the present invention provides a computer readable storage medium, on which a computer program is stored, where the program when executed by a processor implements the steps in the method for constructing a secure knowledge graph of a multi-source heterogeneous network according to the present invention.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:

fig. 1 is a schematic diagram of an application scenario of a method for constructing a secure knowledge graph of a multi-source heterogeneous network according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an implementation flow of a method for constructing a secure knowledge graph of a multi-source heterogeneous network according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a network security knowledge top layer body structure according to an embodiment of the present invention;

Fig. 4 is a schematic structural diagram of an STIX body according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a MALWARE body structure according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of an implementation flow for identifying entities in unstructured data sets that match entities in the network security domain according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of an entity recognition model according to an embodiment of the present invention;

FIG. 8 is a schematic flow chart of an embodiment of selecting entity pairs from entities identified in unstructured dataset according to the present invention;

FIG. 9 is a schematic structural diagram of a relationship extraction model according to an embodiment of the present invention;

fig. 10 is a schematic diagram of a part of a network security knowledge graph constructed according to an embodiment of the present invention;

FIG. 11 is a schematic diagram of an implementation flow of entity query according to an embodiment of the present invention;

FIG. 12 is a schematic flow chart of a relationship between prediction entities according to an embodiment of the present invention;

FIG. 13 is a schematic structural diagram of a knowledge prediction model according to an embodiment of the present invention;

fig. 14 is a schematic structural diagram of a multi-source heterogeneous network security knowledge graph construction device according to an embodiment of the present invention;

fig. 15 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

Referring to fig. 1, which is a schematic diagram of an application scenario of a multi-source heterogeneous network security knowledge graph construction method according to an embodiment of the present invention, the method may include a terminal 110 and a computer device 120, and when a network security knowledge graph needs to be constructed, for example, when the computer device 120 receives a request for constructing the network security knowledge graph sent by the terminal 110, the computer device 120 may return the constructed network security knowledge graph to the terminal 110 after constructing the network security knowledge graph according to the request for constructing the network security knowledge graph. In another application scenario, the computer device 120 may also automatically trigger to construct a network security knowledge graph, to execute the steps of the network security knowledge graph constructing method provided in the embodiment of the present invention, or may also trigger to construct a network security knowledge graph on the computer device 120 by an administrator, to execute the steps of the network security knowledge graph constructing method provided in the embodiment of the present invention.

The computer device 120 may be a stand-alone physical server or terminal, or may be a cloud server that provides basic cloud computing services such as a cloud server, a cloud database, cloud storage, and the like. Terminal 110 may be, but is not limited to being: smart phones, tablet computers, notebook computers, desktop computers, etc. The computer device 120 and the terminal 110 may be connected through a network, which is not limited by the embodiment of the present invention.

Based on the above application scenario, an exemplary embodiment of the present invention will be described in more detail below with reference to fig. 2 to 13, and it should be noted that the above application scenario is only shown for the convenience of understanding the spirit and principle of the present invention, and the embodiments of the present invention are not limited in any way herein. Rather, embodiments of the invention may be applied to any scenario where applicable.

The preferred embodiments of the present invention will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are for illustration and explanation only, and not for limitation of the present invention, and embodiments of the present invention and features of the embodiments may be combined with each other without conflict.

In this context, it is to be understood that the technical terms referred to in the present invention are:

1. Network security knowledge graph: the knowledge graph is a semantic network and formally describes things and relations in the real world, and is generally represented by triples, namely D= (E, R, E), wherein D represents a knowledge base; e= { E ₁ ,e ₂ ,…,e _|E| The entity set in D is shown, and the entity in the entity set mainly has |E|; r= { R ₁ ,r ₂ ,…,r _|R| The relation set in D is represented by a total of |R| different relations in the relation set, and the basic relation form of the triplet is mainly as follows<Concept, attribute value>And<entity 1, relationship, entity 2>Etc.

The network security knowledge graph aims at the knowledge in the network security field, and the described entities and relations mainly aim at the network security field, such as loopholes, hacker organizations and the like, and the relations such as attack, download, utilization and the like.

2. UCO (Unified Cybersecurity Ontology, unified network security ontology) model: the ontology is a model for defining which noun concepts become entity nodes and defining relationships between entities, and the UCO integrates heterogeneous data and knowledge patterns from different network security systems as a unified network security ontology, and is the most common network security standard for information sharing and exchange.

3. STIX (Structured Threat Information Expression) structured threat information expression: is a language defined and developed by MITRE (The MITRE Corporation) company to quickly reach the goal of expressing event relevance and coverage to express architectural network threat information. The STIX language will contain the full range of threat information and achieve as complete a representation, resiliency, extensibility, automation, and interpretability as possible. The system is a language, and aims to standardize network threat information, including threat information acquisition, characteristics and communication. In a structural mode, the network threat management is supported to be more effectively flowered, and the automation of the application is realized. Namely: the STIX is a structured language for describing cyber threat information, so that sharing, storing, and analyzing can be performed in a consistent manner.

4. CVE: public vulnerabilities and exposures, CVEs give a public name for widely agreed information security vulnerabilities or vulnerabilities that have been exposed, and using a common name can help users share data in their respective vulnerability databases and vulnerability assessment tools, thus making CVEs a "key" for secure information sharing. If a vulnerability is indicated in a vulnerability report, if there is a CVE name, the corresponding patch information can be quickly found in any other CVE compatible database, solving the security problem.

5. CNNVD: and the national information security vulnerability library provides information security vulnerability information details.

6. CWE (Common Weakness Enumeration, universal vulnerability enumeration): is a unified and measurable software defect description system used for free in the international range. Which enumerates the software vulnerability types.

7. CAPEC: universal attack pattern enumeration and classification provides common classification of attack patterns available, adding comprehensive planning and classification methods.

8. ATT & CK: resistant tactics, techniques, and public knowledge to reflect the aggression of individual attack lifecycles.

9. CPE: the universal platform enumeration is a structural naming mode, namely a unified naming specification, for information technology systems, platforms, software packages and the like, and is used for describing classes of application programs, operating systems and hardware devices existing in network assets.

As shown in fig. 2, which is a schematic implementation flow chart of a multi-source heterogeneous network security knowledge graph construction method according to an embodiment of the present invention, the multi-source heterogeneous network security knowledge graph construction method may be applied to the above-mentioned computer device 120, and may include the following steps:

s11, responding to a triggering request for constructing a network security knowledge graph, and according to the relationship between entities defined by a preset network security knowledge ontology, extracting the relationship between the matched entities from the half-structured data set and the structured data set in the acquired network security domain data set to generate a structured RDF triplet.

In the embodiment of the invention, the computer equipment constructs the network security knowledge body in advance according to the related network security standard of the network security domain, wherein the network security knowledge body defines the relationship between the entity and the entity in the network security domain, and in addition, the network security knowledge body also comprises the attribute type of the entity (namely the node).

Specifically, the network security standards of the network security domain mainly referred to may include UCO model and STIX, from which core term definitions are collected, and an ontology development tool (such as Prot e software, etc.) is used to design an ontology structure of a hierarchical level applicable to the fields of network security vulnerabilities, network attacks and threat intelligence, namely: the network security knowledge body comprises N layers, N is an integer greater than or equal to 2, the first layer of the body is the network security knowledge top layer body, each layer of body is used for further classifying the upper layer of the body, and the network security knowledge top layer body can but is not limited to comprise the following conceptual entities: CVE, CWE, CNNVD, CPE, CAPEC, ATT & CK, STIX and Malwire (MALWARE), wherein the generic network security knowledge classes include: CVE, CWE, CNNVD, CPE, CAPEC, ATT & CK threat intelligence information classes include: STIX and malwarm. The attributes of the CVE include: vulnerability ID (Identity document, vulnerability number), vulnerability score, vulnerability release time, vulnerability severity and other information, CWE attributes comprise CWE utilization platform, state, detection method, mitigation measures and other information, CNNVD attributes comprise vulnerability ID, vulnerability severity, vulnerability solution and other information, CAPEC attributes comprise utilization possibility, attack steps, attack countermeasures, attack preconditions, attack chain stages and other information, ATT & CK attributes comprise tactical type, technology identification description, technology authority level and other information, and CPE attributes comprise platform description, operating system and other information. The sub-class entities of the STIX mainly comprise security research reports (Report) which contain entities such as observation data, campaign, attack tools, attack patterns, threats, intrusion sets and the like, wherein the sub-class entities of the observation data comprise IP (Internet Protocol ) addresses, domain names (Domain), URLs (Uniform resource locator, uniform resource locators), files (files) and the like, and the sub-class entities of the malw are mainly comprise information such as MALWARE examples, MALWARE families, MALWARE behaviors, MALWARE IPs, MALWARE Domain names, MALWARE examples, MALWARE URLs, MALWARE rules, MALWARE signatures, MALWARE and the like.

As shown in fig. 3, which is a schematic diagram of a network security knowledge top layer ontology structure, a relationship between two entities is represented by "has" and a concatenation form of the two entity class names, for example, a relationship between CNNVD and CVE may be represented by "has_cnnvd_cve", a relationship between CVE and CNNVD may be represented by "has_cve_cnnvd", a relationship between CNNVD and CPE may be represented by "has_cnnvd_cpe", a relationship between CVE and CPE may be represented by "has_cve_cpe", a relationship between CAPEC and ATT & CK may be represented by "has_capec_attck", a relationship between ATT & CK and CAPEC may be represented by "has_attchc_capec", a relationship between CAPEC and CWE may be represented by "hac_cwe" CWE "and" CWE, the relationship between CWE and CVEs may be represented by "has_cwe_cves", the relationship between CVEs and CWE may be represented by "has_cves_cwe", the relationship between malwre and STIX may be represented by "has_malwre_stix", the relationship between the sub-class entity of malwre and the sub-class entity of STIX observe the sub-class entity of the next layer of data, e.g., the relationship between MALWARE domain name and STIX domain name may be represented by "has_malwandomain_stixdomain", the relationship between MALWARE IP and STIX IP may be represented by "has_malwaneiip_stixip", the relationship between MALWARE instance and STIX file may be represented by "has_malwanstance_stixixip", the relationship between the URL of MALWARE and STIX may be represented by "has_malwane_stixirl", the relationship between the next layer of MALWARE URL and STIX may be represented by "has_malwane_stixirl", the relationship between the sub-layer of the MALWARE IP and STIX may be represented by "CVEs, the relationship between the child entity MALWARE instance and the CVE of the next layer of malwre may be represented by "has_malwre instance_cve", which is not limited by embodiments of the present invention.

Fig. 4 is a schematic structure of a STIX body, fig. 5 is a schematic structure of a malwire body, and the representation of the relationships between entities in fig. 4 and 5 is similar to the representation of the relationships between entities in the schematic structure of the top-level body in fig. 3, which is not described herein.

When a network security knowledge graph needs to be built, for example, when a user sends a request for building the network security knowledge graph to computer equipment through a terminal, or the computer equipment automatically triggers the building of the network security knowledge graph, the computer equipment responds to the triggering request for building the network security knowledge graph, and according to the relationship between entities defined by a preset network security knowledge ontology, the relationship between the entities and the entity is extracted from a half-structured dataset and a structured dataset in an acquired network security domain dataset, so that a structured RDF triplet is generated.

In specific implementation, the computer device may obtain a network security domain data set in advance by using a crawler technology, and perform data preprocessing to generate a network security knowledge corpus, where the network security domain data set includes a semi-structured data set, a structured data set, and an unstructured data set, and the semi-structured data set and the structured data set may include, but are not limited to, the following data sets: CVE data sets, CWE data sets, CNNVD data sets, CPE data sets, CAPEC data sets, and ATT & CK data sets, unstructured data sets may include, but are not limited to, STIX data sets and malw re data sets, wherein the data in unstructured data sets is text data, mainly network security research reports written by security researchers, such as APT (Advanced Persistent Threat, advanced sustainable threat) reports, MALWARE topic reports, etc., which more accurately and abundantly describe attack procedure information. Because unstructured text data has the problems of different formats, special symbols (such as HTML (Hyper Text Markup Language, hypertext markup language) tags, XML (Extensible Markup Language ) tags, illegal characters) and the like, preprocessing (i.e., data cleaning) needs to be performed on the text data in the unstructured data set, where the preprocessing includes: the special symbols and code segments are removed.

In the implementation, according to the relationship between the entities in the network security domain defined by the built network security ontology, the computer equipment extracts all the matched relationships between the entities from the semi-structured data set and the structured data set, and generates a corresponding structured RDF triplet.

Specifically, the parsing conversion program can be written by using JAVA OWL, and different data sets are fused through the corresponding relation between the different semi-structured data sets and entity IDs disclosed in the structured data sets, for example, CVE-2009-2213 belongs to CWE-863.

S12, identifying the entity matched with the entity defined by the network security ontology from the unstructured data set in the network security domain data set according to different categories of the entity in a preset identification mode, wherein the data in the unstructured data set is text data.

In specific implementation, knowledge extraction is performed on text data in the unstructured dataset after preprocessing, wherein the knowledge extraction comprises entity identification and relation extraction.

Specifically, the method can identify the entity in the unstructured data set, which is matched with the entity in the network security domain defined in the built network security ontology according to the flow shown in fig. 6, and comprises the following steps:

S21, identifying the matched entities of the preset categories from the text data according to the preset regular expression.

In the embodiment of the invention, the entities are divided into two main types according to the composition characteristics of different entities, and the first type can be but not limited to the following types: mailbox addresses, URLs, IP addresses, domain names, CVE vulnerability numbers (IDs), CNNVD vulnerability numbers (IDs), file hashes (hashes), etc., the first class being marked as a preset class of entities, for which matched entities are identified from the text data according to a preset regular expression; the second category is other entities than the first category, for example: attack organization, attack method, etc.

Specifically, the regular expression can be configured by itself according to needs, and the embodiment of the invention is not limited to this.

In the embodiment of the invention, each entity in the first category can be identified by adopting a regular expression shown in the table 1:

TABLE 1

Table 1 shows correspondence of entities for which regularized expressions match.

S22, identifying the entities outside the entities of the preset category from the text data by using an entity identification model.

In the implementation, entities except the entities of the preset category are identified from the text data by using a pre-trained entity identification model, wherein the entity identification model is obtained by performing sequence labeling on the entities in the network security field in a sample set and training according to a preset training model.

Specifically, when training the entity recognition model, the obtained text data in the unstructured dataset may be used for training, 20% of the text data may be selected as a training sample set, 10% of the text data is selected from the rest of the text data as a cross validation set, and the rest 70% of the text data is selected as a test set, so as to ensure the accuracy of the finally trained entity recognition model, wherein the text data in the training sample set is labeled in sequence by a security researcher, the entities and the relationships among the entities in the network security domain defined by the network security ontology constructed in the application are labeled according to the relationships among the entities in the network security domain defined by the network security ontology, for example, the entity "sea lotus" in the text data is labeled as an "attack organization", the labeled text data is trained according to a preset training model, and the entity recognition model is obtained, wherein the preset training model may be, but is not limited to BERT (Bidirectional Encoder Representation from Transformers), bi-directional Long Short-lstm (Bi-directional Long Short-Term Memory, bi-short time network Term) + (Conditional Random Field), and the airport implementation model is not limited by this embodiment. In the embodiment of the invention, the BERT+BiLSTM+CRF model is taken as an example for illustration, and the entity identification model is obtained through the following training process:

And for each sample sentence in the training sample set, carrying out sequence labeling on the sample sentence according to the word, and labeling label of each word of the sample sentence.

The following operations are executed in a cyclic iteration mode until a preset convergence condition is met:

based on the sample sentence and a BERT model in the BERT+BiLSTM+CRF model, a word vector of each word of the sample sentence is obtained.

And inputting the word vector of each word into a BiLSTM+CRF model in the BERT+BiLSTM+CRF models to predict the label of each word.

And determining a label error according to the predicted label and the labeled label for each word of the sample sentence.

And adjusting each parameter contained in the BERT+BiLSTM+CRF model according to the label error until the model converges to obtain a trained entity identification model. Initially, the values of the respective parameters included in the bert+bilstm+crf model may be specified in advance.

As shown in fig. 7, the structure diagram of the entity recognition model is shown, firstly, text data is input into the BERT model to obtain word vectors of each word, specifically, encoding and learning are performed on text in the network security domain through an Encoder layer in the BERT model to obtain multi-level feature representation of related terms in the network security domain, compared with word vector representation of word2vec, word vector recognition performed by the BERT model improves the expression capability of the entity, further, the generated word vectors are input into the BiLSTM model to learn deep structural features of words in the text, wherein the BiLSTM model is formed by combining a Forward LSTM (Long Short-Term Memory) model and a backward LSTM (Backward LSTM) model, and finally, a layer of CRF model is added behind the BiLSTM model to ensure the validity of a prediction result by using CRF conditional random field constraint, and the identified entity is output. For example: a certain sentence in the text data of the input entity recognition model is Li Hua, XSS loopholes exist in the Chrome, and after the input of the text data is the BERT+BiLSTM+CRF model, the output corresponding to each word in the sentence is that: li Hua-O (Other, representing a non-labeled entity), found-O, chrome-Product, presence-O, XSS Vulnerability-vulnerabilities (vulnerabilities), that is, the final identified entity is: the Chrome and XSS vulnerabilities belong to the product and vulnerability, respectively (where XSS represents Cross Site Scripting, cross-site scripting attack). Another example is: another statement in the text data entered into the entity identification model is "foreign security vendor reveals that APT33 in combination with Parisite organization attacks us electric company using the Password-Spraying technique, while Parisite organization was related to the attack of the barren oil company event", then the identified entities in the network security domain include: APT 33-attack organization, parisite-attack organization, password-Spraying-attack technology, united states electric power company-infrastructure (infracuction), attack balm oil company event-attack event.

S13, inputting the text data into a word vector recognition model to obtain word vectors of the entities recognized from the unstructured data set.

In specific implementation, text data in an unstructured data set is input into a word vector recognition model to obtain word vectors of all entities recognized from the text data in the unstructured data set, wherein the word vector recognition model can use a BERT model, and the embodiment of the invention is not limited to the method.

S14, selecting entity pairs according to character intervals between every two adjacent entities identified from the unstructured dataset, inputting the selected entity pairs and corresponding word vectors into a pre-trained relation extraction model to obtain relations among the entity pairs, and generating structured RDF triples of word vectors of fusion entities according to the entity pairs and the corresponding word vectors and the relations among the entity pairs.

In implementation, selecting the entity pair from the entities identified in the unstructured dataset according to the flow shown in fig. 8 may include the following steps:

s31, determining the number of characters at intervals between every two adjacent entities identified from the unstructured data set.

In particular, a number of characters per two adjacent spaces between entities identified from text data of an unstructured dataset is determined.

S32, deleting entity pairs, the number of characters of which is larger than a preset threshold value and does not accord with the relation defined by the network security knowledge body, from the interval.

In specific implementation, a threshold may be preset according to experience, for example, when the text data is chinese text data, 60 chinese characters may be set, for example, when the text data is english text data, 30 english words may be set, which is not limited in the embodiment of the present invention.

Specifically, the number of characters in the deletion interval is larger than a preset threshold value, and the entity pairs do not accord with the relationship defined by the network security ontology. Pairs of entities that do not conform to the relationship defined by the network security ontology are, for example, "Campaign (Campaign) and Version (Version)".

S33, determining the rest entity pairs as the selected entity pairs.

Further, inputting the selected entity pairs and the word vectors of the entities in the entity pairs into a pre-trained relation extraction model to obtain the relation between the entity pairs, and generating a structured RDF triplet of the word vectors of the fusion entity according to the word vectors of the entities in the entity pairs and the relation between the entity pairs, wherein the relation extraction model is obtained according to the training of the neural network model according to the relation between the word vectors of the entities in the sample set and the entity, and the relation between the entity pairs.

Specifically, when the relation extraction model is trained, a training sample set adopted when the entity recognition model is trained is still used, the neural network model is composed of an input layer (input layer), three hidden layer neurons (hidden layer1, hidden layer2, hidden layer 3) softmax classification layers and an output layer (output layer), and when the relation extraction model is specifically implemented, the neural network model can adopt a DNN (Deep Neural Networks, deep neural network) model, and the embodiment of the invention is not limited to the method.

When the relation extraction model is trained, each entity pair extracted from the training sample set and the corresponding word vector of each entity pair are used as the input of the neural network model, the output is the relation between the entity pairs, and the model parameters are adjusted according to the error of the output relation and the actual relation until the model converges, so that the trained relation extraction model is obtained.

As shown in fig. 9, the structure of the relation extraction model is shown, (where h1, h2, and h3 are abbreviated as "hidden layer 1", "hidden layer 2", and "hidden layer 3") and are input as selected entity pairs and word vectors of the entities, and output as the relation between the entity pairs, and as shown in the figure, the input is: the color + word vector, spygen + word vector, … …, SSL + word vector, hackers + word vector, output as: the relationship between the cockers and the Chrome is targets, the relationship between the Spygen and SSL3.0 is exploids (exploitation), … …, the relationship between SSL3.0 and MIM attack is has Vulnerability (vulnerability exists), etc.

S15, constructing a network security knowledge graph according to the structured RDF triples and the structured RDF triples of the word vectors of the fusion entity.

In the specific implementation, knowledge fusion is carried out according to the structured RDF triples generated based on the semi-structured data set and the structured RDF triples of the word vectors of the fusion entity generated based on the unstructured data set, so as to construct the network security knowledge graph.

Specifically, the same entities in all RDF triples are fused, nodes of a network security knowledge graph are established, edges of the network security knowledge graph are established according to the relation between the entities in the triples, and the relation between the entity with the word Vector representation and the word Vector thereof is established, wherein the relation between the entity with the word Vector representation and the word Vector thereof can be represented by a has Vector. Text vectors represent semantic similarity between entities, for example, in the field of network security, entities Google Chrome and Internet Explorer (web browser) are close at the vector level, as they belong to the browser product class.

In the implementation, the knowledge fusion can establish a connection according to the entity ID or name, further edit the entity matching script program, match the entities of the same entity but different forms (such as case and abbreviation forms) and fuse the successfully matched entities into the same entity.

As shown in fig. 10, the structure diagram is a partial schematic diagram in the constructed network security knowledge graph, wherein execute arbitrary code (executing any code), google Chrome (web browser), denial of service (denial of service) are all entities extracted from threat information intel28876293, the relations between execute arbitrary code, google Chrome, denial of service and intel28876293 are isA (class belongings), the remote Attacker belongs to entity Attacker, execute arbitrary code, google Chrome, denial of service and remote Attacker are all entities extracted from text data in unstructured data sets, and have corresponding word Vector attributes (i.e. text Vector attributes), the corresponding relations between them and respective word vectors are that has Vector, google Chrome belongs to (isA) entity Product, and the relations between intel28876293 and execute arbitrary code are: has Vulnerability (vulnerability present), which means: intel28876293 has a vulnerability execute arbitrary code, and the relationship between intel28876293 and denial of service is also: has Vulnerability, which means: intel28876293 has a relationship between vulnerabilities denial of service, execute arbitrary code and entities vulnerabilities: isA, meaning: execute arbitrary code belongs to the category of vulnerabilities, and the relationship between denial of service and entity vulnerabilities is: isA, meaning: denial of service classes belong to the category of vulnerabilities, and the relationship between execute arbitrary code and remote attackers is: has Attacker, the meaning is: execute arbitrary code there is an attacker remote attckers, the relationship between denial of service and remote attckers is: has Attacker, the meaning is: denial of service there are attacker remote attackers.

In an alternative embodiment, after the network security knowledge graph is established, the entity matched with the entity to be queried can be queried in the network security knowledge graph according to the word vector of the entity to be queried.

In specific implementation, in the embodiment of the invention, cosine similarity between word vectors of entities in the network security field is utilized to measure a plurality of other entities matched with a certain entity. In the embodiment of the invention, the constructed network security knowledge graph is stored based on the Hugegraph (graph database), an entity vector-based query encapsulation layer is designed on a built-in query service Gremlin Server, the query service efficiency is improved, and three basic query modes are designed: c= { search, list, refer }, i.e.: query, enumeration, inference, where C represents the query result.

Specifically, the entity query performed according to the flow shown in fig. 11 may include the following steps:

s41, determining the category of the entity to be queried.

In particular, taking the network security knowledge graph shown in fig. 10 as an example, for example, a software bug similar to "denial of service" in the entity "Google Chrome" is queried (i.e., a software bug similar to "denial of service" in the entity associated with the entity "Google Chrome"). Here, the category to which the entity to be queried "denial of service" belongs is: vulnerabilities (vulnerabilities), wherein "Google Chrome" is the target entity.

S42, respectively calculating cosine similarity between the word vector of the entity to be queried and the word vector of each entity with the same category as the entity to be queried in the network security knowledge graph, and obtaining the corresponding entity with the same category as the entity to be queried when the cosine similarity is greater than or equal to a preset threshold value to form a first entity set.

In particular, the entities shown in fig. 10 that are of the same category as the entity to be queried "denial of service" are: execute arbitrary code, namely execute arbitrary code, are also: the vulnerability is calculated, the cosine similarity between the word vector corresponding to denial of service and the word vector corresponding to execute arbitrary code is calculated, if the calculated cosine similarity is greater than or equal to a preset threshold, the entity is obtained, the preset threshold may be set by itself according to experience, for example, may be set to 0.7, which is not limited in this embodiment of the present invention, and the entity execute arbitrary code is obtained assuming that the cosine similarity between the two is greater than 0.7, and since only a part of examples of the network security knowledge graph is finally constructed in fig. 10, there are a plurality of entities with the same category as the entity "denial of service" to be queried in the whole network security knowledge graph, which are not shown in fig. 10, and based on the same manner, the entity meeting the above condition in the whole network security knowledge graph is obtained, so as to form the first entity set. Steps S41 and S42 correspond to "search" in the above-described query pattern.

S43, searching a first entity associated with the target entity in the network security knowledge graph, and forming a second entity set by the first entity with the same category as the entity to be queried.

In specific implementation, still continuing the above example, searching the target entity Google Chrome in fig. 10 for the first entity associated with the entity denial of service to be queried, as can be seen from the association relationship in fig. 10, the first entity associated with the Google Chrome includes: denial of service and execute arbitrary code, according to the association relationship, it is known that vulnerabilities exist in Google Chrome: denial of service and execute arbitrary code, denial of service and execute arbitrary code constitute a second set of entities. Step S43 corresponds to "list" in the above-described query pattern.

S44, comparing each entity in the first entity set and the second entity set, and determining the same entity as the entity matched with the entity to be queried when determining that the same entity exists in the first entity set and the second entity set.

In a specific implementation, the same entity in the first entity set and the second entity set is execute arbitrary code, and execute arbitrary code is an entity matched with the entity denial of service to be queried, and further, an early warning message may be sent to the client. Step S44 corresponds to "refer" in the above-described query pattern.

In an alternative embodiment, after the network security knowledge graph is constructed, knowledge prediction may be further performed according to the word vector of the entity and the graph vector of the network security knowledge graph, for example, to predict the relationship between any two entities.

In specific implementation, predicting the relationship between any two entities according to the flow shown in fig. 12 may include the following steps:

s51, obtaining the graph vectors of all the entities on the network security knowledge graph according to the graph representation learning model.

In particular, the graph representation learning model may be, but is not limited to, a model using any vector of nodes of the extractable graph, such as a GCN (Graph Convolutional Network, graph convolutional neural network) model or an RDF2VEC model, which is not limited in this embodiment of the present invention.

Specifically, a vector representation of each entity on a network security knowledge graph is obtained according to a graph representation learning model and is recorded as a graph vector of the entity, and the graph vector of the entity represents the vector representation of the entity on the network security knowledge graph.

S52, for each entity, splicing the graph vector of the entity and the word vector of the entity, and determining the spliced vector as a target entity vector of the entity.

In the implementation, since the entity extracted from the text data in the unstructured data set has the word vector attribute, the word vector corresponding to the entity containing the word vector in the network security knowledge graph and the graph vector of the entity are spliced, the spliced vector is determined to be the target entity vector of the entity, and the graph vector of the entity not containing the word vector in the network security knowledge graph is determined to be the target entity vector of the entity.

S53, predicting the relation between any two entities according to the target entity vector and the knowledge prediction model of the any two entities.

In a specific implementation, the knowledge prediction model is obtained by training according to a preset neural network model according to a target entity vector of an entity in the sample set (the target entity vector of the entity is a vector representation of the entity obtained in step S52) and a relationship between the entities.

Specifically, when the knowledge prediction model is trained, the training sample set adopted when the training entity recognition model is still used, the preset neural network model is composed of a convolution layer (Convolutional Layer), an excitation layer (ReLU), a Max-Pooling layer (Max-Pooling layer), a full-connection layer (Fully connected layer) and an Output layer (Output layer), and in the embodiment of the invention, the preset neural network model can be but is not limited to a CNN (Convolutional Neural Network, CNN, convolutional neural network) model.

And when the relation extraction model is trained, taking target entity vectors of any two entities extracted from the training sample set as input of the preset neural network model, outputting the target entity vectors as the relation between the any two entities, and adjusting model parameters according to the error of the output relation and the actual relation until the model converges to obtain a trained knowledge prediction model.

As shown in fig. 13, which is a schematic structural diagram of the knowledge prediction model, target entity vectors Vn, k and Vn, l of two entity vectors are sequentially input into a convolution layer, an excitation layer, a max pooling layer and a full connection layer in the knowledge prediction model, and output as a relationship (relationship) between the two entities.

In the multi-source heterogeneous network security knowledge graph construction method provided by the embodiment of the invention, computer equipment responds to a trigger request for constructing a network security knowledge graph, according to the relation between entities defined by a preset network security ontology, extracts the relation between the entity and entity which are matched with the relation between the entity defined by the network security ontology and the entity from a half-structured data set and a structured data set in an acquired network security domain data set, generates a structured RDF triplet, wherein the network security ontology is constructed according to the related network security standard of the network security domain, the network security ontology defines the relation between the entity and the entity in the network security domain, further identifies the entity which is matched with the entity defined by the network security ontology from the unstructured data set in the network security domain data set according to the different categories of the entity in the preset identification mode, wherein the data in the unstructured data set is text data, inputs the text data into a word vector identification model, obtains the vectors of the entities which are identified from the unstructured data set, selects the two pairs of the entities from the unstructured data set, generates the corresponding relation between the two pairs of the entities according to the word vectors of the three-dimensional vectors, and further generates the three-dimensional relation between the three-dimensional vector according to the three-dimensional vector, and the three-dimensional relation is generated according to the three-dimensional relation between the two pairs of the two adjacent entity pairs of the entity pairs, and the three-dimensional relation is generated according to the three-dimensional relation between the three-dimensional relation is generated by the three-dimensional relation between the three-dimensional relation and the three-dimensional relation is generated, compared with the prior art, in the embodiment of the invention, when the entity is identified from the unstructured data set, the entity matched with the entity in the network security field defined by the network security knowledge body is identified according to different types of the entity by adopting different identification modes, so that the entity is identified more accurately and comprehensively, the word vector of the entity is further obtained, the corresponding relation between the entity and the word vector is established, the influence of the character interval between adjacent entities on the selection of the entity pair is considered, the more effective entity pair is selected, and the selected entity pair and the entity pair are used as the input of the relation extraction model to extract the relation between the entity pairs.

Based on the same inventive concept, the embodiment of the invention also provides a multi-source heterogeneous network security knowledge graph construction device, and because the principle of solving the problem of the multi-source heterogeneous network security knowledge graph construction device is similar to that of the multi-source heterogeneous network security knowledge graph construction method, the implementation of the device can be referred to the implementation of the method, and the repetition is omitted.

Fig. 14 is a schematic structural diagram of a multi-source heterogeneous network security knowledge graph construction device according to an embodiment of the present invention, which may include:

a first generating unit 61, configured to respond to a trigger request for constructing a network security knowledge graph, and extract, from a collected semi-structured data set in a network security domain data set and a structured data set, a relationship between a matched entity and an entity according to a relationship between an entity and an entity defined by a preset network security knowledge body, to generate a structured resource description framework RDF triplet, where the network security knowledge body is constructed according to a related network security standard in the network security domain, and the network security knowledge body defines the relationship between the entity and the entity in the network security domain;

the entity identifying unit 62 is configured to identify, according to different types of entities, entities that match with the entities defined by the network security ontology from unstructured data sets in the network security domain data set according to a preset identification manner, where data in the unstructured data sets are text data;

An obtaining unit 63, configured to input the text data into a word vector recognition model, and obtain word vectors of entities recognized from the unstructured dataset;

a second generating unit 64, configured to select an entity pair according to a character interval between every two adjacent entities identified from the unstructured dataset, input each selected entity pair and a corresponding word vector into a relation extraction model, obtain a relation between each entity pair, and generate a structured RDF triplet of word vectors of a fusion entity according to each entity pair and a corresponding word vector, and the relation between each entity pair;

the knowledge graph construction unit 65 is configured to construct a network security knowledge graph according to the structured RDF triples of the word vectors of the fusion entity and the structured RDF triples.

Preferably, the entity identifying unit 62 is specifically configured to identify, from the text data, a matched entity of a preset category according to a preset regular expression; and identifying the entities except the entities of the preset category from the text data by using an entity identification model, wherein the entity identification model is obtained by performing sequence labeling on the entities in the network security field in a sample set and then training according to a preset training model.

Preferably, the second generating unit 64 is specifically configured to determine the number of characters that are identified from the unstructured dataset and that are spaced between every two adjacent entities; deleting entity pairs, the number of characters of which is larger than a preset threshold value and does not accord with the relation defined by the network security ontology; the remaining entity pairs are determined to be the selected entity pairs.

Optionally, the apparatus further comprises:

Based on the same technical concept, the embodiment of the present invention further provides a computer device 700, referring to fig. 15, where the computer device 700 is configured to implement the method for constructing a multi-source heterogeneous network security knowledge graph described in the embodiment of the method, and the computer device 700 of this embodiment may include: memory 701, processor 702, and a computer program stored in the memory and executable on the processor, such as a multi-source heterogeneous network security knowledge graph construction program. The steps in the above embodiments of the method for constructing a secure knowledge graph of a multi-source heterogeneous network are implemented when the processor executes the computer program, for example, step S11 shown in fig. 2. Alternatively, the processor, when executing the computer program, performs the functions of the modules/units of the apparatus embodiments described above, e.g. 61.

The specific connection medium between the memory 701 and the processor 702 is not limited in the embodiment of the present invention. In the embodiment of the present application, the memory 701 and the processor 702 are connected by the bus 703 in fig. 15, the bus 703 is shown by a thick line in fig. 15, and the connection manner between other components is only schematically illustrated, but not limited thereto. The bus 703 may be classified into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 15, but not only one bus or one type of bus.

The memory 701 may be a volatile memory (RAM), such as a random-access memory (RAM); the memory 701 may also be a non-volatile memory (non-volatile memory), such as a read-only memory, a flash memory (flash memory), a Hard Disk Drive (HDD) or a Solid State Drive (SSD), or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. Memory 701 may be a combination of the above.

The processor 702 is configured to implement a method for constructing a secure knowledge graph of a multi-source heterogeneous network as shown in fig. 2, including:

the processor 702 is configured to call a computer program stored in the memory 701 to execute steps S11 to S15 shown in fig. 1.

The embodiment of the application also provides a computer readable storage medium which stores computer executable instructions required to be executed by the processor, and the computer readable storage medium contains a program for executing the processor.

In some possible embodiments, aspects of the method for constructing a multi-source heterogeneous network security knowledge graph provided by the present invention may also be implemented in the form of a program product, which includes a program code for causing a computer device to perform the steps in the method for constructing a multi-source heterogeneous network security knowledge graph according to the various exemplary embodiments of the present invention described above when the program product is run on the computer device, for example, the computer device may perform the steps S11 to S15 as shown in fig. 1.

It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. The method for constructing the secure knowledge graph of the multi-source heterogeneous network is characterized by comprising the following steps of:

2. The method according to claim 1, wherein identifying the entity matching the entity of the network security domain from the unstructured dataset according to different categories of entities in a preset identification manner, specifically comprises:

3. The method according to claim 1 or 2, wherein selecting an entity pair based on a character spacing between every two adjacent entities identified from the unstructured dataset, in particular comprises:

The remaining entity pairs are determined to be the selected entity pairs.

4. The method of claim 1, wherein the network security ontology comprises N layers, the first layer of ontologies being network security top-level ontologies, each layer of ontologies being a further classification of one of the above, the network security top-level ontologies comprising at least the following entities: public vulnerabilities and exposure CVEs, universal vulnerability enumeration CWE, national information security vulnerability library CNNVD, universal platform enumeration CPE, universal attack pattern enumeration and classification CAPEC, resistance tactics, technology and public knowledge ATT & CK, structured threat information expression STIX, and MALWARE malwax.

5. The method as recited in claim 1, further comprising:

6. The method of claim 5, wherein querying the entity matching the entity to be queried in the network security knowledge-graph according to the word vector of the entity to be queried, specifically comprising:

determining the category of the entity to be queried;

7. The method as recited in claim 1, further comprising:

8. The utility model provides a multisource heterogeneous network safety knowledge graph construction device which characterized in that includes:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the multi-source heterogeneous network security knowledge graph construction method of any of claims 1-7 when the program is executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps in the multi-source heterogeneous network security knowledge graph construction method according to any one of claims 1 to 7.