CN112364173B

CN112364173B - IP address mechanism tracing method based on knowledge graph

Info

Publication number: CN112364173B
Application number: CN202011130373.2A
Authority: CN
Inventors: 周玉金; 孙治; 张志勇; 刘方; 陈剑锋
Original assignee: China Electronic Technology Cyber Security Co Ltd
Current assignee: China Electronic Technology Cyber Security Co Ltd
Priority date: 2020-10-21
Filing date: 2020-10-21
Publication date: 2022-03-18
Anticipated expiration: 2040-10-21
Also published as: CN112364173A

Abstract

The invention relates to the technical field of information, in particular to an IP address mechanism tracing method based on a knowledge graph. Aiming at the problems of isolation and dispersion of safety information in network space, the invention solves the problem that IP and domain name mapping in network space safety have no affiliated organization information from practical application, eliminates the gap of network safety information such as IP, domain name and organization, and the like, and collects more and more comprehensive safety information for network defense, monitoring and the like in network space safety; the candidate organization names with higher probability and higher possibility are effectively screened from the disordered search results, and the optimal inference result of the organization to which the candidate organization names belong is efficiently obtained through the IP address; the accuracy of the result of the organization to which the organization belongs is deduced from the IP address in the network space security field is ensured.

Description

IP address mechanism tracing method based on knowledge graph

Technical Field

The invention relates to the technical field of network information, in particular to an IP address mechanism tracing method based on a knowledge graph.

Background

In the modern times, information technology is developing more and more rapidly, threat sources and attack means of network security are changing constantly, network security personnel are required to improve the sensing capability of network security information, sense the security information in all directions, discover and master more security information in time, and can preempt in network space security attack and defense battles to realize 'knowing about each other'.

At present, in the field of network security, most network security information is isolated and dispersed, so that security personnel can hardly utilize the security information effectively to solve practical problems.

The prior patent of the invention is related to the field of network space security, and the method for solving the actual security problem based on the knowledge graph comprises the following steps: a malicious domain name detection method (publication number: CN110290116A) based on a knowledge graph realizes the detection of malicious domain names based on the knowledge graph, provides a new visual angle for the detection of malicious domain names, but only considers the domain name information dimension in a network space, does not comprehensively consider the correlation among the domain name information, IP addresses, network assets and other dimension safety information, and the IP, the domain name and other safety information in the network space usually contain rich hidden information and knowledge, so that the information is dug more deeply, and the actual safety problem can be better solved; a malicious domain name matching method (publication number: CN108737385A) based on DNS mapping IP maps the IP corresponding to the known malicious domain name in real time through DNS, matches the access behavior of the malicious domain name based on the full IP flow, but the scheme only uses DNS to resolve the domain name, matches the access behavior of the malicious domain name based on the flow of the IP address, and further excavates the hidden information between the IP and the domain name if other security problems need to be solved based on the mapping relation between the IP and the domain name; a distributed security event correlation analysis method (publication number: CN108270785A) based on a knowledge graph constructs a network security knowledge graph comprising a basic dimension, a vulnerability dimension, a threat dimension, an alarm event dimension and an attack rule dimension in the network space security field, realizes correlation analysis of security events, correlates security information from each dimension of network space security resources, realizes correlation analysis of security threat events from a macroscopic view, but does not relate to a solution of an actual security problem, and designs different algorithm flows on the network security knowledge graph to solve the actual security problem in practical application.

The following disadvantages exist in the current network space security field:

(1) IP address information in a network space cannot be directly mapped to an organization to which the IP address information belongs;

(2) the search results are complicated and disordered, and the optimal inference result is difficult to obtain;

(3) the clustering result is not accurate, and a great clustering error exists.

Therefore, no particularly practical and effective method exists in the existing network information security field, and effective security information can be analyzed and obtained from the network so as to utilize and solve practical problems; therefore, the efficiency of solving various problems in reality is reduced, and the reliability is not effectively improved.

Disclosure of Invention

In order to overcome the defects of the prior art mentioned in the content, the invention provides a knowledge graph-based IP address mechanism tracing method, which aims to detect an IP address by utilizing a network space from the practical point of view, further infer the organization mechanism to which the IP address belongs based on the combination of searching, clustering and the knowledge graph, mine the hidden deep-layer value information in the IP address and have great significance for the attack and defense, monitoring and the like of the network space safety.

In order to achieve the purpose, the invention specifically adopts the technical scheme that:

a method for tracing an IP address mechanism based on a knowledge graph comprises the following steps:

acquiring domain name information: aiming at the effective IP address of the mechanism to be inferred, obtaining domain name information corresponding to the IP address through DNS inverse analysis;

acquiring domain name key information: intercepting and screening the domain name information to obtain key information in the domain name information;

obtaining an analysis sample: searching according to the domain name key information to obtain a plurality of webpages corresponding to the domain name key information, sequencing and screening the webpages, and reserving the webpages as analysis samples;

sample treatment: analyzing the analysis sample and acquiring text content existing in the analysis sample;

and (3) entity extraction: performing entity extraction on the text content through a named entity identification model, and identifying all organization names in the text content;

calculating the weight: according to the sequencing of the analysis samples and the label elements of the webpage labels where the text contents are located, performing weight calculation and configuration on the mechanism names;

entity clustering: clustering among entities through the editing distance among the mechanism names, and simultaneously using a priori knowledge graph to constrain and guide the clustering result to obtain a mechanism name set of multiple categories;

selecting the name of the organization: and calculating the weight of the organization name set of each category by a weighted average calculation method according to the weight of each organization name, selecting the organization name set with the maximum weight, and taking the organization name with the maximum weight in the organization name set as a traceability inference result.

According to the mechanism tracing method, effective information can be extracted from contents such as acquired webpage labels and webpage texts for deep analysis by performing inverse analysis on the IP address and reprocessing the acquired domain name information, so that possible target mechanisms can be screened, and the target mechanism with the highest possibility is selected as an inferred result after various calculations so as to complete tracing.

Further, the domain name information may be processed in a variety of ways, where a feasible and more optimal way is used, as follows: in the step of obtaining the domain name key information, the domain name key information is processed through a regular expression, and the prefix and the suffix of the domain name information are removed, so that the key information in the domain name is obtained.

Furthermore, the number of the obtained analysis samples is usually large, wherein extremely large noise exists, about 20 web pages with large relevance can be selected as the analysis samples, and when the analysis samples are selected and processed, the web page labels are removed through a regular expression and a web page document processing rule, and text content is obtained.

Furthermore, all the analysis samples have influence on the final inference result, but the influence ratio of each analysis sample is not completely the same, and in the process of performing inference, the influence of each analysis sample is divided, specifically, the following feasible methods can be adopted: the weighting calculation and configuration are carried out on the mechanism names according to the sequencing sequence of the analysis samples, the weighting is calculated according to the sequencing of the web pages, one weighting factor is calculated according to the following mode,

wherein, ω is_iIs the weight occupied by the analysis samples ordered as i.

Furthermore, the weight of the analysis sample comprises two weight factors, wherein one weight factor is the weight of the webpage in all the analysis samples and has a direct relation with the influence sequencing of the webpage in all the analysis samples; the other weight factor is related to the label element contained in each webpage, the mechanism name is subjected to weight calculation and configuration according to the label element of the webpage label where the text content is located, and the corresponding semantic weight omega is given to the webpage label element according to the contribution degree of the webpage label element to the webpage document theme_j∈[0，10]And reflecting the importance degree of the text content to the search result by the semantic weight of the webpage label element.

Further, according to the weight of the webpage and the semantic weight corresponding to the webpage label element, the weight of the mechanism name is calculated according to the following method

Wherein, W_ijWeight of organization name j in web page i, ω_iAs a weighting factor, ω, for a web page i_jFor semantic weight of organization name j, construct a tree graph with element labels, h_jIs the hierarchy of the tag element tree diagram where the organization name j is located and h_j∈[0，4]，tf_ijFor the frequency, idf, of the entry in web page i for the organization name j_jIs the inverse document frequency of the name word j,

N_ifor all organization name entries, n, searched in web page i_ijThe number of terms in the web page i containing the organization name j.

Further, before entity clustering, at least finding out alias and superior and inferior mechanism subordinative relations of the candidate mechanism names in the prior knowledge graph to form candidate mechanism entity pairs and form triples, and taking a head entity and a tail entity in the triples to form an entity set { (h)_i,t_i)}。

Further, two entities (h) satisfying the triple are selected from the knowledge graph_i,t_i) Taking each path as a feature and calculating the feature value of the path so as to form the feature vector of the triplet.

Specifically, the characteristic value of the path may be calculated as follows

Wherein P ═ R₁…R_lRepresents a path, s is a start node, e is a tail node, e' is a middle node, h_s,p′(e') is represented in the relationship type R_lNext, the (s, e') entity pair can be connected by a path pAnd (4) rate. Wherein the content of the first and second substances,

indicating that node e' is in relationship type R_lThen, the probability of random walk to end node e represents whether there is R between the entity pair (e', e)_lAnd (4) relationship.

After the probability calculation of each feature vector is realized, a logistic regression classifier is trained by using the feature vectors, and whether a specified type of relationship exists between two entities is judged according to the probability of path prediction between the two entities. When the path prediction probability is higher, the probability that the two entities have the specified relationships such as the alias and the superior and subordinate mechanisms is higher, namely the two entities are put into the same mechanism name set as an entity pair; when the probability of path prediction is shown to be low, the probability of the existence of an alias, a superior-subordinate mechanism and other specified relationships between two entities is low, that is, the two entities should be put into different two entity pairs.

Compared with the prior art, the invention has the beneficial effects that:

(1) aiming at the problem of isolation and dispersion of security information in a network space, from the practical application, an organization mechanism to which the security information belongs is reasonably inferred by using an effective IP address, the problem that the IP and domain name mapping in the network space security has no information of the organization mechanism to which the security information belongs is solved, the gap between the IP, the domain name and the network security information of the organization mechanism and the like is eliminated, and more comprehensive security information is collected for network attack and defense, monitoring and the like in the network space security.

(2) By setting reasonable weight factors for the webpage and the label elements thereof, the weight of the target word in the network information is scientifically calculated, the candidate organization names with higher probability and higher possibility are effectively screened out from the disordered search results, and the optimal inference result of the organization to which the candidate organization names belong is efficiently obtained through the IP address.

(3) By utilizing the relational interconnection advantage of the network security knowledge graph, the clustering precision of the search results is improved, and the accuracy of deducing the result of the organization to which the search result belongs from the IP address in the network space security field is ensured.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only show some embodiments of the present invention, and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a schematic flow chart of the practice of the present invention according to an embodiment.

Fig. 2 is a tree diagram of web page tag elements.

Detailed Description

The invention is further explained below with reference to the drawings and the specific embodiments.

It should be noted that the description of the embodiments is provided to help understanding of the present invention, but the present invention is not limited thereto. Specific structural and functional details disclosed herein are merely illustrative of example embodiments of the invention. This invention may, however, be embodied in many alternate forms and should not be construed as limited to the embodiments set forth herein.

Examples

The embodiment provides a tracing mode aiming at a plurality of problems of information security in a network space in the prior art, especially the situation that the mechanism to which the security information belongs is difficult to trace back under the condition of isolated and dispersed security information, and can find available information according to hidden information contained in an IP address and calculate and infer the most possible mechanism.

Specifically, as shown in fig. 1, the embodiment discloses a method for tracing an IP address mechanism based on a knowledge graph, which includes:

s01, acquiring domain name information: aiming at the effective IP address of the mechanism to be inferred, obtaining domain name information corresponding to the IP address through DNS inverse analysis;

s02, acquiring domain name key information: intercepting and screening the domain name information to obtain key information in the domain name information;

s03, obtaining an analysis sample: searching according to the domain name key information to obtain a plurality of webpages corresponding to the domain name key information, crawling the webpages by a web crawler, sequencing and screening the webpages, and reserving the webpages as analysis samples;

s04, sample processing: analyzing the analysis sample and acquiring text content existing in the analysis sample;

s05, entity extraction: performing entity extraction on the text content through a named entity identification model, and identifying all organization names in the text content;

s06, calculating weight: according to the sequencing of the analysis samples and the label elements of the webpage labels where the text contents are located, performing weight calculation and configuration on the mechanism names;

s07, entity clustering: clustering among entities through the editing distance among the mechanism names, and simultaneously using a priori knowledge graph to constrain and guide the clustering result to obtain a mechanism name set of multiple categories;

s08, selecting an organization name: and calculating the weight of the organization name set of each category by a weighted average calculation method according to the weight of each organization name, selecting the organization name set with the maximum weight, and taking the organization name with the maximum weight in the organization name set as a traceability inference result.

In practice, the method may be used to process domain name information in a variety of ways, and this embodiment uses a feasible and more optimized way, which is specifically as follows: in the step of obtaining the domain name key information, the domain name key information is processed through a regular expression, and the prefix and the suffix of the domain name information are removed, so that the key information in the domain name is obtained.

When the method is used for obtaining analysis samples, the number of the obtained analysis samples is usually large, wherein extremely large noise exists, about 20 webpages with large relevance can be selected as the analysis samples, and when the analysis samples are selected and processed, webpage labels are removed through regular expressions and webpage document processing rules, and text content is obtained.

After obtaining the samples, all the analysis samples have an influence on the final inference result, but the influence ratios of each analysis sample are not completely the same, and in the process of performing inference, the influence of each analysis sample is divided, specifically, the following feasible methods can be adopted: the method comprises the steps of carrying out weight calculation and configuration on organization names according to the sequencing sequence of analysis samples, calculating the weight according to the sequencing of web pages, and calculating one weight factor according to the following mode

Wherein, ω is_iIs the weight occupied by the analysis samples ordered as i. The selected analysis samples are sequenced, and the weight of the analysis samples is influenced according to the sequencing. Generally, according to the relevance of the content of the webpage document and the search terms, sequencing is carried out from front to back, and if the terms come from well-known webpages such as wiki and baidu, the sequence of the webpage document is properly advanced; the more advanced the web page ranking, the more analytic the content of the web page document, and the web page document weight omega_iThe larger.

In the method adopted in this embodiment, the weight of the analysis sample includes two weight factors, one of the weight factors is the weight of the web page itself in all the analysis samples, and has a direct relationship with the influence ranking thereof in all the analysis samples, which has been described in detail in the above; the other weight factor is related to the label element contained in each webpage, the mechanism name is subjected to weight calculation and configuration according to the label element of the webpage label where the text content is located, and the contribution degree of the webpage label element to the webpage document theme is provided for the webpageThe tag elements give corresponding semantic weights omega_j∈[0，10]And reflecting the importance degree of the text content to the search result by the semantic weight of the webpage label element.

The term contained in the tag element which contributes a lot to the document theme in the web page can embody the document theme of the web page. For example, the < title > tag element in the < head > tag of the webpage document contains text content which can embody the theme of the document more than the terms appearing in < h1>, < p > in < body >, and the < h1>, < h2> tag in < body > contains terms which can contribute to the theme of the document more than the terms in the < p >, < li > and other tags. By analogy, according to the importance degree of the tag element to the document, a webpage tag element tree diagram is constructed, as shown in fig. 2. According to the division of the webpage label element tree diagram, the semantic weights corresponding to different webpage label elements are correspondingly divided as follows

According to the semi-structured definition format of the XML document, the HTML label elements are modeled into a tree, each label element is represented as a node, the hierarchy is set according to the element-sub element inclusion relation, and then the appropriate ascending and descending hierarchy is carried out according to the importance degree of the HTML label elements. And so on.

In actual calculation, if a certain pending mechanism name A appears in the head<title>In the tab element, then this page is more likely to be the home page of the A organization, at which time h_jThe value is 2.

According to the weight of the web page and the semantic weight corresponding to the web page tag element, for the term which is searched in the web page i and related to the organization name, the weight of the organization name can be calculated by the following formula according to the frequency and the times of the term which appears in different tag elements and extremely appears in the web page

Wherein, W_ijWeight of organization name j in web page i, ω_iAs a weighting factor, ω, for a web page i_jIs the semantic weight of the organization name j, h_jIs the hierarchy of the tag element tree diagram where the organization name j is located and h_j∈[0，4]，tf_ijFor the frequency, idf, of the entry in web page i for the organization name j_jIs the inverse document frequency of the name word j,

The method comprises the steps of entity clustering to help accurately determine the mechanism to which the safety information belongs, before entity clustering, at least finding out alias and superior and inferior mechanism affiliation of a candidate mechanism name in a priori knowledge graph to form candidate mechanism entity pairs and form triples, and taking a head entity and a tail entity in the triples to form an entity set { (h)_i,t_i)}. In this embodiment, the Path Ranking Algorithm (PRA Algorithm) is used to search for entities with a specified relationship. The main idea of the PRA algorithm is to judge the existence of a certain type of relationship by different paths of connections between entities in the knowledge graph.

Specifically, two entities (h) in the triple are selected from the knowledge graph_i,t_i) Taking each path as a feature and calculating the feature value of the path so as to form the feature vector of the triplet.

Specifically, the characteristic value of the path may be calculated as follows

Wherein P ═ R₁…R_lRepresents a path, s is a start node, e is a tail node, e' is a middle node, h_s,p′(e') is represented in the relationship type R_lNext, the probability that the (s, e ') entity pair can be connected through path p'. Wherein the content of the first and second substances,

The prior knowledge of the network security knowledge graph is integrated, the similarity among the candidate organization entities is deduced, and the problem that other similar candidate organization entities with different alias names or upper-level and lower-level subordination relations among the candidate organization entities are clustered to different categories by mistake is solved. For example, NBA and american basketball association both represent the same organization, and if clustered by the edit distance of the organization's name, they would cluster into different categories, making the results ambiguous. At the moment, the PRA algorithm is used for deducing in the prior network security knowledge graph, the NBA and the American basketball association have an alias relationship and have larger similarity, and the alias relationship is used as prior knowledge to guide a clustering process so as to enable the NBA and the American basketball association to be clustered into the same category.

And after entity clustering in the seventh step, a plurality of types of organization mechanism candidate name sets are obtained, each set also contains a plurality of candidate organization mechanism names, and the most reliable and most probable term needs to be taken out as the organization mechanism name finally output in the output result. And after entity clustering, retaining the weight factors of the candidate organization mechanism names calculated in the sixth step, and acquiring the weight of each category set by weighted average according to the weight of each candidate organization mechanism name. And selecting the set with the maximum weight, and outputting the organization name with the maximum weight in the set as a final inference result.

The present invention is not limited to the above-described alternative embodiments, and various other embodiments can be obtained by those skilled in the art from the above-described embodiments in any combination, and any other embodiments can be obtained in various forms while still being within the spirit of the present invention. The above detailed description should not be taken as limiting the scope of the invention, which is defined in the claims, and which the description is intended to be interpreted accordingly.

Claims

1. A method for tracing an IP address mechanism based on a knowledge graph is characterized by comprising the following steps:

entity clustering: clustering among entities through the editing distance among the mechanism names, and simultaneously using a priori knowledge graph to constrain and guide the clustering result to obtain a mechanism name set of multiple categories; the method comprises the following steps: s1, before entity clustering, at least finding out the alias and the superior and inferior mechanism subordination relation of the candidate mechanism name in the prior knowledge map to form a candidate mechanism entity pair and form a triple, and taking the head entity and the tail entity in the triple to form an entity set { (h)_i,t_i) }; s2, selecting two entities (h) satisfying the triple in the knowledge graph_i,t_i) Taking each path as a feature and calculating the feature value of the path so as to form a feature vector of the triple; s3, calculating the characteristic value of the path according to the following method

Wherein P ═ R₁...R_lRepresents a path, s is a start node, e is a tail node, e' is a middle node, h_s，p′(e') is represented in the relationship type R_lThe probability that an (s, e ') entity pair can be connected through path p', where,

indicating that node e' is in relationship type R_lThen, the probability of random walk to end node e represents whether there is R between the entity pair (e', e)_lA relationship;

2. The IP address agency tracing method based on a knowledge graph of claim 1, wherein: in the step of obtaining the domain name key information, the domain name key information is processed through a regular expression, and the prefix and the suffix of the domain name information are removed, so that the key information in the domain name is obtained.

3. The IP address agency tracing method based on a knowledge graph of claim 1, wherein: and when the analysis sample is processed, removing the webpage label through the regular expression and the webpage document processing rule, and acquiring the text content.

4. The IP address agency tracing method based on the knowledge-graph according to claim 1, wherein the agency names are weighted and configured according to the order of the analysis samples, and the method is characterized in that: weights are calculated based on the ranking of the web pages, one of the weight factors is calculated,

wherein, ω is_iIs the weight occupied by the analysis samples ordered as i.

5. The IP address agency source tracing method based on the knowledge graph as claimed in claim 4, wherein the agency name is weighted and configured according to the label element of the webpage label where the text content is located, and the method is characterized in that: according to the contribution degree of the webpage label elements to the webpage document theme, corresponding semantic weight omega is given to the webpage label elements_j∈[0，10]And reflecting the importance degree of the text content to the search result by the semantic weight of the webpage label element.

6. The IP address agency tracing method based on the knowledge graph of claim 5, wherein: according to the weight of the webpage and the semantic weight corresponding to the webpage label element, the weight of the mechanism name is calculated according to the following method

7. The IP address agency tracing method based on a knowledge graph of claim 1, wherein: and training a logistic regression classifier by using the feature vector to judge whether the specified type of relationship exists between the two entities according to the probability of path prediction between the two entities.