CN112364173B - IP address mechanism tracing method based on knowledge graph - Google Patents

IP address mechanism tracing method based on knowledge graph Download PDF

Info

Publication number
CN112364173B
CN112364173B CN202011130373.2A CN202011130373A CN112364173B CN 112364173 B CN112364173 B CN 112364173B CN 202011130373 A CN202011130373 A CN 202011130373A CN 112364173 B CN112364173 B CN 112364173B
Authority
CN
China
Prior art keywords
organization
name
weight
domain name
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011130373.2A
Other languages
Chinese (zh)
Other versions
CN112364173A (en
Inventor
周玉金
孙治
张志勇
刘方
陈剑锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Electronic Technology Cyber Security Co Ltd
Original Assignee
China Electronic Technology Cyber Security Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Electronic Technology Cyber Security Co Ltd filed Critical China Electronic Technology Cyber Security Co Ltd
Priority to CN202011130373.2A priority Critical patent/CN112364173B/en
Publication of CN112364173A publication Critical patent/CN112364173A/en
Application granted granted Critical
Publication of CN112364173B publication Critical patent/CN112364173B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/12Applying verification of the received information
    • H04L63/126Applying verification of the received information the source of the received data

Abstract

The invention relates to the technical field of information, in particular to an IP address mechanism tracing method based on a knowledge graph. Aiming at the problems of isolation and dispersion of safety information in network space, the invention solves the problem that IP and domain name mapping in network space safety have no affiliated organization information from practical application, eliminates the gap of network safety information such as IP, domain name and organization, and the like, and collects more and more comprehensive safety information for network defense, monitoring and the like in network space safety; the candidate organization names with higher probability and higher possibility are effectively screened from the disordered search results, and the optimal inference result of the organization to which the candidate organization names belong is efficiently obtained through the IP address; the accuracy of the result of the organization to which the organization belongs is deduced from the IP address in the network space security field is ensured.

Description

IP address mechanism tracing method based on knowledge graph
Technical Field
The invention relates to the technical field of network information, in particular to an IP address mechanism tracing method based on a knowledge graph.
Background
In the modern times, information technology is developing more and more rapidly, threat sources and attack means of network security are changing constantly, network security personnel are required to improve the sensing capability of network security information, sense the security information in all directions, discover and master more security information in time, and can preempt in network space security attack and defense battles to realize 'knowing about each other'.
At present, in the field of network security, most network security information is isolated and dispersed, so that security personnel can hardly utilize the security information effectively to solve practical problems.
The prior patent of the invention is related to the field of network space security, and the method for solving the actual security problem based on the knowledge graph comprises the following steps: a malicious domain name detection method (publication number: CN110290116A) based on a knowledge graph realizes the detection of malicious domain names based on the knowledge graph, provides a new visual angle for the detection of malicious domain names, but only considers the domain name information dimension in a network space, does not comprehensively consider the correlation among the domain name information, IP addresses, network assets and other dimension safety information, and the IP, the domain name and other safety information in the network space usually contain rich hidden information and knowledge, so that the information is dug more deeply, and the actual safety problem can be better solved; a malicious domain name matching method (publication number: CN108737385A) based on DNS mapping IP maps the IP corresponding to the known malicious domain name in real time through DNS, matches the access behavior of the malicious domain name based on the full IP flow, but the scheme only uses DNS to resolve the domain name, matches the access behavior of the malicious domain name based on the flow of the IP address, and further excavates the hidden information between the IP and the domain name if other security problems need to be solved based on the mapping relation between the IP and the domain name; a distributed security event correlation analysis method (publication number: CN108270785A) based on a knowledge graph constructs a network security knowledge graph comprising a basic dimension, a vulnerability dimension, a threat dimension, an alarm event dimension and an attack rule dimension in the network space security field, realizes correlation analysis of security events, correlates security information from each dimension of network space security resources, realizes correlation analysis of security threat events from a macroscopic view, but does not relate to a solution of an actual security problem, and designs different algorithm flows on the network security knowledge graph to solve the actual security problem in practical application.
The following disadvantages exist in the current network space security field:
(1) IP address information in a network space cannot be directly mapped to an organization to which the IP address information belongs;
(2) the search results are complicated and disordered, and the optimal inference result is difficult to obtain;
(3) the clustering result is not accurate, and a great clustering error exists.
Therefore, no particularly practical and effective method exists in the existing network information security field, and effective security information can be analyzed and obtained from the network so as to utilize and solve practical problems; therefore, the efficiency of solving various problems in reality is reduced, and the reliability is not effectively improved.
Disclosure of Invention
In order to overcome the defects of the prior art mentioned in the content, the invention provides a knowledge graph-based IP address mechanism tracing method, which aims to detect an IP address by utilizing a network space from the practical point of view, further infer the organization mechanism to which the IP address belongs based on the combination of searching, clustering and the knowledge graph, mine the hidden deep-layer value information in the IP address and have great significance for the attack and defense, monitoring and the like of the network space safety.
In order to achieve the purpose, the invention specifically adopts the technical scheme that:
a method for tracing an IP address mechanism based on a knowledge graph comprises the following steps:
acquiring domain name information: aiming at the effective IP address of the mechanism to be inferred, obtaining domain name information corresponding to the IP address through DNS inverse analysis;
acquiring domain name key information: intercepting and screening the domain name information to obtain key information in the domain name information;
obtaining an analysis sample: searching according to the domain name key information to obtain a plurality of webpages corresponding to the domain name key information, sequencing and screening the webpages, and reserving the webpages as analysis samples;
sample treatment: analyzing the analysis sample and acquiring text content existing in the analysis sample;
and (3) entity extraction: performing entity extraction on the text content through a named entity identification model, and identifying all organization names in the text content;
calculating the weight: according to the sequencing of the analysis samples and the label elements of the webpage labels where the text contents are located, performing weight calculation and configuration on the mechanism names;
entity clustering: clustering among entities through the editing distance among the mechanism names, and simultaneously using a priori knowledge graph to constrain and guide the clustering result to obtain a mechanism name set of multiple categories;
selecting the name of the organization: and calculating the weight of the organization name set of each category by a weighted average calculation method according to the weight of each organization name, selecting the organization name set with the maximum weight, and taking the organization name with the maximum weight in the organization name set as a traceability inference result.
According to the mechanism tracing method, effective information can be extracted from contents such as acquired webpage labels and webpage texts for deep analysis by performing inverse analysis on the IP address and reprocessing the acquired domain name information, so that possible target mechanisms can be screened, and the target mechanism with the highest possibility is selected as an inferred result after various calculations so as to complete tracing.
Further, the domain name information may be processed in a variety of ways, where a feasible and more optimal way is used, as follows: in the step of obtaining the domain name key information, the domain name key information is processed through a regular expression, and the prefix and the suffix of the domain name information are removed, so that the key information in the domain name is obtained.
Furthermore, the number of the obtained analysis samples is usually large, wherein extremely large noise exists, about 20 web pages with large relevance can be selected as the analysis samples, and when the analysis samples are selected and processed, the web page labels are removed through a regular expression and a web page document processing rule, and text content is obtained.
Furthermore, all the analysis samples have influence on the final inference result, but the influence ratio of each analysis sample is not completely the same, and in the process of performing inference, the influence of each analysis sample is divided, specifically, the following feasible methods can be adopted: the weighting calculation and configuration are carried out on the mechanism names according to the sequencing sequence of the analysis samples, the weighting is calculated according to the sequencing of the web pages, one weighting factor is calculated according to the following mode,
Figure BDA0002734951210000041
wherein, ω isiIs the weight occupied by the analysis samples ordered as i.
Furthermore, the weight of the analysis sample comprises two weight factors, wherein one weight factor is the weight of the webpage in all the analysis samples and has a direct relation with the influence sequencing of the webpage in all the analysis samples; the other weight factor is related to the label element contained in each webpage, the mechanism name is subjected to weight calculation and configuration according to the label element of the webpage label where the text content is located, and the corresponding semantic weight omega is given to the webpage label element according to the contribution degree of the webpage label element to the webpage document themej∈[0,10]And reflecting the importance degree of the text content to the search result by the semantic weight of the webpage label element.
Further, according to the weight of the webpage and the semantic weight corresponding to the webpage label element, the weight of the mechanism name is calculated according to the following method
Figure BDA0002734951210000042
Wherein, WijWeight of organization name j in web page i, ωiAs a weighting factor, ω, for a web page ijFor semantic weight of organization name j, construct a tree graph with element labels, hjIs the hierarchy of the tag element tree diagram where the organization name j is located and hj∈[0,4],tfijFor the frequency, idf, of the entry in web page i for the organization name jjIs the inverse document frequency of the name word j,
Figure BDA0002734951210000051
Nifor all organization name entries, n, searched in web page iijThe number of terms in the web page i containing the organization name j.
Further, before entity clustering, at least finding out alias and superior and inferior mechanism subordinative relations of the candidate mechanism names in the prior knowledge graph to form candidate mechanism entity pairs and form triples, and taking a head entity and a tail entity in the triples to form an entity set { (h)i,ti)}。
Further, two entities (h) satisfying the triple are selected from the knowledge graphi,ti) Taking each path as a feature and calculating the feature value of the path so as to form the feature vector of the triplet.
Specifically, the characteristic value of the path may be calculated as follows
Figure BDA0002734951210000052
Wherein P ═ R1…RlRepresents a path, s is a start node, e is a tail node, e' is a middle node, hs,p′(e') is represented in the relationship type RlNext, the (s, e') entity pair can be connected by a path pAnd (4) rate. Wherein the content of the first and second substances,
Figure BDA0002734951210000053
indicating that node e' is in relationship type RlThen, the probability of random walk to end node e represents whether there is R between the entity pair (e', e)lAnd (4) relationship.
After the probability calculation of each feature vector is realized, a logistic regression classifier is trained by using the feature vectors, and whether a specified type of relationship exists between two entities is judged according to the probability of path prediction between the two entities. When the path prediction probability is higher, the probability that the two entities have the specified relationships such as the alias and the superior and subordinate mechanisms is higher, namely the two entities are put into the same mechanism name set as an entity pair; when the probability of path prediction is shown to be low, the probability of the existence of an alias, a superior-subordinate mechanism and other specified relationships between two entities is low, that is, the two entities should be put into different two entity pairs.
Compared with the prior art, the invention has the beneficial effects that:
(1) aiming at the problem of isolation and dispersion of security information in a network space, from the practical application, an organization mechanism to which the security information belongs is reasonably inferred by using an effective IP address, the problem that the IP and domain name mapping in the network space security has no information of the organization mechanism to which the security information belongs is solved, the gap between the IP, the domain name and the network security information of the organization mechanism and the like is eliminated, and more comprehensive security information is collected for network attack and defense, monitoring and the like in the network space security.
(2) By setting reasonable weight factors for the webpage and the label elements thereof, the weight of the target word in the network information is scientifically calculated, the candidate organization names with higher probability and higher possibility are effectively screened out from the disordered search results, and the optimal inference result of the organization to which the candidate organization names belong is efficiently obtained through the IP address.
(3) By utilizing the relational interconnection advantage of the network security knowledge graph, the clustering precision of the search results is improved, and the accuracy of deducing the result of the organization to which the search result belongs from the IP address in the network space security field is ensured.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only show some embodiments of the present invention, and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a schematic flow chart of the practice of the present invention according to an embodiment.
Fig. 2 is a tree diagram of web page tag elements.
Detailed Description
The invention is further explained below with reference to the drawings and the specific embodiments.
It should be noted that the description of the embodiments is provided to help understanding of the present invention, but the present invention is not limited thereto. Specific structural and functional details disclosed herein are merely illustrative of example embodiments of the invention. This invention may, however, be embodied in many alternate forms and should not be construed as limited to the embodiments set forth herein.
Examples
The embodiment provides a tracing mode aiming at a plurality of problems of information security in a network space in the prior art, especially the situation that the mechanism to which the security information belongs is difficult to trace back under the condition of isolated and dispersed security information, and can find available information according to hidden information contained in an IP address and calculate and infer the most possible mechanism.
Specifically, as shown in fig. 1, the embodiment discloses a method for tracing an IP address mechanism based on a knowledge graph, which includes:
s01, acquiring domain name information: aiming at the effective IP address of the mechanism to be inferred, obtaining domain name information corresponding to the IP address through DNS inverse analysis;
s02, acquiring domain name key information: intercepting and screening the domain name information to obtain key information in the domain name information;
s03, obtaining an analysis sample: searching according to the domain name key information to obtain a plurality of webpages corresponding to the domain name key information, crawling the webpages by a web crawler, sequencing and screening the webpages, and reserving the webpages as analysis samples;
s04, sample processing: analyzing the analysis sample and acquiring text content existing in the analysis sample;
s05, entity extraction: performing entity extraction on the text content through a named entity identification model, and identifying all organization names in the text content;
s06, calculating weight: according to the sequencing of the analysis samples and the label elements of the webpage labels where the text contents are located, performing weight calculation and configuration on the mechanism names;
s07, entity clustering: clustering among entities through the editing distance among the mechanism names, and simultaneously using a priori knowledge graph to constrain and guide the clustering result to obtain a mechanism name set of multiple categories;
s08, selecting an organization name: and calculating the weight of the organization name set of each category by a weighted average calculation method according to the weight of each organization name, selecting the organization name set with the maximum weight, and taking the organization name with the maximum weight in the organization name set as a traceability inference result.
According to the mechanism tracing method, effective information can be extracted from contents such as acquired webpage labels and webpage texts for deep analysis by performing inverse analysis on the IP address and reprocessing the acquired domain name information, so that possible target mechanisms can be screened, and the target mechanism with the highest possibility is selected as an inferred result after various calculations so as to complete tracing.
In practice, the method may be used to process domain name information in a variety of ways, and this embodiment uses a feasible and more optimized way, which is specifically as follows: in the step of obtaining the domain name key information, the domain name key information is processed through a regular expression, and the prefix and the suffix of the domain name information are removed, so that the key information in the domain name is obtained.
When the method is used for obtaining analysis samples, the number of the obtained analysis samples is usually large, wherein extremely large noise exists, about 20 webpages with large relevance can be selected as the analysis samples, and when the analysis samples are selected and processed, webpage labels are removed through regular expressions and webpage document processing rules, and text content is obtained.
After obtaining the samples, all the analysis samples have an influence on the final inference result, but the influence ratios of each analysis sample are not completely the same, and in the process of performing inference, the influence of each analysis sample is divided, specifically, the following feasible methods can be adopted: the method comprises the steps of carrying out weight calculation and configuration on organization names according to the sequencing sequence of analysis samples, calculating the weight according to the sequencing of web pages, and calculating one weight factor according to the following mode
Figure BDA0002734951210000081
Wherein, ω isiIs the weight occupied by the analysis samples ordered as i. The selected analysis samples are sequenced, and the weight of the analysis samples is influenced according to the sequencing. Generally, according to the relevance of the content of the webpage document and the search terms, sequencing is carried out from front to back, and if the terms come from well-known webpages such as wiki and baidu, the sequence of the webpage document is properly advanced; the more advanced the web page ranking, the more analytic the content of the web page document, and the web page document weight omegaiThe larger.
In the method adopted in this embodiment, the weight of the analysis sample includes two weight factors, one of the weight factors is the weight of the web page itself in all the analysis samples, and has a direct relationship with the influence ranking thereof in all the analysis samples, which has been described in detail in the above; the other weight factor is related to the label element contained in each webpage, the mechanism name is subjected to weight calculation and configuration according to the label element of the webpage label where the text content is located, and the contribution degree of the webpage label element to the webpage document theme is provided for the webpageThe tag elements give corresponding semantic weights omegaj∈[0,10]And reflecting the importance degree of the text content to the search result by the semantic weight of the webpage label element.
The term contained in the tag element which contributes a lot to the document theme in the web page can embody the document theme of the web page. For example, the < title > tag element in the < head > tag of the webpage document contains text content which can embody the theme of the document more than the terms appearing in < h1>, < p > in < body >, and the < h1>, < h2> tag in < body > contains terms which can contribute to the theme of the document more than the terms in the < p >, < li > and other tags. By analogy, according to the importance degree of the tag element to the document, a webpage tag element tree diagram is constructed, as shown in fig. 2. According to the division of the webpage label element tree diagram, the semantic weights corresponding to different webpage label elements are correspondingly divided as follows
Figure BDA0002734951210000101
According to the semi-structured definition format of the XML document, the HTML label elements are modeled into a tree, each label element is represented as a node, the hierarchy is set according to the element-sub element inclusion relation, and then the appropriate ascending and descending hierarchy is carried out according to the importance degree of the HTML label elements. And so on.
In actual calculation, if a certain pending mechanism name A appears in the head<title>In the tab element, then this page is more likely to be the home page of the A organization, at which time hjThe value is 2.
According to the weight of the web page and the semantic weight corresponding to the web page tag element, for the term which is searched in the web page i and related to the organization name, the weight of the organization name can be calculated by the following formula according to the frequency and the times of the term which appears in different tag elements and extremely appears in the web page
Figure BDA0002734951210000102
Wherein, WijWeight of organization name j in web page i, ωiAs a weighting factor, ω, for a web page ijIs the semantic weight of the organization name j, hjIs the hierarchy of the tag element tree diagram where the organization name j is located and hj∈[0,4],tfijFor the frequency, idf, of the entry in web page i for the organization name jjIs the inverse document frequency of the name word j,
Figure BDA0002734951210000111
Nifor all organization name entries, n, searched in web page iijThe number of terms in the web page i containing the organization name j.
The method comprises the steps of entity clustering to help accurately determine the mechanism to which the safety information belongs, before entity clustering, at least finding out alias and superior and inferior mechanism affiliation of a candidate mechanism name in a priori knowledge graph to form candidate mechanism entity pairs and form triples, and taking a head entity and a tail entity in the triples to form an entity set { (h)i,ti)}. In this embodiment, the Path Ranking Algorithm (PRA Algorithm) is used to search for entities with a specified relationship. The main idea of the PRA algorithm is to judge the existence of a certain type of relationship by different paths of connections between entities in the knowledge graph.
Specifically, two entities (h) in the triple are selected from the knowledge graphi,ti) Taking each path as a feature and calculating the feature value of the path so as to form the feature vector of the triplet.
Specifically, the characteristic value of the path may be calculated as follows
Figure BDA0002734951210000112
Wherein P ═ R1…RlRepresents a path, s is a start node, e is a tail node, e' is a middle node, hs,p′(e') is represented in the relationship type RlNext, the probability that the (s, e ') entity pair can be connected through path p'. Wherein the content of the first and second substances,
Figure BDA0002734951210000113
indicating that node e' is in relationship type RlThen, the probability of random walk to end node e represents whether there is R between the entity pair (e', e)lAnd (4) relationship.
After the probability calculation of each feature vector is realized, a logistic regression classifier is trained by using the feature vectors, and whether a specified type of relationship exists between two entities is judged according to the probability of path prediction between the two entities. When the path prediction probability is higher, the probability that the two entities have the specified relationships such as the alias and the superior and subordinate mechanisms is higher, namely the two entities are put into the same mechanism name set as an entity pair; when the probability of path prediction is shown to be low, the probability of the existence of an alias, a superior-subordinate mechanism and other specified relationships between two entities is low, that is, the two entities should be put into different two entity pairs.
The prior knowledge of the network security knowledge graph is integrated, the similarity among the candidate organization entities is deduced, and the problem that other similar candidate organization entities with different alias names or upper-level and lower-level subordination relations among the candidate organization entities are clustered to different categories by mistake is solved. For example, NBA and american basketball association both represent the same organization, and if clustered by the edit distance of the organization's name, they would cluster into different categories, making the results ambiguous. At the moment, the PRA algorithm is used for deducing in the prior network security knowledge graph, the NBA and the American basketball association have an alias relationship and have larger similarity, and the alias relationship is used as prior knowledge to guide a clustering process so as to enable the NBA and the American basketball association to be clustered into the same category.
And after entity clustering in the seventh step, a plurality of types of organization mechanism candidate name sets are obtained, each set also contains a plurality of candidate organization mechanism names, and the most reliable and most probable term needs to be taken out as the organization mechanism name finally output in the output result. And after entity clustering, retaining the weight factors of the candidate organization mechanism names calculated in the sixth step, and acquiring the weight of each category set by weighted average according to the weight of each candidate organization mechanism name. And selecting the set with the maximum weight, and outputting the organization name with the maximum weight in the set as a final inference result.
The present invention is not limited to the above-described alternative embodiments, and various other embodiments can be obtained by those skilled in the art from the above-described embodiments in any combination, and any other embodiments can be obtained in various forms while still being within the spirit of the present invention. The above detailed description should not be taken as limiting the scope of the invention, which is defined in the claims, and which the description is intended to be interpreted accordingly.

Claims (7)

1. A method for tracing an IP address mechanism based on a knowledge graph is characterized by comprising the following steps:
acquiring domain name information: aiming at the effective IP address of the mechanism to be inferred, obtaining domain name information corresponding to the IP address through DNS inverse analysis;
acquiring domain name key information: intercepting and screening the domain name information to obtain key information in the domain name information;
obtaining an analysis sample: searching according to the domain name key information to obtain a plurality of webpages corresponding to the domain name key information, sequencing and screening the webpages, and reserving the webpages as analysis samples;
sample treatment: analyzing the analysis sample and acquiring text content existing in the analysis sample;
and (3) entity extraction: performing entity extraction on the text content through a named entity identification model, and identifying all organization names in the text content;
calculating the weight: according to the sequencing of the analysis samples and the label elements of the webpage labels where the text contents are located, performing weight calculation and configuration on the mechanism names;
entity clustering: clustering among entities through the editing distance among the mechanism names, and simultaneously using a priori knowledge graph to constrain and guide the clustering result to obtain a mechanism name set of multiple categories; the method comprises the following steps: s1, before entity clustering, at least finding out the alias and the superior and inferior mechanism subordination relation of the candidate mechanism name in the prior knowledge map to form a candidate mechanism entity pair and form a triple, and taking the head entity and the tail entity in the triple to form an entity set { (h)i,ti) }; s2, selecting two entities (h) satisfying the triple in the knowledge graphi,ti) Taking each path as a feature and calculating the feature value of the path so as to form a feature vector of the triple; s3, calculating the characteristic value of the path according to the following method
Figure FDA0003395453750000011
Wherein P ═ R1...RlRepresents a path, s is a start node, e is a tail node, e' is a middle node, hs,p′(e') is represented in the relationship type RlThe probability that an (s, e ') entity pair can be connected through path p', where,
Figure FDA0003395453750000021
indicating that node e' is in relationship type RlThen, the probability of random walk to end node e represents whether there is R between the entity pair (e', e)lA relationship;
selecting the name of the organization: and calculating the weight of the organization name set of each category by a weighted average calculation method according to the weight of each organization name, selecting the organization name set with the maximum weight, and taking the organization name with the maximum weight in the organization name set as a traceability inference result.
2. The IP address agency tracing method based on a knowledge graph of claim 1, wherein: in the step of obtaining the domain name key information, the domain name key information is processed through a regular expression, and the prefix and the suffix of the domain name information are removed, so that the key information in the domain name is obtained.
3. The IP address agency tracing method based on a knowledge graph of claim 1, wherein: and when the analysis sample is processed, removing the webpage label through the regular expression and the webpage document processing rule, and acquiring the text content.
4. The IP address agency tracing method based on the knowledge-graph according to claim 1, wherein the agency names are weighted and configured according to the order of the analysis samples, and the method is characterized in that: weights are calculated based on the ranking of the web pages, one of the weight factors is calculated,
Figure FDA0003395453750000022
wherein, ω isiIs the weight occupied by the analysis samples ordered as i.
5. The IP address agency source tracing method based on the knowledge graph as claimed in claim 4, wherein the agency name is weighted and configured according to the label element of the webpage label where the text content is located, and the method is characterized in that: according to the contribution degree of the webpage label elements to the webpage document theme, corresponding semantic weight omega is given to the webpage label elementsj∈[0,10]And reflecting the importance degree of the text content to the search result by the semantic weight of the webpage label element.
6. The IP address agency tracing method based on the knowledge graph of claim 5, wherein: according to the weight of the webpage and the semantic weight corresponding to the webpage label element, the weight of the mechanism name is calculated according to the following method
Figure FDA0003395453750000031
Wherein, WijWeight of organization name j in web page i, ωiAs a weighting factor, ω, for a web page ijFor semantic weight of organization name j, construct a tree graph with element labels, hjIs the hierarchy of the tag element tree diagram where the organization name j is located and hj∈[0,4],tfijFor the frequency, idf, of the entry in web page i for the organization name jjIs the inverse document frequency of the name word j,
Figure FDA0003395453750000032
Nifor all organization name entries, n, searched in web page iijThe number of terms in the web page i containing the organization name j.
7. The IP address agency tracing method based on a knowledge graph of claim 1, wherein: and training a logistic regression classifier by using the feature vector to judge whether the specified type of relationship exists between the two entities according to the probability of path prediction between the two entities.
CN202011130373.2A 2020-10-21 2020-10-21 IP address mechanism tracing method based on knowledge graph Active CN112364173B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011130373.2A CN112364173B (en) 2020-10-21 2020-10-21 IP address mechanism tracing method based on knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011130373.2A CN112364173B (en) 2020-10-21 2020-10-21 IP address mechanism tracing method based on knowledge graph

Publications (2)

Publication Number Publication Date
CN112364173A CN112364173A (en) 2021-02-12
CN112364173B true CN112364173B (en) 2022-03-18

Family

ID=74511364

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011130373.2A Active CN112364173B (en) 2020-10-21 2020-10-21 IP address mechanism tracing method based on knowledge graph

Country Status (1)

Country Link
CN (1) CN112364173B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113141378B (en) * 2021-05-18 2022-12-02 中国互联网络信息中心 Bad domain name identification method and device
CN115840863A (en) * 2021-09-18 2023-03-24 华为技术有限公司 Webpage content tracing method, knowledge graph construction method and related equipment
CN114422170B (en) * 2021-12-08 2023-01-17 中国科学院信息工程研究所 Method and system for reversely acquiring domain name from IP address
CN114328937A (en) * 2022-03-10 2022-04-12 中国医学科学院医学信息研究所 Scientific research institution information processing method and device
CN117235200A (en) * 2023-09-12 2023-12-15 杭州湘云信息技术有限公司 Data integration method and device based on AI technology, computer equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109617728A (en) * 2018-12-14 2019-04-12 中国电子科技网络信息安全有限公司 A kind of distributed IP grade network topology probe method based on multi-protocols
CN109885692A (en) * 2019-01-11 2019-06-14 平安科技(深圳)有限公司 Knowledge data storage method, device, computer equipment and storage medium
CN110113314A (en) * 2019-04-12 2019-08-09 中国人民解放军战略支援部队信息工程大学 Network safety filed knowledge mapping construction method and device for dynamic threats analysis
CN110188191A (en) * 2019-04-08 2019-08-30 北京邮电大学 A kind of entity relationship map construction method and system for Web Community's text
CN110362660A (en) * 2019-07-23 2019-10-22 重庆邮电大学 A kind of Quality of electronic products automatic testing method of knowledge based map
CN110674310A (en) * 2019-09-04 2020-01-10 东华大学 Knowledge graph-based industrial Internet of things identification method
US10630715B1 (en) * 2019-07-25 2020-04-21 Confluera, Inc. Methods and system for characterizing infrastructure security-related events
CN111177591A (en) * 2019-12-10 2020-05-19 浙江工业大学 Knowledge graph-based Web data optimization method facing visualization demand
CN111193749A (en) * 2020-01-03 2020-05-22 北京明略软件系统有限公司 Attack tracing method and device, electronic equipment and storage medium
CN111247773A (en) * 2017-04-03 2020-06-05 力士塔有限公司 Method and apparatus for ultra-secure last-in-the-road communication
CN111581397A (en) * 2020-05-07 2020-08-25 南方电网科学研究院有限责任公司 Network attack tracing method, device and equipment based on knowledge graph

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110121747A (en) * 2016-10-28 2019-08-13 伊鲁米那股份有限公司 For executing the bioinformatics system, apparatus and method of second level and/or tertiary treatment

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111247773A (en) * 2017-04-03 2020-06-05 力士塔有限公司 Method and apparatus for ultra-secure last-in-the-road communication
CN109617728A (en) * 2018-12-14 2019-04-12 中国电子科技网络信息安全有限公司 A kind of distributed IP grade network topology probe method based on multi-protocols
CN109885692A (en) * 2019-01-11 2019-06-14 平安科技(深圳)有限公司 Knowledge data storage method, device, computer equipment and storage medium
CN110188191A (en) * 2019-04-08 2019-08-30 北京邮电大学 A kind of entity relationship map construction method and system for Web Community's text
CN110113314A (en) * 2019-04-12 2019-08-09 中国人民解放军战略支援部队信息工程大学 Network safety filed knowledge mapping construction method and device for dynamic threats analysis
CN110362660A (en) * 2019-07-23 2019-10-22 重庆邮电大学 A kind of Quality of electronic products automatic testing method of knowledge based map
US10630715B1 (en) * 2019-07-25 2020-04-21 Confluera, Inc. Methods and system for characterizing infrastructure security-related events
CN110674310A (en) * 2019-09-04 2020-01-10 东华大学 Knowledge graph-based industrial Internet of things identification method
CN111177591A (en) * 2019-12-10 2020-05-19 浙江工业大学 Knowledge graph-based Web data optimization method facing visualization demand
CN111193749A (en) * 2020-01-03 2020-05-22 北京明略软件系统有限公司 Attack tracing method and device, electronic equipment and storage medium
CN111581397A (en) * 2020-05-07 2020-08-25 南方电网科学研究院有限责任公司 Network attack tracing method, device and equipment based on knowledge graph

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A Two-Phase Approach for Stance Classification in Twitter Using Name Entity Recognition and Term Frequency Feature;Yin Min Tun 等;《2019 IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS)》;20190619;77-81 *
基于知识实体的突发公共卫生事件数据平台构建研究;冯鑫 等;《知识管理论坛》;20200630;第5卷(第3期);175-190 *
科技大数据知识图谱构建方法及应用研究综述;周园春 等;《中国科学:信息科学》;20200715;第50卷(第7期);957-987 *

Also Published As

Publication number Publication date
CN112364173A (en) 2021-02-12

Similar Documents

Publication Publication Date Title
CN112364173B (en) IP address mechanism tracing method based on knowledge graph
US10725836B2 (en) Intent-based organisation of APIs
US9922190B2 (en) Method and system for detecting DGA-based malware
US11799823B2 (en) Domain name classification systems and methods
US9009134B2 (en) Named entity recognition in query
CN105915555B (en) Method and system for detecting network abnormal behavior
CN109885692A (en) Knowledge data storage method, device, computer equipment and storage medium
US8856129B2 (en) Flexible and scalable structured web data extraction
Rodriguez et al. New multi-stage similarity measure for calculation of pairwise patent similarity in a patent citation network
US20140067784A1 (en) Webpage information detection method and system
US20090089285A1 (en) Method of detecting spam hosts based on propagating prediction labels
CN110347701B (en) Target type identification method for entity retrieval query
He et al. Malicious domain detection via domain relationship and graph models
Timilsina et al. Social impact assessment of scientist from mainstream news and weblogs
CN114662096A (en) Threat hunting method based on graph kernel clustering
Helic et al. Building directories for social tagging systems
Peng et al. Malicious URL recognition and detection using attention-based CNN-LSTM
CN111314109A (en) Weak key-based large-scale Internet of things equipment firmware identification method
CA2614774A1 (en) Method and system for automatically extracting data from web sites
Knoblock et al. Automatic spatio-temporal indexing to integrate and analyze the data of an organization
US20160092458A1 (en) System for automatically generating wrapper for entire websites
ul haq Dar et al. Classification of job offers of the World Wide Web
KR20210142443A (en) Method and system for providing continuous adaptive learning over time for real time attack detection in cyberspace
Wedyan et al. An Associative Classification Data Mining Approach for Detecting Phishing Websites
Narwal et al. Web informative content identification and filtering using machine learning technique

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant