CN111224941B - Threat type identification method and device - Google Patents
Threat type identification method and device Download PDFInfo
- Publication number
- CN111224941B CN111224941B CN201911136708.9A CN201911136708A CN111224941B CN 111224941 B CN111224941 B CN 111224941B CN 201911136708 A CN201911136708 A CN 201911136708A CN 111224941 B CN111224941 B CN 111224941B
- Authority
- CN
- China
- Prior art keywords
- domain name
- matrix
- identified
- threat
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1416—Event detection, e.g. attack signature detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The embodiment of the invention provides a threat type identification method and device, relates to the technical field of network security, and can determine the threat type of a domain name with unknown threat type. The scheme of the embodiment of the invention comprises the following steps: and constructing a matrix to be identified, wherein the matrix to be identified comprises an attribute matrix and a relation matrix, the attribute matrix comprises attribute information of the domain name to be identified and attribute information of a target domain name of a known threat type, and the relation matrix comprises the similarity between the domain name to be identified and the target domain name. And then inputting the matrix to be recognized into a threat type recognition model, acquiring the probability that the domain name to be recognized output by the threat type recognition model belongs to each threat type, taking the threat type with the maximum probability as the threat type of the domain name to be recognized, and determining the probability that the domain name to be recognized belongs to each threat type according to the attribute information of the domain name to be recognized, the attribute information of the target domain name and the similarity between the domain name to be recognized and the target domain name.
Description
Technical Field
The invention relates to the technical field of network security, in particular to a threat type identification method and device.
Background
Network attackers typically employ a large amount of network infrastructure to perform network attacks, which typically involve domain names and Internet Protocol (IP) addresses. For example, an attacker may employ various propagation means to infect a large number of hosts on the internet with bother viruses, and the infected hosts receive instructions of the attacker through a control channel, thereby stealing private data of other devices, sending spam to other devices, and the like. These networks of hosts infected with botnets are called botnets.
To cope with the complexity of network attacks, more and more organizations are beginning to share network threat intelligence with open networks. Cyber threat intelligence describes existing cyber threats and describes countermeasures that can be taken in the face of cyber threats.
However, at present, the domain name with unknown threat type still exists in the network, so that the threat type of the domain name with unknown threat type is determined to be very important.
Disclosure of Invention
The embodiment of the invention aims to provide a threat type identification method and a threat type identification device so as to determine a threat type of a domain name of an unknown threat type. The specific technical scheme is as follows:
in a first aspect, an embodiment of the present invention provides a threat type identification method, where the method includes:
constructing a matrix to be identified, wherein the matrix to be identified comprises an attribute matrix and a relation matrix, the attribute matrix comprises attribute information of a domain name to be identified and attribute information of a target domain name of a known threat type, the relation matrix comprises similarity between the domain name to be identified and the target domain name, the similarity between the domain name to be identified and the target domain name is determined based on incidence relation between the domain name to be identified and the target domain name, the incidence relation comprises indirect relation of incidence relation between the domain name to be identified and the target domain name and the same element or the same element graph, incidence relation exists among elements in the element graph, and the elements comprise Internet Protocol (IP) addresses, mailbox addresses and/or malicious software hash values;
inputting the matrix to be recognized into a threat type recognition model;
and obtaining the probability that the domain name to be identified output by the threat type identification model belongs to each threat type, and taking the threat type with the highest probability as the threat type of the domain name to be identified, wherein the threat type identification model is used for determining the probability that the domain name to be identified belongs to each threat type according to the attribute information of the domain name to be identified, the attribute information of the target domain name and the similarity between the domain name to be identified and the target domain name.
Optionally, the threat type identification model is obtained by training through the following steps:
acquiring a sample matrix and actual threat types of a plurality of sample domain names corresponding to the sample matrix, wherein the sample matrix comprises an attribute matrix formed by attribute information of the plurality of sample domain names and a relation matrix used for representing the similarity between every two sample domain names in the plurality of sample domain names;
inputting the sample matrix into a neural network model, and acquiring the probability that the plurality of sample domain names identified by the neural network model belong to each threat type;
calculating a loss function value according to the probability that the plurality of sample domain names identified by the neural network model belong to each threat type and the actual threat types of the plurality of sample domain names;
judging whether the neural network model converges according to the loss function value; if the neural network model is converged, obtaining the threat type identification model; and if the neural network model is not converged, adjusting the model parameters of the neural network model according to the loss function value, and carrying out next training.
Optionally, the similarity between every two sample domain names in the plurality of sample domain names is obtained by the following formula:
wherein phikFor the kth meta-path or meta-map, phik={Φk1,2, …, n is the total number of categories of meta-paths and meta-graphs,for two domain names at phikNumber of paths ofkIs phikAnd satisfies betak>0 andMIS(vi,vj) Is the similarity between the ith domain name and the jth domain name.
Optionally, the neural network model identifies the probability that each sample domain name belongs to each threat type by the following formula:
wherein Z is the probability that the sample domain name belongs to each threat type, X is the attribute matrix, B is the relationship matrix,represents a pair xiNormalization is applied to the rows of the matrix, ReLU (·) max (0,),INis an identity matrix, W(0)And W(1)Is the weight of the neural network model.
Optionally, the loss function value is calculated by the following formula:
wherein H is a loss function value, N' is the number of domain names with corresponding threat types, K is the number of threat types, lk(vi) For representing the ith node viWhether it belongs to threat type k, Zk(vi) Ith node v for representing neural network model predictioniProbability of belonging to threat type k.
Optionally, the element further includes a domain name; before the constructing a matrix to be identified, the method further comprises:
acquiring incidence relations among nodes of different types from a first data source, and constructing a threat intelligence heterogeneous graph, wherein each node in the threat intelligence heterogeneous graph corresponds to one element, and the type of each node is the type of the element corresponding to the node;
the constructing of the matrix to be identified comprises the following steps:
acquiring attribute information of the domain name to be identified and attribute information of the target domain name, and constructing an attribute matrix;
determining the similarity between the domain name to be identified and the target domain name according to the threat intelligence heterogeneous graph;
and constructing the relation matrix according to the similarity between the domain name to be identified and the target domain name.
Optionally, the constructing a threat intelligence heterogeneous map includes:
and acquiring the incidence relation among the nodes of the same type from a second data source, and constructing a threat intelligence heterogeneous graph based on the incidence relation among the nodes of different types and the incidence relation among the nodes of the same type.
In a second aspect, an embodiment of the present invention provides a threat type identification apparatus, where the apparatus includes:
the device comprises a construction module and a processing module, wherein the construction module is used for constructing a matrix to be identified, the matrix to be identified comprises an attribute matrix and a relation matrix, the attribute matrix comprises attribute information of a domain name to be identified and attribute information of a target domain name of a known threat type, the relation matrix comprises similarity between the domain name to be identified and the target domain name, the similarity between the domain name to be identified and the target domain name is determined based on an incidence relation between the domain name to be identified and the target domain name, the incidence relation comprises an indirect relation that the domain name to be identified and the target domain name have an incidence relation with the same element or the same element graph, the incidence relation exists among the elements in the element graph, and the elements comprise an Internet Protocol (IP) address, a mailbox address and/or a malicious software hash value;
the input module is used for inputting the matrix to be recognized constructed by the construction module into a threat type recognition model;
and the obtaining module is used for obtaining the probability that the domain name to be identified, which is output by the threat type identification model, belongs to each threat type, and taking the threat type with the highest probability as the threat type of the domain name to be identified, wherein the threat type identification model is used for determining the probability that the domain name to be identified belongs to each threat type according to the attribute information of the domain name to be identified, the attribute information of the target domain name and the similarity between the domain name to be identified and the target domain name.
Optionally, the apparatus further comprises a training module, wherein the training module is configured to:
acquiring a sample matrix and actual threat types of a plurality of sample domain names corresponding to the sample matrix, wherein the sample matrix comprises an attribute matrix formed by attribute information of the plurality of sample domain names and a relation matrix used for representing the similarity between every two sample domain names in the plurality of sample domain names;
inputting the sample matrix into a neural network model, and acquiring the probability that the plurality of sample domain names identified by the neural network model belong to each threat type;
calculating a loss function value according to the probability that the plurality of sample domain names identified by the neural network model belong to each threat type and the actual threat types of the plurality of sample domain names;
judging whether the neural network model converges according to the loss function value; if the neural network model is converged, obtaining the threat type identification model; and if the neural network model is not converged, adjusting the model parameters of the neural network model according to the loss function value, and carrying out next training.
Optionally, the similarity between every two sample domain names in the plurality of sample domain names is obtained by the following formula:
wherein phikFor the kth meta-path or meta-map, phik={Φk1,2, …, n is the total number of categories of meta-paths and meta-graphs,for two domain names at phikNumber of paths ofkIs phikAnd satisfies betak>0 andis the similarity between the ith domain name and the jth domain name.
Optionally, the neural network model identifies the probability that each sample domain name belongs to each threat type by the following formula:
wherein Z is the probability that the sample domain name belongs to each threat type, X is the attribute matrix, B is the relationship matrix,represents a pair xiNormalization is applied to the rows of the matrix, ReLU (·) max (0,),INis an identity matrix, W(0)And W(1)Is the weight of the neural network model.
Optionally, the loss function value is calculated by the following formula:
wherein H is a loss function value, N' is the number of domain names with corresponding threat types, K is the number of threat types, lk(vi) For representing the ith node viWhether it belongs to threat type k, Zk(vi) Ith node v for representing neural network model predictioniProbability of belonging to threat type k.
Optionally, the element further includes a domain name; the building module is further configured to, before the matrix to be identified is built, obtain association relationships between nodes of different types from a first data source, and build a threat intelligence heterogeneous graph, where each node in the threat intelligence heterogeneous graph corresponds to one element, and a type to which each node belongs is a type of the element corresponding to the node;
the building module is specifically configured to:
acquiring attribute information of the domain name to be identified and attribute information of the target domain name, and constructing an attribute matrix;
determining the similarity between the domain name to be identified and the target domain name according to the threat intelligence heterogeneous graph;
and constructing the relation matrix according to the similarity between the domain name to be identified and the target domain name.
Optionally, the building module is specifically configured to:
and acquiring the incidence relation among the nodes of the same type from a second data source, and constructing a threat intelligence heterogeneous graph based on the incidence relation among the nodes of different types and the incidence relation among the nodes of the same type.
In a third aspect, an embodiment of the present invention provides an electronic device, which is characterized by including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete mutual communication through the communication bus;
a memory for storing a computer program;
and the processor is used for realizing the steps of any threat type identification method when executing the program stored in the memory.
In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of any of the above threat type identification methods.
In a fifth aspect, embodiments of the present invention also provide a computer program product containing instructions, which when run on a computer, cause the computer to perform any of the above threat type identification methods.
The embodiment of the invention at least comprises the following beneficial effects: the threat category identification model in the embodiment of the invention can determine the threat type of the domain name to be identified based on the attribute information of the domain name to be identified, the attribute information of the target domain name and the incidence relation between the domain name to be identified and the target domain name, wherein the incidence relation between the domain name to be identified and the target domain name comprises an indirect relation of incidence relations between the domain name to be identified and the target domain name and the same element or the same element graph. Therefore, the embodiment of the invention realizes that the threat type of the domain name to be identified is determined through the indirect relation between the domain name to be identified and the target domain name, namely through extracting the high-order semantics between the domain name to be identified and the target domain name.
Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a threat type identification method according to an embodiment of the present invention;
FIG. 2 is an exemplary diagram of a threat intelligence hetrogram provided by an embodiment of the present invention;
FIG. 3 is an exemplary diagram of various meta-paths/graphs provided by an embodiment of the present invention;
FIG. 4 is an exemplary diagram illustrating a hierarchical relationship between threat types according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a threat type identification apparatus according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, an embodiment of the present invention provides a threat type identification method, which may be applied to an electronic device, where the electronic device may be a mobile phone, a computer, a tablet computer, or the like, and the method includes the following steps:
The matrix to be recognized comprises an attribute matrix and a relation matrix, wherein the attribute matrix comprises attribute information of a domain name to be recognized and attribute information of each target domain name of a known threat type, the relation matrix comprises similarity between the domain name to be recognized and each target domain name, the similarity between the domain name to be recognized and each target domain name is determined based on incidence relation between the domain name to be recognized and each target domain name, the incidence relation comprises indirect relation between the domain name to be recognized and each target domain name and the same element or the same element graph, incidence relation exists among elements in the element graph, and the elements comprise Internet Protocol (IP) addresses, mailbox addresses and/or malicious software hash values.
And 102, inputting the matrix to be recognized into a threat type recognition model.
And 103, acquiring the probability that the domain name to be identified output by the threat type identification model belongs to each threat type, and taking the threat type with the maximum probability as the threat type of the domain name to be identified.
The threat type identification model is used for determining the probability that the domain name to be identified belongs to each threat type according to the attribute information of the domain name to be identified, the attribute information of the target domain name and the similarity between the domain name to be identified and the target domain name.
The embodiment of the invention at least comprises the following beneficial effects: the threat category identification model in the embodiment of the invention can determine the threat type of the domain name to be identified based on the attribute information of the domain name to be identified, the attribute information of the target domain name and the incidence relation between the domain name to be identified and the target domain name, wherein the incidence relation between the domain name to be identified and the target domain name comprises an indirect relation of incidence relations between the domain name to be identified and the target domain name and the same element or the same element graph. Therefore, the embodiment of the invention realizes that the threat type of the domain name to be identified is determined through the indirect relation between the domain name to be identified and the target domain name, namely through extracting the high-order semantics between the domain name to be identified and the target domain name.
Optionally, before the step 101, a threat intelligence heterogeneous graph may be further constructed, so that the number of paths of each domain name in the threat intelligence heterogeneous graph is obtained according to the intelligence heterogeneous graph, and further the similarity between the domain names is calculated.
The method for constructing the threat intelligence heterogeneous graph comprises the following steps: and acquiring incidence relations among different types of nodes from the first data source, and constructing a threat intelligence heterogeneous graph. Each node in the threat intelligence heterogeneous graph corresponds to one element, and the category of each node is the category of the element corresponding to the node.
Optionally, the threat intelligence heterogeneous graph may be enriched, and the association relationship between nodes of the same type in the threat intelligence heterogeneous graph is determined, including: and acquiring the incidence relation among the nodes of the same type from a second data source, and constructing a threat intelligence heterogeneous graph based on the incidence relation among the nodes of different types and the incidence relation among the nodes of the same type.
For example, if two Domain names have a common owner (owner) and/or a common Domain Name System (DNS) location, an association exists between the two Domain Name nodes. If two IP addresses have the same Internet Service Provider (ISP), an association relationship exists between the two IP address nodes. If two pieces of software attack the electronic equipment by using the same vulnerability, an association relationship exists between the two software hash value nodes.
For example, the constructed threat intelligence heterogeneous graph is shown in fig. 2, and the threat intelligence heterogeneous graph in fig. 2 includes four types of nodes, which are respectively: malware (Malware) hash value node (node M in fig. 2), IP address node (node I in fig. 2), Domain name (Domain) node (node D in fig. 2), and mailbox (Email) address node (node E in fig. 2). The arrows in fig. 2 represent the association between the nodes.
Threat intelligence may be represented by associations between nodes in a set of threat intelligence heterographs, and thus a piece of threat intelligence may be a subgraph of a threat intelligence heterograph. The meta-path defined based on the type of the nodes in the threat intelligence heterogeneous graph can reflect the incidence relation among the nodes in the meta-path and reflect the similarity among the nodes. For example: the meta path Domain-Malware-Domain may indicate that two Domain names are accessed by the same Malware, and the meta path Domain-Email-Domain may indicate that two Domain names are registered by the same mailbox address.
The embodiment of the invention can also bring the following beneficial effects: the threat intelligence heterogeneous graph can effectively and compactly express the network attacks with the incidence relation by using different semantics, and the threat intelligence heterogeneous graph has great potential for knowledge discovery (for example, capturing complex relations among different types of nodes, distinguishing different network attacks based on the difference of network behaviors, exploring how attackers organize the attacks and adjusting attack technologies, and the like). The embodiment of the invention can better mine the intelligence by utilizing the incidence relation among the nodes included in the threat intelligence heterogeneous graph, thereby greatly reducing the workload of a security analyst.
Optionally, the method for constructing the matrix to be identified in step 101 includes:
step one, acquiring attribute information of a domain name to be identified and attribute information of a target domain name, and constructing an attribute matrix.
And step two, determining the similarity between the domain name to be identified and the target domain name according to the threat intelligence heterogeneous graph.
It can be understood that the number of paths of the domain name to be identified and the target domain name under each meta-path or meta-graph can be obtained according to the incidence relation between the domain name to be identified and the target domain name in the threat intelligence heterogeneous graph, and then the similarity between the domain name to be identified and the target domain name is calculated.
And step three, constructing a relation matrix according to the similarity between the domain name to be identified and the target domain name.
Optionally, the attribute information in the attribute matrix in step 101 may include: the length of the domain name, the entropy of character distribution information of the domain name, the survival time of the domain name, the updating frequency of the domain name and the like.
Optionally, the attribute matrix may include N × m attributes, where N is the number of domain names and m is the number of attributes included in each domain name.
For example: attribute information matrixWherein each row is attribute information of a domain name.
Optionally, the relationship-based adjacency matrix includes: matrix R, matrix S, matrix G, matrix C, and matrix N. Wherein, the element R in the matrix RijIndicating whether the ith domain name is resolved to the jth IP address; if so, then rij1 is ═ 1; if not, then rij0. Element S in matrix SijIndicating whether the ith domain name is accessed by jth malware; if so, sij1 is ═ 1; if not, sij0. Element G in matrix GijIndicating whether the ith domain name is registered by the jth mailbox address; if so, gij1 is ═ 1; if not, gij0. Element C in matrix CijIndicating whether the ith IP address communicates with the jth malware; if so, then cij1 is ═ 1; if not, cij0. Element N in matrix NijIndicating whether the ith IP address is communicated with the jth mailbox address; if so, then nij1 is ═ 1; if not, nij=0。
Optionally, the matrix element B in the relationship matrix B in step 101 isijIndicating the ith domain name viAnd the jth domain name vjThe similarity between them.
Optionally, viAnd vjThe similarity between them can be obtained by the following formula (1):
wherein phikFor the kth meta-path or meta-map, phik={Φk1,2, …, n is the total number of categories of meta-paths and meta-graphs,for two domain names at phikNumber of paths ofkIs phikAnd satisfies betak>0 andis the similarity between the ith domain name and the jth domain name.
It can be understood that, as shown in formula (1), the similarity between two domain names is related to the number of paths between two nodes on one hand and the path nodes of the domain name itself on the other hand. Weight vector betakCan be obtained through automatic learning of a threat type recognition model.
Optionally, as shown in fig. 3, D in fig. 3 represents a domain name node, M represents a malware hash value node, I represents an IP address node, and E represents a mailbox address node. The embodiment of the invention defines that 7 types of meta-paths (meta-paths) comprise phi1~Φ7And 5 types of metagraphs (meta-graphs) including phi8~Φ12。
Wherein, the node viAnd node vjAt phikThe following number of paths are:
wherein the content of the first and second substances,is a domain name node viAnd domain name node vjAt phikThe following relationship (communing) matrix. The communtation matrix is a kind of adjacency matrix.
When phi isk=(A1,A2,…Ad+1) When it is meta-path, the domain name node A1And domain name node Ad+1The communating matrix in between is:
wherein the content of the first and second substances,is node AiAnd node Ai+1The symbol "represents a matrix multiplication.
When phi iskWhen it is a primitive diagram, use phi10For the purpose of example only,calculated by the following steps:
step 1, calculatingWherein, P1Represents a sub-path (IP-Email-IP),is shown in sub-path P1Communicating matrix, Q, of two IP address nodesIERepresenting a communating matrix between IP addresses and mailbox addresses, the element N of the N matrixijIndicating whether the ith IP address node communicates with the jth mailbox address node ·TRepresenting a transposed matrix of matrices.
Step 2, calculatingWherein, P2Represents a sub path (IP-aware-IP),is shown in sub-path P2Communicating matrix, Q, of two IP address nodesIMRepresenting a communing matrix between an IP address and malware, element C in the C matrixijIndicates whether the ith IP address is accessed by the jth malware ·TRepresenting a transposed matrix of matrices.
For example, the relationship matrix in step 101 aboveWherein each matrix element represents a similarity between the corresponding two domain names. E.g. b1,2Is the similarity between the 1 st domain name and the 2 nd domain name.
It can be understood that, since different meta-paths and meta-maps have different meanings, the similarity between two domain names represented by different meta-paths and meta-maps is different, and therefore, in order to distinguish the importance of different meta-paths and meta-maps, a corresponding weight β may be set for each meta-path and meta-mapk。
For example, Domain name D1And domain name D1By the same mailbox address E1Register, so Domain name D1And domain name D2Meta path (Domain) exists between them1-Email1-Domain2). Domain name D1And domain name D1Are all still by malware M1Visited, so domain name D1And domain name D2Meta path (Domain) exists between them1-Malware1-Domain2). Under the condition that the threat source is important, the weight corresponding to the meta-path Domain-Email-Domain can be set to be higher than the weight corresponding to the meta-path Domain-Malware-Domain. In the case that the threat behavior is important, the weight corresponding to the meta-path Domain-Malware-Domain may be set to be higher than the weight corresponding to the meta-path Domain-Email-Domain.
It will be appreciated that to explore the complementarity of the different meta-paths and the degree of similarity between two domain names represented by the meta-paths, the degree of similarity between two domain names may be calculated based on the weighted adjacency matrices of the meta-paths and the meta-maps.
Optionally, the threat type identification model in fig. 1 is obtained by training through the following steps:
step one, acquiring a sample matrix and actual threat types of a plurality of sample domain names corresponding to the sample matrix.
The sample matrix comprises an attribute matrix formed by attribute information of a plurality of sample domain names and a relation matrix used for representing the similarity between every two sample domain names in the plurality of sample domain names.
And step two, inputting the sample matrix into the neural network model, and acquiring threat types of a plurality of sample domain names identified by the neural network model.
In one embodiment, the neural network model can identify the probability that each sample domain name belongs to each threat type by the following equation (4):
wherein Z is the probability that the sample domain name belongs to each threat type, X is the attribute matrix of a plurality of sample domain names, B is the relation matrix of a plurality of sample domain names,represents a pair xiGo on to returnNormalizing rows applied to the matrix;is represented by the following general formulae 0 andthe larger value is selected as the calculation result,INis an identity matrix; w(0)And W(1)As weights of the neural network model, W(0)For the weight matrix between the input layer to the hidden layer of the neural network model, W(1)Is a weight matrix from the hidden layer to the output layer.To representThe element of the ith row and the jth column of the matrix,to representThe ith row and the jth column of the matrix.
It will be appreciated that the neural network model can be abstracted to include three network layers, respectively: the device comprises an input layer, a hidden layer and an output layer, wherein weight matrixes are stored among network layers.
Optionally, the training randomness may be realized by a random inactivation (dropout) mode.
And step three, calculating a loss function value according to the threat types of the plurality of sample domain names identified by the neural network model and the actual threat types of the plurality of sample domain names.
In one embodiment, the loss function value may be calculated by the following equation (5):
wherein H is a loss function value, N' is the number of domain names with corresponding threat types, K is the number of threat types, lk(vi) For representing the ith node viWhether it belongs to threat type k, Zk(vi) Ith node v for representing neural network model predictioniProbability of belonging to threat type k.
And step four, judging whether the neural network model is converged or not according to the loss function value. If the neural network model is converged, obtaining a threat type identification model; and if the neural network model is not converged, adjusting the model parameters of the neural network model according to the loss function value, and carrying out next training.
The embodiment of the invention also has the following beneficial effects: according to the embodiment of the invention, the neural network model can be trained according to the actual threat type of the sample domain name, so that the predicted value of the neural network model is closer to the true value, and the determined threat type of the domain name to be identified is accurate through the threat type identification model obtained by training.
Optionally, there may be a hierarchical relationship between the threat types, as shown in fig. 4, where each node in fig. 4 represents a threat type, and a threat type of a parent type may include one or more threat types of subtypes. For example: the trojan horse in fig. 4 (threat type of parent type) includes a backdoor (threat type of subtype).
It will be appreciated that introducing a hierarchy between threat types may improve the accuracy of threat type identification. For example, when training sample domain names of threat type recognition models, including sample domain names of threat types of a small number of subtypes, when the models are recognizing threat types of an input domain name, they may be regularized by threat types of a parent type of a subtype. The parameters of threat types with hierarchical relationships tend to be similar.
Alternatively, L ═ L may be usediI 1,2, … K represents a set of threat types, K being the number of threat types. To represent between threat typesThe hierarchical relationship of (1) can be usedKiIs represented byiNumber of threat types included as sub-types.Output-layer network parameters of the model are identified for threat types, wherein,indicates the ith threat type liNetwork parameters at the output layer of the threat type identification model.
When there is a hierarchical relationship between threat types, the model parameters of the output layer of the neural network model may be regularized by equation (6):
where λ (W) is a model parameter of the output layer. K is the number of threat types, liFor the ith type of threat,for the ith threat type liThe parameters of the model at the output layer,threat types of jth sub-type included for ith threat typeModel parameters at the output layer.
Optionally, the model parameters of the neural network model may be adjusted according to the loss function values by equation (7):
J=H+Cλ(W) (7)
wherein J is the adjusted loss function value, H is the loss function value, C is the preset penalty parameter, and lambda (W) is the model parameter of the output layer.
Alternatively, the predetermined penalty parameter may be set empirically.
The embodiment of the invention also has the following beneficial effects: the embodiment of the invention can regularize the model parameters of the output layer of the neural network model, and relieve the overfitting problem of the neural network model.
Corresponding to the above method embodiment, referring to fig. 5, an embodiment of the present invention further provides a threat type identification apparatus, including: a construction module 501, an input module 502 and an acquisition module 503;
the construction module 501 is configured to construct a to-be-identified matrix, where the to-be-identified matrix includes an attribute matrix and a relationship matrix, the attribute matrix includes attribute information of a to-be-identified domain name and attribute information of a target domain name of a known threat type, the relationship matrix includes a similarity between the to-be-identified domain name and the target domain name, the similarity between the to-be-identified domain name and the target domain name is determined based on an association relationship between the to-be-identified domain name and the target domain name, the association relationship includes an indirect relationship in which the to-be-identified domain name and the target domain name have an association relationship with a same element or a same element map, an association relationship exists between elements included in the element map, and the elements include an internet protocol IP address, a;
an input module 502, configured to input the matrix to be identified, which is constructed by the construction module 501, into the threat type identification model;
the obtaining module 503 is configured to obtain probabilities that the domain name to be identified output by the threat type identification model belongs to each threat type, and use the threat type with the highest probability as the threat type of the domain name to be identified, where the threat type identification model is configured to determine the probability that the domain name to be identified belongs to each threat type according to the attribute information of the domain name to be identified, the attribute information of the target domain name, and the similarity between the domain name to be identified and the target domain name.
Optionally, the apparatus may further comprise a training module, and the training module may be configured to:
acquiring a sample matrix and actual threat types of a plurality of sample domain names corresponding to the sample matrix, wherein the sample matrix comprises an attribute matrix formed by attribute information of the plurality of sample domain names and a relation matrix used for representing the similarity between every two sample domain names in the plurality of sample domain names;
inputting the sample matrix into a neural network model, and acquiring the probability that a plurality of sample domain names identified by the neural network model belong to each threat type;
calculating a loss function value according to the probability that a plurality of sample domain names identified by the neural network model belong to each threat type and the actual threat types of the plurality of sample domain names;
judging whether the neural network model converges according to the loss function value; if the neural network model is converged, obtaining a threat type identification model; and if the neural network model is not converged, adjusting the model parameters of the neural network model according to the loss function value, and carrying out next training.
Optionally, the similarity between each two sample domain names in the multiple sample domain names may be obtained by the following formula:
wherein phikFor the kth meta-path or meta-map, phik={Φk1,2, …, n is the total number of categories of meta-paths and meta-graphs,for two domain names at phikNumber of paths ofkIs phikAnd satisfies betak>0 andis the similarity between the ith domain name and the jth domain name.
Optionally, the neural network model may identify the probability that each sample domain name belongs to each threat type by the following formula:
wherein Z is the probability that the sample domain name belongs to each threat type, X is an attribute matrix, B is a relationship matrix,represents a pair xiNormalization is applied to the rows of the matrix, ReLU (·) max (0,),INis an identity matrix, W(0)And W(1)Is the weight of the neural network model.
Alternatively, the loss function value may be calculated by the following formula:
wherein H is a loss function value, N' is the number of domain names with corresponding threat types, K is the number of threat types, lk(vi) For representing the ith node viWhether it belongs to threat type k, Zk(vi) Ith node v for representing neural network model predictioniProbability of belonging to threat type k.
Optionally, the element may also include a domain name;
the constructing module 501 may also be configured to, before constructing the matrix to be identified, obtain an association relationship between nodes of different types from a first data source, and construct a threat intelligence heterogeneous graph, where each node in the threat intelligence heterogeneous graph corresponds to an element, and a type to which each node belongs is a type of the element corresponding to the node;
the building module 501 may be specifically configured to:
acquiring attribute information of a domain name to be identified and attribute information of the target domain name, and constructing an attribute matrix;
determining the similarity between the domain name to be identified and the target domain name according to the threat intelligence heterogeneous graph;
and constructing a relation matrix according to the similarity between the domain name to be identified and the target domain name.
Optionally, the building block 501 may be specifically configured to:
and acquiring the incidence relation among the nodes of the same type from a second data source, and constructing a threat intelligence heterogeneous graph based on the incidence relation among the nodes of different types and the incidence relation among the nodes of the same type.
An embodiment of the present invention further provides an electronic device, as shown in fig. 6, including a processor 601, a communication interface 602, a memory 603, and a communication bus 604, where the processor 601, the communication interface 602, and the memory 603 complete mutual communication through the communication bus 604,
a memory 603 for storing a computer program;
the processor 601 is configured to implement the steps executed by the electronic device in the above method embodiments when executing the program stored in the memory 603.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
In a further embodiment of the present invention, a computer-readable storage medium is also provided, in which a computer program is stored, which, when being executed by a processor, carries out the steps of any of the above-mentioned threat type identification methods.
In a further embodiment provided by the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the threat type identification methods of the above embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.
Claims (6)
1. A method for threat type identification, the method comprising:
constructing a matrix to be identified, wherein the matrix to be identified comprises an attribute matrix and a relation matrix, the attribute matrix comprises attribute information of a domain name to be identified and attribute information of a target domain name of a known threat type, the relation matrix comprises similarity between the domain name to be identified and the target domain name, the similarity between the domain name to be identified and the target domain name is determined based on incidence relation between the domain name to be identified and the target domain name, the incidence relation comprises indirect relation of incidence relation between the domain name to be identified and the target domain name and the same element or the same element graph, incidence relation exists among elements in the element graph, and the elements comprise Internet Protocol (IP) addresses, mailbox addresses and/or malicious software hash values;
inputting the matrix to be recognized into a threat type recognition model;
acquiring the probability that the domain name to be identified output by the threat type identification model belongs to each threat type, and taking the threat type with the highest probability as the threat type of the domain name to be identified, wherein the threat type identification model is used for determining the probability that the domain name to be identified belongs to each threat type according to the attribute information of the domain name to be identified, the attribute information of the target domain name and the similarity between the domain name to be identified and the target domain name;
the threat type recognition model is obtained by training through the following steps:
acquiring a sample matrix and actual threat types of a plurality of sample domain names corresponding to the sample matrix, wherein the sample matrix comprises an attribute matrix formed by attribute information of the plurality of sample domain names and a relation matrix used for representing the similarity between every two sample domain names in the plurality of sample domain names;
inputting the sample matrix into a neural network model, and acquiring the probability that the plurality of sample domain names identified by the neural network model belong to each threat type;
calculating a loss function value according to the probability that the plurality of sample domain names identified by the neural network model belong to each threat type and the actual threat types of the plurality of sample domain names;
judging whether the neural network model converges according to the loss function value; if the neural network model is converged, obtaining the threat type identification model; if the neural network model is not converged, adjusting the model parameters of the neural network model according to the loss function value, and carrying out next training;
the element further includes a domain name; before the constructing a matrix to be identified, the method further comprises:
acquiring incidence relations among nodes of different types from a first data source, and constructing a threat intelligence heterogeneous graph, wherein each node in the threat intelligence heterogeneous graph corresponds to one element, and the type of each node is the type of the element corresponding to the node;
the constructing of the matrix to be identified comprises the following steps:
acquiring attribute information of the domain name to be identified and attribute information of the target domain name, and constructing an attribute matrix;
determining the similarity between the domain name to be identified and the target domain name according to the threat intelligence heterogeneous graph;
constructing the relation matrix according to the similarity between the domain name to be identified and the target domain name;
the threat intelligence heterogeneous map construction method comprises the following steps:
and acquiring the incidence relation among the nodes of the same type from a second data source, and constructing a threat intelligence heterogeneous graph based on the incidence relation among the nodes of different types and the incidence relation among the nodes of the same type.
2. The method of claim 1, wherein the similarity between each two sample domain names in the plurality of sample domain names is obtained by the following formula:
3. The method of claim 1, wherein the neural network model identifies the probability that each sample domain name belongs to each threat type by the following formula:
wherein Z is the probability that the sample domain name belongs to each threat type, X is the attribute matrix, B is the relationship matrix,represents a pair xiNormalization is applied to the rows of the matrix, ReLU (·) max (0,),INis an identity matrix, W(0)And W(1)Is the weight of the neural network model.
4. The method of claim 1, wherein the loss function value is calculated by the formula:
wherein H is a loss function value, N' is the number of domain names with corresponding threat types, K is the number of threat types, lk(vi) For representing the ith node viWhether it belongs to threat type k, Zk(vi) Ith node v for representing neural network model predictioniProbability of belonging to threat type k.
5. A threat type identification apparatus, the apparatus comprising:
the device comprises a construction module and a processing module, wherein the construction module is used for constructing a matrix to be identified, the matrix to be identified comprises an attribute matrix and a relation matrix, the attribute matrix comprises attribute information of a domain name to be identified and attribute information of a target domain name of a known threat type, the relation matrix comprises similarity between the domain name to be identified and the target domain name, the similarity between the domain name to be identified and the target domain name is determined based on an incidence relation between the domain name to be identified and the target domain name, the incidence relation comprises an indirect relation that the domain name to be identified and the target domain name have an incidence relation with the same element or the same element graph, the incidence relation exists among the elements in the element graph, and the elements comprise an Internet Protocol (IP) address, a mailbox address and/or a malicious software hash value;
the input module is used for inputting the matrix to be recognized constructed by the construction module into a threat type recognition model;
an obtaining module, configured to obtain probabilities that the domain name to be identified output by the threat type identification model belongs to each threat type, and use the threat type with the highest probability as the threat type of the domain name to be identified, where the threat type identification model is configured to determine, according to attribute information of the domain name to be identified, attribute information of the target domain name, and a similarity between the domain name to be identified and the target domain name, the probability that the domain name to be identified belongs to each threat type;
the apparatus further comprises a training module to:
acquiring a sample matrix and actual threat types of a plurality of sample domain names corresponding to the sample matrix, wherein the sample matrix comprises an attribute matrix formed by attribute information of the plurality of sample domain names and a relation matrix used for representing the similarity between every two sample domain names in the plurality of sample domain names;
inputting the sample matrix into a neural network model, and acquiring the probability that the plurality of sample domain names identified by the neural network model belong to each threat type;
calculating a loss function value according to the probability that the plurality of sample domain names identified by the neural network model belong to each threat type and the actual threat types of the plurality of sample domain names;
judging whether the neural network model converges according to the loss function value; if the neural network model is converged, obtaining the threat type identification model; if the neural network model is not converged, adjusting the model parameters of the neural network model according to the loss function value, and carrying out next training;
the element further includes a domain name; the building module is further configured to, before the matrix to be identified is built, obtain association relationships between nodes of different types from a first data source, and build a threat intelligence heterogeneous graph, where each node in the threat intelligence heterogeneous graph corresponds to one element, and a type to which each node belongs is a type of the element corresponding to the node;
the building module is specifically configured to:
acquiring attribute information of the domain name to be identified and attribute information of the target domain name, and constructing an attribute matrix;
determining the similarity between the domain name to be identified and the target domain name according to the threat intelligence heterogeneous graph;
constructing the relation matrix according to the similarity between the domain name to be identified and the target domain name;
the building module is specifically configured to:
and acquiring the incidence relation among the nodes of the same type from a second data source, and constructing a threat intelligence heterogeneous graph based on the incidence relation among the nodes of different types and the incidence relation among the nodes of the same type.
6. The apparatus of claim 5, wherein the similarity between each two of the plurality of sample domain names is obtained by the following formula:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911136708.9A CN111224941B (en) | 2019-11-19 | 2019-11-19 | Threat type identification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911136708.9A CN111224941B (en) | 2019-11-19 | 2019-11-19 | Threat type identification method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111224941A CN111224941A (en) | 2020-06-02 |
CN111224941B true CN111224941B (en) | 2020-12-04 |
Family
ID=70827679
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911136708.9A Active CN111224941B (en) | 2019-11-19 | 2019-11-19 | Threat type identification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111224941B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112019521B (en) * | 2020-08-07 | 2023-04-07 | 杭州安恒信息技术股份有限公司 | Asset scoring method and device, computer equipment and storage medium |
CN112131259B (en) * | 2020-09-28 | 2024-03-15 | 绿盟科技集团股份有限公司 | Similar malicious software recommendation method, device, medium and equipment |
CN112257066B (en) * | 2020-10-30 | 2021-09-07 | 广州大学 | Malicious behavior identification method and system for weighted heterogeneous graph and storage medium |
CN113259199B (en) * | 2021-05-18 | 2022-08-12 | 中国互联网络信息中心 | Domain name credit monitoring method and device |
CN113141378B (en) * | 2021-05-18 | 2022-12-02 | 中国互联网络信息中心 | Bad domain name identification method and device |
CN115225308B (en) * | 2022-05-17 | 2024-03-12 | 国家计算机网络与信息安全管理中心 | Attack partner identification method for large-scale group attack flow and related equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108540979A (en) * | 2018-04-04 | 2018-09-14 | 北京邮电大学 | Pseudo- AP detection method and device based on fingerprint characteristic |
CN108650260A (en) * | 2018-05-09 | 2018-10-12 | 北京邮电大学 | A kind of recognition methods of malicious websites and device |
CN109241989A (en) * | 2018-07-17 | 2019-01-18 | 中国电力科学研究院有限公司 | A kind of method and system of the intelligent substation intrusion scenario reduction based on space-time similarity mode |
CN110460605A (en) * | 2019-08-16 | 2019-11-15 | 南京邮电大学 | A kind of Abnormal network traffic detection method based on autocoding |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10216938B2 (en) * | 2014-12-05 | 2019-02-26 | T-Mobile Usa, Inc. | Recombinant threat modeling |
CN110198292B (en) * | 2018-03-30 | 2021-12-07 | 腾讯科技(深圳)有限公司 | Domain name recognition method and device, storage medium and electronic device |
US10931703B2 (en) * | 2018-04-24 | 2021-02-23 | ProSOC, Inc. | Threat coverage score and recommendations |
-
2019
- 2019-11-19 CN CN201911136708.9A patent/CN111224941B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108540979A (en) * | 2018-04-04 | 2018-09-14 | 北京邮电大学 | Pseudo- AP detection method and device based on fingerprint characteristic |
CN108650260A (en) * | 2018-05-09 | 2018-10-12 | 北京邮电大学 | A kind of recognition methods of malicious websites and device |
CN109241989A (en) * | 2018-07-17 | 2019-01-18 | 中国电力科学研究院有限公司 | A kind of method and system of the intelligent substation intrusion scenario reduction based on space-time similarity mode |
CN110460605A (en) * | 2019-08-16 | 2019-11-15 | 南京邮电大学 | A kind of Abnormal network traffic detection method based on autocoding |
Also Published As
Publication number | Publication date |
---|---|
CN111224941A (en) | 2020-06-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111224941B (en) | Threat type identification method and device | |
US11856021B2 (en) | Detecting and mitigating poison attacks using data provenance | |
Min et al. | TR-IDS: Anomaly-based intrusion detection through text-convolutional neural network and random forest | |
Aljawarneh et al. | Anomaly-based intrusion detection system through feature selection analysis and building hybrid efficient model | |
US20210019674A1 (en) | Risk profiling and rating of extended relationships using ontological databases | |
Vinayakumar et al. | Scalable framework for cyber threat situational awareness based on domain name systems data analysis | |
CN112235264B (en) | Network traffic identification method and device based on deep migration learning | |
US8438386B2 (en) | System and method for developing a risk profile for an internet service | |
Jiang et al. | A deep learning based online malicious URL and DNS detection scheme | |
US9032527B2 (en) | Inferring a state of behavior through marginal probability estimation | |
Song et al. | Advanced evasion attacks and mitigations on practical ML‐based phishing website classifiers | |
CN113315742B (en) | Attack behavior detection method and device and attack detection equipment | |
Biswas et al. | Botnet traffic identification using neural networks | |
CN110602137A (en) | Malicious IP and malicious URL intercepting method, device, equipment and medium | |
Seth et al. | MIDS: Metaheuristic based intrusion detection system for cloud using k-NN and MGWO | |
Nowroozi et al. | An adversarial attack analysis on malicious advertisement url detection framework | |
Li et al. | Deep learning algorithms for cyber security applications: A survey | |
Elekar | Combination of data mining techniques for intrusion detection system | |
Niveditha et al. | Detection of Malware attacks in smart phones using Machine Learning | |
Aiello et al. | Unsupervised learning and rule extraction for Domain Name Server tunneling detection | |
CN113905016A (en) | DGA domain name detection method, detection device and computer storage medium | |
Peng et al. | MalShoot: shooting malicious domains through graph embedding on passive DNS data | |
CN112583827A (en) | Data leakage detection method and device | |
Amar et al. | Weighted LSTM for intrusion detection and data mining to prevent attacks | |
Shahriar et al. | Towards an attack signature generation framework for intrusion detection systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |