CN111224941B

CN111224941B - Threat type identification method and device

Info

Publication number: CN111224941B
Application number: CN201911136708.9A
Authority: CN
Inventors: 李小勇; 高雅丽; 彭浩; 郭宁; 常超舜; 苑洁
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2019-11-19
Filing date: 2019-11-19
Publication date: 2020-12-04
Anticipated expiration: 2039-11-19
Also published as: CN111224941A

Abstract

The embodiment of the invention provides a threat type identification method and device, relates to the technical field of network security, and can determine the threat type of a domain name with unknown threat type. The scheme of the embodiment of the invention comprises the following steps: and constructing a matrix to be identified, wherein the matrix to be identified comprises an attribute matrix and a relation matrix, the attribute matrix comprises attribute information of the domain name to be identified and attribute information of a target domain name of a known threat type, and the relation matrix comprises the similarity between the domain name to be identified and the target domain name. And then inputting the matrix to be recognized into a threat type recognition model, acquiring the probability that the domain name to be recognized output by the threat type recognition model belongs to each threat type, taking the threat type with the maximum probability as the threat type of the domain name to be recognized, and determining the probability that the domain name to be recognized belongs to each threat type according to the attribute information of the domain name to be recognized, the attribute information of the target domain name and the similarity between the domain name to be recognized and the target domain name.

Description

Threat type identification method and device

Technical Field

The invention relates to the technical field of network security, in particular to a threat type identification method and device.

Background

Network attackers typically employ a large amount of network infrastructure to perform network attacks, which typically involve domain names and Internet Protocol (IP) addresses. For example, an attacker may employ various propagation means to infect a large number of hosts on the internet with bother viruses, and the infected hosts receive instructions of the attacker through a control channel, thereby stealing private data of other devices, sending spam to other devices, and the like. These networks of hosts infected with botnets are called botnets.

To cope with the complexity of network attacks, more and more organizations are beginning to share network threat intelligence with open networks. Cyber threat intelligence describes existing cyber threats and describes countermeasures that can be taken in the face of cyber threats.

However, at present, the domain name with unknown threat type still exists in the network, so that the threat type of the domain name with unknown threat type is determined to be very important.

Disclosure of Invention

The embodiment of the invention aims to provide a threat type identification method and a threat type identification device so as to determine a threat type of a domain name of an unknown threat type. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a threat type identification method, where the method includes:

constructing a matrix to be identified, wherein the matrix to be identified comprises an attribute matrix and a relation matrix, the attribute matrix comprises attribute information of a domain name to be identified and attribute information of a target domain name of a known threat type, the relation matrix comprises similarity between the domain name to be identified and the target domain name, the similarity between the domain name to be identified and the target domain name is determined based on incidence relation between the domain name to be identified and the target domain name, the incidence relation comprises indirect relation of incidence relation between the domain name to be identified and the target domain name and the same element or the same element graph, incidence relation exists among elements in the element graph, and the elements comprise Internet Protocol (IP) addresses, mailbox addresses and/or malicious software hash values;

inputting the matrix to be recognized into a threat type recognition model;

and obtaining the probability that the domain name to be identified output by the threat type identification model belongs to each threat type, and taking the threat type with the highest probability as the threat type of the domain name to be identified, wherein the threat type identification model is used for determining the probability that the domain name to be identified belongs to each threat type according to the attribute information of the domain name to be identified, the attribute information of the target domain name and the similarity between the domain name to be identified and the target domain name.

Optionally, the threat type identification model is obtained by training through the following steps:

acquiring a sample matrix and actual threat types of a plurality of sample domain names corresponding to the sample matrix, wherein the sample matrix comprises an attribute matrix formed by attribute information of the plurality of sample domain names and a relation matrix used for representing the similarity between every two sample domain names in the plurality of sample domain names;

inputting the sample matrix into a neural network model, and acquiring the probability that the plurality of sample domain names identified by the neural network model belong to each threat type;

calculating a loss function value according to the probability that the plurality of sample domain names identified by the neural network model belong to each threat type and the actual threat types of the plurality of sample domain names;

judging whether the neural network model converges according to the loss function value; if the neural network model is converged, obtaining the threat type identification model; and if the neural network model is not converged, adjusting the model parameters of the neural network model according to the loss function value, and carrying out next training.

Optionally, the similarity between every two sample domain names in the plurality of sample domain names is obtained by the following formula:

wherein phi_kFor the kth meta-path or meta-map, phi_k＝{Φ_k1,2, …, n is the total number of categories of meta-paths and meta-graphs,

for two domain names at phi_kNumber of paths of_kIs phi_kAnd satisfies beta_k>0 and

MIS(v_i,v_j) Is the similarity between the ith domain name and the jth domain name.

Optionally, the neural network model identifies the probability that each sample domain name belongs to each threat type by the following formula:

wherein Z is the probability that the sample domain name belongs to each threat type, X is the attribute matrix, B is the relationship matrix,

represents a pair x_iNormalization is applied to the rows of the matrix, ReLU (·) max (0,),

I_Nis an identity matrix, W⁽⁰⁾And W⁽¹⁾Is the weight of the neural network model.

Optionally, the loss function value is calculated by the following formula:

wherein H is a loss function value, N' is the number of domain names with corresponding threat types, K is the number of threat types, l_k(v_i) For representing the ith node v_iWhether it belongs to threat type k, Z_k(v_i) Ith node v for representing neural network model prediction_iProbability of belonging to threat type k.

Optionally, the element further includes a domain name; before the constructing a matrix to be identified, the method further comprises:

acquiring incidence relations among nodes of different types from a first data source, and constructing a threat intelligence heterogeneous graph, wherein each node in the threat intelligence heterogeneous graph corresponds to one element, and the type of each node is the type of the element corresponding to the node;

the constructing of the matrix to be identified comprises the following steps:

acquiring attribute information of the domain name to be identified and attribute information of the target domain name, and constructing an attribute matrix;

determining the similarity between the domain name to be identified and the target domain name according to the threat intelligence heterogeneous graph;

and constructing the relation matrix according to the similarity between the domain name to be identified and the target domain name.

Optionally, the constructing a threat intelligence heterogeneous map includes:

and acquiring the incidence relation among the nodes of the same type from a second data source, and constructing a threat intelligence heterogeneous graph based on the incidence relation among the nodes of different types and the incidence relation among the nodes of the same type.

In a second aspect, an embodiment of the present invention provides a threat type identification apparatus, where the apparatus includes:

the device comprises a construction module and a processing module, wherein the construction module is used for constructing a matrix to be identified, the matrix to be identified comprises an attribute matrix and a relation matrix, the attribute matrix comprises attribute information of a domain name to be identified and attribute information of a target domain name of a known threat type, the relation matrix comprises similarity between the domain name to be identified and the target domain name, the similarity between the domain name to be identified and the target domain name is determined based on an incidence relation between the domain name to be identified and the target domain name, the incidence relation comprises an indirect relation that the domain name to be identified and the target domain name have an incidence relation with the same element or the same element graph, the incidence relation exists among the elements in the element graph, and the elements comprise an Internet Protocol (IP) address, a mailbox address and/or a malicious software hash value;

the input module is used for inputting the matrix to be recognized constructed by the construction module into a threat type recognition model;

and the obtaining module is used for obtaining the probability that the domain name to be identified, which is output by the threat type identification model, belongs to each threat type, and taking the threat type with the highest probability as the threat type of the domain name to be identified, wherein the threat type identification model is used for determining the probability that the domain name to be identified belongs to each threat type according to the attribute information of the domain name to be identified, the attribute information of the target domain name and the similarity between the domain name to be identified and the target domain name.

Optionally, the apparatus further comprises a training module, wherein the training module is configured to:

is the similarity between the ith domain name and the jth domain name.

Optionally, the loss function value is calculated by the following formula:

Optionally, the element further includes a domain name; the building module is further configured to, before the matrix to be identified is built, obtain association relationships between nodes of different types from a first data source, and build a threat intelligence heterogeneous graph, where each node in the threat intelligence heterogeneous graph corresponds to one element, and a type to which each node belongs is a type of the element corresponding to the node;

the building module is specifically configured to:

Optionally, the building module is specifically configured to:

In a third aspect, an embodiment of the present invention provides an electronic device, which is characterized by including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete mutual communication through the communication bus;

a memory for storing a computer program;

and the processor is used for realizing the steps of any threat type identification method when executing the program stored in the memory.

In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of any of the above threat type identification methods.

In a fifth aspect, embodiments of the present invention also provide a computer program product containing instructions, which when run on a computer, cause the computer to perform any of the above threat type identification methods.

The embodiment of the invention at least comprises the following beneficial effects: the threat category identification model in the embodiment of the invention can determine the threat type of the domain name to be identified based on the attribute information of the domain name to be identified, the attribute information of the target domain name and the incidence relation between the domain name to be identified and the target domain name, wherein the incidence relation between the domain name to be identified and the target domain name comprises an indirect relation of incidence relations between the domain name to be identified and the target domain name and the same element or the same element graph. Therefore, the embodiment of the invention realizes that the threat type of the domain name to be identified is determined through the indirect relation between the domain name to be identified and the target domain name, namely through extracting the high-order semantics between the domain name to be identified and the target domain name.

Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a threat type identification method according to an embodiment of the present invention;

FIG. 2 is an exemplary diagram of a threat intelligence hetrogram provided by an embodiment of the present invention;

FIG. 3 is an exemplary diagram of various meta-paths/graphs provided by an embodiment of the present invention;

FIG. 4 is an exemplary diagram illustrating a hierarchical relationship between threat types according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a threat type identification apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, an embodiment of the present invention provides a threat type identification method, which may be applied to an electronic device, where the electronic device may be a mobile phone, a computer, a tablet computer, or the like, and the method includes the following steps:

step 101, constructing a matrix to be identified.

The matrix to be recognized comprises an attribute matrix and a relation matrix, wherein the attribute matrix comprises attribute information of a domain name to be recognized and attribute information of each target domain name of a known threat type, the relation matrix comprises similarity between the domain name to be recognized and each target domain name, the similarity between the domain name to be recognized and each target domain name is determined based on incidence relation between the domain name to be recognized and each target domain name, the incidence relation comprises indirect relation between the domain name to be recognized and each target domain name and the same element or the same element graph, incidence relation exists among elements in the element graph, and the elements comprise Internet Protocol (IP) addresses, mailbox addresses and/or malicious software hash values.

And 102, inputting the matrix to be recognized into a threat type recognition model.

And 103, acquiring the probability that the domain name to be identified output by the threat type identification model belongs to each threat type, and taking the threat type with the maximum probability as the threat type of the domain name to be identified.

The threat type identification model is used for determining the probability that the domain name to be identified belongs to each threat type according to the attribute information of the domain name to be identified, the attribute information of the target domain name and the similarity between the domain name to be identified and the target domain name.

Optionally, before the step 101, a threat intelligence heterogeneous graph may be further constructed, so that the number of paths of each domain name in the threat intelligence heterogeneous graph is obtained according to the intelligence heterogeneous graph, and further the similarity between the domain names is calculated.

The method for constructing the threat intelligence heterogeneous graph comprises the following steps: and acquiring incidence relations among different types of nodes from the first data source, and constructing a threat intelligence heterogeneous graph. Each node in the threat intelligence heterogeneous graph corresponds to one element, and the category of each node is the category of the element corresponding to the node.

Optionally, the threat intelligence heterogeneous graph may be enriched, and the association relationship between nodes of the same type in the threat intelligence heterogeneous graph is determined, including: and acquiring the incidence relation among the nodes of the same type from a second data source, and constructing a threat intelligence heterogeneous graph based on the incidence relation among the nodes of different types and the incidence relation among the nodes of the same type.

For example, if two Domain names have a common owner (owner) and/or a common Domain Name System (DNS) location, an association exists between the two Domain Name nodes. If two IP addresses have the same Internet Service Provider (ISP), an association relationship exists between the two IP address nodes. If two pieces of software attack the electronic equipment by using the same vulnerability, an association relationship exists between the two software hash value nodes.

For example, the constructed threat intelligence heterogeneous graph is shown in fig. 2, and the threat intelligence heterogeneous graph in fig. 2 includes four types of nodes, which are respectively: malware (Malware) hash value node (node M in fig. 2), IP address node (node I in fig. 2), Domain name (Domain) node (node D in fig. 2), and mailbox (Email) address node (node E in fig. 2). The arrows in fig. 2 represent the association between the nodes.

Threat intelligence may be represented by associations between nodes in a set of threat intelligence heterographs, and thus a piece of threat intelligence may be a subgraph of a threat intelligence heterograph. The meta-path defined based on the type of the nodes in the threat intelligence heterogeneous graph can reflect the incidence relation among the nodes in the meta-path and reflect the similarity among the nodes. For example: the meta path Domain-Malware-Domain may indicate that two Domain names are accessed by the same Malware, and the meta path Domain-Email-Domain may indicate that two Domain names are registered by the same mailbox address.

The embodiment of the invention can also bring the following beneficial effects: the threat intelligence heterogeneous graph can effectively and compactly express the network attacks with the incidence relation by using different semantics, and the threat intelligence heterogeneous graph has great potential for knowledge discovery (for example, capturing complex relations among different types of nodes, distinguishing different network attacks based on the difference of network behaviors, exploring how attackers organize the attacks and adjusting attack technologies, and the like). The embodiment of the invention can better mine the intelligence by utilizing the incidence relation among the nodes included in the threat intelligence heterogeneous graph, thereby greatly reducing the workload of a security analyst.

Optionally, the method for constructing the matrix to be identified in step 101 includes:

step one, acquiring attribute information of a domain name to be identified and attribute information of a target domain name, and constructing an attribute matrix.

And step two, determining the similarity between the domain name to be identified and the target domain name according to the threat intelligence heterogeneous graph.

It can be understood that the number of paths of the domain name to be identified and the target domain name under each meta-path or meta-graph can be obtained according to the incidence relation between the domain name to be identified and the target domain name in the threat intelligence heterogeneous graph, and then the similarity between the domain name to be identified and the target domain name is calculated.

And step three, constructing a relation matrix according to the similarity between the domain name to be identified and the target domain name.

Optionally, the attribute information in the attribute matrix in step 101 may include: the length of the domain name, the entropy of character distribution information of the domain name, the survival time of the domain name, the updating frequency of the domain name and the like.

Optionally, the attribute matrix may include N × m attributes, where N is the number of domain names and m is the number of attributes included in each domain name.

For example: attribute information matrix

Wherein each row is attribute information of a domain name.

Optionally, the relationship-based adjacency matrix includes: matrix R, matrix S, matrix G, matrix C, and matrix N. Wherein, the element R in the matrix R_ijIndicating whether the ith domain name is resolved to the jth IP address; if so, then r_ij1 is ═ 1; if not, then r_ij0. Element S in matrix S_ijIndicating whether the ith domain name is accessed by jth malware; if so, s_ij1 is ═ 1; if not, s_ij0. Element G in matrix G_ijIndicating whether the ith domain name is registered by the jth mailbox address; if so, g_ij1 is ═ 1; if not, g_ij0. Element C in matrix C_ijIndicating whether the ith IP address communicates with the jth malware; if so, then c_ij1 is ═ 1; if not, c_ij0. Element N in matrix N_ijIndicating whether the ith IP address is communicated with the jth mailbox address; if so, then n_ij1 is ═ 1; if not, n_ij＝0。

Optionally, the matrix element B in the relationship matrix B in step 101 is_ijIndicating the ith domain name v_iAnd the jth domain name v_jThe similarity between them.

Optionally, v_iAnd v_jThe similarity between them can be obtained by the following formula (1):

is the similarity between the ith domain name and the jth domain name.

It can be understood that, as shown in formula (1), the similarity between two domain names is related to the number of paths between two nodes on one hand and the path nodes of the domain name itself on the other hand. Weight vector beta_kCan be obtained through automatic learning of a threat type recognition model.

Optionally, as shown in fig. 3, D in fig. 3 represents a domain name node, M represents a malware hash value node, I represents an IP address node, and E represents a mailbox address node. The embodiment of the invention defines that 7 types of meta-paths (meta-paths) comprise phi₁～Φ₇And 5 types of metagraphs (meta-graphs) including phi₈～Φ₁₂。

Wherein, the node v_iAnd node v_jAt phi_kThe following number of paths are:

wherein the content of the first and second substances,

is a domain name node v_iAnd domain name node v_jAt phi_kThe following relationship (communing) matrix. The communtation matrix is a kind of adjacency matrix.

When phi is_k＝(A₁,A₂,…A_d+1) When it is meta-path, the domain name node A₁And domain name node A_d+1The communating matrix in between is:

wherein the content of the first and second substances,

is node A_iAnd node A_i+1The symbol "represents a matrix multiplication.

When phi is_kWhen it is a primitive diagram, use phi₁₀For the purpose of example only,

calculated by the following steps:

step 1, calculating

Wherein, P₁Represents a sub-path (IP-Email-IP),

is shown in sub-path P₁Communicating matrix, Q, of two IP address nodes_IERepresenting a communating matrix between IP addresses and mailbox addresses, the element N of the N matrix_ijIndicating whether the ith IP address node communicates with the jth mailbox address node ·^TRepresenting a transposed matrix of matrices.

Step 2, calculating

Wherein, P₂Represents a sub path (IP-aware-IP),

is shown in sub-path P₂Communicating matrix, Q, of two IP address nodes_IMRepresenting a communing matrix between an IP address and malware, element C in the C matrix_ijIndicates whether the ith IP address is accessed by the jth malware ·^TRepresenting a transposed matrix of matrices.

Step 3, calculating

Wherein, P_rRepresents a path P₁And path P₂The formed directed acyclic graph is composed of a plurality of directed acyclic graphs,

is shown on path P_rThe "sign" represents a matrix multiplication between the next two IP address nodes.

Step 4, calculating

For example, the relationship matrix in step 101 above

Wherein each matrix element represents a similarity between the corresponding two domain names. E.g. b_1,2Is the similarity between the 1 st domain name and the 2 nd domain name.

It can be understood that, since different meta-paths and meta-maps have different meanings, the similarity between two domain names represented by different meta-paths and meta-maps is different, and therefore, in order to distinguish the importance of different meta-paths and meta-maps, a corresponding weight β may be set for each meta-path and meta-map_k。

For example, Domain name D₁And domain name D₁By the same mailbox address E₁Register, so Domain name D₁And domain name D₂Meta path (Domain) exists between them₁-Email₁-Domain₂). Domain name D₁And domain name D₁Are all still by malware M₁Visited, so domain name D₁And domain name D₂Meta path (Domain) exists between them₁-Malware₁-Domain₂). Under the condition that the threat source is important, the weight corresponding to the meta-path Domain-Email-Domain can be set to be higher than the weight corresponding to the meta-path Domain-Malware-Domain. In the case that the threat behavior is important, the weight corresponding to the meta-path Domain-Malware-Domain may be set to be higher than the weight corresponding to the meta-path Domain-Email-Domain.

It will be appreciated that to explore the complementarity of the different meta-paths and the degree of similarity between two domain names represented by the meta-paths, the degree of similarity between two domain names may be calculated based on the weighted adjacency matrices of the meta-paths and the meta-maps.

Optionally, the threat type identification model in fig. 1 is obtained by training through the following steps:

step one, acquiring a sample matrix and actual threat types of a plurality of sample domain names corresponding to the sample matrix.

The sample matrix comprises an attribute matrix formed by attribute information of a plurality of sample domain names and a relation matrix used for representing the similarity between every two sample domain names in the plurality of sample domain names.

And step two, inputting the sample matrix into the neural network model, and acquiring threat types of a plurality of sample domain names identified by the neural network model.

In one embodiment, the neural network model can identify the probability that each sample domain name belongs to each threat type by the following equation (4):

wherein Z is the probability that the sample domain name belongs to each threat type, X is the attribute matrix of a plurality of sample domain names, B is the relation matrix of a plurality of sample domain names,

represents a pair x_iGo on to returnNormalizing rows applied to the matrix;

is represented by the following general formulae 0 and

the larger value is selected as the calculation result,

I_Nis an identity matrix; w⁽⁰⁾And W⁽¹⁾As weights of the neural network model, W⁽⁰⁾For the weight matrix between the input layer to the hidden layer of the neural network model, W⁽¹⁾Is a weight matrix from the hidden layer to the output layer.

To represent

The element of the ith row and the jth column of the matrix,

to represent

The ith row and the jth column of the matrix.

It will be appreciated that the neural network model can be abstracted to include three network layers, respectively: the device comprises an input layer, a hidden layer and an output layer, wherein weight matrixes are stored among network layers.

Optionally, the training randomness may be realized by a random inactivation (dropout) mode.

And step three, calculating a loss function value according to the threat types of the plurality of sample domain names identified by the neural network model and the actual threat types of the plurality of sample domain names.

In one embodiment, the loss function value may be calculated by the following equation (5):

And step four, judging whether the neural network model is converged or not according to the loss function value. If the neural network model is converged, obtaining a threat type identification model; and if the neural network model is not converged, adjusting the model parameters of the neural network model according to the loss function value, and carrying out next training.

The embodiment of the invention also has the following beneficial effects: according to the embodiment of the invention, the neural network model can be trained according to the actual threat type of the sample domain name, so that the predicted value of the neural network model is closer to the true value, and the determined threat type of the domain name to be identified is accurate through the threat type identification model obtained by training.

Optionally, there may be a hierarchical relationship between the threat types, as shown in fig. 4, where each node in fig. 4 represents a threat type, and a threat type of a parent type may include one or more threat types of subtypes. For example: the trojan horse in fig. 4 (threat type of parent type) includes a backdoor (threat type of subtype).

It will be appreciated that introducing a hierarchy between threat types may improve the accuracy of threat type identification. For example, when training sample domain names of threat type recognition models, including sample domain names of threat types of a small number of subtypes, when the models are recognizing threat types of an input domain name, they may be regularized by threat types of a parent type of a subtype. The parameters of threat types with hierarchical relationships tend to be similar.

Alternatively, L ═ L may be used_iI 1,2, … K represents a set of threat types, K being the number of threat types. To represent between threat typesThe hierarchical relationship of (1) can be used

K_iIs represented by_iNumber of threat types included as sub-types.

Output-layer network parameters of the model are identified for threat types, wherein,

indicates the ith threat type l_iNetwork parameters at the output layer of the threat type identification model.

When there is a hierarchical relationship between threat types, the model parameters of the output layer of the neural network model may be regularized by equation (6):

where λ (W) is a model parameter of the output layer. K is the number of threat types, l_iFor the ith type of threat,

for the ith threat type l_iThe parameters of the model at the output layer,

threat types of jth sub-type included for ith threat type

Model parameters at the output layer.

Optionally, the model parameters of the neural network model may be adjusted according to the loss function values by equation (7):

J＝H+Cλ(W) (7)

wherein J is the adjusted loss function value, H is the loss function value, C is the preset penalty parameter, and lambda (W) is the model parameter of the output layer.

Alternatively, the predetermined penalty parameter may be set empirically.

The embodiment of the invention also has the following beneficial effects: the embodiment of the invention can regularize the model parameters of the output layer of the neural network model, and relieve the overfitting problem of the neural network model.

Corresponding to the above method embodiment, referring to fig. 5, an embodiment of the present invention further provides a threat type identification apparatus, including: a construction module 501, an input module 502 and an acquisition module 503;

the construction module 501 is configured to construct a to-be-identified matrix, where the to-be-identified matrix includes an attribute matrix and a relationship matrix, the attribute matrix includes attribute information of a to-be-identified domain name and attribute information of a target domain name of a known threat type, the relationship matrix includes a similarity between the to-be-identified domain name and the target domain name, the similarity between the to-be-identified domain name and the target domain name is determined based on an association relationship between the to-be-identified domain name and the target domain name, the association relationship includes an indirect relationship in which the to-be-identified domain name and the target domain name have an association relationship with a same element or a same element map, an association relationship exists between elements included in the element map, and the elements include an internet protocol IP address, a;

an input module 502, configured to input the matrix to be identified, which is constructed by the construction module 501, into the threat type identification model;

the obtaining module 503 is configured to obtain probabilities that the domain name to be identified output by the threat type identification model belongs to each threat type, and use the threat type with the highest probability as the threat type of the domain name to be identified, where the threat type identification model is configured to determine the probability that the domain name to be identified belongs to each threat type according to the attribute information of the domain name to be identified, the attribute information of the target domain name, and the similarity between the domain name to be identified and the target domain name.

Optionally, the apparatus may further comprise a training module, and the training module may be configured to:

inputting the sample matrix into a neural network model, and acquiring the probability that a plurality of sample domain names identified by the neural network model belong to each threat type;

calculating a loss function value according to the probability that a plurality of sample domain names identified by the neural network model belong to each threat type and the actual threat types of the plurality of sample domain names;

judging whether the neural network model converges according to the loss function value; if the neural network model is converged, obtaining a threat type identification model; and if the neural network model is not converged, adjusting the model parameters of the neural network model according to the loss function value, and carrying out next training.

Optionally, the similarity between each two sample domain names in the multiple sample domain names may be obtained by the following formula:

is the similarity between the ith domain name and the jth domain name.

Optionally, the neural network model may identify the probability that each sample domain name belongs to each threat type by the following formula:

wherein Z is the probability that the sample domain name belongs to each threat type, X is an attribute matrix, B is a relationship matrix,

Alternatively, the loss function value may be calculated by the following formula:

Optionally, the element may also include a domain name;

the constructing module 501 may also be configured to, before constructing the matrix to be identified, obtain an association relationship between nodes of different types from a first data source, and construct a threat intelligence heterogeneous graph, where each node in the threat intelligence heterogeneous graph corresponds to an element, and a type to which each node belongs is a type of the element corresponding to the node;

the building module 501 may be specifically configured to:

acquiring attribute information of a domain name to be identified and attribute information of the target domain name, and constructing an attribute matrix;

and constructing a relation matrix according to the similarity between the domain name to be identified and the target domain name.

Optionally, the building block 501 may be specifically configured to:

An embodiment of the present invention further provides an electronic device, as shown in fig. 6, including a processor 601, a communication interface 602, a memory 603, and a communication bus 604, where the processor 601, the communication interface 602, and the memory 603 complete mutual communication through the communication bus 604,

a memory 603 for storing a computer program;

the processor 601 is configured to implement the steps executed by the electronic device in the above method embodiments when executing the program stored in the memory 603.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In a further embodiment of the present invention, a computer-readable storage medium is also provided, in which a computer program is stored, which, when being executed by a processor, carries out the steps of any of the above-mentioned threat type identification methods.

In a further embodiment provided by the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the threat type identification methods of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for threat type identification, the method comprising:

inputting the matrix to be recognized into a threat type recognition model;

acquiring the probability that the domain name to be identified output by the threat type identification model belongs to each threat type, and taking the threat type with the highest probability as the threat type of the domain name to be identified, wherein the threat type identification model is used for determining the probability that the domain name to be identified belongs to each threat type according to the attribute information of the domain name to be identified, the attribute information of the target domain name and the similarity between the domain name to be identified and the target domain name;

the threat type recognition model is obtained by training through the following steps:

judging whether the neural network model converges according to the loss function value; if the neural network model is converged, obtaining the threat type identification model; if the neural network model is not converged, adjusting the model parameters of the neural network model according to the loss function value, and carrying out next training;

the element further includes a domain name; before the constructing a matrix to be identified, the method further comprises:

the constructing of the matrix to be identified comprises the following steps:

constructing the relation matrix according to the similarity between the domain name to be identified and the target domain name;

the threat intelligence heterogeneous map construction method comprises the following steps:

2. The method of claim 1, wherein the similarity between each two sample domain names in the plurality of sample domain names is obtained by the following formula:

3. The method of claim 1, wherein the neural network model identifies the probability that each sample domain name belongs to each threat type by the following formula:

4. The method of claim 1, wherein the loss function value is calculated by the formula:

5. A threat type identification apparatus, the apparatus comprising:

an obtaining module, configured to obtain probabilities that the domain name to be identified output by the threat type identification model belongs to each threat type, and use the threat type with the highest probability as the threat type of the domain name to be identified, where the threat type identification model is configured to determine, according to attribute information of the domain name to be identified, attribute information of the target domain name, and a similarity between the domain name to be identified and the target domain name, the probability that the domain name to be identified belongs to each threat type;

the apparatus further comprises a training module to:

the element further includes a domain name; the building module is further configured to, before the matrix to be identified is built, obtain association relationships between nodes of different types from a first data source, and build a threat intelligence heterogeneous graph, where each node in the threat intelligence heterogeneous graph corresponds to one element, and a type to which each node belongs is a type of the element corresponding to the node;

the building module is specifically configured to:

6. The apparatus of claim 5, wherein the similarity between each two of the plurality of sample domain names is obtained by the following formula: