CN111163057B - User identification system and method based on heterogeneous information network embedding algorithm - Google Patents

User identification system and method based on heterogeneous information network embedding algorithm Download PDF

Info

Publication number
CN111163057B
CN111163057B CN201911246787.9A CN201911246787A CN111163057B CN 111163057 B CN111163057 B CN 111163057B CN 201911246787 A CN201911246787 A CN 201911246787A CN 111163057 B CN111163057 B CN 111163057B
Authority
CN
China
Prior art keywords
user
host
embedding
node
behavior
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911246787.9A
Other languages
Chinese (zh)
Other versions
CN111163057A (en
Inventor
于爱民
李梦
蔡利君
马建刚
孟丹
于海波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201911246787.9A priority Critical patent/CN111163057B/en
Publication of CN111163057A publication Critical patent/CN111163057A/en
Application granted granted Critical
Publication of CN111163057B publication Critical patent/CN111163057B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/147Network analysis or design for predicting network behaviour

Abstract

The invention relates to a user identification system and a method based on a heterogeneous information network embedding algorithm, which comprises the following steps: the system comprises a data processing module, a combined embedding module and an evaluation analysis module; the method is based on the thinking of behavior analysis, a normal behavior model is constructed by utilizing multi-source heterogeneous user behavior data, after behavior data of a new time period arrives, user identification is executed by comparing the similarity of the current behavior and the normal behavior model, and suspicious behavior sequencing is given based on dot product similarity operation aiming at the condition of identification errors. The method can be applied to detecting potential internal threats in an enterprise intranet, a more comprehensive and accurate behavior model can be obtained by combining two heterogeneous information network embedding algorithms, the user identification accuracy can be improved by about 10%, and in addition, event-level traceability clues can be provided for further analysis of safety monitoring personnel.

Description

User identification system and method based on heterogeneous information network embedding algorithm
Technical Field
The invention relates to a user identification system and method based on a heterogeneous information network embedding algorithm, belongs to the technical field of information security detection, and is used in an enterprise intranet environment.
Background
Today the most devastating security threats come not from outside malicious persons or malware but from trusted inside persons. Members in an organization acquire certain access control authority according to responsibility, and effective identity authentication is an important way for defending internal attack. However, the identity authentication mechanism mainly includes an account password, fingerprint identification, and the like, and is only effective during login, and a lot of potential safety hazards still exist. Existing research generally establishes a normal behavior model of a user based on behavior analysis, so as to obtain continuous and effective user identity monitoring after login. Because any form of internal attack shows a certain degree of behavior deviation, the identity of the user can be identified by comparing the similarity degree of the current behavior and the historical normal model, and then abnormal operation is found.
User identification based on behavior analysis can be divided into two categories, single-domain behavior analysis and multi-domain behavior analysis. User recognition based on single domain behavior analysis refers to modeling normal behavior with a single type of behavior data, such as: file behavior, mail behavior, etc. The method has the problems that the used data source is single, a comprehensive normal behavior model is difficult to depict, a simple machine learning classifier is usually adopted, and the user recognition rate is not high. A user identification method based on multi-domain behavior analysis tries to combine multiple behavior types to construct a comprehensive behavior model by using the thought of multi-source data fusion. However, in the idea, the correlation among various behavior data is not considered either by comparing workflow similarity or by adopting feature engineering to extract multi-source behavior features.
Under the background, the multi-source heterogeneous behavior data are converted into the heterogeneous information network, conditions are created for analyzing behavior association, local features and global features are extracted by using a local embedding algorithm and a global embedding algorithm respectively, so that a comprehensive behavior model can be constructed, association information among behaviors can be captured, and suspicious behavior sequencing can be further given based on similarity calculation for security personnel to analyze and trace sources under the condition of wrong model identification.
Disclosure of Invention
The invention solves the problems: the invention provides a user identification system and method based on a heterogeneous information network embedding algorithm, which can construct a more comprehensive behavior model, greatly improve the user identification accuracy, analyze suspicious situations and provide event-level suspicious operation sequencing.
The technical scheme of the invention is as follows: a user identification system based on heterogeneous information network embedding algorithm is characterized in that: the heterogeneous information network embedding algorithm is a local embedding algorithm realized based on a neural network and a global embedding algorithm realized based on a meta-path, the user identification is to identify potential operating users based on multi-source heterogeneous audit log data collected by each host in an enterprise intranet, the system comprises a data processing module, a combined embedding module and an evaluation analysis module, wherein:
a data processing module: there are two functions: the first function is to extract standardized audit log data from a historical behavior database, and the log data is used as a training set for constructing a heterogeneous information network G; the second function is to preprocess the original multi-source heterogeneous audit log data newly collected from the intranet host; whether standardized audit log data in a historical behavior database or newly acquired original audit log data comprise five multi-source heterogeneous audit log data types, the five log data types are login log data, file log data, mail log data, HTTP log data and equipment log data respectively, and the data record login behaviors, file behaviors, mail behaviors, WEB behaviors and external equipment connection behaviors of a user respectively; the method comprises the steps that preprocessing original multi-source heterogeneous audit log data refers to conducting standardization processing on each log data, a log analyzer is used for extracting key information based on predefined fields, the predefined fields comprise a subject, equipment, an object and a timestamp, the subject refers to a user identifier, the equipment refers to a host identifier, and the object is determined according to different log data types and is used for identifying specific behaviors of specific log data types; in the log data of the file type, the object adopts the combination of a file path and a file name; the timestamp is the occurrence time of the log data; the analyzed newly-collected log data is used as a test set; the heterogeneous information network G takes the information extracted based on the predefined field in the standardized log data as a node identifier, wherein the host identifier is taken as a central node, the user identifier and the behavior identifier are taken as neighbor nodes of the host identifier, and the constructed heterogeneous information network follows the network mode shown in FIG. 2;
a joint embedding module: training a model reflecting the operation mode of each host by taking a heterogeneous information network G constructed in a data processing module as input, wherein the model is called a user predictor, and the user predictor is used for executing user prediction on the test set to finally obtain potential operation user sequencing corresponding to log data in the test set; the process of training the user predictor refers to learning vector representations of nodes in the heterogeneous information network G and parameters of a model. In order to enable the learned node vector to keep network structure information and similarity information between nodes, the joint embedding module adopts two heterogeneous information network embedding algorithms, namely a local embedding algorithm and a global embedding algorithm, wherein the local embedding algorithm is used for learning the interaction between each host and a neighbor node thereof and embedding normal behavior pattern information; the global embedding algorithm utilizes the semantics defined by the meta path to embed the associated information between the nodes of different types; finally, combining the two embedding algorithms through a combined objective function for iterative training;
the evaluation analysis module is used for evaluating the prediction result obtained in the combined embedding module and judging whether the real operation user of the host computer is consistent with the prediction result or not; obtaining a prediction result A given by a log data model in a test set in a joint embedding module, wherein the result is a sequence, the rank in the sequence successively represents the probability that the behavior in the test set belongs to a certain user, if the first K operation users corresponding to the behavior in the test set appear in the prediction sequence, the identification is considered to be correct, otherwise, the deviation between the behavior of the user in the test set and the normal behavior pattern in the training set is represented, and the situation is called as suspicious; for such suspicious situations, through anomaly analysis based on similarity, the final result is the order of suspicious behaviors causing errors in the user identification result, so that a security analyst or related staff can perform traceability verification according to clues given by the system.
In the data processing module, the process of constructing the heterogeneous information network comprises the following steps: based on the log data which is subjected to standardized processing in the historical database, the heterogeneous information network G is constructed by using the extracted host, user and behavior identifiers as nodes, wherein the host identifier is used as a central node, and the user identifier and the behavior identifier are used as neighbor nodes of the host identifier.
In the combined embedding module, a local embedding algorithm is realized based on a neural network, and the specific process is as follows:
(1) firstly, mapping all nodes in a heterogeneous information network G to a potential space, namely randomly initializing vector representations of all nodes to form an embedded vector table V;
(2) giving a host p, and obtaining a host vector V by two-step aggregationpIn the first step, a node type vector of each type of behavior identification neighbor node of the host p is calculated
Figure BDA0002307872930000031
The method is that all behaviors contained in each type are marked to be a neighbor node vector vnAveraging;
Figure BDA0002307872930000032
wherein the content of the first and second substances,
Figure BDA0002307872930000033
identifying a neighbor node set by representing the t type of behavior contained by the host p;
in a second step, a node type vector is calculated
Figure BDA0002307872930000034
Obtaining a host vector V by weighted combinationp
Figure BDA0002307872930000035
Wherein wtRepresenting the weight of a t-type node type vector, wherein the behavior identification neighbor node types are 5 in total, so that the value of t is 1 to 5 and respectively represents a login node type, a file node type, a mail node type, an HTTP node type and an equipment node type;
(3) based on host vector VpCalculating the similarity of dot products between the host and the user and sequencing the potential operation users, wherein vuRepresenting a user vector;
Figure BDA0002307872930000036
(4) updating an embedded vector table V by adopting random gradient descent (SGD), and learning the weight w of each type of node type vectortUsing a max-margin objective function as the loss function, the loss function is defined as:
max(0,f(p,u′)-f(p,u)+ε)
where u is the true operating user of the host p, i.e., a positive case sample, u 'is a negative case sample, ε is a boundary value, and if the difference between f (p, u) and f (p, u') is less than ε, a loss penalty is generated.
In the specific implementation of the joint embedding module, the global embedding algorithm is implemented based on the meta path, and the implementation process is as follows:
(1) the meta-path defines high-order semantic association among different types of nodes, wherein the high-order semantic association refers to association information which cannot be captured by edges in an original network; given a meta-path set R, a meta-path-based global embedding algorithm firstly models the conditional neighbor distribution of nodes, in a heterogeneous information network G, there are various meta-paths from a node i, so that the neighbor distribution of the nodes depends on both the node i and the given meta-path R, and a conditional neighbor distribution function is defined as follows:
Figure BDA0002307872930000041
wherein v isiAnd vjVector representation representing nodes i and j, dst (r) representing all possible node sets of node i on the target side of meta-path r;
(2) the number of nodes contained in all possible node sets DST (r) on the target side of the meta-path r is huge, and in order to reduce the operation burden, a negative sampling strategy is used to obtain an approximate solution from the following formula, wherein the left side of the formula represents the approximation of the previous formula;
Figure BDA0002307872930000042
Figure BDA0002307872930000043
the expression is an approximate solution to the formula, j' is a noise distribution predefined from the meta-path r
Figure BDA0002307872930000044
Sampling k negative nodes by each node i, and adjusting the density of different element paths by the bias term br;
(3) embedding vector table V and parameter b using random gradient descent (SGD) learningrThe goal is to maximize the likelihood function.
In the specific implementation of the joint embedding module, the joint objective function is to effectively combine the local features captured by the local embedding algorithm and the global features captured by the global embedding algorithm, and is defined as follows:
Figure BDA0002307872930000045
wherein ω ∈ [0, 1 ]]Is a predefined parameter for balancing the importance of the model to optimize, and adds a regularization term to prevent overfitting; wherein ZunitedRepresenting the objective function of the joint embedding model, ZglobalAn objective function representing a global embedding model, ZlocalRepresenting an objective function of the local embedding model, wherein lambda is a regularization parameter;
the iterative training process using the joint objective function is as follows:
(1) sampling one of a local embedding algorithm and a global embedding algorithm based on Bernoulli distribution with parameter omega;
(2) if the local embedding algorithm is sampled, training an embedding vector table V and learning the weight w of each type of node type vector according to the operation steps of the local embedding algorithmtSimilarly, if the global embedding algorithm is sampled, the steps are operated according to the global embedding algorithmTraining the Embedded vector Table V and learning the parameters brThe embedded vector table V is shared for both embedding algorithms;
(3) and (5) repeatedly executing the steps (1) and (2) until the model is converged, and obtaining the user predictor.
The analysis process for the suspicious situation in the evaluation analysis module is as follows:
in the evaluation analysis module, aiming at the suspicious situation, dot products of host behavior identification neighbor nodes and host real operation user nodes are sequentially calculated to serve as abnormal references, the lower the dot product score is, the lower the similarity between two entities is, the higher the abnormal risk is, and finally the suspicious behaviors are sequenced from high to low according to the abnormal risk:
Figure BDA0002307872930000051
wherein L ispRepresenting the resulting sequence of suspicious behaviors, EpBehavior-identifying neighbor node set, v, on behalf of host piVector representation, u, representing node ipA vector representation representing the real operating user of host p.
The determination of the meta-path set R needs to pass through a meta-path selection process, which is specifically as follows:
(1) calculating the recognition accuracy rate achieved after each element path is added one by one, and sequencing to obtain the influence of each element path on the recognition effect when the element paths are used independently;
(2) and adding the meta-paths step by step according to the obtained sequence, and finally greedily selecting a combination which can enable the user identification accuracy to reach the highest rate as an optimal meta-path set R according to the change of the identification accuracy.
The invention discloses a user identification method based on a heterogeneous information network embedding algorithm, which comprises the following steps:
step (1), data processing: collecting audit log data of a certain host in an intranet in a period of time interval, wherein the types of the audit log comprise a login log, a file log, a mail log, an HTTP log and an equipment log; analyzing each type of log one by using a log analyzer, extracting predefined key fields, wherein the key fields comprise a subject, an object, equipment and a timestamp, for a file log, the extracted subject is a user account, the object is a combination of a file path and a file name, the equipment is a host number, the timestamp is access time of a log record, the analyzed log data is used as a test set, and in addition, the standardized log data in a time window in a historical behavior database is used as a training set for constructing a heterogeneous information network G;
step (2) heterogeneous information network construction: constructing a heterogeneous information network G by using a training set, taking information extracted based on predefined fields in log data standardized by a historical behavior database as a node identifier, wherein a host identifier is taken as a central node, a user identifier and a behavior identifier are taken as neighbor nodes of a host, and aiming at each host p, forming a set E by all behavior identifier neighbor nodes related to the host ppWhile representing its true operation user as upEach independent behavior identifier can be associated with a plurality of hosts, and if two hosts p and q have log records with the mail entity e, the mail entity e can be simultaneously used as a neighbor node of the two hosts p and q;
step (3) combined embedding: after the heterogeneous information network G is obtained, vector representation of each node is subjected to iterative learning, an embedded vector table V is initialized randomly at first, and then one of a local embedding algorithm and a global embedding algorithm is sampled based on a parameter omega in a joint objective function; if the local embedding algorithm is sampled, training an embedding vector table V and learning the weight w of each type of node type vector according to the operation steps of the local embedding algorithmt(ii) a If the global embedding algorithm is sampled, the embedding vector table V is trained and the parameter b is learned according to the operation steps of the global embedding algorithmrRepeating the iterative training process until the model converges, and obtaining a trained model at the moment, wherein the trained model is called a user predictor;
step (4), user prediction: the user predictor comprises a trained node embedded vector table V and respective parameters of a local embedding algorithm and a global embedding algorithm, then a user prediction task is executed on a test set, namely a host p to be predicted is given, the log data on the host p is predicted to belong to which operation user, the prediction result is a sequence, the ranking in the sequence sequentially represents the probability that the log data in the test set belong to a certain user, and the basis of the ranking is the dot product similarity score of the host vector and the user vector;
and (5) evaluating and analyzing: and (4) regarding the prediction result obtained in the step (4), if the real operation users corresponding to the behaviors in the test set appear in the first K operation users in the prediction sequence, the identification is considered to be correct, otherwise, the deviation between the user behaviors in the test set and the normal behavior patterns in the training set is indicated, and the situation is called as a suspicious situation.
Compared with the prior art, the invention has the advantages that:
(1) the key of defending internal attacks lies in user authority management, an effective approach of the user authority management is to continuously monitor the user identity based on behavior analysis, and the traditional user identification methods do not fully utilize multi-source heterogeneous behavior data and are difficult to model complex association among data. The invention skillfully utilizes a heterogeneous information network to represent the structured audit log data into a graph structure, thereby creating conditions for analyzing data association;
(2) the method combines two heterogeneous information network embedding algorithms to automatically learn the vector representation of the nodes, which is an innovation attempt of applying the heterogeneous information network embedding method to the safety field once, solves the problem that the traditional method relies on artificial experience knowledge to extract features, and the two embedding algorithms pay attention to local behavior pattern features and network global association features respectively, and has the advantages that the comprehensive user behavior pattern depiction can be carried out, and the user identification accuracy is greatly improved;
(3) for the suspicious situation of prediction error, the invention can also sequence the potential abnormal operation according to the similarity between the entities, and provide suspicious behavior clues of event level. The security analyst can perform traceability verification based on the effective clues of the event levels;
(4) in general, the invention provides a user identification system based on a heterogeneous information network embedding algorithm, which has the core advantages of being capable of modeling comprehensive user behavior characteristics, improving user identification accuracy and providing fine-grained anomaly analysis.
Drawings
FIG. 1 is a block diagram of an implementation of the system of the present invention;
FIG. 2 illustrates a network model of a heterogeneous information network according to the present invention;
FIG. 3 is a framework of the local embedding algorithm of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and examples.
The method mainly solves the problem of how to identify potential operation users based on the multi-source heterogeneous host audit logs, and provides the anomaly analysis with guiding significance aiming at the suspicious situation of identification errors.
As shown in FIG. 1, the system of the present invention comprises a data processing module, a joint embedding module, and an evaluation analysis module. The data processing module is used for analyzing and processing original multi-source heterogeneous host audit log data, reserving predefined key fields and constructing a heterogeneous information network by using standardized historical log data; the combined embedding module is used for learning the operation mode of a single host and capturing network global association by using two heterogeneous information network embedding algorithms, the two heterogeneous information network embedding algorithms are called as a local embedding algorithm and a global embedding algorithm, and the local embedding algorithm and the global embedding algorithm are combined through a combined function for iterative training to obtain a user predictor; and the evaluation analysis module is used for evaluating the user identification effect and giving out suspicious behavior sequencing through similarity-based anomaly analysis aiming at the suspicious situation of the identification error.
The data processing module is specifically realized as follows:
(1) data processing: collecting audit log data of a certain host in an intranet in a period of time interval, wherein the types of the audit log comprise a login log, a file log, a mail log, an HTTP log and an equipment log; the method comprises the steps of analyzing each type of log one by using a log analyzer, extracting predefined key fields, wherein the key fields comprise a subject, an object, equipment and a timestamp, for one file log, the extracted subject is a user account, the object is a combination of a file path and a file name, the equipment is a host number, the timestamp is access time of log records, and the analyzed log data is used as a test set. In addition, log data standardized in a time window in the historical behavior database is used as a training set for constructing the heterogeneous information network G in the next step.
(2) Constructing a heterogeneous information network: and constructing a heterogeneous information network G by using the training set, taking information extracted based on predefined fields in the standardized log data of the historical behavior database as node identification, wherein the host identification is used as a central node, the user identification and the behavior identification are used as neighbor nodes of the host, and the constructed heterogeneous information network follows the network mode shown in FIG. 2. In the network mode, six node types are mainly involved, namely PC, user, login, file, mail, HTTP and equipment, wherein the PC is a supernode connected with the other five node types; the types of edges involved include "PC access file", "PC send mail", and the like. For each host p, all behavior identification neighbor nodes related to the host p are combined into a set EpWhile representing its true operation user as upEach independent behavior identifier can be associated with a plurality of hosts, and if two hosts p and q have log records with the mail entity e, the mail entity e can be simultaneously used as a neighbor node of the two hosts p and q;
the joint embedding module is specifically realized as follows:
the traditional method usually adopts a characteristic engineering mode to artificially extract high-dimensional characteristics, and needs artificial experience knowledge. The invention automatically extracts the feature vector containing rich structure and semantic association for representing users and entities. And predicting the operation users of the log data in the test set based on the trained user predictor. The method comprises the following steps:
(1) initialization: firstly, randomly initializing an embedded vector table V, wherein the V represents vector representation of all nodes in a constructed heterogeneous information network G;
(2) iterative training: then, an iterative training process is executed, and one of a local embedding algorithm and a global embedding algorithm is sampled based on a parameter omega in the combined objective function; if the sampled is the local embedding algorithm, training is carried out according to the execution steps of the local embedding model, and the embedded vector table V and the weight w of each type of node type vector are updatedt(ii) a If the global embedding algorithm is sampled, training is carried out according to the execution steps of the global embedding model, and the embedded vector table V and the parameter b are updatedrRepeating the iterative training process until the model converges, and obtaining a trained model at the moment, wherein the trained model is called a user predictor;
(3) and (3) user prediction: inputting the test set into a trained user predictor, predicting potential operation users by the user predictor based on standardized log data in the test set, wherein the prediction result is a sequence, the ranking in the sequence successively represents the probability that the log data in the test set belong to a certain user, and the basis of the ranking is the dot product similarity score of a host vector and a user vector;
according to fig. 3, the local embedding algorithm is performed as follows:
(1) giving a host p, and obtaining a host vector V by two-step aggregationpIn the first step, a node type vector of each type of behavior identification neighbor node of the host p is calculated
Figure BDA0002307872930000081
The method is that all behaviors contained in each type are marked to be a neighbor node vector vnAveraging;
Figure BDA0002307872930000082
wherein the content of the first and second substances,
Figure BDA0002307872930000083
representsThe t-th type of behavior contained in the host p identifies a neighbor node set;
in a second step, a node type vector is calculated
Figure BDA0002307872930000084
Obtaining a host vector V by weighted combinationp
Figure BDA0002307872930000085
Wherein, wtRepresenting the weight of a t-type node type vector, wherein the behavior identification neighbor node types are 5 in total, so that the value of t is 1 to 5 and respectively represents a login node type, a file node type, a mail node type, an HTTP node type and an equipment node type;
(2) based on host vector VpCalculating the similarity of dot products between the host and the user and sequencing the potential operation users, wherein vuRepresenting a user vector;
Figure BDA0002307872930000086
(3) updating an embedded vector table V by adopting random gradient descent (SGD), and learning the weight w of each type of node type vectortUsing a max-margin objective function as the loss function, the loss function is defined as:
max(0,f(p,u′)-f(p,u)+ε)
where u is the true operating user of the host p, i.e., a positive case sample, u 'is a negative case sample, ε is a boundary value, and if the difference between f (p, u) and f (p, u') is less than ε, a loss penalty is generated.
The global embedding algorithm is executed as follows:
(1) the meta-path defines high-order semantic association among different types of nodes, wherein the high-order semantic association refers to association information which cannot be captured by edges in an original network; given a meta-path set R, a meta-path-based global embedding algorithm firstly models the conditional neighbor distribution of nodes, in a heterogeneous information network G, there are various meta-paths from a node i, so that the neighbor distribution of the nodes depends on both the node i and the given meta-path R, and a conditional neighbor distribution function is defined as follows:
Figure BDA0002307872930000091
wherein v isiAnd vjVector representation representing nodes i and j, dst (r) representing all possible node sets of node i on the target side of meta-path r;
(2) the number of nodes contained in all possible node sets DST (r) on the target side of the meta-path r is huge, and in order to reduce the operation burden, a negative sampling strategy is used to obtain an approximate solution from the following formula, wherein the left side of the formula represents the approximation of the previous formula;
Figure BDA0002307872930000092
Figure BDA0002307872930000093
the expression is an approximate solution to the formula, j' is a noise distribution predefined from the meta-path r
Figure BDA0002307872930000095
A middle sampled negative node, each node i samples k negative nodes, and an offset term brUsed for adjusting the density of different element paths;
(3) embedding vector table V and parameter b using random gradient descent (SGD) learningrThe goal is to maximize the likelihood function.
The evaluation and analysis module is specifically realized as follows:
(1) evaluation: and evaluating the prediction result obtained in the combined embedded module, and judging whether the real operation user of the host is consistent with the prediction result. In the joint embedding module, a prediction result A given by a log data model in a test set is obtained, the result is a sequence, and the ranking in the sequence sequentially represents the probability that the behavior in the test set belongs to a certain user. If the real operation users corresponding to the behaviors in the test set appear in the first K in the prediction sequence, the identification is considered to be correct, and otherwise, the condition is called as a suspicious condition.
(2) And (3) analysis: the model considers that the 'suspicious situation' is caused by deviation of user behaviors in a test set and normal behavior patterns in a training set, in an evaluation analysis module, aiming at the suspicious situation, dot products of host behavior identification neighbor nodes and host real operation users are sequentially calculated to be used as abnormal references, the lower the dot product number is, the lower the similarity between two entities is, the higher the abnormal risk is, and finally, the suspicious behaviors are sequenced from high to low according to the abnormal risk:
Figure BDA0002307872930000094
wherein L ispRepresenting the resulting sequence of suspicious behaviors, EpBehavior-identifying neighbor node set, v, on behalf of host piVector representation, u, representing node ipA vector representation representing the real operating user of host p.

Claims (8)

1. A user identification system based on heterogeneous information network embedding algorithm is characterized in that: the heterogeneous information network embedding algorithm is a local embedding algorithm realized based on a neural network and a global embedding algorithm realized based on a meta-path, the user identification is to identify potential operating users based on multi-source heterogeneous audit log data collected by each host in an enterprise intranet, the system comprises a data processing module, a combined embedding module and an evaluation analysis module, wherein:
a data processing module: there are two functions: the first function is to extract standardized audit log data from a historical behavior database, and the log data is used as a training set for constructing a heterogeneous information network G; the second function is to preprocess the original multi-source heterogeneous audit log data newly collected from the intranet host; whether standardized audit log data in a historical behavior database or newly acquired original audit log data comprise five multi-source heterogeneous audit log data types, the five log data types are login log data, file log data, mail log data, HTTP log data and equipment log data respectively, and the data record login behaviors, file behaviors, mail behaviors, WEB behaviors and external equipment connection behaviors of a user respectively; the method comprises the steps that preprocessing original multi-source heterogeneous audit log data refers to conducting standardization processing on each log data, a log analyzer is used for extracting key information based on predefined fields, the predefined fields comprise a subject, equipment, an object and a timestamp, the subject refers to a user identifier, the equipment refers to a host identifier, and the object is determined according to different log data types and is used for identifying specific behaviors of specific log data types; in the log data of the file type, the object adopts the combination of a file path and a file name; the timestamp is the occurrence time of the log data; the analyzed newly-collected log data is used as a test set; the heterogeneous information network G takes information extracted based on predefined fields in the standardized log data as node identification, wherein the host identification is taken as a central node, and the user identification and the behavior identification are taken as neighbor nodes of the host identification;
a joint embedding module: training a model reflecting the operation mode of each host by taking a heterogeneous information network G constructed in a data processing module as input, wherein the model is called a user predictor, and the user predictor is used for executing user prediction on the test set to finally obtain potential operation user sequencing corresponding to log data in the test set; the process of training the user predictor refers to learning vector representation of nodes in the heterogeneous information network G and parameters of a model; in order to enable the learned node vector to keep network structure information and similarity information between nodes, the joint embedding module adopts two heterogeneous information network embedding algorithms, namely a local embedding algorithm and a global embedding algorithm, wherein the local embedding algorithm is used for learning the interaction between each host and a neighbor node thereof and embedding normal behavior pattern information; the global embedding algorithm utilizes the semantics defined by the meta path to embed the associated information between the nodes of different types; finally, combining the two embedding algorithms through a combined objective function for iterative training;
the evaluation analysis module is used for evaluating the prediction result obtained in the combined embedding module and judging whether the real operation user of the host computer is consistent with the prediction result or not; obtaining a prediction result A given by a log data model in a test set in a joint embedding module, wherein the result is a sequence, the rank in the sequence successively represents the probability that the behavior in the test set belongs to a certain user, if the first K operation users corresponding to the behavior in the test set appear in the prediction sequence, the identification is considered to be correct, otherwise, the deviation between the behavior of the user in the test set and the normal behavior pattern in the training set is represented, and the situation is called as suspicious; for such suspicious situations, through anomaly analysis based on similarity, the final result is the order of suspicious behaviors causing errors in the user identification result, so that a security analyst or related staff can perform traceability verification according to clues given by the system.
2. The heterogeneous information network embedding algorithm-based user identification system according to claim 1, wherein: in the data processing module, the process of constructing the heterogeneous information network comprises the following steps: based on the log data which is subjected to standardized processing in the historical database, the heterogeneous information network G is constructed by using the extracted host, user and behavior identifiers as nodes, wherein the host identifier is used as a central node, and the user identifier and the behavior identifier are used as neighbor nodes of the host identifier.
3. The heterogeneous information network embedding algorithm-based user identification system according to claim 1, wherein: in the combined embedding module, a local embedding algorithm is realized based on a neural network, and the specific process is as follows:
(1) firstly, mapping all nodes in a heterogeneous information network G to a potential space, namely randomly initializing vector representations of all nodes to form an embedded vector table V;
(2) giving a host p, and obtaining a host vector V by two-step aggregationpIn the first step, a node type vector of each type of behavior identification neighbor node of the host p is calculated
Figure FDA0002839917240000021
The method is that all behaviors contained in each type are marked to be a neighbor node vector vnAveraging;
Figure FDA0002839917240000022
wherein the content of the first and second substances,
Figure FDA0002839917240000023
identifying a neighbor node set by representing the t type of behavior contained by the host p;
in a second step, a node type vector is calculated
Figure FDA0002839917240000024
Obtaining a host vector V by weighted combinationp
Figure FDA0002839917240000025
Wherein, wtRepresenting the weight of a t-type node type vector, wherein the behavior identification neighbor node types are 5 in total, so that the value of t is 1 to 5 and respectively represents a login node type, a file node type, a mail node type, an HTTP node type and an equipment node type;
(3) based on host vector VpCalculating the similarity of dot products between the host and the user and sequencing the potential operation users, wherein vuRepresenting a user vector;
Figure FDA0002839917240000026
(4) updating an embedded vector table V by adopting random gradient descent (SGD), and learning the weight w of each type of node type vectortUsing a max-margin objective function as the loss function, the loss function is defined as:
max(0,f(p,u′)-f(p,u)+ε)
where u is the true operating user of the host p, i.e., a positive case sample, u 'is a negative case sample, ε is a boundary value, and if the difference between f (p, u) and f (p, u') is less than ε, a loss penalty is generated.
4. The heterogeneous information network embedding algorithm-based user identification system according to claim 1, wherein: in the specific implementation of the joint embedding module, the global embedding algorithm is implemented based on the meta path, and the implementation process is as follows:
(1) the meta-path defines high-order semantic association among different types of nodes, wherein the high-order semantic association refers to association information which cannot be captured by edges in an original network; given a meta-path set R, a meta-path-based global embedding algorithm firstly models the conditional neighbor distribution of nodes, in a heterogeneous information network G, there are various meta-paths from a node i, so that the neighbor distribution of the nodes depends on both the node i and the given meta-path R, and a conditional neighbor distribution function is defined as follows:
Figure FDA0002839917240000031
wherein v isiAnd vjVector representation representing nodes i and j, dst (r) representing all possible node sets of node i on the target side of meta-path r;
(2) the number of nodes contained in all possible node sets DST (r) on the target side of the meta-path r is huge, and in order to reduce the operation burden, a negative sampling strategy is used to obtain an approximate solution from the following formula, wherein the left side of the formula represents the approximation of the previous formula;
Figure FDA0002839917240000032
Figure FDA0002839917240000033
the expression is an approximate solution to the formula, j' is a noise distribution predefined from the meta-path r
Figure FDA0002839917240000034
A middle sampled negative node, each node i samples k negative nodes, and an offset term brIs a bias item of neural network training and is used for adjusting the density of different element paths;
(3) embedding vector table V and parameter bias term b using random gradient descent (SGD) learningrThe goal is to maximize the likelihood function.
5. The heterogeneous information network embedding algorithm-based user identification system according to claim 1, wherein: in the specific implementation of the joint embedding module, the joint objective function is to effectively combine the local features captured by the local embedding algorithm and the global features captured by the global embedding algorithm, and is defined as follows:
Figure FDA0002839917240000035
wherein ω ∈ [0, 1 ]]Is a predefined parameter for balancing the importance of the model to optimize, and adds a regularization term to prevent overfitting; wherein ZunitedRepresenting the objective function of the joint embedding model, ZglobalAn objective function representing a global embedding model, ZlocalRepresenting an objective function of the local embedding model, wherein lambda is a regularization parameter;
the iterative training process using the joint objective function is as follows:
(1) sampling one of a local embedding algorithm and a global embedding algorithm based on Bernoulli distribution with parameter omega;
(2) if the local embedding algorithm is sampled, training an embedding vector table V and learning the weight w of each type of node type vector according to the operation steps of the local embedding algorithmtSimilarly, if the global embedding algorithm is sampled, the embedding vector table V is trained and the parameter b is learned according to the operation steps of the global embedding algorithmrWherein the parameter brUsed for adjusting the density of different element paths;
the embedded vector table V is shared for both embedding algorithms;
(3) and (5) repeatedly executing the steps (1) and (2) until the model is converged, and obtaining the user predictor.
6. The heterogeneous information network embedding algorithm-based user identification system according to claim 1, wherein: the analysis process for the suspicious situation in the evaluation analysis module is as follows:
in the evaluation analysis module, aiming at the suspicious situation, dot products of host behavior identification neighbor nodes and host real operation user nodes are sequentially calculated to serve as abnormal references, the lower the dot product score is, the lower the similarity between two entities is, the higher the abnormal risk is, and finally the suspicious behaviors are sequenced from high to low according to the abnormal risk:
Figure FDA0002839917240000041
wherein L ispRepresenting the resulting sequence of suspicious behaviors, EpBehavior-identifying neighbor node set, v, on behalf of host piVector representation, u, representing node ipA vector representation representing the real operating user of host p.
7. The heterogeneous information network embedding algorithm-based user identification system of claim 4, wherein: the determination of the meta-path set R needs to pass through a meta-path selection process, which is specifically as follows:
(1) calculating the recognition accuracy rate achieved after each element path is added one by one, and sequencing to obtain the influence of each element path on the recognition effect when the element paths are used independently;
(2) and adding the meta-paths step by step according to the obtained sequence, and finally greedily selecting a combination which can enable the user identification accuracy to reach the highest rate as an optimal meta-path set R according to the change of the identification accuracy.
8. A user identification method based on a heterogeneous information network embedding algorithm is characterized by comprising the following steps:
step (1), data processing: collecting audit log data of a certain host in an intranet in a period of time interval, wherein the types of the audit log comprise a login log, a file log, a mail log, an HTTP log and an equipment log; analyzing each type of log one by using a log analyzer, extracting predefined key fields, wherein the key fields comprise a subject, an object, equipment and a timestamp, for a file log, the extracted subject is a user account, the object is a combination of a file path and a file name, the equipment is a host number, the timestamp is access time of a log record, the analyzed log data is used as a test set, and in addition, the standardized log data in a time window in a historical behavior database is used as a training set for constructing a heterogeneous information network G;
step (2) heterogeneous information network construction: constructing a heterogeneous information network G by using a training set, taking information extracted based on predefined fields in log data standardized by a historical behavior database as a node identifier, wherein a host identifier is taken as a central node, a user identifier and a behavior identifier are taken as neighbor nodes of a host, and aiming at each host p, forming a set E by all behavior identifier neighbor nodes related to the host ppWhile representing its true operation user as upEach independent behavior identifier can be associated with a plurality of hosts, and if two hosts p and q have log records with the mail entity e, the mail entity e can be simultaneously used as a neighbor node of the two hosts p and q;
step (3) combined embedding: after the heterogeneous information network G is obtained, vector representation of each node is subjected to iterative learning, an embedded vector table V is initialized randomly at first, and then one of a local embedding algorithm and a global embedding algorithm is sampled based on a parameter omega in a joint objective function; if the local embedding algorithm is sampled, training an embedding vector table V and learning the weight w of each type of node type vector according to the operation steps of the local embedding algorithmt(ii) a If the global embedding algorithm is sampled, the embedding vector table V is trained and the parameter b is learned according to the operation steps of the global embedding algorithmrRepeating the iterative training process until the model converges, and obtaining a trained model at the moment, wherein the trained model is called a user predictor;
step (4), user prediction: the user predictor comprises a trained node embedded vector table V and respective parameters of a local embedding algorithm and a global embedding algorithm, then a user prediction task is executed on a test set, namely a host p to be predicted is given, the log data on the host p is predicted to belong to which operation user, the prediction result is a sequence, the ranking in the sequence sequentially represents the probability that the log data in the test set belong to a certain user, and the basis of the ranking is the dot product similarity score of the host vector and the user vector;
and (5) evaluating and analyzing: and (4) regarding the prediction result obtained in the step (4), if the real operation users corresponding to the behaviors in the test set appear in the first K operation users in the prediction sequence, the identification is considered to be correct, otherwise, the deviation between the user behaviors in the test set and the normal behavior patterns in the training set is indicated, and the situation is called as a suspicious situation.
CN201911246787.9A 2019-12-09 2019-12-09 User identification system and method based on heterogeneous information network embedding algorithm Active CN111163057B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911246787.9A CN111163057B (en) 2019-12-09 2019-12-09 User identification system and method based on heterogeneous information network embedding algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911246787.9A CN111163057B (en) 2019-12-09 2019-12-09 User identification system and method based on heterogeneous information network embedding algorithm

Publications (2)

Publication Number Publication Date
CN111163057A CN111163057A (en) 2020-05-15
CN111163057B true CN111163057B (en) 2021-04-02

Family

ID=70555734

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911246787.9A Active CN111163057B (en) 2019-12-09 2019-12-09 User identification system and method based on heterogeneous information network embedding algorithm

Country Status (1)

Country Link
CN (1) CN111163057B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737551B (en) * 2020-05-26 2022-08-05 国家计算机网络与信息安全管理中心 Dark network cable detection method based on special-pattern attention neural network
CN113742665B (en) * 2020-06-05 2024-03-26 国家计算机网络与信息安全管理中心 User identity recognition model construction and user identity verification methods and devices
CN111752729B (en) * 2020-06-30 2023-06-27 上海观安信息技术股份有限公司 Method for constructing three-layer association relation model and three-layer relation identification method
CN111651149B (en) * 2020-07-03 2022-11-22 东软教育科技集团有限公司 Machine learning model system convenient to deploy and calling method thereof
WO2022047659A1 (en) * 2020-09-02 2022-03-10 大连大学 Multi-source heterogeneous log analysis method
CN112149124B (en) * 2020-11-02 2022-04-29 电子科技大学 Android malicious program detection method and system based on heterogeneous information network
CN112597240B (en) * 2021-03-01 2021-06-04 索信达(北京)数据技术有限公司 Federal learning data processing method and system based on alliance chain
CN113220911B (en) * 2021-05-25 2024-02-02 中国农业科学院农业信息研究所 Agricultural multi-source heterogeneous data analysis and mining method and application thereof
US11880439B2 (en) 2021-06-16 2024-01-23 International Business Machines Corporation Enhancing verification in mobile devices using model based on user interaction history
CN113596097B (en) * 2021-06-30 2023-08-18 联想(北京)有限公司 Log transmission method and electronic equipment
CN113572739B (en) * 2021-06-30 2023-02-24 中国人民解放军战略支援部队信息工程大学 Network organized attack intrusion detection method and device
CN113656797B (en) * 2021-10-19 2021-12-21 航天宏康智能科技(北京)有限公司 Behavior feature extraction method and behavior feature extraction device
CN114329099B (en) * 2021-11-22 2023-07-07 腾讯科技(深圳)有限公司 Overlapping community identification method, device, equipment, storage medium and program product
CN114553497B (en) * 2022-01-28 2022-11-15 中国科学院信息工程研究所 Internal threat detection method based on feature fusion
CN114329455B (en) * 2022-03-08 2022-07-29 北京大学 User abnormal behavior detection method and device based on heterogeneous graph embedding
CN114598545B (en) * 2022-03-23 2022-12-30 中国科学技术大学 Internal security threat detection method, system, equipment and storage medium
CN115333915B (en) * 2022-06-01 2023-12-05 中电莱斯信息系统有限公司 Heterogeneous host-oriented network management and control system
CN115118505B (en) * 2022-06-29 2023-06-09 上海众人智能科技有限公司 Behavior baseline targeting grabbing method based on intrusion data tracing
CN116318465B (en) * 2023-05-25 2023-08-29 广州南方卫星导航仪器有限公司 Edge computing method and system in multi-source heterogeneous network environment

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9794386B2 (en) * 2014-02-18 2017-10-17 Quiet, Inc. Anechoic cup or secondary anechoic chamber comprising metallic flake mixed with sound attenuating or absorbing materials for use with a communication device and related methods
US10078851B2 (en) * 2015-01-13 2018-09-18 Live Nation Entertainment, Inc. Systems and methods for leveraging social queuing to identify and prevent ticket purchaser simulation
CN105246130B (en) * 2015-09-22 2019-04-05 华北电力大学(保定) A kind of user's selection algorithm in heterogeneous network
CN107508721B (en) * 2017-08-01 2018-11-02 南京云利来软件科技有限公司 A kind of collecting method based on metadata
CN109471785A (en) * 2018-11-15 2019-03-15 郑州云海信息技术有限公司 A kind of log analysis method and device
CN109753801B (en) * 2019-01-29 2022-04-22 重庆邮电大学 Intelligent terminal malicious software dynamic detection method based on system call
CN110046943B (en) * 2019-05-14 2023-01-03 华中师范大学 Optimization method and optimization system for network consumer subdivision
CN110532881A (en) * 2019-07-30 2019-12-03 长江大学 A kind of recognition of face security alarm method based on embedded artificial intelligent chip

Also Published As

Publication number Publication date
CN111163057A (en) 2020-05-15

Similar Documents

Publication Publication Date Title
CN111163057B (en) User identification system and method based on heterogeneous information network embedding algorithm
Khan et al. Malicious insider attack detection in IoTs using data analytics
Joshi et al. Investigating hidden Markov models capabilities in anomaly detection
CN111107102A (en) Real-time network flow abnormity detection method based on big data
CN111107072B (en) Authentication graph embedding-based abnormal login behavior detection method and system
Kotenko et al. Systematic literature review of security event correlation methods
Adhao et al. Feature selection using principal component analysis and genetic algorithm
CN108063776A (en) Inside threat detection method based on cross-domain behavioural analysis
Carminati et al. Evasion attacks against banking fraud detection systems
Wu et al. Factor-analysis based anomaly detection and clustering
Wang et al. Evolving boundary detector for anomaly detection
CN112202718B (en) XGboost algorithm-based operating system identification method, storage medium and device
CN114218998A (en) Power system abnormal behavior analysis method based on hidden Markov model
Ramasubramanian et al. A genetic-algorithm based neural network short-term forecasting framework for database intrusion prediction system
US20230164162A1 (en) Valuable alert screening method efficiently detecting malicious threat
CN115174263B (en) Attack path dynamic decision method and device
Nagarajan et al. Optimization of BPN parameters using PSO for intrusion detection in cloud environment
Jayasimhan et al. Anomaly detection using a clustering technique
CN115187064A (en) Qingdao city property development index analysis based on principal component and clustering method
CN110290101B (en) Deep trust network-based associated attack behavior identification method in smart grid environment
CN115085948A (en) Network security situation assessment method based on improved D-S evidence theory
CN114039837A (en) Alarm data processing method, device, system, equipment and storage medium
Istiaque et al. Artificial Intelligence Based Cybersecurity: Two-Step Suitability Test
Zhang The WSN intrusion detection method based on deep data mining
Alves et al. Evaluating the behaviour of stream learning algorithms for detecting invasion on wireless networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant