CN111163057B

CN111163057B - User identification system and method based on heterogeneous information network embedding algorithm

Info

Publication number: CN111163057B
Application number: CN201911246787.9A
Authority: CN
Inventors: 于爱民; 李梦; 蔡利君; 马建刚; 孟丹; 于海波
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2019-12-09
Filing date: 2019-12-09
Publication date: 2021-04-02
Anticipated expiration: 2039-12-09
Also published as: CN111163057A

Abstract

The invention relates to a user identification system and a method based on a heterogeneous information network embedding algorithm, which comprises the following steps: the system comprises a data processing module, a combined embedding module and an evaluation analysis module; the method is based on the thinking of behavior analysis, a normal behavior model is constructed by utilizing multi-source heterogeneous user behavior data, after behavior data of a new time period arrives, user identification is executed by comparing the similarity of the current behavior and the normal behavior model, and suspicious behavior sequencing is given based on dot product similarity operation aiming at the condition of identification errors. The method can be applied to detecting potential internal threats in an enterprise intranet, a more comprehensive and accurate behavior model can be obtained by combining two heterogeneous information network embedding algorithms, the user identification accuracy can be improved by about 10%, and in addition, event-level traceability clues can be provided for further analysis of safety monitoring personnel.

Description

User identification system and method based on heterogeneous information network embedding algorithm

Technical Field

The invention relates to a user identification system and method based on a heterogeneous information network embedding algorithm, belongs to the technical field of information security detection, and is used in an enterprise intranet environment.

Background

Today the most devastating security threats come not from outside malicious persons or malware but from trusted inside persons. Members in an organization acquire certain access control authority according to responsibility, and effective identity authentication is an important way for defending internal attack. However, the identity authentication mechanism mainly includes an account password, fingerprint identification, and the like, and is only effective during login, and a lot of potential safety hazards still exist. Existing research generally establishes a normal behavior model of a user based on behavior analysis, so as to obtain continuous and effective user identity monitoring after login. Because any form of internal attack shows a certain degree of behavior deviation, the identity of the user can be identified by comparing the similarity degree of the current behavior and the historical normal model, and then abnormal operation is found.

User identification based on behavior analysis can be divided into two categories, single-domain behavior analysis and multi-domain behavior analysis. User recognition based on single domain behavior analysis refers to modeling normal behavior with a single type of behavior data, such as: file behavior, mail behavior, etc. The method has the problems that the used data source is single, a comprehensive normal behavior model is difficult to depict, a simple machine learning classifier is usually adopted, and the user recognition rate is not high. A user identification method based on multi-domain behavior analysis tries to combine multiple behavior types to construct a comprehensive behavior model by using the thought of multi-source data fusion. However, in the idea, the correlation among various behavior data is not considered either by comparing workflow similarity or by adopting feature engineering to extract multi-source behavior features.

Under the background, the multi-source heterogeneous behavior data are converted into the heterogeneous information network, conditions are created for analyzing behavior association, local features and global features are extracted by using a local embedding algorithm and a global embedding algorithm respectively, so that a comprehensive behavior model can be constructed, association information among behaviors can be captured, and suspicious behavior sequencing can be further given based on similarity calculation for security personnel to analyze and trace sources under the condition of wrong model identification.

Disclosure of Invention

The invention solves the problems: the invention provides a user identification system and method based on a heterogeneous information network embedding algorithm, which can construct a more comprehensive behavior model, greatly improve the user identification accuracy, analyze suspicious situations and provide event-level suspicious operation sequencing.

The technical scheme of the invention is as follows: a user identification system based on heterogeneous information network embedding algorithm is characterized in that: the heterogeneous information network embedding algorithm is a local embedding algorithm realized based on a neural network and a global embedding algorithm realized based on a meta-path, the user identification is to identify potential operating users based on multi-source heterogeneous audit log data collected by each host in an enterprise intranet, the system comprises a data processing module, a combined embedding module and an evaluation analysis module, wherein:

a data processing module: there are two functions: the first function is to extract standardized audit log data from a historical behavior database, and the log data is used as a training set for constructing a heterogeneous information network G; the second function is to preprocess the original multi-source heterogeneous audit log data newly collected from the intranet host; whether standardized audit log data in a historical behavior database or newly acquired original audit log data comprise five multi-source heterogeneous audit log data types, the five log data types are login log data, file log data, mail log data, HTTP log data and equipment log data respectively, and the data record login behaviors, file behaviors, mail behaviors, WEB behaviors and external equipment connection behaviors of a user respectively; the method comprises the steps that preprocessing original multi-source heterogeneous audit log data refers to conducting standardization processing on each log data, a log analyzer is used for extracting key information based on predefined fields, the predefined fields comprise a subject, equipment, an object and a timestamp, the subject refers to a user identifier, the equipment refers to a host identifier, and the object is determined according to different log data types and is used for identifying specific behaviors of specific log data types; in the log data of the file type, the object adopts the combination of a file path and a file name; the timestamp is the occurrence time of the log data; the analyzed newly-collected log data is used as a test set; the heterogeneous information network G takes the information extracted based on the predefined field in the standardized log data as a node identifier, wherein the host identifier is taken as a central node, the user identifier and the behavior identifier are taken as neighbor nodes of the host identifier, and the constructed heterogeneous information network follows the network mode shown in FIG. 2;

a joint embedding module: training a model reflecting the operation mode of each host by taking a heterogeneous information network G constructed in a data processing module as input, wherein the model is called a user predictor, and the user predictor is used for executing user prediction on the test set to finally obtain potential operation user sequencing corresponding to log data in the test set; the process of training the user predictor refers to learning vector representations of nodes in the heterogeneous information network G and parameters of a model. In order to enable the learned node vector to keep network structure information and similarity information between nodes, the joint embedding module adopts two heterogeneous information network embedding algorithms, namely a local embedding algorithm and a global embedding algorithm, wherein the local embedding algorithm is used for learning the interaction between each host and a neighbor node thereof and embedding normal behavior pattern information; the global embedding algorithm utilizes the semantics defined by the meta path to embed the associated information between the nodes of different types; finally, combining the two embedding algorithms through a combined objective function for iterative training;

the evaluation analysis module is used for evaluating the prediction result obtained in the combined embedding module and judging whether the real operation user of the host computer is consistent with the prediction result or not; obtaining a prediction result A given by a log data model in a test set in a joint embedding module, wherein the result is a sequence, the rank in the sequence successively represents the probability that the behavior in the test set belongs to a certain user, if the first K operation users corresponding to the behavior in the test set appear in the prediction sequence, the identification is considered to be correct, otherwise, the deviation between the behavior of the user in the test set and the normal behavior pattern in the training set is represented, and the situation is called as suspicious; for such suspicious situations, through anomaly analysis based on similarity, the final result is the order of suspicious behaviors causing errors in the user identification result, so that a security analyst or related staff can perform traceability verification according to clues given by the system.

In the data processing module, the process of constructing the heterogeneous information network comprises the following steps: based on the log data which is subjected to standardized processing in the historical database, the heterogeneous information network G is constructed by using the extracted host, user and behavior identifiers as nodes, wherein the host identifier is used as a central node, and the user identifier and the behavior identifier are used as neighbor nodes of the host identifier.

In the combined embedding module, a local embedding algorithm is realized based on a neural network, and the specific process is as follows:

(1) firstly, mapping all nodes in a heterogeneous information network G to a potential space, namely randomly initializing vector representations of all nodes to form an embedded vector table V;

(2) giving a host p, and obtaining a host vector V by two-step aggregation_pIn the first step, a node type vector of each type of behavior identification neighbor node of the host p is calculated

The method is that all behaviors contained in each type are marked to be a neighbor node vector v_nAveraging;

wherein the content of the first and second substances,

identifying a neighbor node set by representing the t type of behavior contained by the host p;

in a second step, a node type vector is calculated

Obtaining a host vector V by weighted combination_p；

Wherein w_tRepresenting the weight of a t-type node type vector, wherein the behavior identification neighbor node types are 5 in total, so that the value of t is 1 to 5 and respectively represents a login node type, a file node type, a mail node type, an HTTP node type and an equipment node type;

(3) based on host vector V_pCalculating the similarity of dot products between the host and the user and sequencing the potential operation users, wherein v_uRepresenting a user vector;

(4) updating an embedded vector table V by adopting random gradient descent (SGD), and learning the weight w of each type of node type vector_tUsing a max-margin objective function as the loss function, the loss function is defined as:

max(0，f(p，u′)-f(p，u)+ε)

where u is the true operating user of the host p, i.e., a positive case sample, u 'is a negative case sample, ε is a boundary value, and if the difference between f (p, u) and f (p, u') is less than ε, a loss penalty is generated.

In the specific implementation of the joint embedding module, the global embedding algorithm is implemented based on the meta path, and the implementation process is as follows:

(1) the meta-path defines high-order semantic association among different types of nodes, wherein the high-order semantic association refers to association information which cannot be captured by edges in an original network; given a meta-path set R, a meta-path-based global embedding algorithm firstly models the conditional neighbor distribution of nodes, in a heterogeneous information network G, there are various meta-paths from a node i, so that the neighbor distribution of the nodes depends on both the node i and the given meta-path R, and a conditional neighbor distribution function is defined as follows:

wherein v is_iAnd v_jVector representation representing nodes i and j, dst (r) representing all possible node sets of node i on the target side of meta-path r;

(2) the number of nodes contained in all possible node sets DST (r) on the target side of the meta-path r is huge, and in order to reduce the operation burden, a negative sampling strategy is used to obtain an approximate solution from the following formula, wherein the left side of the formula represents the approximation of the previous formula;

the expression is an approximate solution to the formula, j' is a noise distribution predefined from the meta-path r

Sampling k negative nodes by each node i, and adjusting the density of different element paths by the bias term br;

(3) embedding vector table V and parameter b using random gradient descent (SGD) learning_rThe goal is to maximize the likelihood function.

In the specific implementation of the joint embedding module, the joint objective function is to effectively combine the local features captured by the local embedding algorithm and the global features captured by the global embedding algorithm, and is defined as follows:

wherein ω ∈ [0, 1 ]]Is a predefined parameter for balancing the importance of the model to optimize, and adds a regularization term to prevent overfitting; wherein Z_unitedRepresenting the objective function of the joint embedding model, Z_globalAn objective function representing a global embedding model, Z_localRepresenting an objective function of the local embedding model, wherein lambda is a regularization parameter;

the iterative training process using the joint objective function is as follows:

(1) sampling one of a local embedding algorithm and a global embedding algorithm based on Bernoulli distribution with parameter omega;

(2) if the local embedding algorithm is sampled, training an embedding vector table V and learning the weight w of each type of node type vector according to the operation steps of the local embedding algorithm_tSimilarly, if the global embedding algorithm is sampled, the steps are operated according to the global embedding algorithmTraining the Embedded vector Table V and learning the parameters b_rThe embedded vector table V is shared for both embedding algorithms;

(3) and (5) repeatedly executing the steps (1) and (2) until the model is converged, and obtaining the user predictor.

The analysis process for the suspicious situation in the evaluation analysis module is as follows:

in the evaluation analysis module, aiming at the suspicious situation, dot products of host behavior identification neighbor nodes and host real operation user nodes are sequentially calculated to serve as abnormal references, the lower the dot product score is, the lower the similarity between two entities is, the higher the abnormal risk is, and finally the suspicious behaviors are sequenced from high to low according to the abnormal risk:

wherein L is_pRepresenting the resulting sequence of suspicious behaviors, E_pBehavior-identifying neighbor node set, v, on behalf of host p_iVector representation, u, representing node i_pA vector representation representing the real operating user of host p.

The determination of the meta-path set R needs to pass through a meta-path selection process, which is specifically as follows:

(1) calculating the recognition accuracy rate achieved after each element path is added one by one, and sequencing to obtain the influence of each element path on the recognition effect when the element paths are used independently;

(2) and adding the meta-paths step by step according to the obtained sequence, and finally greedily selecting a combination which can enable the user identification accuracy to reach the highest rate as an optimal meta-path set R according to the change of the identification accuracy.

The invention discloses a user identification method based on a heterogeneous information network embedding algorithm, which comprises the following steps:

step (1), data processing: collecting audit log data of a certain host in an intranet in a period of time interval, wherein the types of the audit log comprise a login log, a file log, a mail log, an HTTP log and an equipment log; analyzing each type of log one by using a log analyzer, extracting predefined key fields, wherein the key fields comprise a subject, an object, equipment and a timestamp, for a file log, the extracted subject is a user account, the object is a combination of a file path and a file name, the equipment is a host number, the timestamp is access time of a log record, the analyzed log data is used as a test set, and in addition, the standardized log data in a time window in a historical behavior database is used as a training set for constructing a heterogeneous information network G;

step (2) heterogeneous information network construction: constructing a heterogeneous information network G by using a training set, taking information extracted based on predefined fields in log data standardized by a historical behavior database as a node identifier, wherein a host identifier is taken as a central node, a user identifier and a behavior identifier are taken as neighbor nodes of a host, and aiming at each host p, forming a set E by all behavior identifier neighbor nodes related to the host p_pWhile representing its true operation user as u_pEach independent behavior identifier can be associated with a plurality of hosts, and if two hosts p and q have log records with the mail entity e, the mail entity e can be simultaneously used as a neighbor node of the two hosts p and q;

step (3) combined embedding: after the heterogeneous information network G is obtained, vector representation of each node is subjected to iterative learning, an embedded vector table V is initialized randomly at first, and then one of a local embedding algorithm and a global embedding algorithm is sampled based on a parameter omega in a joint objective function; if the local embedding algorithm is sampled, training an embedding vector table V and learning the weight w of each type of node type vector according to the operation steps of the local embedding algorithm_t(ii) a If the global embedding algorithm is sampled, the embedding vector table V is trained and the parameter b is learned according to the operation steps of the global embedding algorithm_rRepeating the iterative training process until the model converges, and obtaining a trained model at the moment, wherein the trained model is called a user predictor;

step (4), user prediction: the user predictor comprises a trained node embedded vector table V and respective parameters of a local embedding algorithm and a global embedding algorithm, then a user prediction task is executed on a test set, namely a host p to be predicted is given, the log data on the host p is predicted to belong to which operation user, the prediction result is a sequence, the ranking in the sequence sequentially represents the probability that the log data in the test set belong to a certain user, and the basis of the ranking is the dot product similarity score of the host vector and the user vector;

and (5) evaluating and analyzing: and (4) regarding the prediction result obtained in the step (4), if the real operation users corresponding to the behaviors in the test set appear in the first K operation users in the prediction sequence, the identification is considered to be correct, otherwise, the deviation between the user behaviors in the test set and the normal behavior patterns in the training set is indicated, and the situation is called as a suspicious situation.

Compared with the prior art, the invention has the advantages that:

(1) the key of defending internal attacks lies in user authority management, an effective approach of the user authority management is to continuously monitor the user identity based on behavior analysis, and the traditional user identification methods do not fully utilize multi-source heterogeneous behavior data and are difficult to model complex association among data. The invention skillfully utilizes a heterogeneous information network to represent the structured audit log data into a graph structure, thereby creating conditions for analyzing data association;

(2) the method combines two heterogeneous information network embedding algorithms to automatically learn the vector representation of the nodes, which is an innovation attempt of applying the heterogeneous information network embedding method to the safety field once, solves the problem that the traditional method relies on artificial experience knowledge to extract features, and the two embedding algorithms pay attention to local behavior pattern features and network global association features respectively, and has the advantages that the comprehensive user behavior pattern depiction can be carried out, and the user identification accuracy is greatly improved;

(3) for the suspicious situation of prediction error, the invention can also sequence the potential abnormal operation according to the similarity between the entities, and provide suspicious behavior clues of event level. The security analyst can perform traceability verification based on the effective clues of the event levels;

(4) in general, the invention provides a user identification system based on a heterogeneous information network embedding algorithm, which has the core advantages of being capable of modeling comprehensive user behavior characteristics, improving user identification accuracy and providing fine-grained anomaly analysis.

Drawings

FIG. 1 is a block diagram of an implementation of the system of the present invention;

FIG. 2 illustrates a network model of a heterogeneous information network according to the present invention;

FIG. 3 is a framework of the local embedding algorithm of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and examples.

The method mainly solves the problem of how to identify potential operation users based on the multi-source heterogeneous host audit logs, and provides the anomaly analysis with guiding significance aiming at the suspicious situation of identification errors.

As shown in FIG. 1, the system of the present invention comprises a data processing module, a joint embedding module, and an evaluation analysis module. The data processing module is used for analyzing and processing original multi-source heterogeneous host audit log data, reserving predefined key fields and constructing a heterogeneous information network by using standardized historical log data; the combined embedding module is used for learning the operation mode of a single host and capturing network global association by using two heterogeneous information network embedding algorithms, the two heterogeneous information network embedding algorithms are called as a local embedding algorithm and a global embedding algorithm, and the local embedding algorithm and the global embedding algorithm are combined through a combined function for iterative training to obtain a user predictor; and the evaluation analysis module is used for evaluating the user identification effect and giving out suspicious behavior sequencing through similarity-based anomaly analysis aiming at the suspicious situation of the identification error.

The data processing module is specifically realized as follows:

(1) data processing: collecting audit log data of a certain host in an intranet in a period of time interval, wherein the types of the audit log comprise a login log, a file log, a mail log, an HTTP log and an equipment log; the method comprises the steps of analyzing each type of log one by using a log analyzer, extracting predefined key fields, wherein the key fields comprise a subject, an object, equipment and a timestamp, for one file log, the extracted subject is a user account, the object is a combination of a file path and a file name, the equipment is a host number, the timestamp is access time of log records, and the analyzed log data is used as a test set. In addition, log data standardized in a time window in the historical behavior database is used as a training set for constructing the heterogeneous information network G in the next step.

(2) Constructing a heterogeneous information network: and constructing a heterogeneous information network G by using the training set, taking information extracted based on predefined fields in the standardized log data of the historical behavior database as node identification, wherein the host identification is used as a central node, the user identification and the behavior identification are used as neighbor nodes of the host, and the constructed heterogeneous information network follows the network mode shown in FIG. 2. In the network mode, six node types are mainly involved, namely PC, user, login, file, mail, HTTP and equipment, wherein the PC is a supernode connected with the other five node types; the types of edges involved include "PC access file", "PC send mail", and the like. For each host p, all behavior identification neighbor nodes related to the host p are combined into a set E_pWhile representing its true operation user as u_pEach independent behavior identifier can be associated with a plurality of hosts, and if two hosts p and q have log records with the mail entity e, the mail entity e can be simultaneously used as a neighbor node of the two hosts p and q;

the joint embedding module is specifically realized as follows:

the traditional method usually adopts a characteristic engineering mode to artificially extract high-dimensional characteristics, and needs artificial experience knowledge. The invention automatically extracts the feature vector containing rich structure and semantic association for representing users and entities. And predicting the operation users of the log data in the test set based on the trained user predictor. The method comprises the following steps:

(1) initialization: firstly, randomly initializing an embedded vector table V, wherein the V represents vector representation of all nodes in a constructed heterogeneous information network G;

(2) iterative training: then, an iterative training process is executed, and one of a local embedding algorithm and a global embedding algorithm is sampled based on a parameter omega in the combined objective function; if the sampled is the local embedding algorithm, training is carried out according to the execution steps of the local embedding model, and the embedded vector table V and the weight w of each type of node type vector are updated_t(ii) a If the global embedding algorithm is sampled, training is carried out according to the execution steps of the global embedding model, and the embedded vector table V and the parameter b are updated_rRepeating the iterative training process until the model converges, and obtaining a trained model at the moment, wherein the trained model is called a user predictor;

(3) and (3) user prediction: inputting the test set into a trained user predictor, predicting potential operation users by the user predictor based on standardized log data in the test set, wherein the prediction result is a sequence, the ranking in the sequence successively represents the probability that the log data in the test set belong to a certain user, and the basis of the ranking is the dot product similarity score of a host vector and a user vector;

according to fig. 3, the local embedding algorithm is performed as follows:

(1) giving a host p, and obtaining a host vector V by two-step aggregation_pIn the first step, a node type vector of each type of behavior identification neighbor node of the host p is calculated

wherein the content of the first and second substances,

representsThe t-th type of behavior contained in the host p identifies a neighbor node set;

in a second step, a node type vector is calculated

Obtaining a host vector V by weighted combination_p；

Wherein, w_tRepresenting the weight of a t-type node type vector, wherein the behavior identification neighbor node types are 5 in total, so that the value of t is 1 to 5 and respectively represents a login node type, a file node type, a mail node type, an HTTP node type and an equipment node type;

(2) based on host vector V_pCalculating the similarity of dot products between the host and the user and sequencing the potential operation users, wherein v_uRepresenting a user vector;

(3) updating an embedded vector table V by adopting random gradient descent (SGD), and learning the weight w of each type of node type vector_tUsing a max-margin objective function as the loss function, the loss function is defined as:

max(0，f(p，u′)-f(p，u)+ε)

The global embedding algorithm is executed as follows:

A middle sampled negative node, each node i samples k negative nodes, and an offset term b_rUsed for adjusting the density of different element paths;

The evaluation and analysis module is specifically realized as follows:

(1) evaluation: and evaluating the prediction result obtained in the combined embedded module, and judging whether the real operation user of the host is consistent with the prediction result. In the joint embedding module, a prediction result A given by a log data model in a test set is obtained, the result is a sequence, and the ranking in the sequence sequentially represents the probability that the behavior in the test set belongs to a certain user. If the real operation users corresponding to the behaviors in the test set appear in the first K in the prediction sequence, the identification is considered to be correct, and otherwise, the condition is called as a suspicious condition.

(2) And (3) analysis: the model considers that the 'suspicious situation' is caused by deviation of user behaviors in a test set and normal behavior patterns in a training set, in an evaluation analysis module, aiming at the suspicious situation, dot products of host behavior identification neighbor nodes and host real operation users are sequentially calculated to be used as abnormal references, the lower the dot product number is, the lower the similarity between two entities is, the higher the abnormal risk is, and finally, the suspicious behaviors are sequenced from high to low according to the abnormal risk:

Claims

1. A user identification system based on heterogeneous information network embedding algorithm is characterized in that: the heterogeneous information network embedding algorithm is a local embedding algorithm realized based on a neural network and a global embedding algorithm realized based on a meta-path, the user identification is to identify potential operating users based on multi-source heterogeneous audit log data collected by each host in an enterprise intranet, the system comprises a data processing module, a combined embedding module and an evaluation analysis module, wherein:

a data processing module: there are two functions: the first function is to extract standardized audit log data from a historical behavior database, and the log data is used as a training set for constructing a heterogeneous information network G; the second function is to preprocess the original multi-source heterogeneous audit log data newly collected from the intranet host; whether standardized audit log data in a historical behavior database or newly acquired original audit log data comprise five multi-source heterogeneous audit log data types, the five log data types are login log data, file log data, mail log data, HTTP log data and equipment log data respectively, and the data record login behaviors, file behaviors, mail behaviors, WEB behaviors and external equipment connection behaviors of a user respectively; the method comprises the steps that preprocessing original multi-source heterogeneous audit log data refers to conducting standardization processing on each log data, a log analyzer is used for extracting key information based on predefined fields, the predefined fields comprise a subject, equipment, an object and a timestamp, the subject refers to a user identifier, the equipment refers to a host identifier, and the object is determined according to different log data types and is used for identifying specific behaviors of specific log data types; in the log data of the file type, the object adopts the combination of a file path and a file name; the timestamp is the occurrence time of the log data; the analyzed newly-collected log data is used as a test set; the heterogeneous information network G takes information extracted based on predefined fields in the standardized log data as node identification, wherein the host identification is taken as a central node, and the user identification and the behavior identification are taken as neighbor nodes of the host identification;

a joint embedding module: training a model reflecting the operation mode of each host by taking a heterogeneous information network G constructed in a data processing module as input, wherein the model is called a user predictor, and the user predictor is used for executing user prediction on the test set to finally obtain potential operation user sequencing corresponding to log data in the test set; the process of training the user predictor refers to learning vector representation of nodes in the heterogeneous information network G and parameters of a model; in order to enable the learned node vector to keep network structure information and similarity information between nodes, the joint embedding module adopts two heterogeneous information network embedding algorithms, namely a local embedding algorithm and a global embedding algorithm, wherein the local embedding algorithm is used for learning the interaction between each host and a neighbor node thereof and embedding normal behavior pattern information; the global embedding algorithm utilizes the semantics defined by the meta path to embed the associated information between the nodes of different types; finally, combining the two embedding algorithms through a combined objective function for iterative training;

2. The heterogeneous information network embedding algorithm-based user identification system according to claim 1, wherein: in the data processing module, the process of constructing the heterogeneous information network comprises the following steps: based on the log data which is subjected to standardized processing in the historical database, the heterogeneous information network G is constructed by using the extracted host, user and behavior identifiers as nodes, wherein the host identifier is used as a central node, and the user identifier and the behavior identifier are used as neighbor nodes of the host identifier.

3. The heterogeneous information network embedding algorithm-based user identification system according to claim 1, wherein: in the combined embedding module, a local embedding algorithm is realized based on a neural network, and the specific process is as follows:

wherein the content of the first and second substances,

in a second step, a node type vector is calculated

Obtaining a host vector V by weighted combination_p；

max(0，f(p，u′)-f(p，u)+ε)

4. The heterogeneous information network embedding algorithm-based user identification system according to claim 1, wherein: in the specific implementation of the joint embedding module, the global embedding algorithm is implemented based on the meta path, and the implementation process is as follows:

A middle sampled negative node, each node i samples k negative nodes, and an offset term b_rIs a bias item of neural network training and is used for adjusting the density of different element paths;

(3) embedding vector table V and parameter bias term b using random gradient descent (SGD) learning_rThe goal is to maximize the likelihood function.

5. The heterogeneous information network embedding algorithm-based user identification system according to claim 1, wherein: in the specific implementation of the joint embedding module, the joint objective function is to effectively combine the local features captured by the local embedding algorithm and the global features captured by the global embedding algorithm, and is defined as follows:

(2) if the local embedding algorithm is sampled, training an embedding vector table V and learning the weight w of each type of node type vector according to the operation steps of the local embedding algorithm_tSimilarly, if the global embedding algorithm is sampled, the embedding vector table V is trained and the parameter b is learned according to the operation steps of the global embedding algorithm_rWherein the parameter b_rUsed for adjusting the density of different element paths;

the embedded vector table V is shared for both embedding algorithms;

6. The heterogeneous information network embedding algorithm-based user identification system according to claim 1, wherein: the analysis process for the suspicious situation in the evaluation analysis module is as follows:

7. The heterogeneous information network embedding algorithm-based user identification system of claim 4, wherein: the determination of the meta-path set R needs to pass through a meta-path selection process, which is specifically as follows:

8. A user identification method based on a heterogeneous information network embedding algorithm is characterized by comprising the following steps: