CN114553497B

CN114553497B - Internal threat detection method based on feature fusion

Info

Publication number: CN114553497B
Application number: CN202210105573.5A
Authority: CN
Inventors: 卢志刚; 肖海涛; 刘玉岭; 张辰; 刘松; 姜波
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2022-11-15
Anticipated expiration: 2042-01-28
Also published as: CN114553497A

Abstract

The invention provides a feature fusion-based internal threat detection method, which relates to the field of network space security.

Description

Internal threat detection method based on feature fusion

Technical Field

The invention relates to the field of network space security, which fuses statistical characteristics and structural characteristics of users, detects internal threats by using an anomaly detection method based on deep learning, and discovers potential threat users in an intranet in time.

Background

Under the condition that the external network threat is rampant day by day, the influence brought by the internal threat is not ignored at the same time, the internal threat generally refers to a user who has access authority to the internal network, system or data in an organization, abuse authority, violates the security policy of the organization, and has a negative influence on confidentiality, integrity and usability of internal information. According to the 2020 safety report from Cybersecurity instruments, two thirds of the organizations (68%) indicate that threats from Insiders have become more frequent in the past year, and 70% of the organizations have suffered at least once from malicious behavior by Insiders in the past year. The threat brought by the current internal network is gradually highlighted, becomes a problem to be solved urgently at present, and how to accurately and timely detect the internal threat is crucial to the stable operation and the healthy development of the organization.

The internal threat is often associated with malicious employees who intentionally implement data theft, system destruction and the like to cause losses to enterprises and organizations, and actually, negligence of employees and errors of partners also contribute to a lot of security holes and unexpected data leakage to cause small losses to enterprises and organizations. Except objective factors such as system loopholes and improper authority distribution, people are main factors causing losses of enterprises and organizations, internal threats are usually implemented by privileged users with legal authorities, and are different from behaviors of unauthorized operation of system loopholes applied by external threats, the internal users have legal identities and are familiar to internal architectures, so that malicious behaviors are difficult to discover, and huge threats are caused to the safety of the organizations.

Today's research on internal threat detection techniques can be divided into rule-based internal threat detection, traditional machine learning-based internal threat detection, and deep learning-based internal threat detection, depending on the method used.

The rule-based internal threat detection generally has higher accuracy and lower false alarm rate on the determined threat behaviors, but is difficult to detect unknown abnormal behaviors which are not in a knowledge base, cannot adapt to new internal attack behaviors, and is not suitable for the current complex network environment.

Internal threat detection based on traditional machine learning generally preprocesses data, extracts features and selects the features, trains training data by using a selected traditional machine learning algorithm, predicts a model obtained by training by using test data, and evaluates a prediction result and a true value.

The internal threat detection based on deep learning mostly takes a user behavior sequence as data input, a normal behavior model of a user is established, the user behavior abnormity is detected through the change of the user behavior sequence, whether the user is abnormal or not is judged, only unilateral user behavior sequence information or unilateral user behavior statistical information can be learned, and judgment on correlation information between the user attribute and the user is lacked.

In summary, in the field of internal threat detection, the problems of low detection accuracy and high false alarm rate due to incomplete utilization of feature types and no consideration of associated information among users generally exist, so that the effect of internal threat detection is often not ideal enough.

Disclosure of Invention

In order to solve the problems, the invention provides an internal threat detection method based on feature fusion, which is used for extracting statistical features and structural features of user behaviors from multi-source logs respectively, and combining an anomaly detection method based on deep learning to realize detection of internal threats and improve the security of an internal network organization.

In order to achieve the purpose, the invention adopts the specific technical scheme that:

an internal threat detection method based on feature fusion comprises the following steps:

collecting multi-source user behavior logs in an internal network, analyzing by taking users as units, and forming an independent multi-source user behavior log record for each user;

counting behavior information of users from a multi-source user behavior log record corresponding to each user, and extracting statistical characteristics of user behaviors;

constructing a user login behavior association graph by using login logs in multi-source user behavior log records of a user, randomly walking neighbor nodes of each user node to generate a plurality of random walking sequences with fixed lengths, wherein each sequence is formed by arranging the walking nodes in a front-back sequence, and the structural characteristics of the user behavior are extracted aiming at each node in the sequence;

fusing the extracted statistical characteristics and structural characteristics of the user behaviors to form a characteristic matrix;

processing training data containing internal threat labels through the steps to obtain a feature matrix, inputting the feature matrix into a Capsule neural network for training to obtain an internal threat detection model based on feature fusion; when the threat in the internal network is formally detected, the multi-source user behavior log in the internal network is obtained, the characteristic matrix is obtained through the processing of the steps, and the characteristic matrix is input into the internal threat detection model to detect the threat in the internal network.

Further, the multi-source user behavior log includes a mobile device usage record, a file operation log, an email log, and a web browsing record in addition to a login log.

Further, the behavior information of the user comprises frequency-based user behavior statistical characteristics and content-based user behavior statistical characteristics; the frequency-based user behavior statistical characteristics are average counts of different types of user behaviors, and the content-based user behavior statistical characteristics are content generated based on user behavior operation.

Further, the frequency-based user behavior statistical characteristics comprise one or more of user login and logout times, user login and logout times at the next shift time, the number of computers logged in by a user, the number of equipment connections, the number of different file transmission times, the total number of file transmission, the number of files transmitted at the next shift time, the number of executable file transmission times, the number of computers involved in file transmission, the number of electronic mails sent out, the number of electronic mails sent to the inside of an organization, the number of electronic mails sent to the outside of the organization, the average size of the electronic mails, the number of electronic mail attachments, the number of electronic mail receivers, the number of electronic mails sent at the next shift time, the number of computers used for receiving the mails, the number of web page views and the number of web page views at the next shift time;

the content-based user behavior statistical characteristics comprise one or more of the number of e-mail associated with emotional tendency, the number of web pages browsed associated with decryption, the number of web pages browsed associated with job recruitment, the number of web pages browsed associated with hacker, the number of web pages browsed associated with cloud storage, the number of web pages browsed associated with social contact, and the number of web pages browsed associated with emotional tendency.

Further, the user login behavior association graph is a same composition, is composed of different users and association relations between the different users, and is denoted by G = (V, E), where V represents a set of nodes in the graph, and each node represents one user; and E represents a set of edges in the graph, and each edge represents the association relationship between two corresponding users.

Further, the method for extracting the structural features of the user behavior for each node in the sequence comprises the following steps: generating a plurality of random walk sequences with fixed length to form a node sequence set S; and (3) for each node u, calculating a loss function f maximized on the set S through Skip-gram model learning to obtain a node embedding vector, wherein the node embedding vector is the structural characteristic of the user behavior.

Further, the method for fusing the statistical characteristics and the structural characteristics of the user behaviors comprises the following steps: and splicing the statistical characteristics and the structural characteristics of the user behaviors, normalizing the characteristic values to be in a range of 0-1 by using Min-Max normalization, and converting into a two-dimensional characteristic matrix.

Further, during the training process in the Capsule neural network, accumulating the reconstruction error loss and the slowness limit loss to obtain a final loss function, and using the final loss function to adjust the parameters of the Capsule neural network to obtain the internal threat detection model; the reconstruction error loss is the Euclidean distance between the characteristic matrix and the output of a Sigmoid layer in reconstruction, the output of the Sigmoid layer in reconstruction reconstructs a digital decoding structure from a Digitcaps digital Capsule layer of a Capsule neural network, a correct activation vector of a digital Capsule is reserved by a masking method, and then the activation vector is reconstructed to obtain the reconstruction error loss.

Compared with the existing internal threat detection method, the method has the following advantages:

the method and the device respectively excavate potential characteristics of the user behaviors from a statistical level and a structural level of the multi-source user behavior log, effectively utilize various characteristic types and combine the association information between the users, thereby more effectively detecting the internal threats, solving the problems that the original internal threat detection method only can learn unilateral user behavior information and does not consider the association between the users, and further improving the accuracy of the internal threat detection.

Drawings

Fig. 1 is a general flow chart of an internal threat detection method based on feature fusion according to an embodiment of the present invention.

Fig. 2 is a schematic diagram illustrating the extraction of statistical features of user behaviors according to an embodiment of the present invention.

Fig. 3 is a schematic diagram illustrating extraction of structural features of user behaviors according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions in the embodiments of the present invention better understood and make the objects, features and advantages of the present invention more obvious and understandable by those skilled in the art, the technical cores of the present invention are further described in detail with reference to the accompanying drawings and examples.

The embodiment of the invention provides an internal threat detection method based on feature fusion, which is specifically explained as follows:

as shown in fig. 1, is a general flow chart of internal threat detection based on feature fusion.The method comprises 5 steps, wherein the first step is to collect and analyze a log R of a user _L Mobile device usage record R _D File operation log R _F E-mail journal R _M And a web browsing record R _W Forming a multi-source user behavior log R, and respectively extracting statistical characteristics F of user behaviors from the multi-source user behavior log R obtained by analysis _stat Structural features F related to user behavior _stru Then, the obtained two types of features are fused and converted, and each internal user forms a unique feature matrix F _matrix And finally, inputting the data into a Capsule neural network for training to obtain an internal threat detection model M based on feature fusion, and detecting the internal threat user by using the model M.

As shown in fig. 2, the analyzed multi-source user behavior logs are subjected to statistics of user behaviors from two aspects of frequency-based and content-based, the frequency of user behavior operation often reflects typical characteristics of user behaviors, and the user behavior statistical characteristics based on the frequency are average counts of different types of user behaviors, such as daily user login and logout times, daily user login and logout times in the off-duty time, and the like; on the other hand, the statistical characteristics of the user behavior based on the content are extracted based on the content generated by the behavior operation, and are mainly text characteristics generated in user communication, such as emotional tendency in an email, the access frequency of a specific webpage and the like. Here, a total of 33 statistical features of user behavior are extracted.

Specifically, the user behavior statistical characteristics are extracted by taking a user as a unit, and the frequency-based user behavior statistical characteristics comprise user login and logout times, user login and logout times in the off-duty time and the number of computers logged in by the user; the connection times of the equipment, the connection times of the equipment during the off-duty time and the number of computers connected with the equipment; the transmission times of different files, the total transmission number of the files, the number of files transmitted in the off-duty time, the transmission number of executable files and the number of computers related to the file transmission; the number of sent e-mails, the number of e-mails sent to the inside of the organization, the number of e-mails sent to the outside of the organization, the average size of the e-mails, the number of e-mail attachments, the number of e-mail receivers, the number of sent e-mails during the off-duty time, and the number of computers used for receiving the e-mails; the number of web page browses and the number of web page browses at the next work time. The content-based user behavior statistical characteristics comprise the number of e-mail pieces relevant to emotional tendency; the number of web pages viewed related to decryption, the number of web pages viewed related to job recruitment, the number of web pages viewed related to hackers, the number of web pages viewed related to cloud storage, the number of web pages viewed related to social interaction, and the number of web pages viewed related to emotional tendency.

As shown in fig. 3, a user login behavior association graph G = (V, E) is constructed according to a login log of a user, the constructed user login behavior association graph is a same graph, where V represents a set of nodes in the graph, E represents a set of edges in the graph, and V has only one attribute, that is, a user, and E has only one attribute, that is, whether there is an association relationship between users. Suppose a user _i With user _j When the same equipment is logged in, the user uses _i With user _j There is an associative relationship between them. And constructing a user login behavior association diagram. Then, a random walk sequence S with a fixed length is simulated by using a random walk method, surrounding neighbor nodes are walked, the length of the random walk is set to be l for any node u epsilon V in the graph G, and therefore a node sequence set S = { S } is generated ₁ ,s ₂ ,…,s _m In which s is _i And representing the ith random walk sequence, wherein m represents the number of all the random walk sequences, and then, for each node u, learning through a Skip-gram model to obtain a d-dimensional node vector representation. The Skip-gram model learns node embedding vectors by maximizing a loss function f on a node sequence set S, wherein the specific loss function f is shown as formula (1):

f＝∑ _u∈W logP(N(u)|u) (1)

wherein, W is a word list containing each independent node, N (u) is a neighbor node set of the node u, and the base number of log is 2; p (N (u) | u) is the transition probability of a given source node u to u's neighbor nodes, which is defined as shown in equation (2):

wherein v and v' are two vector representations of the node u, and v is a low-dimensional embedded vector of the final node u, and user nodes with similar device login behaviors often have similar low-dimensional embedded vectors. T denotes a transposed matrix, n _i I as the neighbor nodes of node u, W is the vocabulary of the independent node, W is the vocabulary containing each independent node, N is the set of neighbor nodes of the node,

two vector representations of node u, respectively.

In the process of feature fusion, firstly, the statistical features F of the user behaviors are obtained by extraction _stat Structural features F related to user behavior _stru Splicing, normalizing the feature value to be in the range of 0-1 by using Min-Max normalization, and eliminating the dimension difference among different features by using Min-Max normalization so as to avoid the great influence of the features of different dimensions on the model, wherein the features are converted into a two-dimensional feature matrix F after being normalized _matrix For input into a subsequent deep neural network model.

In the neural network training stage, the labeled training data is subjected to the preprocessing to obtain a feature matrix F _matrix Inputting the abnormal data into a Capsule neural network, and outputting whether the user corresponding to the characteristic matrix is abnormal or not after respectively passing through a Conv convolution layer, a Primary caps main Capsule layer and a Digitcaps digital Capsule layer. Compared with the traditional neural network, the Capsule neural network has stronger learning and characterization capabilities and can better adapt to the learned features.

In the training process, in order to adjust parameters of the Capsule neural network, the embodiment of the invention uses a slow limiting Loss (Margin Loss) function, and for each Capsule k, a Loss function L _k As shown in equation (3):

L _k ＝T _k max(0,m ⁺ -|v _k |) ² +λ(1-T _k )max(0,|v _k |-m ^- ) ² (3)

wherein v is _k Is the output vector of the capsule neural network; | v _k L is the modular length of the output vector and can represent the size of class probability; when class k is the correct classification, T _k =1, otherwise T _k ＝0。m ⁺ And m ^- The two hyperparameters control the upper and lower bounds of the right classification and wrong classification losses, respectively, e.g. if the probability of correct classification is greater than m ⁺ When this is the case, then the loss of correct classification is 0, where m ⁺ And m ^- Set to 0.9 and 0.1, respectively, and λ is the regularization parameter, here set to 0.5.

Meanwhile, the embodiment of the invention accumulates the reconstruction error loss and the slow limit loss as a final loss function to obtain a more accurate detection model, namely the reconstruction error loss L _R The method is characterized in that a digital decoding structure is reconstructed from Digitcaps digital capsule layers, only the activation vector of a correct digital capsule is reserved by using a masking method, then the activation vector is used for reconstruction, and the reconstruction error loss L is reduced _R As input feature matrix F _matrix And the euclidean distance between the outputs of the Sigmoid layers in the reconstruction. The Sigmoid layer is an additional layer for performing reconstruction error calculation, and is used for comparing errors of a matrix reconstructed by the output vector with an input matrix. The final loss function L is shown in equation (4):

wherein, C _k Is the number of categories (i.e. 2 categories of malicious and non-malicious users), here C _k ＝2，L _k For the loss of slack for each capsule k, L _R For reconstruction errors, the neural network is trained with the minimum error function L as a target.

After training is finished, an internal threat detection model based on feature fusion can be obtained, the model is formally used for detecting internal threats, an obtained feature matrix is input into the model for detection during detection, and a detection result is output.

The internal threat detection model is evaluated as follows, wherein the data set is derived from an internal threat test data set issued by the computer security emergency response group of the university of tomilong in the card, and the data set respectively collects the log of 1000 users of 17 months, the mobile device usage record, the file operation log, the email log and the web browsing record, wherein 70 internal threat users are contained.

The training of the model mainly learns the parameters of the Capsule neural network through a training data set, and when the loss function obtains the minimum value, the parameters of the Capsule neural network are the optimal parameters, so that the internal threat detection model based on feature fusion is obtained.

In the evaluation of the model, four indexes of Accuracy, F-measure, AUC and Recall are adopted for evaluation, and the Accuracy of the Accuracy visually reflects the performance of the model; the F-measure is a harmonic mean of the accuracy rate and the recall rate, the accuracy rate and the recall rate are usually a pair of contradictory measures, and generally, the higher the accuracy rate is, the lower the recall rate is; when the recall rate is high, the accuracy rate is often low. The F-measure value balances the two values, if the obtained values of the accuracy rate and the recall rate are higher and the other value is lower, the final F-measure value is also lower, and the F-measure value is higher only when the values of the two values are higher simultaneously, so that the occurrence of extreme conditions is avoided, and the model can be better and accurately evaluated; AUC is the area under the receiver operating characteristic curve and is commonly used for evaluating the effect of the two-classification model; recall is a Recall rate, which refers to the ratio of the number of correctly predicted positive samples to the total number of real positive samples, and can reflect the detection capability of the model on internal threat users.

In the evaluation experiment, a data set is divided into a training set and a test set according to the proportion of 6: 0.980, 0.958, 0.933 and 0.867 which are all higher than the internal threat detection methods based on machine learning such as Logistic regression, SVM support vector machine, random Forest and the like and deep learning such as CNN convolution neural network, GCN graph convolution neural network and the like, the effectiveness of the internal threat detection method based on feature fusion in the internal threat detection field is proved.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail by using examples, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered in the claims of the present invention.

Claims

1. An internal threat detection method based on feature fusion is characterized by comprising the following steps:

collecting multi-source user behavior logs in an internal network, analyzing by taking a user as a unit, and forming an independent multi-source user behavior log record for each user;

constructing a user login behavior association graph by using login logs in a multi-source user behavior log record of a user, randomly walking neighbor nodes of each user node to generate a plurality of random walk sequences with fixed lengths, wherein each sequence is formed by arranging the walk nodes in a front-back sequence, and the structural characteristics of the user behavior are extracted aiming at each node in the sequence;

fusing the extracted statistical characteristics of the user behaviors with the structural characteristics to form a characteristic matrix;

processing the training data containing the internal threat labels through the steps to obtain a feature matrix, inputting the feature matrix into a Capsule neural network for training to obtain an internal threat detection model based on feature fusion; when the threat in the internal network is formally detected, a multi-source user behavior log in the internal network is obtained, a feature matrix is obtained through the steps, and the feature matrix is input into the internal threat detection model to detect the threat in the internal network.

2. The method of claim 1, wherein the multi-source user behavior log comprises mobile device usage records, file operation logs, email logs, and web browsing records in addition to a log of logins.

3. The method of claim 1, wherein the behavior information of the user comprises frequency-based user behavior statistics and content-based user behavior statistics; the frequency-based user behavior statistical characteristics are average counts of different types of user behaviors, and the content-based user behavior statistical characteristics are content generated based on user behavior operation.

4. The method of claim 3, wherein the frequency-based statistical user behavior characteristics include one or more of a number of user logins and logouts, a number of user logins and logouts at a time of work, a number of computers that the user logs in, a number of connections of the device at a time of work, a number of computers that the device is connected to, a number of transmissions of different files, a total number of file transmissions, a number of files transmitted at a time of work, a number of executable file transmissions, a number of computers involved in file transmissions, a number of outgoing e-mail messages, a number of e-mail messages sent to the interior of the organization, a number of e-mail messages sent to the exterior of the organization, an average size of e-mail messages, a number of e-mail recipients, a number of e-mail senders, a number of computers used to receive e-mails, a number of web pages viewed, and a number of web pages viewed at a time of work;

5. The method of claim 1, wherein the user login behavior association graph is a isomorphic graph, and is composed of different users and associations between different users, and is denoted as G = (V, E), where V represents a set of nodes in the graph, and each node represents a user; and E represents a set of edges in the graph, each edge represents an association relationship existing between two corresponding users, and the association relationship comprises the fact that the two users log in the same device.

6. The method of claim 1, wherein the method of extracting structural features of user behavior for each node in the sequence is: generating a plurality of random walk sequences with fixed length to form a node sequence set S; and (3) for each node u, calculating a loss function f maximized on the set S through Skip-gram model learning to obtain a node embedding vector, wherein the node embedding vector is the structural characteristic of the user behavior.

7. The method of claim 1, wherein the statistical and structural features of user behavior are fused by: and splicing the statistical characteristics and the structural characteristics of the user behaviors, normalizing the characteristic values to be in a range of 0-1 by using Min-Max normalization, and converting the characteristic values into a two-dimensional characteristic matrix.

8. The method of claim 1, wherein during training in the Capsule neural network, the reconstruction error loss and the slowness loss are accumulated to obtain a final loss function, and the final loss function is used to adjust parameters of the Capsule neural network to obtain the internal threat detection model; the reconstruction error loss is the Euclidean distance between the characteristic matrix and the output of a Sigmoid layer in reconstruction, the output of the Sigmoid layer in reconstruction reconstructs a digital decoding structure from a Digitcaps digital Capsule layer of a Capsule neural network, a correct activation vector of a digital Capsule is reserved by a masking method, and then the activation vector is reconstructed to obtain the reconstruction error loss.