CN106778259B

CN106778259B - Abnormal behavior discovery method and system based on big data machine learning

Info

Publication number: CN106778259B
Application number: CN201611232408.7A
Authority: CN
Inventors: 李学进; 王志海; 魏力; 喻波; 何晋昊; 蒲鹏飞
Original assignee: Beijing Wondersoft Technology Co Ltd
Current assignee: Beijing Wondersoft Technology Co Ltd
Priority date: 2016-12-28
Filing date: 2016-12-28
Publication date: 2020-01-10
Anticipated expiration: 2036-12-28
Also published as: CN106778259A

Abstract

The invention discloses an abnormal behavior discovery method and system based on big data machine learning, wherein the method comprises the following steps: preprocessing original safety log data; extracting feature data from the preprocessed results; clustering the characteristic data, and determining an abnormal behavior library and a normal behavior library; acquiring new behavior sample data in the new safety day, comparing the sample data with the normal behavior library and the abnormal behavior library to determine that the sample data is normal behavior or abnormal behavior, and updating the normal behavior library or the abnormal behavior library by using the new behavior sample data; and repeating the previous step, when the normal behavior library and the abnormal behavior library have enough sample data of normal behaviors and abnormal behaviors, training a random forest model by using the sample data in the normal behavior library and the abnormal behavior library, and judging the abnormal behaviors by using the random forest model obtained through training. By the scheme, the problem that the number of samples containing the labels is too small in the initial stage is solved, the judgment accuracy is improved, and the misjudgment condition is effectively prevented.

Description

Abnormal behavior discovery method and system based on big data machine learning

Technical Field

The invention relates to the field of data security, in particular to an abnormal behavior discovery method and system based on big data machine learning.

Background

Traditional network security and data security technologies, such as various software and hardware firewalls, generally adopt a 'fence type' protection strategy, artificially add a lot of limitations to a network and an application system, and any data access action needs to be filtered by all preset rules, so that the user experience of the system is influenced, and the operation burden of the system is increased. In addition, in the existing security software, a built-in rule is generated, and multiple stages of vulnerability discovery, attack simulation, message analysis, feature extraction, rule generation and the like are generally required. With the continuous updating of the attack means, the rule generation process needs to be repeated continuously, and a large amount of labor cost is consumed. More importantly, traditional protections cannot handle large data. Based on the method, the abnormal behavior discovery method based on big data machine learning is provided, passive defense is changed into active examination, user access is relaxed, behavior monitoring is enhanced, and machines replace manual work.

Fig. 1 is a process for discovering abnormal behaviors of a management user based on big data log analysis in the prior art, which specifically includes:

(1) and storing the log to be analyzed in a log pool.

(2) And connecting the log pool with the preprocessing module through the interface module.

(3) And connecting the preprocessing module with an analysis module, manually carrying out statistical analysis and establishing rules.

(4) And judging the behavior log according to the established rule, and storing the log judged to be abnormal behavior into a knowledge base.

(5) And connecting the visualization module with the service module, and visually displaying the abnormal behavior track analyzed by the log on a user interface by the visualization module.

The prior art has the following defects:

(1) the data source is single, and only the log is analyzed.

(2) Abnormal behavior and users cannot be determined in real time.

(3) All rely on manual statistical analysis, the cost is high and misjudgment of behaviors is easy to occur.

Therefore, the following technical problems need to be solved:

(1) the receiving, the storage, the processing and the mining of the structured data, the semi-structured data and the unstructured data are realized.

(2) And the machine learning modeling is used for replacing the manual work, so that the judgment accuracy is improved and the labor cost is saved. In addition, the trained model can be used for batch off-line behavior judgment and on-line quasi-real-time behavior judgment.

(3) The identification of abnormal behaviors does not depend on a strong safety rule base preset by a system any more, but is continuously self-perfected in a self-adaptive mode.

Disclosure of Invention

In order to solve the technical problem, the invention provides an abnormal behavior discovery method based on big data machine learning, which comprises the following steps:

1) preprocessing original safety log data;

2) extracting feature data from the preprocessed results;

3) clustering the characteristic data, determining each behavior sample in the original safety log data as an abnormal behavior sample or a normal behavior sample, and respectively putting the abnormal behavior sample or the normal behavior sample into an abnormal behavior library or a normal behavior library;

4) acquiring new behavior sample data in the new safety day, and comparing the sample data with the normal behavior library and the abnormal behavior library to determine that the sample data is normal behavior or abnormal behavior;

5) updating the normal behavior library or the abnormal behavior library by using the new behavior sample data;

6) when the normal behavior library and the abnormal behavior library have enough normal behavior and abnormal behavior sample data, jumping to the step 7), otherwise, jumping to the step 4);

7) training a random forest model by using the sample data in the normal behavior library and the abnormal behavior library, respectively deploying the random forest model obtained by training in a real-time processing module and an offline processing module to judge the abnormal behavior of the subsequent new behavior sample data, and jumping to the step 5).

Preferably, the feature data extracted in step 2) includes: the time of the user using the terminal, the operation behavior category and the operation file type; vectorizing the extracted feature data.

Preferably, the step 3) includes: clustering the feature data by using Mllib, specifically comprising: determining K clustering centers by using a Canopy algorithm, then carrying out K-Means clustering, marking the class which contains less instances than a certain threshold value or obviously less instances than other classes after clustering as an abnormal class, marking the instances in the class as abnormal behaviors, and marking the other classes as normal classes, wherein the instances are marked as normal behaviors.

Preferably, the step 4) includes: randomly extracting a part of sample data from the normal behavior library for a KNN algorithm to find abnormal behaviors, wherein if the distances between the new behavior sample data and the randomly extracted sample data are both greater than a set threshold value, the behavior of the new behavior sample data is an abnormal behavior, and otherwise, the new behavior sample data is a normal behavior; if the abnormal behavior is manually judged to be normal behavior, the abnormal behavior is normal behavior; and updating the normal behavior library or the abnormal behavior library by using the sample data corresponding to the normal behavior or the abnormal behavior respectively.

Preferably, the real-time processing module provides streaming computing capability, performs user behavior judgment in a quasi-real-time manner, and stores a judgment result into a high-performance database providing real-time data service for a user;

the batch processing module provides batch processing capacity of mass data and is used for training a model and batch off-line judgment, the batch processing module comprises a plurality of timing tasks, data sets are processed in a full or incremental mode, and judgment results are stored in the high-performance database.

In order to solve the above technical problem, the present invention provides an abnormal behavior discovery system based on big data machine learning, including:

the preprocessing module is used for preprocessing the original safety log data;

the characteristic data extraction module is used for extracting characteristic data from the preprocessed result;

the clustering module is used for clustering the characteristic data, determining each behavior sample in the original safety log data as an abnormal behavior sample or a normal behavior sample, and respectively putting the abnormal behavior sample or the normal behavior sample into an abnormal behavior library or a normal behavior library;

the behavior library generation module is used for acquiring new behavior sample data in the new safety day, and comparing the sample data with the normal behavior library and the abnormal behavior library to determine that the sample data is normal behavior or abnormal behavior;

the updating module is used for updating the normal behavior library or the abnormal behavior library by using the new behavior sample data;

and the behavior judgment module is used for training a random forest model by using the sample data in the normal behavior library and the abnormal behavior library, deploying the random forest model obtained by training in the real-time processing module and the off-line processing module respectively, and judging the abnormal behavior by using the subsequent new behavior sample data.

Preferably, the extracted feature data includes: time, operation type and operation file type of the user using the terminal; vectorizing the extracted feature data.

Preferably, the clustering module uses Mllib to cluster the feature data, and specifically includes: determining K clustering centers by using a Canopy algorithm, then carrying out K-Means clustering, marking the class which contains less instances than a certain threshold value or obviously less instances than other classes after clustering as an abnormal class, marking the instances in the class as abnormal behaviors, and marking the other classes as normal classes, wherein the instances are marked as normal behaviors.

Preferably, the behavior library generating module randomly extracts a part of sample data in the normal behavior library for the KNN algorithm to find abnormal behavior, and if the distance between the new behavior sample data and the randomly extracted sample data is greater than a set threshold, the behavior of the new behavior sample data is abnormal behavior, otherwise, the behavior is normal behavior; if the abnormal behavior is manually judged to be normal behavior, the abnormal behavior is normal behavior; and updating the normal behavior library or the abnormal behavior library by using the sample data corresponding to the normal behavior or the abnormal behavior respectively.

In order to solve the above technical problem, the present invention provides an abnormal behavior processing system based on big data machine learning, which includes: the system comprises a data service module, a real-time processing module and a batch processing module;

the data service module forms a normal behavior library and an abnormal behavior library based on the method;

training a random forest model by using sample data in the normal behavior library and the abnormal behavior library, and respectively deploying the random forest model obtained by training in a real-time processing module and an offline processing module;

after new sample data is input into the system, the new sample data is copied into two identical sample data, and the two identical sample data are respectively input into the real-time processing module and the offline processing module so as to judge the abnormal behavior of the sample data;

the real-time processing module provides streaming computing capability, judges user behaviors in a quasi-real-time mode, and stores a judgment result into a high-performance database providing real-time data service for a user;

The technical scheme of the invention achieves the following technical effects:

1. the problem of the sample quantity of the label in the initial stage is too small is solved.

2. The machine learning algorithm is used for replacing manpower, labor cost and time cost are saved, the judgment accuracy is improved, and the misjudgment condition is effectively prevented.

3. The operation process of the platform is not only an abnormal behavior discovery process, but also a self-adjustment and continuous improvement process, and the identification of the abnormal behavior does not depend on a strong safety rule base preset by the system any more, but is continuously self-perfected in a self-adaptive mode.

Drawings

FIG. 1 is a flow chart of user abnormal behavior discovery in the prior art

FIG. 2 is a general flow chart of the present invention

FIG. 3 is a general architecture diagram of the system of the present invention

FIG. 4 is a flow chart of an embodiment of the present invention

Detailed Description

The noun explains:

hadoop: the distributed system infrastructure has the core design of HDFS and MapReduce. The HDFS provides storage for massive data, and the MapReduce provides calculation for the massive data.

Spark: the general parallel computing framework is similar to that of Hadoop MapReduce, and different from MapReduce, Job intermediate output results can be stored in a memory, so that the computing speed is higher, and the method is better suitable for algorithms needing iteration, such as data mining, machine learning and the like.

Lambda architecture: a real-time big data processing framework provided by Nathan Marz integrates a series of framework principles such as offline calculation and real-time calculation, integration of invariability, read-write separation, complexity isolation and the like, and can integrate various big data components such as Hadoop, Spark and the like.

Sqoop: and the big data component is used for transmitting data between the big data platform and the traditional relational database.

MLlib: spark's machine learning library.

Canopy: one kind of unsupervised learning clustering algorithm is mainly used for determining the number of clusters.

KMeans: the K mean algorithm is one of unsupervised learning clustering algorithms.

KNN: k-nearest neighbor (K-nearest neighbor) algorithm, one of the classification algorithms for supervised learning.

Random Forest: random forests, an algorithm for training and predicting samples by using a plurality of decision trees, and belongs to a classification algorithm for supervised learning.

Fig. 2 shows the abnormal behavior discovery flow chart of the present invention.

(1) Preprocessing raw data

And cleaning, converting and extracting the original data.

(2) Feature engineering

Features that are representative of the pre-processed raw data are derived from experience and analysis.

(3) Clustering with MLlib to obtain samples

Firstly, K clustering centers are determined by using a Canopy algorithm, then K-Means algorithm clustering is carried out, the class which contains too few examples or is obviously less than other classes after clustering is marked as an abnormal class, the examples in the class are marked as abnormal behaviors, and the examples in the other classes are marked as normal behaviors.

(4) Manually studying, judging and updating behavior library

And marking a label on the example through clustering, then manually studying and judging the abnormal behavior example, storing data which is manually judged to be abnormal behavior into an illegal behavior library, and putting the rest into a normal behavior library. In the early stage, the number of samples is small, so that the manual study and judgment are performed to improve the quality of the samples, and when a certain number of samples are accumulated, the manual study and judgment are not performed.

(5) Classification using MLlib, updating of behavior library and training of models

The method comprises the steps of carrying out primary classification on samples by using a KNN algorithm, updating a behavior library, then training a RandomForest model by using the samples in the behavior library, and combining a manually formulated rule library after the model is trained to be used for judging the quasi-real-time behavior and the batch off-line behavior.

The clustering in the step (3) is unsupervised learning, no sample data is needed, the classification is supervised learning, the sample is needed, and the output of the clustering is used as the input of the classification, so that the judging accuracy is improved.

(6) And storing the results of the real-time behavior judgment and the batch off-line behavior judgment into a behavior library, wherein the behavior library is updated and perfected all the time.

Fig. 3 is a system architecture diagram of the present invention.

The system uses the Lambda architecture for reference and is divided into a real-time processing layer, a batch processing layer and a data service layer. The original data are copied into two parts after being accessed to the platform, and respectively enter a real-time processing layer and a batch processing layer.

The real-time processing layer provides streaming computing capability, user judgment is carried out in a quasi-real-time mode, and the judgment result is stored in a high-performance database providing real-time data service for the user.

The batch layer provides batch processing capability for mass data for training models and batch off-line decisions. The batch processing layer comprises a plurality of timing tasks, the data set is processed in a full or incremental mode, and the judgment result is stored in the database.

Fig. 4 is an abnormal behavior discovery embodiment of the present invention.

1 data preprocessing

The safety control terminal log is stored in a traditional database and has fields such as equipment unique identifiers, user unique identifiers, operation behaviors and the like. The data are imported into a data warehouse of a big data platform by using sqoop, then cleaning and converting are carried out, meaningless fields are removed, and missing values are filled.

2 characteristic engineering

(1) Aiming at the safety control terminal log, the following characteristics are extracted from the original data according to experience and statistical analysis:

① time of the user using the safety control terminal, time period of operation, morning, noon and evening.

② operation types including supervision and reporting, sending out mail, going out to work and communicating outside.

③ types of files for operation, office documents, compressed files, pictures.

④ access data traffic of the operation.

⑤ uses different numbers of terminals, IP change times, log-in and log-out times.

(2) And vectorization is carried out to obtain data which can be processed by the machine learning model.

3 modeling and decision

And carrying out coarse clustering by adopting a Canopy algorithm to obtain the category number of the data set aggregation.

And performing high-precision clustering by adopting a K-Means clustering method, marking the class which contains too few examples or is obviously less than other classes after clustering as an abnormal class, marking the examples in the class as abnormal behaviors, and marking the examples in the other classes as normal behaviors. In the K-Means clustering result graph, classes which are obviously deviated and contain a small number of examples are marked as abnormal classes, and the abnormal classes are used for classification after being marked with labels.

(3) And generating a small-range normal behavior library by means of manual judgment.

The specific method comprises the following steps: and manually checking whether the clustered instances marked as abnormal have abnormal operation, if so, marking the clustered instances as abnormal behaviors, and forming an abnormal behavior library by all instances corresponding to the abnormal behaviors.

(4) Randomly extracting a part of samples from a normal behavior library for a KNN algorithm to find abnormal behaviors, wherein if the Euclidean distance between the new behavior and each sample instance in the library is larger than a set threshold value, the behavior is the abnormal behavior; and if the abnormal behavior is manually judged to be the normal behavior, updating the normal behavior library by using the behavior. In the KNN classification result graph, abnormal users are marked, but some users which are not abnormal are marked as abnormal.

(5) And when enough normal behavior data and abnormal behavior data exist, training a random forest model by using the data as samples, and respectively deploying the trained models in a real-time processing module and an offline processing module to judge the abnormal behavior. In the RandomForest classification result graph, the error marked users are obviously reduced.

The above examples and samples all have the same meaning and each indicate a security management and control terminal log.

By the method, the problem that the number of samples containing the labels is too small in the initial stage is solved; the machine learning algorithm is used for replacing manpower, so that the labor cost and the time cost are saved, the judgment accuracy is improved, and the occurrence of misjudgment is effectively prevented; the operation process of the platform is not only an abnormal behavior discovery process, but also a self-adjustment and continuous improvement process, and the identification of the abnormal behavior does not depend on a strong safety rule base preset by the system any more, but is continuously self-perfected in a self-adaptive mode.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be protected within the protection scope of the present invention.

Claims

1. An abnormal behavior discovery method based on big data machine learning comprises the following steps:

1) preprocessing original safety log data;

2) extracting feature data from the preprocessed results;

4) acquiring new behavior sample data in the new security log, and comparing the sample data with the normal behavior library and the abnormal behavior library to determine that the sample data is a normal behavior or an abnormal behavior;

6) when the normal behavior library and the abnormal behavior library have sample data of normal behaviors and abnormal behaviors with the same quantity, jumping to the step 7), otherwise, jumping to the step 4);

7) training a random forest model by using sample data in the normal behavior library and the abnormal behavior library, respectively deploying the random forest models obtained through training in a real-time processing module and an offline processing module to judge the abnormal behavior of subsequent new behavior sample data, copying the sample data into two same sample data after inputting the new behavior sample data, and respectively inputting the sample data into the real-time processing module and the offline processing module;

the real-time processing module provides streaming computing capability, judges user behaviors in a quasi-real-time mode, and stores a judgment result into a high-performance database providing real-time data service for a user; the offline processing module provides batch processing capability of mass data and is used for training a model and performing batch offline judgment, the offline processing module comprises a plurality of timing tasks, data sets are processed in a full or incremental mode, and judgment results are stored in the high-performance database;

8) updating the normal behavior library or the abnormal behavior library by using the new sample behavior data in the step 7).

2. The method of claim 1, wherein the feature data extracted in step 2) comprises: the time of the user using the terminal, the operation behavior category and the operation file type; vectorizing the extracted feature data.

3. The method of claim 1, the step 3) comprising: clustering the feature data by using Mllib, specifically comprising: determining K clustering centers by using a Canopy algorithm, then carrying out K-Means clustering, marking the class which contains the instances less than a certain threshold value after clustering as an abnormal class, marking the instances in the class as abnormal behaviors, and marking the other classes as normal classes, wherein the instances are marked as normal behaviors.

4. The method of claim 1, the step 4) comprising: randomly extracting a part of sample data from the normal behavior library for a KNN algorithm to find abnormal behaviors, wherein if the distances between the new behavior sample data and the randomly extracted sample data are both greater than a set threshold value, the behavior of the new behavior sample data is an abnormal behavior, and otherwise, the new behavior sample data is a normal behavior; if the abnormal behavior is manually judged to be normal behavior, the abnormal behavior is normal behavior; and updating the normal behavior library or the abnormal behavior library by using the sample data corresponding to the normal behavior or the abnormal behavior respectively.

5. An abnormal behavior discovery system based on big data machine learning, comprising:

the behavior library generation module is used for acquiring new behavior sample data in the new security log, and comparing the sample data with the normal behavior library and the abnormal behavior library to determine that the sample data is normal behavior or abnormal behavior;

the behavior judging module is used for training a random forest model by using sample data in the normal behavior library and the abnormal behavior library, deploying the random forest model obtained through training in the real-time processing module and the offline processing module respectively, judging abnormal behaviors by using subsequent new behavior sample data, copying the new behavior sample data into two same sample data after inputting the new behavior sample data, and inputting the two same sample data into the real-time processing module and the offline processing module respectively; the real-time processing module provides streaming computing capability, judges user behaviors in a quasi-real-time mode, and stores a judgment result into a high-performance database providing real-time data service for a user; the offline processing module provides batch processing capability of mass data and is used for training a model and performing batch offline judgment, the offline processing module comprises a plurality of timing tasks, data sets are processed in a full or incremental mode, and judgment results are stored in the high-performance database;

and the updating module updates the normal behavior library or the abnormal behavior library by using the new sample behavior data processed by the behavior judging module.

6. The system of claim 5, the extracted feature data comprising: time, operation type and operation file type of the user using the terminal; vectorizing the extracted feature data.

7. The system of claim 5, wherein the clustering module clusters the feature data using Mllib, and specifically comprises: determining K clustering centers by using a Canopy algorithm, then carrying out K-Means clustering, marking the class which contains the instances less than a certain threshold value after clustering as an abnormal class, marking the instances in the class as abnormal behaviors, and marking the other classes as normal classes, wherein the instances are marked as normal behaviors.

8. The system according to claim 5, wherein the behavior library generation module randomly extracts a part of sample data in the normal behavior library for the KNN algorithm to find abnormal behavior, if the distances between the new behavior sample data and the randomly extracted sample data are both greater than a set threshold, the behavior of the new behavior sample data is abnormal behavior, otherwise, the behavior is normal behavior; if the abnormal behavior is manually judged to be normal behavior, the abnormal behavior is normal behavior; and updating the normal behavior library or the abnormal behavior library by using the sample data corresponding to the normal behavior or the abnormal behavior respectively.

9. An abnormal behavior processing system based on big data machine learning, the system comprising: the system comprises a data service module, a real-time processing module and an offline processing module;

the data service module forms a normal behavior library and an abnormal behavior library based on the method of any one of claims 1-4;

the offline processing module provides the batch processing capability of mass data and is used for training a model and batch offline judgment, the offline processing module comprises a plurality of timing tasks, data sets are processed in a full or incremental mode, and judgment results are stored in the high-performance database.