CN106778259B - Abnormal behavior discovery method and system based on big data machine learning - Google Patents

Abnormal behavior discovery method and system based on big data machine learning Download PDF

Info

Publication number
CN106778259B
CN106778259B CN201611232408.7A CN201611232408A CN106778259B CN 106778259 B CN106778259 B CN 106778259B CN 201611232408 A CN201611232408 A CN 201611232408A CN 106778259 B CN106778259 B CN 106778259B
Authority
CN
China
Prior art keywords
behavior
data
abnormal
library
normal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611232408.7A
Other languages
Chinese (zh)
Other versions
CN106778259A (en
Inventor
李学进
王志海
魏力
喻波
何晋昊
蒲鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wondersoft Technology Co Ltd
Original Assignee
Beijing Wondersoft Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wondersoft Technology Co Ltd filed Critical Beijing Wondersoft Technology Co Ltd
Priority to CN201611232408.7A priority Critical patent/CN106778259B/en
Publication of CN106778259A publication Critical patent/CN106778259A/en
Application granted granted Critical
Publication of CN106778259B publication Critical patent/CN106778259B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/552Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/16Threshold monitoring
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an abnormal behavior discovery method and system based on big data machine learning, wherein the method comprises the following steps: preprocessing original safety log data; extracting feature data from the preprocessed results; clustering the characteristic data, and determining an abnormal behavior library and a normal behavior library; acquiring new behavior sample data in the new safety day, comparing the sample data with the normal behavior library and the abnormal behavior library to determine that the sample data is normal behavior or abnormal behavior, and updating the normal behavior library or the abnormal behavior library by using the new behavior sample data; and repeating the previous step, when the normal behavior library and the abnormal behavior library have enough sample data of normal behaviors and abnormal behaviors, training a random forest model by using the sample data in the normal behavior library and the abnormal behavior library, and judging the abnormal behaviors by using the random forest model obtained through training. By the scheme, the problem that the number of samples containing the labels is too small in the initial stage is solved, the judgment accuracy is improved, and the misjudgment condition is effectively prevented.

Description

Abnormal behavior discovery method and system based on big data machine learning
Technical Field
The invention relates to the field of data security, in particular to an abnormal behavior discovery method and system based on big data machine learning.
Background
Traditional network security and data security technologies, such as various software and hardware firewalls, generally adopt a 'fence type' protection strategy, artificially add a lot of limitations to a network and an application system, and any data access action needs to be filtered by all preset rules, so that the user experience of the system is influenced, and the operation burden of the system is increased. In addition, in the existing security software, a built-in rule is generated, and multiple stages of vulnerability discovery, attack simulation, message analysis, feature extraction, rule generation and the like are generally required. With the continuous updating of the attack means, the rule generation process needs to be repeated continuously, and a large amount of labor cost is consumed. More importantly, traditional protections cannot handle large data. Based on the method, the abnormal behavior discovery method based on big data machine learning is provided, passive defense is changed into active examination, user access is relaxed, behavior monitoring is enhanced, and machines replace manual work.
Fig. 1 is a process for discovering abnormal behaviors of a management user based on big data log analysis in the prior art, which specifically includes:
(1) and storing the log to be analyzed in a log pool.
(2) And connecting the log pool with the preprocessing module through the interface module.
(3) And connecting the preprocessing module with an analysis module, manually carrying out statistical analysis and establishing rules.
(4) And judging the behavior log according to the established rule, and storing the log judged to be abnormal behavior into a knowledge base.
(5) And connecting the visualization module with the service module, and visually displaying the abnormal behavior track analyzed by the log on a user interface by the visualization module.
The prior art has the following defects:
(1) the data source is single, and only the log is analyzed.
(2) Abnormal behavior and users cannot be determined in real time.
(3) All rely on manual statistical analysis, the cost is high and misjudgment of behaviors is easy to occur.
Therefore, the following technical problems need to be solved:
(1) the receiving, the storage, the processing and the mining of the structured data, the semi-structured data and the unstructured data are realized.
(2) And the machine learning modeling is used for replacing the manual work, so that the judgment accuracy is improved and the labor cost is saved. In addition, the trained model can be used for batch off-line behavior judgment and on-line quasi-real-time behavior judgment.
(3) The identification of abnormal behaviors does not depend on a strong safety rule base preset by a system any more, but is continuously self-perfected in a self-adaptive mode.
Disclosure of Invention
In order to solve the technical problem, the invention provides an abnormal behavior discovery method based on big data machine learning, which comprises the following steps:
1) preprocessing original safety log data;
2) extracting feature data from the preprocessed results;
3) clustering the characteristic data, determining each behavior sample in the original safety log data as an abnormal behavior sample or a normal behavior sample, and respectively putting the abnormal behavior sample or the normal behavior sample into an abnormal behavior library or a normal behavior library;
4) acquiring new behavior sample data in the new safety day, and comparing the sample data with the normal behavior library and the abnormal behavior library to determine that the sample data is normal behavior or abnormal behavior;
5) updating the normal behavior library or the abnormal behavior library by using the new behavior sample data;
6) when the normal behavior library and the abnormal behavior library have enough normal behavior and abnormal behavior sample data, jumping to the step 7), otherwise, jumping to the step 4);
7) training a random forest model by using the sample data in the normal behavior library and the abnormal behavior library, respectively deploying the random forest model obtained by training in a real-time processing module and an offline processing module to judge the abnormal behavior of the subsequent new behavior sample data, and jumping to the step 5).
Preferably, the feature data extracted in step 2) includes: the time of the user using the terminal, the operation behavior category and the operation file type; vectorizing the extracted feature data.
Preferably, the step 3) includes: clustering the feature data by using Mllib, specifically comprising: determining K clustering centers by using a Canopy algorithm, then carrying out K-Means clustering, marking the class which contains less instances than a certain threshold value or obviously less instances than other classes after clustering as an abnormal class, marking the instances in the class as abnormal behaviors, and marking the other classes as normal classes, wherein the instances are marked as normal behaviors.
Preferably, the step 4) includes: randomly extracting a part of sample data from the normal behavior library for a KNN algorithm to find abnormal behaviors, wherein if the distances between the new behavior sample data and the randomly extracted sample data are both greater than a set threshold value, the behavior of the new behavior sample data is an abnormal behavior, and otherwise, the new behavior sample data is a normal behavior; if the abnormal behavior is manually judged to be normal behavior, the abnormal behavior is normal behavior; and updating the normal behavior library or the abnormal behavior library by using the sample data corresponding to the normal behavior or the abnormal behavior respectively.
Preferably, the real-time processing module provides streaming computing capability, performs user behavior judgment in a quasi-real-time manner, and stores a judgment result into a high-performance database providing real-time data service for a user;
the batch processing module provides batch processing capacity of mass data and is used for training a model and batch off-line judgment, the batch processing module comprises a plurality of timing tasks, data sets are processed in a full or incremental mode, and judgment results are stored in the high-performance database.
In order to solve the above technical problem, the present invention provides an abnormal behavior discovery system based on big data machine learning, including:
the preprocessing module is used for preprocessing the original safety log data;
the characteristic data extraction module is used for extracting characteristic data from the preprocessed result;
the clustering module is used for clustering the characteristic data, determining each behavior sample in the original safety log data as an abnormal behavior sample or a normal behavior sample, and respectively putting the abnormal behavior sample or the normal behavior sample into an abnormal behavior library or a normal behavior library;
the behavior library generation module is used for acquiring new behavior sample data in the new safety day, and comparing the sample data with the normal behavior library and the abnormal behavior library to determine that the sample data is normal behavior or abnormal behavior;
the updating module is used for updating the normal behavior library or the abnormal behavior library by using the new behavior sample data;
and the behavior judgment module is used for training a random forest model by using the sample data in the normal behavior library and the abnormal behavior library, deploying the random forest model obtained by training in the real-time processing module and the off-line processing module respectively, and judging the abnormal behavior by using the subsequent new behavior sample data.
Preferably, the extracted feature data includes: time, operation type and operation file type of the user using the terminal; vectorizing the extracted feature data.
Preferably, the clustering module uses Mllib to cluster the feature data, and specifically includes: determining K clustering centers by using a Canopy algorithm, then carrying out K-Means clustering, marking the class which contains less instances than a certain threshold value or obviously less instances than other classes after clustering as an abnormal class, marking the instances in the class as abnormal behaviors, and marking the other classes as normal classes, wherein the instances are marked as normal behaviors.
Preferably, the behavior library generating module randomly extracts a part of sample data in the normal behavior library for the KNN algorithm to find abnormal behavior, and if the distance between the new behavior sample data and the randomly extracted sample data is greater than a set threshold, the behavior of the new behavior sample data is abnormal behavior, otherwise, the behavior is normal behavior; if the abnormal behavior is manually judged to be normal behavior, the abnormal behavior is normal behavior; and updating the normal behavior library or the abnormal behavior library by using the sample data corresponding to the normal behavior or the abnormal behavior respectively.
In order to solve the above technical problem, the present invention provides an abnormal behavior processing system based on big data machine learning, which includes: the system comprises a data service module, a real-time processing module and a batch processing module;
the data service module forms a normal behavior library and an abnormal behavior library based on the method;
training a random forest model by using sample data in the normal behavior library and the abnormal behavior library, and respectively deploying the random forest model obtained by training in a real-time processing module and an offline processing module;
after new sample data is input into the system, the new sample data is copied into two identical sample data, and the two identical sample data are respectively input into the real-time processing module and the offline processing module so as to judge the abnormal behavior of the sample data;
the real-time processing module provides streaming computing capability, judges user behaviors in a quasi-real-time mode, and stores a judgment result into a high-performance database providing real-time data service for a user;
the batch processing module provides batch processing capacity of mass data and is used for training a model and batch off-line judgment, the batch processing module comprises a plurality of timing tasks, data sets are processed in a full or incremental mode, and judgment results are stored in the high-performance database.
The technical scheme of the invention achieves the following technical effects:
1. the problem of the sample quantity of the label in the initial stage is too small is solved.
2. The machine learning algorithm is used for replacing manpower, labor cost and time cost are saved, the judgment accuracy is improved, and the misjudgment condition is effectively prevented.
3. The operation process of the platform is not only an abnormal behavior discovery process, but also a self-adjustment and continuous improvement process, and the identification of the abnormal behavior does not depend on a strong safety rule base preset by the system any more, but is continuously self-perfected in a self-adaptive mode.
Drawings
FIG. 1 is a flow chart of user abnormal behavior discovery in the prior art
FIG. 2 is a general flow chart of the present invention
FIG. 3 is a general architecture diagram of the system of the present invention
FIG. 4 is a flow chart of an embodiment of the present invention
Detailed Description
The noun explains:
hadoop: the distributed system infrastructure has the core design of HDFS and MapReduce. The HDFS provides storage for massive data, and the MapReduce provides calculation for the massive data.
Spark: the general parallel computing framework is similar to that of Hadoop MapReduce, and different from MapReduce, Job intermediate output results can be stored in a memory, so that the computing speed is higher, and the method is better suitable for algorithms needing iteration, such as data mining, machine learning and the like.
Lambda architecture: a real-time big data processing framework provided by Nathan Marz integrates a series of framework principles such as offline calculation and real-time calculation, integration of invariability, read-write separation, complexity isolation and the like, and can integrate various big data components such as Hadoop, Spark and the like.
Sqoop: and the big data component is used for transmitting data between the big data platform and the traditional relational database.
MLlib: spark's machine learning library.
Canopy: one kind of unsupervised learning clustering algorithm is mainly used for determining the number of clusters.
KMeans: the K mean algorithm is one of unsupervised learning clustering algorithms.
KNN: k-nearest neighbor (K-nearest neighbor) algorithm, one of the classification algorithms for supervised learning.
Random Forest: random forests, an algorithm for training and predicting samples by using a plurality of decision trees, and belongs to a classification algorithm for supervised learning.
Fig. 2 shows the abnormal behavior discovery flow chart of the present invention.
(1) Preprocessing raw data
And cleaning, converting and extracting the original data.
(2) Feature engineering
Features that are representative of the pre-processed raw data are derived from experience and analysis.
(3) Clustering with MLlib to obtain samples
Firstly, K clustering centers are determined by using a Canopy algorithm, then K-Means algorithm clustering is carried out, the class which contains too few examples or is obviously less than other classes after clustering is marked as an abnormal class, the examples in the class are marked as abnormal behaviors, and the examples in the other classes are marked as normal behaviors.
(4) Manually studying, judging and updating behavior library
And marking a label on the example through clustering, then manually studying and judging the abnormal behavior example, storing data which is manually judged to be abnormal behavior into an illegal behavior library, and putting the rest into a normal behavior library. In the early stage, the number of samples is small, so that the manual study and judgment are performed to improve the quality of the samples, and when a certain number of samples are accumulated, the manual study and judgment are not performed.
(5) Classification using MLlib, updating of behavior library and training of models
The method comprises the steps of carrying out primary classification on samples by using a KNN algorithm, updating a behavior library, then training a RandomForest model by using the samples in the behavior library, and combining a manually formulated rule library after the model is trained to be used for judging the quasi-real-time behavior and the batch off-line behavior.
The clustering in the step (3) is unsupervised learning, no sample data is needed, the classification is supervised learning, the sample is needed, and the output of the clustering is used as the input of the classification, so that the judging accuracy is improved.
(6) And storing the results of the real-time behavior judgment and the batch off-line behavior judgment into a behavior library, wherein the behavior library is updated and perfected all the time.
Fig. 3 is a system architecture diagram of the present invention.
The system uses the Lambda architecture for reference and is divided into a real-time processing layer, a batch processing layer and a data service layer. The original data are copied into two parts after being accessed to the platform, and respectively enter a real-time processing layer and a batch processing layer.
The real-time processing layer provides streaming computing capability, user judgment is carried out in a quasi-real-time mode, and the judgment result is stored in a high-performance database providing real-time data service for the user.
The batch layer provides batch processing capability for mass data for training models and batch off-line decisions. The batch processing layer comprises a plurality of timing tasks, the data set is processed in a full or incremental mode, and the judgment result is stored in the database.
Fig. 4 is an abnormal behavior discovery embodiment of the present invention.
1 data preprocessing
The safety control terminal log is stored in a traditional database and has fields such as equipment unique identifiers, user unique identifiers, operation behaviors and the like. The data are imported into a data warehouse of a big data platform by using sqoop, then cleaning and converting are carried out, meaningless fields are removed, and missing values are filled.
2 characteristic engineering
(1) Aiming at the safety control terminal log, the following characteristics are extracted from the original data according to experience and statistical analysis:
① time of the user using the safety control terminal, time period of operation, morning, noon and evening.
② operation types including supervision and reporting, sending out mail, going out to work and communicating outside.
③ types of files for operation, office documents, compressed files, pictures.
④ access data traffic of the operation.
⑤ uses different numbers of terminals, IP change times, log-in and log-out times.
(2) And vectorization is carried out to obtain data which can be processed by the machine learning model.
3 modeling and decision
And carrying out coarse clustering by adopting a Canopy algorithm to obtain the category number of the data set aggregation.
And performing high-precision clustering by adopting a K-Means clustering method, marking the class which contains too few examples or is obviously less than other classes after clustering as an abnormal class, marking the examples in the class as abnormal behaviors, and marking the examples in the other classes as normal behaviors. In the K-Means clustering result graph, classes which are obviously deviated and contain a small number of examples are marked as abnormal classes, and the abnormal classes are used for classification after being marked with labels.
(3) And generating a small-range normal behavior library by means of manual judgment.
The specific method comprises the following steps: and manually checking whether the clustered instances marked as abnormal have abnormal operation, if so, marking the clustered instances as abnormal behaviors, and forming an abnormal behavior library by all instances corresponding to the abnormal behaviors.
(4) Randomly extracting a part of samples from a normal behavior library for a KNN algorithm to find abnormal behaviors, wherein if the Euclidean distance between the new behavior and each sample instance in the library is larger than a set threshold value, the behavior is the abnormal behavior; and if the abnormal behavior is manually judged to be the normal behavior, updating the normal behavior library by using the behavior. In the KNN classification result graph, abnormal users are marked, but some users which are not abnormal are marked as abnormal.
(5) And when enough normal behavior data and abnormal behavior data exist, training a random forest model by using the data as samples, and respectively deploying the trained models in a real-time processing module and an offline processing module to judge the abnormal behavior. In the RandomForest classification result graph, the error marked users are obviously reduced.
The above examples and samples all have the same meaning and each indicate a security management and control terminal log.
By the method, the problem that the number of samples containing the labels is too small in the initial stage is solved; the machine learning algorithm is used for replacing manpower, so that the labor cost and the time cost are saved, the judgment accuracy is improved, and the occurrence of misjudgment is effectively prevented; the operation process of the platform is not only an abnormal behavior discovery process, but also a self-adjustment and continuous improvement process, and the identification of the abnormal behavior does not depend on a strong safety rule base preset by the system any more, but is continuously self-perfected in a self-adaptive mode.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be protected within the protection scope of the present invention.

Claims (9)

1. An abnormal behavior discovery method based on big data machine learning comprises the following steps:
1) preprocessing original safety log data;
2) extracting feature data from the preprocessed results;
3) clustering the characteristic data, determining each behavior sample in the original safety log data as an abnormal behavior sample or a normal behavior sample, and respectively putting the abnormal behavior sample or the normal behavior sample into an abnormal behavior library or a normal behavior library;
4) acquiring new behavior sample data in the new security log, and comparing the sample data with the normal behavior library and the abnormal behavior library to determine that the sample data is a normal behavior or an abnormal behavior;
5) updating the normal behavior library or the abnormal behavior library by using the new behavior sample data;
6) when the normal behavior library and the abnormal behavior library have sample data of normal behaviors and abnormal behaviors with the same quantity, jumping to the step 7), otherwise, jumping to the step 4);
7) training a random forest model by using sample data in the normal behavior library and the abnormal behavior library, respectively deploying the random forest models obtained through training in a real-time processing module and an offline processing module to judge the abnormal behavior of subsequent new behavior sample data, copying the sample data into two same sample data after inputting the new behavior sample data, and respectively inputting the sample data into the real-time processing module and the offline processing module;
the real-time processing module provides streaming computing capability, judges user behaviors in a quasi-real-time mode, and stores a judgment result into a high-performance database providing real-time data service for a user; the offline processing module provides batch processing capability of mass data and is used for training a model and performing batch offline judgment, the offline processing module comprises a plurality of timing tasks, data sets are processed in a full or incremental mode, and judgment results are stored in the high-performance database;
8) updating the normal behavior library or the abnormal behavior library by using the new sample behavior data in the step 7).
2. The method of claim 1, wherein the feature data extracted in step 2) comprises: the time of the user using the terminal, the operation behavior category and the operation file type; vectorizing the extracted feature data.
3. The method of claim 1, the step 3) comprising: clustering the feature data by using Mllib, specifically comprising: determining K clustering centers by using a Canopy algorithm, then carrying out K-Means clustering, marking the class which contains the instances less than a certain threshold value after clustering as an abnormal class, marking the instances in the class as abnormal behaviors, and marking the other classes as normal classes, wherein the instances are marked as normal behaviors.
4. The method of claim 1, the step 4) comprising: randomly extracting a part of sample data from the normal behavior library for a KNN algorithm to find abnormal behaviors, wherein if the distances between the new behavior sample data and the randomly extracted sample data are both greater than a set threshold value, the behavior of the new behavior sample data is an abnormal behavior, and otherwise, the new behavior sample data is a normal behavior; if the abnormal behavior is manually judged to be normal behavior, the abnormal behavior is normal behavior; and updating the normal behavior library or the abnormal behavior library by using the sample data corresponding to the normal behavior or the abnormal behavior respectively.
5. An abnormal behavior discovery system based on big data machine learning, comprising:
the preprocessing module is used for preprocessing the original safety log data;
the characteristic data extraction module is used for extracting characteristic data from the preprocessed result;
the clustering module is used for clustering the characteristic data, determining each behavior sample in the original safety log data as an abnormal behavior sample or a normal behavior sample, and respectively putting the abnormal behavior sample or the normal behavior sample into an abnormal behavior library or a normal behavior library;
the behavior library generation module is used for acquiring new behavior sample data in the new security log, and comparing the sample data with the normal behavior library and the abnormal behavior library to determine that the sample data is normal behavior or abnormal behavior;
the updating module is used for updating the normal behavior library or the abnormal behavior library by using the new behavior sample data;
the behavior judging module is used for training a random forest model by using sample data in the normal behavior library and the abnormal behavior library, deploying the random forest model obtained through training in the real-time processing module and the offline processing module respectively, judging abnormal behaviors by using subsequent new behavior sample data, copying the new behavior sample data into two same sample data after inputting the new behavior sample data, and inputting the two same sample data into the real-time processing module and the offline processing module respectively; the real-time processing module provides streaming computing capability, judges user behaviors in a quasi-real-time mode, and stores a judgment result into a high-performance database providing real-time data service for a user; the offline processing module provides batch processing capability of mass data and is used for training a model and performing batch offline judgment, the offline processing module comprises a plurality of timing tasks, data sets are processed in a full or incremental mode, and judgment results are stored in the high-performance database;
and the updating module updates the normal behavior library or the abnormal behavior library by using the new sample behavior data processed by the behavior judging module.
6. The system of claim 5, the extracted feature data comprising: time, operation type and operation file type of the user using the terminal; vectorizing the extracted feature data.
7. The system of claim 5, wherein the clustering module clusters the feature data using Mllib, and specifically comprises: determining K clustering centers by using a Canopy algorithm, then carrying out K-Means clustering, marking the class which contains the instances less than a certain threshold value after clustering as an abnormal class, marking the instances in the class as abnormal behaviors, and marking the other classes as normal classes, wherein the instances are marked as normal behaviors.
8. The system according to claim 5, wherein the behavior library generation module randomly extracts a part of sample data in the normal behavior library for the KNN algorithm to find abnormal behavior, if the distances between the new behavior sample data and the randomly extracted sample data are both greater than a set threshold, the behavior of the new behavior sample data is abnormal behavior, otherwise, the behavior is normal behavior; if the abnormal behavior is manually judged to be normal behavior, the abnormal behavior is normal behavior; and updating the normal behavior library or the abnormal behavior library by using the sample data corresponding to the normal behavior or the abnormal behavior respectively.
9. An abnormal behavior processing system based on big data machine learning, the system comprising: the system comprises a data service module, a real-time processing module and an offline processing module;
the data service module forms a normal behavior library and an abnormal behavior library based on the method of any one of claims 1-4;
training a random forest model by using sample data in the normal behavior library and the abnormal behavior library, and respectively deploying the random forest model obtained by training in a real-time processing module and an offline processing module;
after new sample data is input into the system, the new sample data is copied into two identical sample data, and the two identical sample data are respectively input into the real-time processing module and the offline processing module so as to judge the abnormal behavior of the sample data;
the real-time processing module provides streaming computing capability, judges user behaviors in a quasi-real-time mode, and stores a judgment result into a high-performance database providing real-time data service for a user;
the offline processing module provides the batch processing capability of mass data and is used for training a model and batch offline judgment, the offline processing module comprises a plurality of timing tasks, data sets are processed in a full or incremental mode, and judgment results are stored in the high-performance database.
CN201611232408.7A 2016-12-28 2016-12-28 Abnormal behavior discovery method and system based on big data machine learning Active CN106778259B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611232408.7A CN106778259B (en) 2016-12-28 2016-12-28 Abnormal behavior discovery method and system based on big data machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611232408.7A CN106778259B (en) 2016-12-28 2016-12-28 Abnormal behavior discovery method and system based on big data machine learning

Publications (2)

Publication Number Publication Date
CN106778259A CN106778259A (en) 2017-05-31
CN106778259B true CN106778259B (en) 2020-01-10

Family

ID=58921432

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611232408.7A Active CN106778259B (en) 2016-12-28 2016-12-28 Abnormal behavior discovery method and system based on big data machine learning

Country Status (1)

Country Link
CN (1) CN106778259B (en)

Families Citing this family (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107404473A (en) * 2017-06-06 2017-11-28 西安电子科技大学 Based on Mshield machine learning multi-mode Web application means of defences
CN107291911B (en) * 2017-06-26 2020-01-21 北京奇艺世纪科技有限公司 Anomaly detection method and device
CN107341095B (en) * 2017-06-27 2020-07-28 北京优特捷信息技术有限公司 Method and device for intelligently analyzing log data
CN107426199B (en) * 2017-07-05 2020-10-30 浙江鹏信信息科技股份有限公司 Method and system for detecting and analyzing network abnormal behaviors
CN107204991A (en) * 2017-07-06 2017-09-26 深信服科技股份有限公司 A kind of server exception detection method and system
US10419468B2 (en) * 2017-07-11 2019-09-17 The Boeing Company Cyber security system with adaptive machine learning features
CN107707541A (en) * 2017-09-28 2018-02-16 小花互联网金融服务(深圳)有限公司 A kind of attack daily record real-time detection method based on machine learning of streaming
CN108011809A (en) * 2017-12-04 2018-05-08 北京明朝万达科技股份有限公司 Anti-data-leakage analysis method and system based on user behavior and document content
CN108319851B (en) * 2017-12-12 2022-03-11 中国电子科技集团公司电子科学研究院 Abnormal behavior active detection method, equipment and storage medium
CN108040052A (en) * 2017-12-13 2018-05-15 北京明朝万达科技股份有限公司 A kind of network security threats analysis method and system based on Netflow daily record datas
CN107968840B (en) * 2017-12-15 2020-10-09 华北电力大学(保定) Real-time processing method and system for monitoring alarm data of large-scale power equipment
CN108416376B (en) * 2018-02-27 2021-03-12 北京东方天得科技有限公司 SVM-based logistics man-vehicle tracking monitoring management system and method
CN108512841B (en) * 2018-03-23 2021-03-16 四川长虹电器股份有限公司 Intelligent defense system and method based on machine learning
CN108718296A (en) * 2018-04-27 2018-10-30 广州西麦科技股份有限公司 Network management-control method, device and computer readable storage medium based on SDN network
CN113159145A (en) * 2018-04-28 2021-07-23 华为技术有限公司 Characteristic engineering arrangement method and device
CN108614895B (en) * 2018-05-10 2020-09-29 中国移动通信集团海南有限公司 Abnormal data access behavior identification method and data processing device
CN108717510A (en) * 2018-05-11 2018-10-30 深圳市联软科技股份有限公司 A kind of method, system and terminal by clustering file abnormal operation behavior
CN108737222A (en) * 2018-06-29 2018-11-02 山东汇贸电子口岸有限公司 A kind of server exception method of real-time based on data extraction
CN108769079A (en) * 2018-07-09 2018-11-06 四川大学 A kind of Web Intrusion Detection Techniques based on machine learning
CA3105858A1 (en) * 2018-07-12 2020-01-16 Cyber Defence Qcd Corporation Systems and methods of cyber-monitoring which utilizes a knowledge database
CN109189819B (en) * 2018-07-12 2021-08-24 华南师范大学 Mobile k neighbor differential query method, system and device
CN110738827A (en) * 2018-07-20 2020-01-31 珠海格力电器股份有限公司 Abnormity early warning method, system, device and storage medium of electric appliance
CN109218077A (en) * 2018-08-14 2019-01-15 阿里巴巴集团控股有限公司 Prediction technique, device, electronic equipment and the storage medium of target device
CN109255001A (en) * 2018-08-31 2019-01-22 阿里巴巴集团控股有限公司 Maintaining method and device, the electronic equipment in interface instance library
CN109034140B (en) * 2018-09-13 2021-05-04 哈尔滨工业大学 Industrial control network signal abnormity detection method based on deep learning structure
CN109472293A (en) * 2018-10-12 2019-03-15 国家电网有限公司 A kind of grid equipment file data error correction method based on machine learning
CN109359098B (en) * 2018-10-31 2023-04-11 云南电网有限责任公司 System and method for monitoring scheduling data network behaviors
CN109472484B (en) * 2018-11-01 2021-08-03 凌云光技术股份有限公司 Production process abnormity recording method based on flow chart
CN109871954B (en) * 2018-12-24 2022-12-02 腾讯科技(深圳)有限公司 Training sample generation method, abnormality detection method and apparatus
CN109739846A (en) * 2018-12-27 2019-05-10 国电南瑞科技股份有限公司 A kind of electric network data mass analysis method
CN110210512B (en) * 2019-04-19 2024-03-26 北京亿阳信通科技有限公司 Automatic log anomaly detection method and system
CN110517469A (en) * 2019-08-08 2019-11-29 武汉兴图新科电子股份有限公司 A kind of intelligent alarm convergence method suitable for audio-video convergence platform
CN111026653B (en) * 2019-09-16 2022-04-08 腾讯科技(深圳)有限公司 Abnormal program behavior detection method and device, electronic equipment and storage medium
CN112784862A (en) * 2019-11-07 2021-05-11 中国石油化工股份有限公司 Fault diagnosis and identification method for refining process of atmospheric and vacuum distillation unit
CN110889441B (en) * 2019-11-19 2023-07-25 海南电网有限责任公司海南输变电检修分公司 Power transformation equipment data anomaly identification method based on distance and point density
CN110889451B (en) * 2019-11-26 2023-07-07 Oppo广东移动通信有限公司 Event auditing method, device, terminal equipment and storage medium
CN111597549A (en) * 2020-04-17 2020-08-28 国网浙江省电力有限公司湖州供电公司 Network security behavior identification method and system based on big data
CN112001533A (en) * 2020-08-06 2020-11-27 众安信息技术服务有限公司 Parameter detection method and device and computer system
CN112488226B (en) * 2020-12-10 2022-11-01 中国电子科技集团公司第三十研究所 Terminal abnormal behavior identification method based on machine learning algorithm
CN112383575B (en) * 2021-01-18 2021-05-04 北京晶未科技有限公司 Method, electronic device and electronic equipment for information security
CN112926773A (en) * 2021-02-23 2021-06-08 深圳市北斗智能科技有限公司 Riding safety early warning method and device, electronic equipment and storage medium
CN113722707A (en) * 2021-11-02 2021-11-30 西安热工研究院有限公司 Database abnormal access detection method, system and equipment based on distance measurement
CN115269981A (en) * 2022-03-03 2022-11-01 陈林 Abnormal behavior analysis method and system combined with artificial intelligence

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103095711A (en) * 2013-01-18 2013-05-08 重庆邮电大学 Application layer distributed denial of service (DDoS) attack detection method and defensive system aimed at website
CN104735074A (en) * 2015-03-31 2015-06-24 江苏通付盾信息科技有限公司 Malicious URL detection method and implement system thereof
CN104954453A (en) * 2015-06-02 2015-09-30 浙江工业大学 Data mining REST service platform based on cloud computing
CN105553998A (en) * 2015-12-23 2016-05-04 中国电子科技集团公司第三十研究所 Network attack abnormality detection method
CN105677615A (en) * 2016-01-04 2016-06-15 北京邮电大学 Distributed machine learning method based on weka interface

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224872B (en) * 2015-09-30 2018-04-13 河南科技大学 A kind of user's anomaly detection method based on neural network clustering

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103095711A (en) * 2013-01-18 2013-05-08 重庆邮电大学 Application layer distributed denial of service (DDoS) attack detection method and defensive system aimed at website
CN104735074A (en) * 2015-03-31 2015-06-24 江苏通付盾信息科技有限公司 Malicious URL detection method and implement system thereof
CN104954453A (en) * 2015-06-02 2015-09-30 浙江工业大学 Data mining REST service platform based on cloud computing
CN105553998A (en) * 2015-12-23 2016-05-04 中国电子科技集团公司第三十研究所 Network attack abnormality detection method
CN105677615A (en) * 2016-01-04 2016-06-15 北京邮电大学 Distributed machine learning method based on weka interface

Also Published As

Publication number Publication date
CN106778259A (en) 2017-05-31

Similar Documents

Publication Publication Date Title
CN106778259B (en) Abnormal behavior discovery method and system based on big data machine learning
WO2021184630A1 (en) Method for locating pollutant discharge object on basis of knowledge graph, and related device
CN111639497B (en) Abnormal behavior discovery method based on big data machine learning
CN106682527B (en) A kind of data security control method and system based on data classification classification
CN105302911B (en) A kind of data screening engine method for building up and data screening engine
CN107992746A (en) Malicious act method for digging and device
CN112468347B (en) Security management method and device for cloud platform, electronic equipment and storage medium
CN104765733A (en) Method and device for analyzing social network event
CN111339297A (en) Network asset anomaly detection method, system, medium, and device
CN111126820A (en) Electricity stealing prevention method and system
CN105376193A (en) Intelligent association analysis method and intelligent association analysis device for security events
CN113918367A (en) Large-scale system log anomaly detection method based on attention mechanism
CN110851422A (en) Data anomaly monitoring model construction method based on machine learning
CN112202718B (en) XGboost algorithm-based operating system identification method, storage medium and device
CN112367303A (en) Distributed self-learning abnormal flow cooperative detection method and system
CN114841268B (en) Abnormal power customer identification method based on Transformer and LSTM fusion algorithm
CN109660656A (en) A kind of intelligent terminal method for identifying application program
CN117220920A (en) Firewall policy management method based on artificial intelligence
CN115277113A (en) Power grid network intrusion event detection and identification method based on ensemble learning
CN112532652A (en) Attack behavior portrait device and method based on multi-source data
CN110716957B (en) Intelligent mining and analyzing method for class case suspicious objects
Harbola et al. Improved intrusion detection in DDoS applying feature selection using rank & score of attributes in KDD-99 data set
CN111343143A (en) Data identification method, device and storage medium
CN105930430B (en) Real-time fraud detection method and device based on non-accumulative attribute
CN116865994A (en) Network data security prediction method based on big data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant