CN106778259B - Abnormal behavior discovery method and system based on big data machine learning - Google Patents
Abnormal behavior discovery method and system based on big data machine learning Download PDFInfo
- Publication number
- CN106778259B CN106778259B CN201611232408.7A CN201611232408A CN106778259B CN 106778259 B CN106778259 B CN 106778259B CN 201611232408 A CN201611232408 A CN 201611232408A CN 106778259 B CN106778259 B CN 106778259B
- Authority
- CN
- China
- Prior art keywords
- behavior
- data
- abnormal
- library
- normal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/552—Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24147—Distances to closest patterns, e.g. nearest neighbour classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/57—Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
- G06F21/577—Assessing vulnerabilities and evaluating computer system security
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/16—Threshold monitoring
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1416—Event detection, e.g. attack signature detection
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1425—Traffic logging, e.g. anomaly detection
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Computer Hardware Design (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Probability & Statistics with Applications (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses an abnormal behavior discovery method and system based on big data machine learning, wherein the method comprises the following steps: preprocessing original safety log data; extracting feature data from the preprocessed results; clustering the characteristic data, and determining an abnormal behavior library and a normal behavior library; acquiring new behavior sample data in the new safety day, comparing the sample data with the normal behavior library and the abnormal behavior library to determine that the sample data is normal behavior or abnormal behavior, and updating the normal behavior library or the abnormal behavior library by using the new behavior sample data; and repeating the previous step, when the normal behavior library and the abnormal behavior library have enough sample data of normal behaviors and abnormal behaviors, training a random forest model by using the sample data in the normal behavior library and the abnormal behavior library, and judging the abnormal behaviors by using the random forest model obtained through training. By the scheme, the problem that the number of samples containing the labels is too small in the initial stage is solved, the judgment accuracy is improved, and the misjudgment condition is effectively prevented.
Description
Technical Field
The invention relates to the field of data security, in particular to an abnormal behavior discovery method and system based on big data machine learning.
Background
Traditional network security and data security technologies, such as various software and hardware firewalls, generally adopt a 'fence type' protection strategy, artificially add a lot of limitations to a network and an application system, and any data access action needs to be filtered by all preset rules, so that the user experience of the system is influenced, and the operation burden of the system is increased. In addition, in the existing security software, a built-in rule is generated, and multiple stages of vulnerability discovery, attack simulation, message analysis, feature extraction, rule generation and the like are generally required. With the continuous updating of the attack means, the rule generation process needs to be repeated continuously, and a large amount of labor cost is consumed. More importantly, traditional protections cannot handle large data. Based on the method, the abnormal behavior discovery method based on big data machine learning is provided, passive defense is changed into active examination, user access is relaxed, behavior monitoring is enhanced, and machines replace manual work.
Fig. 1 is a process for discovering abnormal behaviors of a management user based on big data log analysis in the prior art, which specifically includes:
(1) and storing the log to be analyzed in a log pool.
(2) And connecting the log pool with the preprocessing module through the interface module.
(3) And connecting the preprocessing module with an analysis module, manually carrying out statistical analysis and establishing rules.
(4) And judging the behavior log according to the established rule, and storing the log judged to be abnormal behavior into a knowledge base.
(5) And connecting the visualization module with the service module, and visually displaying the abnormal behavior track analyzed by the log on a user interface by the visualization module.
The prior art has the following defects:
(1) the data source is single, and only the log is analyzed.
(2) Abnormal behavior and users cannot be determined in real time.
(3) All rely on manual statistical analysis, the cost is high and misjudgment of behaviors is easy to occur.
Therefore, the following technical problems need to be solved:
(1) the receiving, the storage, the processing and the mining of the structured data, the semi-structured data and the unstructured data are realized.
(2) And the machine learning modeling is used for replacing the manual work, so that the judgment accuracy is improved and the labor cost is saved. In addition, the trained model can be used for batch off-line behavior judgment and on-line quasi-real-time behavior judgment.
(3) The identification of abnormal behaviors does not depend on a strong safety rule base preset by a system any more, but is continuously self-perfected in a self-adaptive mode.
Disclosure of Invention
In order to solve the technical problem, the invention provides an abnormal behavior discovery method based on big data machine learning, which comprises the following steps:
1) preprocessing original safety log data;
2) extracting feature data from the preprocessed results;
3) clustering the characteristic data, determining each behavior sample in the original safety log data as an abnormal behavior sample or a normal behavior sample, and respectively putting the abnormal behavior sample or the normal behavior sample into an abnormal behavior library or a normal behavior library;
4) acquiring new behavior sample data in the new safety day, and comparing the sample data with the normal behavior library and the abnormal behavior library to determine that the sample data is normal behavior or abnormal behavior;
5) updating the normal behavior library or the abnormal behavior library by using the new behavior sample data;
6) when the normal behavior library and the abnormal behavior library have enough normal behavior and abnormal behavior sample data, jumping to the step 7), otherwise, jumping to the step 4);
7) training a random forest model by using the sample data in the normal behavior library and the abnormal behavior library, respectively deploying the random forest model obtained by training in a real-time processing module and an offline processing module to judge the abnormal behavior of the subsequent new behavior sample data, and jumping to the step 5).
Preferably, the feature data extracted in step 2) includes: the time of the user using the terminal, the operation behavior category and the operation file type; vectorizing the extracted feature data.
Preferably, the step 3) includes: clustering the feature data by using Mllib, specifically comprising: determining K clustering centers by using a Canopy algorithm, then carrying out K-Means clustering, marking the class which contains less instances than a certain threshold value or obviously less instances than other classes after clustering as an abnormal class, marking the instances in the class as abnormal behaviors, and marking the other classes as normal classes, wherein the instances are marked as normal behaviors.
Preferably, the step 4) includes: randomly extracting a part of sample data from the normal behavior library for a KNN algorithm to find abnormal behaviors, wherein if the distances between the new behavior sample data and the randomly extracted sample data are both greater than a set threshold value, the behavior of the new behavior sample data is an abnormal behavior, and otherwise, the new behavior sample data is a normal behavior; if the abnormal behavior is manually judged to be normal behavior, the abnormal behavior is normal behavior; and updating the normal behavior library or the abnormal behavior library by using the sample data corresponding to the normal behavior or the abnormal behavior respectively.
Preferably, the real-time processing module provides streaming computing capability, performs user behavior judgment in a quasi-real-time manner, and stores a judgment result into a high-performance database providing real-time data service for a user;
the batch processing module provides batch processing capacity of mass data and is used for training a model and batch off-line judgment, the batch processing module comprises a plurality of timing tasks, data sets are processed in a full or incremental mode, and judgment results are stored in the high-performance database.
In order to solve the above technical problem, the present invention provides an abnormal behavior discovery system based on big data machine learning, including:
the preprocessing module is used for preprocessing the original safety log data;
the characteristic data extraction module is used for extracting characteristic data from the preprocessed result;
the clustering module is used for clustering the characteristic data, determining each behavior sample in the original safety log data as an abnormal behavior sample or a normal behavior sample, and respectively putting the abnormal behavior sample or the normal behavior sample into an abnormal behavior library or a normal behavior library;
the behavior library generation module is used for acquiring new behavior sample data in the new safety day, and comparing the sample data with the normal behavior library and the abnormal behavior library to determine that the sample data is normal behavior or abnormal behavior;
the updating module is used for updating the normal behavior library or the abnormal behavior library by using the new behavior sample data;
and the behavior judgment module is used for training a random forest model by using the sample data in the normal behavior library and the abnormal behavior library, deploying the random forest model obtained by training in the real-time processing module and the off-line processing module respectively, and judging the abnormal behavior by using the subsequent new behavior sample data.
Preferably, the extracted feature data includes: time, operation type and operation file type of the user using the terminal; vectorizing the extracted feature data.
Preferably, the clustering module uses Mllib to cluster the feature data, and specifically includes: determining K clustering centers by using a Canopy algorithm, then carrying out K-Means clustering, marking the class which contains less instances than a certain threshold value or obviously less instances than other classes after clustering as an abnormal class, marking the instances in the class as abnormal behaviors, and marking the other classes as normal classes, wherein the instances are marked as normal behaviors.
Preferably, the behavior library generating module randomly extracts a part of sample data in the normal behavior library for the KNN algorithm to find abnormal behavior, and if the distance between the new behavior sample data and the randomly extracted sample data is greater than a set threshold, the behavior of the new behavior sample data is abnormal behavior, otherwise, the behavior is normal behavior; if the abnormal behavior is manually judged to be normal behavior, the abnormal behavior is normal behavior; and updating the normal behavior library or the abnormal behavior library by using the sample data corresponding to the normal behavior or the abnormal behavior respectively.
In order to solve the above technical problem, the present invention provides an abnormal behavior processing system based on big data machine learning, which includes: the system comprises a data service module, a real-time processing module and a batch processing module;
the data service module forms a normal behavior library and an abnormal behavior library based on the method;
training a random forest model by using sample data in the normal behavior library and the abnormal behavior library, and respectively deploying the random forest model obtained by training in a real-time processing module and an offline processing module;
after new sample data is input into the system, the new sample data is copied into two identical sample data, and the two identical sample data are respectively input into the real-time processing module and the offline processing module so as to judge the abnormal behavior of the sample data;
the real-time processing module provides streaming computing capability, judges user behaviors in a quasi-real-time mode, and stores a judgment result into a high-performance database providing real-time data service for a user;
the batch processing module provides batch processing capacity of mass data and is used for training a model and batch off-line judgment, the batch processing module comprises a plurality of timing tasks, data sets are processed in a full or incremental mode, and judgment results are stored in the high-performance database.
The technical scheme of the invention achieves the following technical effects:
1. the problem of the sample quantity of the label in the initial stage is too small is solved.
2. The machine learning algorithm is used for replacing manpower, labor cost and time cost are saved, the judgment accuracy is improved, and the misjudgment condition is effectively prevented.
3. The operation process of the platform is not only an abnormal behavior discovery process, but also a self-adjustment and continuous improvement process, and the identification of the abnormal behavior does not depend on a strong safety rule base preset by the system any more, but is continuously self-perfected in a self-adaptive mode.
Drawings
FIG. 1 is a flow chart of user abnormal behavior discovery in the prior art
FIG. 2 is a general flow chart of the present invention
FIG. 3 is a general architecture diagram of the system of the present invention
FIG. 4 is a flow chart of an embodiment of the present invention
Detailed Description
The noun explains:
hadoop: the distributed system infrastructure has the core design of HDFS and MapReduce. The HDFS provides storage for massive data, and the MapReduce provides calculation for the massive data.
Spark: the general parallel computing framework is similar to that of Hadoop MapReduce, and different from MapReduce, Job intermediate output results can be stored in a memory, so that the computing speed is higher, and the method is better suitable for algorithms needing iteration, such as data mining, machine learning and the like.
Lambda architecture: a real-time big data processing framework provided by Nathan Marz integrates a series of framework principles such as offline calculation and real-time calculation, integration of invariability, read-write separation, complexity isolation and the like, and can integrate various big data components such as Hadoop, Spark and the like.
Sqoop: and the big data component is used for transmitting data between the big data platform and the traditional relational database.
MLlib: spark's machine learning library.
Canopy: one kind of unsupervised learning clustering algorithm is mainly used for determining the number of clusters.
KMeans: the K mean algorithm is one of unsupervised learning clustering algorithms.
KNN: k-nearest neighbor (K-nearest neighbor) algorithm, one of the classification algorithms for supervised learning.
Random Forest: random forests, an algorithm for training and predicting samples by using a plurality of decision trees, and belongs to a classification algorithm for supervised learning.
Fig. 2 shows the abnormal behavior discovery flow chart of the present invention.
(1) Preprocessing raw data
And cleaning, converting and extracting the original data.
(2) Feature engineering
Features that are representative of the pre-processed raw data are derived from experience and analysis.
(3) Clustering with MLlib to obtain samples
Firstly, K clustering centers are determined by using a Canopy algorithm, then K-Means algorithm clustering is carried out, the class which contains too few examples or is obviously less than other classes after clustering is marked as an abnormal class, the examples in the class are marked as abnormal behaviors, and the examples in the other classes are marked as normal behaviors.
(4) Manually studying, judging and updating behavior library
And marking a label on the example through clustering, then manually studying and judging the abnormal behavior example, storing data which is manually judged to be abnormal behavior into an illegal behavior library, and putting the rest into a normal behavior library. In the early stage, the number of samples is small, so that the manual study and judgment are performed to improve the quality of the samples, and when a certain number of samples are accumulated, the manual study and judgment are not performed.
(5) Classification using MLlib, updating of behavior library and training of models
The method comprises the steps of carrying out primary classification on samples by using a KNN algorithm, updating a behavior library, then training a RandomForest model by using the samples in the behavior library, and combining a manually formulated rule library after the model is trained to be used for judging the quasi-real-time behavior and the batch off-line behavior.
The clustering in the step (3) is unsupervised learning, no sample data is needed, the classification is supervised learning, the sample is needed, and the output of the clustering is used as the input of the classification, so that the judging accuracy is improved.
(6) And storing the results of the real-time behavior judgment and the batch off-line behavior judgment into a behavior library, wherein the behavior library is updated and perfected all the time.
Fig. 3 is a system architecture diagram of the present invention.
The system uses the Lambda architecture for reference and is divided into a real-time processing layer, a batch processing layer and a data service layer. The original data are copied into two parts after being accessed to the platform, and respectively enter a real-time processing layer and a batch processing layer.
The real-time processing layer provides streaming computing capability, user judgment is carried out in a quasi-real-time mode, and the judgment result is stored in a high-performance database providing real-time data service for the user.
The batch layer provides batch processing capability for mass data for training models and batch off-line decisions. The batch processing layer comprises a plurality of timing tasks, the data set is processed in a full or incremental mode, and the judgment result is stored in the database.
Fig. 4 is an abnormal behavior discovery embodiment of the present invention.
1 data preprocessing
The safety control terminal log is stored in a traditional database and has fields such as equipment unique identifiers, user unique identifiers, operation behaviors and the like. The data are imported into a data warehouse of a big data platform by using sqoop, then cleaning and converting are carried out, meaningless fields are removed, and missing values are filled.
2 characteristic engineering
(1) Aiming at the safety control terminal log, the following characteristics are extracted from the original data according to experience and statistical analysis:
① time of the user using the safety control terminal, time period of operation, morning, noon and evening.
② operation types including supervision and reporting, sending out mail, going out to work and communicating outside.
③ types of files for operation, office documents, compressed files, pictures.
④ access data traffic of the operation.
⑤ uses different numbers of terminals, IP change times, log-in and log-out times.
(2) And vectorization is carried out to obtain data which can be processed by the machine learning model.
3 modeling and decision
And carrying out coarse clustering by adopting a Canopy algorithm to obtain the category number of the data set aggregation.
And performing high-precision clustering by adopting a K-Means clustering method, marking the class which contains too few examples or is obviously less than other classes after clustering as an abnormal class, marking the examples in the class as abnormal behaviors, and marking the examples in the other classes as normal behaviors. In the K-Means clustering result graph, classes which are obviously deviated and contain a small number of examples are marked as abnormal classes, and the abnormal classes are used for classification after being marked with labels.
(3) And generating a small-range normal behavior library by means of manual judgment.
The specific method comprises the following steps: and manually checking whether the clustered instances marked as abnormal have abnormal operation, if so, marking the clustered instances as abnormal behaviors, and forming an abnormal behavior library by all instances corresponding to the abnormal behaviors.
(4) Randomly extracting a part of samples from a normal behavior library for a KNN algorithm to find abnormal behaviors, wherein if the Euclidean distance between the new behavior and each sample instance in the library is larger than a set threshold value, the behavior is the abnormal behavior; and if the abnormal behavior is manually judged to be the normal behavior, updating the normal behavior library by using the behavior. In the KNN classification result graph, abnormal users are marked, but some users which are not abnormal are marked as abnormal.
(5) And when enough normal behavior data and abnormal behavior data exist, training a random forest model by using the data as samples, and respectively deploying the trained models in a real-time processing module and an offline processing module to judge the abnormal behavior. In the RandomForest classification result graph, the error marked users are obviously reduced.
The above examples and samples all have the same meaning and each indicate a security management and control terminal log.
By the method, the problem that the number of samples containing the labels is too small in the initial stage is solved; the machine learning algorithm is used for replacing manpower, so that the labor cost and the time cost are saved, the judgment accuracy is improved, and the occurrence of misjudgment is effectively prevented; the operation process of the platform is not only an abnormal behavior discovery process, but also a self-adjustment and continuous improvement process, and the identification of the abnormal behavior does not depend on a strong safety rule base preset by the system any more, but is continuously self-perfected in a self-adaptive mode.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be protected within the protection scope of the present invention.
Claims (9)
1. An abnormal behavior discovery method based on big data machine learning comprises the following steps:
1) preprocessing original safety log data;
2) extracting feature data from the preprocessed results;
3) clustering the characteristic data, determining each behavior sample in the original safety log data as an abnormal behavior sample or a normal behavior sample, and respectively putting the abnormal behavior sample or the normal behavior sample into an abnormal behavior library or a normal behavior library;
4) acquiring new behavior sample data in the new security log, and comparing the sample data with the normal behavior library and the abnormal behavior library to determine that the sample data is a normal behavior or an abnormal behavior;
5) updating the normal behavior library or the abnormal behavior library by using the new behavior sample data;
6) when the normal behavior library and the abnormal behavior library have sample data of normal behaviors and abnormal behaviors with the same quantity, jumping to the step 7), otherwise, jumping to the step 4);
7) training a random forest model by using sample data in the normal behavior library and the abnormal behavior library, respectively deploying the random forest models obtained through training in a real-time processing module and an offline processing module to judge the abnormal behavior of subsequent new behavior sample data, copying the sample data into two same sample data after inputting the new behavior sample data, and respectively inputting the sample data into the real-time processing module and the offline processing module;
the real-time processing module provides streaming computing capability, judges user behaviors in a quasi-real-time mode, and stores a judgment result into a high-performance database providing real-time data service for a user; the offline processing module provides batch processing capability of mass data and is used for training a model and performing batch offline judgment, the offline processing module comprises a plurality of timing tasks, data sets are processed in a full or incremental mode, and judgment results are stored in the high-performance database;
8) updating the normal behavior library or the abnormal behavior library by using the new sample behavior data in the step 7).
2. The method of claim 1, wherein the feature data extracted in step 2) comprises: the time of the user using the terminal, the operation behavior category and the operation file type; vectorizing the extracted feature data.
3. The method of claim 1, the step 3) comprising: clustering the feature data by using Mllib, specifically comprising: determining K clustering centers by using a Canopy algorithm, then carrying out K-Means clustering, marking the class which contains the instances less than a certain threshold value after clustering as an abnormal class, marking the instances in the class as abnormal behaviors, and marking the other classes as normal classes, wherein the instances are marked as normal behaviors.
4. The method of claim 1, the step 4) comprising: randomly extracting a part of sample data from the normal behavior library for a KNN algorithm to find abnormal behaviors, wherein if the distances between the new behavior sample data and the randomly extracted sample data are both greater than a set threshold value, the behavior of the new behavior sample data is an abnormal behavior, and otherwise, the new behavior sample data is a normal behavior; if the abnormal behavior is manually judged to be normal behavior, the abnormal behavior is normal behavior; and updating the normal behavior library or the abnormal behavior library by using the sample data corresponding to the normal behavior or the abnormal behavior respectively.
5. An abnormal behavior discovery system based on big data machine learning, comprising:
the preprocessing module is used for preprocessing the original safety log data;
the characteristic data extraction module is used for extracting characteristic data from the preprocessed result;
the clustering module is used for clustering the characteristic data, determining each behavior sample in the original safety log data as an abnormal behavior sample or a normal behavior sample, and respectively putting the abnormal behavior sample or the normal behavior sample into an abnormal behavior library or a normal behavior library;
the behavior library generation module is used for acquiring new behavior sample data in the new security log, and comparing the sample data with the normal behavior library and the abnormal behavior library to determine that the sample data is normal behavior or abnormal behavior;
the updating module is used for updating the normal behavior library or the abnormal behavior library by using the new behavior sample data;
the behavior judging module is used for training a random forest model by using sample data in the normal behavior library and the abnormal behavior library, deploying the random forest model obtained through training in the real-time processing module and the offline processing module respectively, judging abnormal behaviors by using subsequent new behavior sample data, copying the new behavior sample data into two same sample data after inputting the new behavior sample data, and inputting the two same sample data into the real-time processing module and the offline processing module respectively; the real-time processing module provides streaming computing capability, judges user behaviors in a quasi-real-time mode, and stores a judgment result into a high-performance database providing real-time data service for a user; the offline processing module provides batch processing capability of mass data and is used for training a model and performing batch offline judgment, the offline processing module comprises a plurality of timing tasks, data sets are processed in a full or incremental mode, and judgment results are stored in the high-performance database;
and the updating module updates the normal behavior library or the abnormal behavior library by using the new sample behavior data processed by the behavior judging module.
6. The system of claim 5, the extracted feature data comprising: time, operation type and operation file type of the user using the terminal; vectorizing the extracted feature data.
7. The system of claim 5, wherein the clustering module clusters the feature data using Mllib, and specifically comprises: determining K clustering centers by using a Canopy algorithm, then carrying out K-Means clustering, marking the class which contains the instances less than a certain threshold value after clustering as an abnormal class, marking the instances in the class as abnormal behaviors, and marking the other classes as normal classes, wherein the instances are marked as normal behaviors.
8. The system according to claim 5, wherein the behavior library generation module randomly extracts a part of sample data in the normal behavior library for the KNN algorithm to find abnormal behavior, if the distances between the new behavior sample data and the randomly extracted sample data are both greater than a set threshold, the behavior of the new behavior sample data is abnormal behavior, otherwise, the behavior is normal behavior; if the abnormal behavior is manually judged to be normal behavior, the abnormal behavior is normal behavior; and updating the normal behavior library or the abnormal behavior library by using the sample data corresponding to the normal behavior or the abnormal behavior respectively.
9. An abnormal behavior processing system based on big data machine learning, the system comprising: the system comprises a data service module, a real-time processing module and an offline processing module;
the data service module forms a normal behavior library and an abnormal behavior library based on the method of any one of claims 1-4;
training a random forest model by using sample data in the normal behavior library and the abnormal behavior library, and respectively deploying the random forest model obtained by training in a real-time processing module and an offline processing module;
after new sample data is input into the system, the new sample data is copied into two identical sample data, and the two identical sample data are respectively input into the real-time processing module and the offline processing module so as to judge the abnormal behavior of the sample data;
the real-time processing module provides streaming computing capability, judges user behaviors in a quasi-real-time mode, and stores a judgment result into a high-performance database providing real-time data service for a user;
the offline processing module provides the batch processing capability of mass data and is used for training a model and batch offline judgment, the offline processing module comprises a plurality of timing tasks, data sets are processed in a full or incremental mode, and judgment results are stored in the high-performance database.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611232408.7A CN106778259B (en) | 2016-12-28 | 2016-12-28 | Abnormal behavior discovery method and system based on big data machine learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611232408.7A CN106778259B (en) | 2016-12-28 | 2016-12-28 | Abnormal behavior discovery method and system based on big data machine learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106778259A CN106778259A (en) | 2017-05-31 |
CN106778259B true CN106778259B (en) | 2020-01-10 |
Family
ID=58921432
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611232408.7A Active CN106778259B (en) | 2016-12-28 | 2016-12-28 | Abnormal behavior discovery method and system based on big data machine learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106778259B (en) |
Families Citing this family (43)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107404473A (en) * | 2017-06-06 | 2017-11-28 | 西安电子科技大学 | Based on Mshield machine learning multi-mode Web application means of defences |
CN107291911B (en) * | 2017-06-26 | 2020-01-21 | 北京奇艺世纪科技有限公司 | Anomaly detection method and device |
CN107341095B (en) * | 2017-06-27 | 2020-07-28 | 北京优特捷信息技术有限公司 | Method and device for intelligently analyzing log data |
CN107426199B (en) * | 2017-07-05 | 2020-10-30 | 浙江鹏信信息科技股份有限公司 | Method and system for detecting and analyzing network abnormal behaviors |
CN107204991A (en) * | 2017-07-06 | 2017-09-26 | 深信服科技股份有限公司 | A kind of server exception detection method and system |
US10419468B2 (en) * | 2017-07-11 | 2019-09-17 | The Boeing Company | Cyber security system with adaptive machine learning features |
CN107707541A (en) * | 2017-09-28 | 2018-02-16 | 小花互联网金融服务(深圳)有限公司 | A kind of attack daily record real-time detection method based on machine learning of streaming |
CN108011809A (en) * | 2017-12-04 | 2018-05-08 | 北京明朝万达科技股份有限公司 | Anti-data-leakage analysis method and system based on user behavior and document content |
CN108319851B (en) * | 2017-12-12 | 2022-03-11 | 中国电子科技集团公司电子科学研究院 | Abnormal behavior active detection method, equipment and storage medium |
CN108040052A (en) * | 2017-12-13 | 2018-05-15 | 北京明朝万达科技股份有限公司 | A kind of network security threats analysis method and system based on Netflow daily record datas |
CN107968840B (en) * | 2017-12-15 | 2020-10-09 | 华北电力大学(保定) | Real-time processing method and system for monitoring alarm data of large-scale power equipment |
CN108416376B (en) * | 2018-02-27 | 2021-03-12 | 北京东方天得科技有限公司 | SVM-based logistics man-vehicle tracking monitoring management system and method |
CN108512841B (en) * | 2018-03-23 | 2021-03-16 | 四川长虹电器股份有限公司 | Intelligent defense system and method based on machine learning |
CN108718296A (en) * | 2018-04-27 | 2018-10-30 | 广州西麦科技股份有限公司 | Network management-control method, device and computer readable storage medium based on SDN network |
CN113159145A (en) * | 2018-04-28 | 2021-07-23 | 华为技术有限公司 | Characteristic engineering arrangement method and device |
CN108614895B (en) * | 2018-05-10 | 2020-09-29 | 中国移动通信集团海南有限公司 | Abnormal data access behavior identification method and data processing device |
CN108717510A (en) * | 2018-05-11 | 2018-10-30 | 深圳市联软科技股份有限公司 | A kind of method, system and terminal by clustering file abnormal operation behavior |
CN108737222A (en) * | 2018-06-29 | 2018-11-02 | 山东汇贸电子口岸有限公司 | A kind of server exception method of real-time based on data extraction |
CN108769079A (en) * | 2018-07-09 | 2018-11-06 | 四川大学 | A kind of Web Intrusion Detection Techniques based on machine learning |
CA3105858A1 (en) * | 2018-07-12 | 2020-01-16 | Cyber Defence Qcd Corporation | Systems and methods of cyber-monitoring which utilizes a knowledge database |
CN109189819B (en) * | 2018-07-12 | 2021-08-24 | 华南师范大学 | Mobile k neighbor differential query method, system and device |
CN110738827A (en) * | 2018-07-20 | 2020-01-31 | 珠海格力电器股份有限公司 | Abnormity early warning method, system, device and storage medium of electric appliance |
CN109218077A (en) * | 2018-08-14 | 2019-01-15 | 阿里巴巴集团控股有限公司 | Prediction technique, device, electronic equipment and the storage medium of target device |
CN109255001A (en) * | 2018-08-31 | 2019-01-22 | 阿里巴巴集团控股有限公司 | Maintaining method and device, the electronic equipment in interface instance library |
CN109034140B (en) * | 2018-09-13 | 2021-05-04 | 哈尔滨工业大学 | Industrial control network signal abnormity detection method based on deep learning structure |
CN109472293A (en) * | 2018-10-12 | 2019-03-15 | 国家电网有限公司 | A kind of grid equipment file data error correction method based on machine learning |
CN109359098B (en) * | 2018-10-31 | 2023-04-11 | 云南电网有限责任公司 | System and method for monitoring scheduling data network behaviors |
CN109472484B (en) * | 2018-11-01 | 2021-08-03 | 凌云光技术股份有限公司 | Production process abnormity recording method based on flow chart |
CN109871954B (en) * | 2018-12-24 | 2022-12-02 | 腾讯科技(深圳)有限公司 | Training sample generation method, abnormality detection method and apparatus |
CN109739846A (en) * | 2018-12-27 | 2019-05-10 | 国电南瑞科技股份有限公司 | A kind of electric network data mass analysis method |
CN110210512B (en) * | 2019-04-19 | 2024-03-26 | 北京亿阳信通科技有限公司 | Automatic log anomaly detection method and system |
CN110517469A (en) * | 2019-08-08 | 2019-11-29 | 武汉兴图新科电子股份有限公司 | A kind of intelligent alarm convergence method suitable for audio-video convergence platform |
CN111026653B (en) * | 2019-09-16 | 2022-04-08 | 腾讯科技(深圳)有限公司 | Abnormal program behavior detection method and device, electronic equipment and storage medium |
CN112784862A (en) * | 2019-11-07 | 2021-05-11 | 中国石油化工股份有限公司 | Fault diagnosis and identification method for refining process of atmospheric and vacuum distillation unit |
CN110889441B (en) * | 2019-11-19 | 2023-07-25 | 海南电网有限责任公司海南输变电检修分公司 | Power transformation equipment data anomaly identification method based on distance and point density |
CN110889451B (en) * | 2019-11-26 | 2023-07-07 | Oppo广东移动通信有限公司 | Event auditing method, device, terminal equipment and storage medium |
CN111597549A (en) * | 2020-04-17 | 2020-08-28 | 国网浙江省电力有限公司湖州供电公司 | Network security behavior identification method and system based on big data |
CN112001533A (en) * | 2020-08-06 | 2020-11-27 | 众安信息技术服务有限公司 | Parameter detection method and device and computer system |
CN112488226B (en) * | 2020-12-10 | 2022-11-01 | 中国电子科技集团公司第三十研究所 | Terminal abnormal behavior identification method based on machine learning algorithm |
CN112383575B (en) * | 2021-01-18 | 2021-05-04 | 北京晶未科技有限公司 | Method, electronic device and electronic equipment for information security |
CN112926773A (en) * | 2021-02-23 | 2021-06-08 | 深圳市北斗智能科技有限公司 | Riding safety early warning method and device, electronic equipment and storage medium |
CN113722707A (en) * | 2021-11-02 | 2021-11-30 | 西安热工研究院有限公司 | Database abnormal access detection method, system and equipment based on distance measurement |
CN115269981A (en) * | 2022-03-03 | 2022-11-01 | 陈林 | Abnormal behavior analysis method and system combined with artificial intelligence |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103095711A (en) * | 2013-01-18 | 2013-05-08 | 重庆邮电大学 | Application layer distributed denial of service (DDoS) attack detection method and defensive system aimed at website |
CN104735074A (en) * | 2015-03-31 | 2015-06-24 | 江苏通付盾信息科技有限公司 | Malicious URL detection method and implement system thereof |
CN104954453A (en) * | 2015-06-02 | 2015-09-30 | 浙江工业大学 | Data mining REST service platform based on cloud computing |
CN105553998A (en) * | 2015-12-23 | 2016-05-04 | 中国电子科技集团公司第三十研究所 | Network attack abnormality detection method |
CN105677615A (en) * | 2016-01-04 | 2016-06-15 | 北京邮电大学 | Distributed machine learning method based on weka interface |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105224872B (en) * | 2015-09-30 | 2018-04-13 | 河南科技大学 | A kind of user's anomaly detection method based on neural network clustering |
-
2016
- 2016-12-28 CN CN201611232408.7A patent/CN106778259B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103095711A (en) * | 2013-01-18 | 2013-05-08 | 重庆邮电大学 | Application layer distributed denial of service (DDoS) attack detection method and defensive system aimed at website |
CN104735074A (en) * | 2015-03-31 | 2015-06-24 | 江苏通付盾信息科技有限公司 | Malicious URL detection method and implement system thereof |
CN104954453A (en) * | 2015-06-02 | 2015-09-30 | 浙江工业大学 | Data mining REST service platform based on cloud computing |
CN105553998A (en) * | 2015-12-23 | 2016-05-04 | 中国电子科技集团公司第三十研究所 | Network attack abnormality detection method |
CN105677615A (en) * | 2016-01-04 | 2016-06-15 | 北京邮电大学 | Distributed machine learning method based on weka interface |
Also Published As
Publication number | Publication date |
---|---|
CN106778259A (en) | 2017-05-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106778259B (en) | Abnormal behavior discovery method and system based on big data machine learning | |
WO2021184630A1 (en) | Method for locating pollutant discharge object on basis of knowledge graph, and related device | |
CN111639497B (en) | Abnormal behavior discovery method based on big data machine learning | |
CN106682527B (en) | A kind of data security control method and system based on data classification classification | |
CN105302911B (en) | A kind of data screening engine method for building up and data screening engine | |
CN107992746A (en) | Malicious act method for digging and device | |
CN112468347B (en) | Security management method and device for cloud platform, electronic equipment and storage medium | |
CN104765733A (en) | Method and device for analyzing social network event | |
CN111339297A (en) | Network asset anomaly detection method, system, medium, and device | |
CN111126820A (en) | Electricity stealing prevention method and system | |
CN105376193A (en) | Intelligent association analysis method and intelligent association analysis device for security events | |
CN113918367A (en) | Large-scale system log anomaly detection method based on attention mechanism | |
CN110851422A (en) | Data anomaly monitoring model construction method based on machine learning | |
CN112202718B (en) | XGboost algorithm-based operating system identification method, storage medium and device | |
CN112367303A (en) | Distributed self-learning abnormal flow cooperative detection method and system | |
CN114841268B (en) | Abnormal power customer identification method based on Transformer and LSTM fusion algorithm | |
CN109660656A (en) | A kind of intelligent terminal method for identifying application program | |
CN117220920A (en) | Firewall policy management method based on artificial intelligence | |
CN115277113A (en) | Power grid network intrusion event detection and identification method based on ensemble learning | |
CN112532652A (en) | Attack behavior portrait device and method based on multi-source data | |
CN110716957B (en) | Intelligent mining and analyzing method for class case suspicious objects | |
Harbola et al. | Improved intrusion detection in DDoS applying feature selection using rank & score of attributes in KDD-99 data set | |
CN111343143A (en) | Data identification method, device and storage medium | |
CN105930430B (en) | Real-time fraud detection method and device based on non-accumulative attribute | |
CN116865994A (en) | Network data security prediction method based on big data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |