CN114039837B

CN114039837B - Alarm data processing method, device, system, equipment and storage medium

Info

Publication number: CN114039837B
Application number: CN202111309066.5A
Authority: CN
Inventors: 范敏
Original assignee: Qianxin Technology Group Co Ltd; Secworld Information Technology Beijing Co Ltd
Current assignee: Qianxin Technology Group Co Ltd; Secworld Information Technology Beijing Co Ltd
Priority date: 2021-11-05
Filing date: 2021-11-05
Publication date: 2023-10-31
Anticipated expiration: 2041-11-05
Also published as: CN114039837A

Abstract

The application provides an alarm data processing method, a device, a system, equipment and a storage medium, wherein the method comprises the following steps: acquiring alarm data to be processed in a database; extracting a multidimensional feature set of the alarm data; and inputting the multidimensional feature set into a preset anomaly detection integrated model, and outputting anomaly IP information for generating the alarm data. The application realizes the automatic prediction batch processing of the alarm data, obtains the detection result of the abnormal IP information in the alarm data, does not need human participation, greatly saves the labor cost of the alarm data processing and improves the alarm processing efficiency.

Description

Alarm data processing method, device, system, equipment and storage medium

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a method, an apparatus, a system, a device, and a storage medium for processing alarm data.

Background

The data centers (IDC, internet Data Center) of enterprises always deploy a large number of security devices to build network security systems. These security devices generate a large number of alarm logs based on mirrored traffic, which eventually converge to a network security posture awareness platform (NSSA, network Security Situation Awareness). Security analysis teams often rely on network security situational awareness platforms to manually handle alarm events on these platforms by means of security expert experience knowledge. Because a large number of false alarms, repeated alarms and invalid alarms which do not generate attack harm exist in the actual alarm logs, real and effective attacks are often submerged in massive alarm logs, and a security analysis team is difficult to accurately find effective attack alarm events, so that a plurality of difficulties and hidden hazards are brought to the security operation of a data center.

Aiming at the problem that real and effective attacks are often submerged in massive alarm logs, two general ideas exist:

1) The detection effect of early security devices (e.g., IDS, WAF, etc.) is improved. The thought starts from the flow, two methods are generally adopted to improve the detection effect, and one method is to perfect a rule engine based on manual experience; another is to utilize machine learning modeling to reduce false negatives and false positives.

2) And screening and filtering the alarm log in the later stage. The thinking starts from an alarm log of the safety equipment, and two methods are adopted for filtering, wherein one method is that a safety personnel executes alarm reconfirmation; the other is a multi-level association rule, and an expert system and an engine are utilized to dynamically configure rules, filter and merge alarms.

Aiming at the scheme of the 2), the safety personnel is required to execute alarm reconfirmation, and the safety analysis personnel can hardly process the alarms one by one, so that the safety analysis personnel can generate alarm fatigue to influence the judgment result. The multilevel association rule requires association rules and experience rules based on manual knowledge experience, and in an actual service scene, the maintenance cost of the rules is very high, and a great deal of manpower is required to be input for updating and modifying. Therefore, how to improve the security operation efficiency of the data center is a technical problem to be solved.

Disclosure of Invention

The embodiment of the application aims to provide an alarm data processing method, an alarm data processing device, an alarm data processing system, an alarm data processing device and a storage medium, which realize automatic prediction batch processing of alarm data, acquire detection results of abnormal IP information in the alarm data, do not need human participation, greatly save labor cost of alarm data processing and improve alarm processing efficiency.

An embodiment of the present application provides a method for processing alarm data, including: acquiring alarm data to be processed in a database; extracting a multidimensional feature set of the alarm data; and inputting the multidimensional feature set into a preset anomaly detection integrated model, and outputting anomaly IP information for generating the alarm data.

In one embodiment, the anomaly detection integration model includes: a plurality of anomaly detection base models; the step of inputting the multidimensional feature set into a preset anomaly detection integrated model, outputting anomaly IP information for generating the alarm data, comprises the following steps: inputting each sub-feature set in the multi-dimensional feature set to a plurality of abnormal detection base models respectively, and outputting a plurality of initial detection results for generating abnormal IP information of the alarm data; and combining the plurality of initial detection results according to a preset combination strategy to obtain a final detection result of the abnormal IP information generating the alarm data.

In one embodiment, the step of establishing the anomaly detection integration model includes: acquiring sample alarm data in the database; extracting a multi-dimensional sample feature set of the sample alarm data; respectively training a plurality of local anomaly detection algorithm models with different neighbor scale parameters by adopting each sample sub-feature set in the multi-dimensional sample feature set to obtain a plurality of anomaly detection base models; and carrying out combination processing on detection results of the plurality of abnormal detection base models according to a preset combination strategy to generate the abnormal detection integrated model.

In one embodiment, the step of establishing the anomaly detection integrated model further includes: acquiring updated alarm data in a first time period from the database at intervals of the first time period; extracting a multidimensional updating feature set of the updating alarm data; respectively training the local anomaly detection algorithm models with different neighbor scale parameters by adopting each updated sub-feature set in the multi-dimensional updated feature set to obtain a plurality of updated base models, and combining the plurality of updated base models according to the preset combining strategy; and updating the anomaly detection integrated model according to the model file combined by the plurality of updating base models.

In an embodiment, the updating the anomaly detection integrated model according to the update model file after the plurality of update base models are combined includes: and verifying the updated model file, and if the output precision of the updated model file reaches a preset threshold value, covering the original file of the abnormality detection integrated model with the updated model file to obtain an updated abnormality detection integrated model.

In an embodiment, the preset combination policy is an average result policy for respectively averaging the number of the neighbor scale parameters and the dimension of the feature set.

In an embodiment, inputting the multi-dimensional feature set into a preset anomaly detection integration model, outputting anomaly IP information for generating the alarm data, includes: and inputting the multidimensional feature set into the anomaly detection integrated model every second time period to obtain a plurality of pieces of anomaly IP information in the alarm data in the second time period, and outputting a sequencing result of the plurality of pieces of anomaly IP information based on the anomaly score of each anomaly IP.

In an embodiment, before the obtaining the alarm data to be processed in the database, the method further includes: and collecting an original alarm data set, and carrying out standardization processing on the original alarm data set to generate the database, wherein the database comprises attack types of each piece of alarm data.

A second aspect of an embodiment of the present application provides an alarm data processing apparatus, including: comprising the following steps: the acquisition module is used for acquiring alarm data to be processed in the database; the extraction module is used for extracting the multi-dimensional feature set of the alarm data; the detection module is used for inputting the multidimensional feature set into a preset anomaly detection integrated model and outputting anomaly IP information for generating the alarm data.

In one embodiment, the anomaly detection integration model includes: a plurality of anomaly detection base models; the detection module is used for: inputting each sub-feature set in the multi-dimensional feature set to a plurality of abnormal detection base models respectively, and outputting a plurality of initial detection results for generating abnormal IP information of the alarm data; and combining the plurality of initial detection results according to a preset combination strategy to obtain a final detection result of the abnormal IP information generating the alarm data.

In an embodiment, the method further includes a model building module for: acquiring sample alarm data in the database; extracting a multi-dimensional sample feature set of the sample alarm data; respectively training a plurality of local anomaly detection algorithm models with different neighbor scale parameters by adopting each sample sub-feature set in the multi-dimensional sample feature set to obtain a plurality of anomaly detection base models; and carrying out combination processing on detection results of the plurality of abnormal detection base models according to a preset combination strategy to generate the abnormal detection integrated model.

In an embodiment, the model building module is further configured to: acquiring updated alarm data in a first time period from the database at intervals of the first time period; extracting a multidimensional updating feature set of the updating alarm data; respectively training the local anomaly detection algorithm models with different neighbor scale parameters by adopting each updated sub-feature set in the multi-dimensional updated feature set to obtain a plurality of updated base models, and combining the plurality of updated base models according to the preset combining strategy; and updating the anomaly detection integrated model according to the model file combined by the plurality of updating base models.

In an embodiment, the detection module is further configured to: and inputting the multidimensional feature set into the anomaly detection integrated model every second time period to obtain a plurality of pieces of anomaly IP information in the alarm data in the second time period, and outputting a sequencing result of the plurality of pieces of anomaly IP information based on the anomaly score of each anomaly IP.

In one embodiment, the method further comprises: the standardized module is used for collecting an original alarm data set before the alarm data to be processed in the database is acquired, and carrying out standardized processing on the original alarm data set to generate the database, wherein the database comprises attack types to which each alarm data belongs.

A third aspect of an embodiment of the present application provides an alarm data processing system, including: the data layer is used for storing alarm data and generating a database; the feature layer is used for periodically reading alarm data from the database of the data layer and extracting a multidimensional feature set of the alarm data; the deployment module is deployed with an anomaly detection integrated model and is used for predicting the anomaly IP information for generating the alarm data based on the feature set of the feature layer; the model training module is used for updating the anomaly detection integrated model of the deployment module based on the feature set of the feature layer; and the output layer is used for outputting the prediction result of the deployment module.

A fourth aspect of an embodiment of the present application provides an electronic device, including: a memory for storing a computer program and data; a processor for executing the computer program to implement the method of the first aspect of the embodiment of the present application and any of the embodiments thereof.

A fifth aspect of an embodiment of the present application provides a non-transitory electronic device readable storage medium, comprising: a program which, when run by an electronic device, causes the electronic device to perform the method of the first aspect of the embodiments of the application and any of its embodiments.

According to the alarm data processing method, the device, the system, the equipment and the storage medium, the alarm data to be processed are obtained from the database, the multi-dimensional feature extraction is carried out on the alarm data, the multi-dimensional feature set is obtained, then the multi-dimensional feature set is input into the pre-configured anomaly detection integrated model, the automatic prediction batch processing of the alarm data can be realized, the detection result of the anomaly IP information in the alarm data is obtained, the human participation is not needed, the labor cost of the alarm data processing is greatly saved, and the alarm processing efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an electronic device according to an embodiment of the application;

FIG. 2A is a diagram illustrating an application architecture of an alarm data processing system according to an embodiment of the present application;

FIG. 2B is a schematic diagram illustrating an alarm data processing system according to an embodiment of the present application;

FIG. 2C is a schematic diagram illustrating a distribution scenario of an alarm log according to an embodiment of the present application;

FIG. 3 is a flowchart of an alarm data processing method according to an embodiment of the present application;

FIG. 4A is a flowchart illustrating an alarm data processing method according to an embodiment of the present application;

FIG. 4B is a schematic diagram of a feature subspace anomaly according to an embodiment of the present application;

FIG. 4C is a schematic diagram of local anomalies and global anomalies according to one embodiment of the present application;

FIG. 5 is a schematic diagram illustrating an alarm data processing apparatus according to an embodiment of the application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. In the description of the present application, the terms "first," "second," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.

As shown in fig. 1, the present embodiment provides an electronic apparatus 1 including: at least one processor 11 and a memory 12, one processor being exemplified in fig. 1. The processor 11 and the memory 12 are connected by a bus 10. The memory 12 stores instruction data executable by the processor 11, and the instruction is executed by the processor 11, so that the electronic device 1 may execute all or part of the methods in the embodiments described below, so as to implement automatic prediction batch processing on the alarm data, and obtain abnormal IP information in the alarm data.

In an embodiment, the electronic device 1 may be a mobile phone, a tablet computer, a notebook computer, a desktop computer, or a large computer system composed of a plurality of computers.

In a real production environment, once a machine learning system is deployed to a client or a first party, a model is difficult to update manually and offline due to network isolation and the like, and the cost is high. It is therefore necessary to have the learned model automatically updated online based on the latest data of the customer or the first party when the machine learning application is designed.

Referring to FIG. 2A, a schematic diagram of an application architecture of an alarm data processing system according to an embodiment of the present application mainly includes: both predictive batch processing and learning batch processing.

In the learning batch stage, the feature extractor forms feature vectors by extracting alert logs from the generalization stored database. The learner is a core algorithm module for training to form a predictive model. In the prediction batch processing stage, a feature extractor is utilized to extract features from a database, then prediction is carried out by combining a prediction model, and a prediction result is output through Web application. The system architecture ensures that the model can be separated from a manual offline updating model, periodically updated online, regularly trained according to alarm log data of a window in a last period of time to obtain the model and realize batch prediction.

The application architecture constructs a set of online updating model by combining batch learning in a batch processing mode and a mode of calling a prediction result through a database, so that the online updating problem of the model can be solved from a system level, and the Web application and the machine learning system are decoupled through the database, so that a specific programming language dependency relationship is avoided.

In an embodiment, based on the architecture of fig. 2A, as shown in fig. 2B, a structural diagram of an alarm data processing system of the present embodiment may mainly include: data layer, feature layer, deployment module, model training module and output layer, wherein:

the data layer is used for storing alarm data and generating a database, the alarm data can be alarm logs converged to the situation awareness platform, in an actual scene, the situation awareness platform often needs to be connected with external alarm data of a plurality of different security manufacturers and a plurality of different security devices, the alarm data structures of the different devices are generally inconsistent, standard normal form processing can be carried out on original alarm data through CAPEC (Common Attack Pattern Enumeration and Classification, attack type enumeration and classification data set), the database is generated based on the processed alarm data, and the database is stored in the big data analysis platform.

And the feature layer is used for periodically reading the alarm data from the database of the data layer and extracting a multidimensional feature set of the alarm data. For example, from an alarm log database, statistical features of multiple dimensions of the source IP (the IP generating the alarm log) are calculated every 1 hour through feature engineering, and are used for characterizing the behavior features of the source IP from a multi-dimensional feature set, and the multi-dimensional feature set is stored in a MySQL (relational database management system) database for regularly training a model. At the same time, the characteristic data of the latest batch is reserved for model deployment and prediction.

The deployment module is deployed with an anomaly detection integrated model and is used for predicting anomaly IP information for generating alarm data based on the feature set of the feature layer. The deployment module can read the characteristic data and the model file of the latest batch, and is used for predicting and obtaining the anomaly score of the source IP and carrying out Top-K sequencing based on the anomaly score of the source IP. For the algorithmically derived outlier score interpretation, the z-score value for each feature deviation mean may be used for interpretation.

And the model training module is used for updating the anomaly detection integrated model of the deployment module based on the feature set of the feature layer. The model training module can comprise an algorithm layer and model evaluation and verification, wherein the algorithm layer is configured with various non-supervision abnormal detection models, and can train the various non-supervision abnormal detection models based on a multi-dimensional feature set to obtain an abnormal detection integrated model. The updated multidimensional feature set can be read from the feature layer periodically (for example, every 24 hours) to update the anomaly detection integrated model, so that the online learning of the whole model is ensured through the periodic batch learning of the system and the updating mode of asynchronous data, and the problem of the distributed drift of the data in time is avoided.

Further, update model evaluation and verification can be performed, and if verification is successful, the original model file is covered, otherwise the original model file is reserved. The idea of evaluation and verification may employ worst tests, namely: and comparing the model verification with the alarm generated by the rule, if the coincidence degree between the IP set deduced by the system and the IP set with the dangerous level less than a certain threshold value generated by the rule exceeds a certain quantity, namely the updated model prediction precision does not reach the preset precision, the model verification fails, and otherwise, the model verification is successful.

And the output layer is used for outputting the prediction result of the deployment module. The output layer can be a Web application, outputs the sorting information of the abnormal IP generating the alarm data, and can divide the abnormal IP into areas, such as an external network and an internal network, so that related personnel can conveniently and quickly review.

As shown in fig. 2C, in an actual scenario, the alarm logs received by the enterprise side situation awareness platform generally conform to the pyramid distribution in fig. 2C. Where L5 represents a security event that does result or indicates that a security event has occurred. L4 indicates that the attacker tries to exploit the vulnerability, but is unsuccessful. L3 represents a heuristic malicious behavior that does not collapse even if a vulnerability exists. L2 represents false alarms, attack class alarms but practically no malicious. L1 represents a log-type low-risk alert that is difficult to directly correlate to malicious behavior. As can be derived from fig. 2C, most are invalid alarms, duplicate alarms and false alarms. For example: heuristic malicious behavior—a very high threat attack L5 is relatively rare for a hacker to launch an attempted attack using an automated tool. Thus, modeling from the perspective of anomaly detection is possible: and taking most of alarm data generated by the source IP as normal points, and taking a small amount of alarms generated by the true high-threat IP as abnormal points, namely carrying out abnormal IP identification on the alarm data by using an abnormal detection algorithm.

Referring to fig. 3, an alarm data processing method according to an embodiment of the present application may be executed by the electronic device 1 shown in fig. 1, and may be applied to the above-mentioned scenarios of the alarm data processing systems shown in fig. 2A to 2C, so as to implement automatic prediction batch processing of alarm data and obtain abnormal IP information in the alarm data. The method comprises the following steps:

step 301: and acquiring alarm data to be processed in the database.

In this step, a large amount of alarm data is stored in the database, and the alarm data may be alarm logs, for example, a large amount of security protection devices are deployed in a data center of an enterprise, and the security devices generate a large amount of alarm logs based on mirror image traffic, and the alarm logs are finally converged to the network security situation awareness platform, so that the database can be built based on the data of the situation awareness platform. The alert data to be processed may be a plurality of alert logs over a period of time.

In an embodiment, before step 301, the method may further include: and collecting an original alarm data set, and carrying out standardized processing on the original alarm data set to generate a database, wherein the database comprises attack types of each piece of alarm data.

In an actual scene, the situation awareness platform often needs to access external alarm data of a plurality of different security manufacturers and a plurality of different security devices, the alarm data structures of the different devices are generally inconsistent, and a final database is generated after the standard normal form processing is performed on an original alarm data set through a CAPEC, wherein alarm logs can be classified in the database, so that each alarm log has own attack type.

Step 302: and extracting a multi-dimensional feature set of the alarm data.

In the step, the feature set is used for representing the safety hazard degree of the alarm data, and the multidimensional feature set more comprehensively represents the real intention of generating the IP of the alarm data, so that the high-risk source IP can be more accurately identified. For example, a batch of alarm data contains the following: alarm log 1: account a of a certain APP is logged in early morning. Alarm log 2: account a modifies the login password of the APP in the early morning. Alarm log 3: account a purchases items not commonly purchased at ordinary times in the early morning at the APP. Suppose that only a single feature dimension is considered: the login time cannot be accurately known whether the IP generating the alarm log 1 is abnormal, because a normal user may log in the APP in the early morning. But if feature dimensions are added: behavioral characteristics. That is, taking login time and behavior into consideration at the same time, an event occurring in the IP that generates the alarm log 1 can be obtained: account a logs in to APP in the early morning and modifies the login password and purchases items that are not very purchased. This event normally does not occur, but rather occurs currently, indicating that the IP is likely to be an attacker. Therefore, the multi-dimensional feature set of the alarm data can be extracted from the attribute dimension, the time-space dimension and the like of the attacker IP, and the risk degree of the alarm data can be further completely represented.

Step 303: inputting the multidimensional feature set into a preset anomaly detection integrated model, and outputting anomaly IP information for generating alarm data.

In this step, based on modeling assumptions for FIG. 2C: because the actual situation has few real high-risk IPs, based on the characteristic, most of alarm data generated by source IPs can be regarded as normal points, and a small amount of alarms generated by the real high-risk IPs can be regarded as abnormal points. Based on this assumption, truly high-risk IP can be detected using anomaly detection algorithms. Specifically, an anomaly detection integrated model can be obtained based on an anomaly detection algorithm, the multi-dimensional feature set of the alarm data is input into the anomaly detection integrated model, so that the IP generating the alarm data is outlier, the outlier IP is real high-risk anomaly IP information, the anomaly IP information can be output to an interaction interface, and collinear management personnel can review the anomaly IP information.

According to the alarm data processing method, the alarm data to be processed is obtained from the database, the multi-dimensional feature extraction is carried out on the alarm data, the multi-dimensional feature set is obtained, then the multi-dimensional feature set is input into the pre-configured anomaly detection integrated model, the automatic prediction batch processing of the alarm data can be realized, the detection result of the anomaly IP information in the alarm data is obtained, human participation is not needed, the labor cost of alarm data processing is greatly saved, and the alarm processing efficiency is improved.

Referring to fig. 4A, an alarm data processing method according to an embodiment of the present application may be executed by the electronic device 1 shown in fig. 1, and may be applied to the above-mentioned scenario of the alarm data processing system shown in fig. 2A to 2C, so as to implement automatic prediction batch processing of alarm data and obtain abnormal IP information in the alarm data. The method comprises the following steps:

step 401: sample alarm data in a database is obtained.

In this step, before the alarm data is detected, the alarm data in the database may be used as sample alarm data to build an anomaly detection integrated model. The alarm log in the database can be directly selected as sample alarm data through the data layer shown in the Rayleigh diagram 2B, wherein the alarm log can be the alarm log subjected to standardized processing.

Step 402: a multi-dimensional sample feature set of sample alert data is extracted.

In this step, based on modeling assumptions for FIG. 2C: the alarm data generated by most source IPs are considered as normal points, and the alarms generated by a small number of true high threat IPs are considered as abnormal points. In an actual scene, the data point feature dimensions are selected differently, and the obtained anomaly detection results are also different. For example, in fig. 4B, it is assumed that the triangle points and the diamond points are true outliers, and both the triangle points and the diamond points are normal points (false alarm) when viewed from the feature full space a on the left side of the broken line, but the diamond points are outliers when viewed from the feature subspace B on the right side of the broken line, and the triangle points are normal points. Thus, looking at data from different feature subspaces, the conclusions drawn are different. Some outliers are normal points from the global feature space, but outliers from the subspace view. Therefore, in order to enhance the robustness of the subsequent anomaly detection integrated model and reduce the sensitivity of the model, the thought of feature space sampling of random forests can be used for respectively extracting sample feature sets of sample alarm data from multiple dimensions, and then the anomaly scores output by the multiple feature subspaces are integrated to average.

Step 403: and respectively training a plurality of local anomaly detection algorithm models with different neighbor scale parameters by adopting each sample sub-feature set in the multi-dimensional sample feature set to obtain a plurality of anomaly detection base models.

In this step, the reason for selecting the local anomaly detection algorithm model is as follows: different enterprises in the actual scene face different businesses, so that differences exist in the deployment of security products and devices. For example: government clouds, internet companies and the like face the public, and external networks are deployed with more security devices. And some group enterprises with higher information security requirements are more used inwards, so that the security devices deployed in the intranet are more. Therefore, the number and the type of alarms generated by the intranet IP and the extranet IP are very different, two independent distributions are often formed after the processing of the characteristic layer, and the problems of global abnormality and local abnormality exist.

As shown in fig. 4C: c1 is an external network IP distribution cluster, C2 is an internal network IP distribution cluster, O2 is a local outlier, and O1 is a global outlier. Therefore, if a global anomaly detection algorithm such as ifest (isolated forest) or HBOS (Histogram-based Outlier Score, an unsupervised anomaly detection algorithm) is selected at the time of algorithm type selection, a small number of clusters, such as C2 cluster and O1, will be considered as anomalies at the time of detection, which is not practical. If a local anomaly detection algorithm is selected, C1 and C2 can be treated as two clusters, and O1 and O2 can be identified as anomaly points. Therefore, in the step, a local abnormality detection algorithm is selected as a core algorithm of the abnormality detection integrated model, so as to obtain a detection result which is more in line with reality.

In an embodiment, the local anomaly detection algorithm may be implemented based on an LOF (Local Outlier Factors, a density-based anomaly detection algorithm), where the LOF algorithm is an algorithm for evaluating anomalies of local relative densities, and is better suited for datasets with inconsistent cluster densities. The LOF algorithm calculates anomaly scores mainly by the following four steps:

(1) k-nearest neighbor distance: the distance between the kth nearest point to point p and point p, called the k-nearest neighbor distance of point p, is denoted as k-distance, i.e., the nearest neighbor scale parameter.

(2) Reachable distance (rechability distance): given the super parameter k of k-nearest neighbor distances, the reachable distance from a point p to any point o is the maximum of the k-nearest neighbor distance of the point o and the distance between the points p and o, namely:

reach—dist _k (p,o)＝max(d(p,o),k-distance(o))

(3) Local reachable density (local rechability density): given the super-parameter k, given the point p, the set of points whose distance from the point p is less than or equal to the k-nearest neighbor distance of the point p is denoted as N _k (p). The local reachable density of the point p is the points p and N _k Inverse of the average reachable distance of (p), namely:

(4) Local abnormality factor (local outlier factor): given a super-parameter k, the local anomaly factor of the point p is the k-neighbor point set N of the point p _k The ratio of the average of the locally reachable densities of all points in (p) to the locally reachable density of point p, i.e.:

the local outlier factor of point p, the final output of the LOF algorithm, is used to measure whether point p is an outlier with respect to its surrounding points. When the k value is given, the higher the local abnormality factor of the point p, the more abnormal the distribution of the point p as compared with the surrounding points. As long as proper super parameter k is selected, even in data sets with uneven distribution and different densities, local abnormal points can be found.

From the above principle, it can be obtained that: on the one hand, the hyper-parameter k of the LOF algorithm determines whether the judgment point p is a reference range of an abnormal point, and when k is infinity, the hyper-parameter k represents detected global abnormality, and when k is 1, the hyper-parameter k represents detected minimum-scale local abnormality. In reality, the data distribution conditions in different production environments are different, and the super parameter k is also different, so that the embodiment can use the integrated learning algorithm idea to improve the robustness of the anomaly detection integrated model in the real environment data by integrating a plurality of LOF base models with different neighbor scale parameters k. In addition, the dimension of the sample feature adopted by each LOF base model is different, and the feature distribution of the real environment data in different dimensions can be comprehensively considered, so that the result of the LOF base model is more in line with the actual situation and is more accurate than the result of a single LOF algorithm model trained by single dimension features.

Step 404: and carrying out combination processing on detection results of the plurality of abnormal detection base models according to a preset combination strategy to generate an abnormal detection integrated model.

In this step, the integrated model for anomaly detection is an integrated learner, and the principle of the integrated algorithm is to train a plurality of base learners (base models), and then synthesize the results of the base learners through a certain combination strategy, so as to obtain the result of the integrated learner.

In an embodiment, the preset combination policy is an average result policy for respectively averaging the number of neighboring scale parameters and the dimension of the feature set. The integration algorithm has various combination strategies, wherein, since the feature sets of each dimension are basically identical to the sharing of the model, an average value strategy can be selected, and the specific process of generating the anomaly detection integrated model can be exemplified as follows:

1) Let X be the multi-dimensional sample feature set.

2) m X _j The j-th feature subspace of X (i.e., the sample sub-feature set), j is a positive integer, and space X is the sum of all feature subspaces, i.e., X= +X _j

3) Arbitrary two feature subspaces X _i ,X _j The method meets the following conditions:

wherein X is _i The ith feature subspace of X, i is a positive integer.

4) Let { K ₁ ,…,K _n And the neighbor scale parameters K of n preset LOF algorithms are shown. The anomaly score calculation formula for a single IP may be:

wherein i=1, 2,3,; j=1, 2,3,..m.

The algorithm adopts the LOF integrated model with the multi-neighbor scale parameter K neighbor subspace sampling, and different abnormal detection base models adopt different dimensions of sample characteristics, so that the characteristic distribution of real environment data in different dimensions can be comprehensively considered, and the result of the algorithm is more in line with the actual situation and more accurate than the result of a single LOF algorithm model trained by single dimension characteristics. The plurality of abnormal detection base models can be realized in parallel, so that the robustness of the model can be improved.

In one embodiment, the step of establishing the anomaly detection integrated model may further include a process of model update, including: and acquiring updated alarm data in the first time period from the database at intervals of the first time period. A multi-dimensional update feature set is extracted that updates alert data. And respectively training a plurality of local anomaly detection algorithm models with different neighbor scale parameters by adopting each updated sub-feature set in the multi-dimensional updated feature set to obtain a plurality of updated base models, and combining the plurality of updated base models according to a preset combining strategy. And updating the anomaly detection integrated model according to the model file combined by the plurality of updating base models.

Based on the system as shown in fig. 2B, the first period of time may be set based on actual scene requirements, for example, the first period of time may be 24 hours, i.e., the anomaly detection integration model is updated once a day. The specific updating process of the model may refer to the process of establishing the anomaly detection integrated model in the steps 401 to 404, which is not described herein.

In one embodiment, updating the anomaly detection integrated model according to an update model file in which a plurality of update base models are combined includes: and verifying the updated model file, and if the output precision of the updated model file reaches a preset threshold value, covering the original file of the abnormality detection integrated model with the updated model file to obtain the updated abnormality detection integrated model.

The verification of the updated model file can be realized by the model evaluation and verification part shown in fig. 2B, namely, by comparing with the alarm generated by the rule, if the coincidence degree between the IP set deduced by the system and the IP set with the danger level generated by the rule being smaller than a certain threshold exceeds a certain quantity, namely, the accuracy of the updated model prediction accuracy does not reach the preset threshold, the model verification fails, otherwise, the model verification is successful. If verification is successful, the original model file is covered, otherwise, the original model file is reserved, and the updated model file can be ensured to meet the output precision requirement.

The model building process of steps 401 to 404 described above may be implemented based on a model training module of the system as shown in fig. 2B.

Step 405: and acquiring alarm data to be processed in the database. See the description of step 301 in the above embodiments for details.

Step 406: and extracting a multi-dimensional feature set of the alarm data. See the description of step 302 in the above embodiments for details. In addition, the extraction method of the multi-dimensional feature set may refer to the description of the extraction method of the multi-dimensional sample feature set in step 402.

Step 407: and inputting the multidimensional feature set into the anomaly detection integrated model every second time period to obtain a plurality of pieces of anomaly IP information in the alarm data in the second time period, and outputting a plurality of anomaly IP information sequencing results based on the anomaly score of each anomaly IP.

In this step, the second period is the interval time of outputting the prediction result, and 1 hour may be selected in the actual scene, that is, the output prediction result is updated every 1 hour. Other suitable durations may also be selected as the second time period, and typically the first time period (model update interval) may be greater than the second time period (prediction result update interval).

The anomaly detection integration model may be configured in a deployment module as shown in fig. 2B, where the deployment module may read the feature data of the latest batch (one batch every second period of time) and the latest model file, input the feature data and the latest model file into the anomaly detection integration model, predict an anomaly score of a source IP that generates the alarm data, and perform Top-K sorting based on the anomaly score of the source IP. For the algorithmically derived outlier score interpretation, the z-score value for each feature deviation mean may be used for interpretation. I.e., higher anomaly score, the greater the likelihood that the IP is a malicious attacker.

In one embodiment, step 407 may specifically include: and respectively inputting each sub-feature set in the multi-dimensional feature set into a plurality of abnormal detection base models, and outputting a plurality of initial detection results of abnormal IP information for generating alarm data. And combining the plurality of initial detection results according to a preset combination strategy to obtain a final detection result of the abnormal IP information for generating the alarm data. In this scenario, the anomaly detection integrated model may include: a plurality of anomaly detection base models. The specific identification process of the anomaly detection integrated model can be referred to the above detailed description of establishing the anomaly detection integrated model, and will not be repeated here.

According to the alarm data processing method, a set of architecture of an online updating model is built by combining batch learning in a batch processing mode and a mode of synchronously predicting results through an asynchronous database, and the online updating problem of the model is solved from a system level rather than an algorithm level. Through carrying out real-time streaming analysis and CAPEC standardization on massive single-point security equipment alarm logs, constructing characteristic engineering from attribute dimension of an attacker IP and space-time dimension of attack behaviors, based on the ideas of unsupervised anomaly detection algorithm LOF and integrated learning, providing an improved LOF algorithm-EBLOF (Ensemble Based Local Outlier Factors, an anomaly detection algorithm combined with integrated learning) from the aspects of robustness and sensitivity, discovering the attacker IP, assisting security analysts in analyzing massive, multi-source and heterogeneous security alarm logs, accelerating and enhancing.

Referring to fig. 5, an alarm data processing apparatus 500 according to an embodiment of the present application is applicable to the electronic device 1 shown in fig. 1 and the above-mentioned scene of the alarm data processing system shown in fig. 2A to 2C, so as to implement automatic prediction batch processing on alarm data and obtain abnormal IP information in the alarm data. The device comprises: the principle relationship of the acquisition module 501, the extraction module 502 and the detection module 503 is as follows:

The obtaining module 501 is configured to obtain alarm data to be processed in the database.

The extracting module 502 is configured to extract a multi-dimensional feature set of the alarm data.

The detection module 503 is configured to input the multidimensional feature set into a preset anomaly detection integrated model, and output anomaly IP information for generating alarm data.

In one embodiment, the anomaly detection integration model includes: a plurality of anomaly detection base models. The detection module 503 is configured to: and respectively inputting each sub-feature set in the multi-dimensional feature set into a plurality of abnormal detection base models, and outputting a plurality of initial detection results of abnormal IP information for generating alarm data. And combining the plurality of initial detection results according to a preset combination strategy to obtain a final detection result of the abnormal IP information for generating the alarm data.

In one embodiment, the method further includes a model building module 504 for: sample alarm data in a database is obtained. A multi-dimensional sample feature set of sample alert data is extracted. And respectively training a plurality of local anomaly detection algorithm models with different neighbor scale parameters by adopting each sample sub-feature set in the multi-dimensional sample feature set to obtain a plurality of anomaly detection base models. And carrying out combination processing on detection results of the plurality of abnormal detection base models according to a preset combination strategy to generate an abnormal detection integrated model.

In one embodiment, the modeling module 504 is further configured to: and acquiring updated alarm data in the first time period from the database at intervals of the first time period. A multi-dimensional update feature set is extracted that updates alert data. And respectively training a plurality of local anomaly detection algorithm models with different neighbor scale parameters by adopting each updated sub-feature set in the multi-dimensional updated feature set to obtain a plurality of updated base models, and combining the plurality of updated base models according to a preset combining strategy. And updating the anomaly detection integrated model according to the model file combined by the plurality of updating base models.

In an embodiment, the preset combination policy is an average result policy for respectively averaging the number of neighboring scale parameters and the dimension of the feature set.

In one embodiment, the detection module 503 is further configured to: and inputting the multidimensional feature set into the anomaly detection integrated model every second time period to obtain a plurality of pieces of anomaly IP information in the alarm data in the second time period, and outputting a plurality of anomaly IP information sequencing results based on the anomaly score of each anomaly IP.

In one embodiment, the method further comprises: the normalization module 505 is configured to collect an original alarm data set before obtaining alarm data to be processed in the database, and perform normalization processing on the original alarm data set to generate a database, where the database includes an attack type to which each piece of alarm data belongs.

For a detailed description of the alarm data processing device 500, please refer to the description of the relevant method steps in the above embodiments.

The embodiment of the invention also provides a non-transitory electronic device readable storage medium, which comprises: a program which, when run on an electronic device, causes the electronic device to perform all or part of the flow of the method in the above-described embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD), etc. The storage medium may also comprise a combination of memories of the kind described above.

Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations are within the scope of the invention as defined by the appended claims.

Claims

1. An alarm data processing method, comprising:

acquiring alarm data to be processed in a database;

extracting a multidimensional feature set of the alarm data;

inputting the multidimensional feature set into a preset anomaly detection integrated model, and outputting anomaly IP information for generating the alarm data;

the step of establishing the anomaly detection integrated model comprises the following steps:

acquiring sample alarm data in the database;

extracting a multi-dimensional sample feature set of the sample alarm data;

respectively training a plurality of local anomaly detection algorithm models with different neighbor scale parameters by adopting each sample sub-feature set in the multi-dimensional sample feature set to obtain a plurality of anomaly detection base models;

combining the detection results of the plurality of abnormal detection base models according to a preset combination strategy to generate an abnormal detection integrated model;

the preset combination strategy is an average result strategy for respectively averaging the number of the neighbor scale parameters and the dimension of the feature set.

2. The method of claim 1, wherein the anomaly detection integration model comprises: a plurality of anomaly detection base models; the step of inputting the multidimensional feature set into a preset anomaly detection integrated model, outputting anomaly IP information for generating the alarm data, comprises the following steps:

Inputting each sub-feature set in the multi-dimensional feature set to a plurality of abnormal detection base models respectively, and outputting a plurality of initial detection results for generating abnormal IP information of the alarm data;

and combining the plurality of initial detection results according to a preset combination strategy to obtain a final detection result of the abnormal IP information generating the alarm data.

3. The method of claim 1, wherein the step of building the anomaly detection integration model further comprises:

acquiring updated alarm data in a first time period from the database at intervals of the first time period;

extracting a multidimensional updating feature set of the updating alarm data;

respectively training the local anomaly detection algorithm models with different neighbor scale parameters by adopting each updated sub-feature set in the multi-dimensional updated feature set to obtain a plurality of updated base models, and combining the plurality of updated base models according to the preset combining strategy;

and updating the anomaly detection integrated model according to the model file combined by the plurality of updating base models.

4. The method of claim 3, wherein updating the anomaly detection integration model based on the update model file after the plurality of update base models are combined comprises:

And verifying the updated model file, and if the output precision of the updated model file reaches a preset threshold value, covering the original file of the abnormality detection integrated model with the updated model file to obtain an updated abnormality detection integrated model.

5. The method according to claim 1, wherein inputting the multi-dimensional feature set into a preset anomaly detection integration model, outputting anomaly IP information that generates the alert data, comprises:

and inputting the multidimensional feature set into the anomaly detection integrated model every second time period to obtain a plurality of pieces of anomaly IP information in the alarm data in the second time period, and outputting a sequencing result of the plurality of pieces of anomaly IP information based on the anomaly score of each anomaly IP.

6. The method of claim 1, further comprising, prior to the obtaining alert data to be processed in the database:

and collecting an original alarm data set, and carrying out standardization processing on the original alarm data set to generate the database, wherein the database comprises attack types of each piece of alarm data.

7. An alert data processing apparatus, comprising:

The acquisition module is used for acquiring alarm data to be processed in the database;

the extraction module is used for extracting the multi-dimensional feature set of the alarm data;

the detection module is used for inputting the multidimensional feature set into a preset anomaly detection integrated model and outputting anomaly IP information for generating the alarm data;

acquiring sample alarm data in the database;

extracting a multi-dimensional sample feature set of the sample alarm data;

8. An alert data processing system, comprising:

the data layer is used for storing alarm data and generating a database;

The feature layer is used for periodically reading alarm data from the database of the data layer and extracting a multidimensional feature set of the alarm data;

the deployment module is deployed with an anomaly detection integrated model and is used for predicting the anomaly IP information for generating the alarm data based on the feature set of the feature layer;

the model training module is used for updating the anomaly detection integrated model of the deployment module based on the feature set of the feature layer;

the output layer is used for outputting the prediction result of the deployment module;

acquiring sample alarm data in the database;

extracting a multi-dimensional sample feature set of the sample alarm data;

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the method of any one of claims 1 to 6.

10. A non-transitory electronic device-readable storage medium, comprising: program which, when run by an electronic device, causes the electronic device to perform the method of any one of claims 1 to 6.