CN114039837A

CN114039837A - Alarm data processing method, device, system, equipment and storage medium

Info

Publication number: CN114039837A
Application number: CN202111309066.5A
Authority: CN
Inventors: 范敏
Original assignee: Qianxin Technology Group Co Ltd; Secworld Information Technology Beijing Co Ltd
Current assignee: Qianxin Technology Group Co Ltd; Secworld Information Technology Beijing Co Ltd
Priority date: 2021-11-05
Filing date: 2021-11-05
Publication date: 2022-02-11
Anticipated expiration: 2041-11-05
Also published as: CN114039837B

Abstract

The application provides an alarm data processing method, a device, a system, equipment and a storage medium, wherein the method comprises the following steps: acquiring alarm data to be processed in a database; extracting a multi-dimensional feature set of the alarm data; and inputting the multi-dimensional feature set into a preset anomaly detection integrated model, and outputting the anomaly IP information for generating the alarm data. According to the method and the device, the alarm data can be automatically subjected to prediction batch processing, the detection result of the abnormal IP information in the alarm data can be obtained, manual participation is not needed, the labor cost for processing the alarm data is greatly saved, and the alarm processing efficiency is improved.

Description

Alarm data processing method, device, system, equipment and storage medium

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a method, an apparatus, a system, a device, and a storage medium for processing alarm data.

Background

An enterprise Data Center (IDC) always deploys a large number of security devices to construct a network security system. These Security devices generate a large amount of alarm logs based on the mirror traffic, and these alarm logs are finally converged to a Network Security Situation Awareness platform (NSSA). Security analysis teams often rely on network security situation awareness platforms and on security expert knowledge to manually handle alarm events on these platforms. Because a large number of false alarms, repeated alarms and invalid alarms without attack hazards exist in the actual alarm log, real and effective attacks are often submerged in a large number of alarm logs, and a security analysis team is difficult to find out effective attack alarm events accurately, so that a lot of difficulties and hidden dangers are brought to the safe operation of a data center.

Aiming at the problem that real and effective attacks are often buried in massive alarm logs, the following two ideas are generally provided:

1) the detection effect of the early-stage safety equipment (such as IDS, WAF and the like) is improved. The idea starts from flow, two methods are generally adopted to improve the detection effect, one method is to perfect a rule engine based on manual experience; another is to utilize machine learning modeling to reduce false positives and false positives.

2) And screening and filtering the alarm logs in the later period. The idea of the method starts from an alarm log of the safety equipment, and two methods are used for filtering, wherein one method is that safety personnel perform alarm and then confirm; the other is a multi-level association rule, and a proprietary system and an engine are used for dynamically configuring the rule, filtering and merging the alarms.

Aiming at the scheme 2), safety personnel are required to perform alarm reconfirmation, and safety analysis personnel are difficult to process alarms one by one, so that the safety analysis personnel generate alarm fatigue and influence a judgment result. The multi-level association rules need association rules and experience rules based on artificial knowledge and experience, and in an actual service scene, the maintenance cost of the rules is very high, and a large amount of manpower needs to be invested for updating and modifying. Therefore, how to improve the safety operation efficiency of the data center becomes a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the application aims to provide an alarm data processing method, an alarm data processing device, an alarm data processing system, alarm data processing equipment and a storage medium, so that automatic prediction batch processing of alarm data is realized, detection results of abnormal IP information in the alarm data are obtained, manual participation is not needed, the labor cost of alarm data processing is greatly saved, and the alarm processing efficiency is improved.

A first aspect of the embodiments of the present application provides an alarm data processing method, including: acquiring alarm data to be processed in a database; extracting a multi-dimensional feature set of the alarm data; and inputting the multi-dimensional feature set into a preset anomaly detection integrated model, and outputting the anomaly IP information for generating the alarm data.

In one embodiment, the anomaly detection integration model includes: a plurality of anomaly detection basis models; the inputting the multidimensional feature set into a preset anomaly detection integrated model and outputting the abnormal IP information for generating the alarm data comprises the following steps: respectively inputting each sub-feature set in the multi-dimensional feature set to a plurality of abnormal detection base models, and outputting a plurality of initial detection results of abnormal IP information for generating the alarm data; and combining the plurality of initial detection results according to a preset combination strategy to obtain a final detection result of the abnormal IP information generating the alarm data.

In one embodiment, the step of establishing the anomaly detection integration model includes: acquiring sample alarm data in the database; extracting a multi-dimensional sample feature set of the sample alarm data; respectively training a plurality of local anomaly detection algorithm models with different neighbor scale parameters by adopting each sample sub-feature set in the multi-dimensional sample feature set to obtain a plurality of anomaly detection base models; and combining the detection results of the plurality of abnormal detection base models according to a preset combination strategy to generate the abnormal detection integrated model.

In one embodiment, the step of establishing the anomaly detection integration model further includes: acquiring updated alarm data in a first time period from the database every other first time period; extracting a multi-dimensional updating feature set of the updating alarm data; respectively training the local anomaly detection algorithm models with different parameters of the neighboring scales by adopting each updating sub-feature set in the multi-dimensional updating feature set to obtain a plurality of updating base models, and combining the plurality of updating base models according to the preset combination strategy; and updating the anomaly detection integrated model according to the model file combined by the plurality of updating base models.

In an embodiment, the updating the anomaly detection integrated model according to the combined update model file of the plurality of update base models includes: and verifying the updated model file, and if the output precision of the updated model file reaches a preset threshold value, covering the updated model file with the original file of the anomaly detection integrated model to obtain the updated anomaly detection integrated model.

In an embodiment, the preset combination policy is an average result policy of averaging the number of neighboring scale parameters and the dimension of the feature set, respectively.

In an embodiment, the inputting the multidimensional feature set into a preset anomaly detection integrated model and outputting the abnormal IP information for generating the alarm data includes: and inputting the multi-dimensional feature set to the anomaly detection integrated model every other second time period to obtain a plurality of abnormal IP information in the alarm data in the second time period, and outputting a plurality of abnormal IP information sequencing results based on the anomaly score of each abnormal IP.

In an embodiment, before the acquiring the alarm data to be processed in the database, the method further includes: and collecting an original alarm data set, and carrying out standardization processing on the original alarm data set to generate the database, wherein the database comprises the attack type of each alarm data.

A second aspect of the embodiments of the present application provides an alarm data processing apparatus, including: the method comprises the following steps: the acquisition module is used for acquiring alarm data to be processed in the database; the extraction module is used for extracting a multi-dimensional feature set of the alarm data; and the detection module is used for inputting the multi-dimensional feature set into a preset anomaly detection integrated model and outputting the anomaly IP information for generating the alarm data.

In one embodiment, the anomaly detection integration model includes: a plurality of anomaly detection basis models; the detection module is used for: respectively inputting each sub-feature set in the multi-dimensional feature set to a plurality of abnormal detection base models, and outputting a plurality of initial detection results of abnormal IP information for generating the alarm data; and combining the plurality of initial detection results according to a preset combination strategy to obtain a final detection result of the abnormal IP information generating the alarm data.

In an embodiment, the system further includes a model building module, configured to: acquiring sample alarm data in the database; extracting a multi-dimensional sample feature set of the sample alarm data; respectively training a plurality of local anomaly detection algorithm models with different neighbor scale parameters by adopting each sample sub-feature set in the multi-dimensional sample feature set to obtain a plurality of anomaly detection base models; and combining the detection results of the plurality of abnormal detection base models according to a preset combination strategy to generate the abnormal detection integrated model.

In one embodiment, the model building module is further configured to: acquiring updated alarm data in a first time period from the database every other first time period; extracting a multi-dimensional updating feature set of the updating alarm data; respectively training the local anomaly detection algorithm models with different parameters of the neighboring scales by adopting each updating sub-feature set in the multi-dimensional updating feature set to obtain a plurality of updating base models, and combining the plurality of updating base models according to the preset combination strategy; and updating the anomaly detection integrated model according to the model file combined by the plurality of updating base models.

In one embodiment, the detection module is further configured to: and inputting the multi-dimensional feature set to the anomaly detection integrated model every other second time period to obtain a plurality of abnormal IP information in the alarm data in the second time period, and outputting a plurality of abnormal IP information sequencing results based on the anomaly score of each abnormal IP.

In one embodiment, the method further comprises: and the standardization module is used for collecting an original alarm data set before the alarm data to be processed in the database are obtained, carrying out standardization processing on the original alarm data set and generating the database, wherein the database comprises the attack type of each alarm data.

A third aspect of the embodiments of the present application provides an alarm data processing system, including: the data layer is used for storing alarm data and generating a database; the characteristic layer is used for regularly reading alarm data from the database of the data layer and extracting a multi-dimensional characteristic set of the alarm data; the deployment module is deployed with an abnormal detection integrated model and used for predicting abnormal IP information for generating the alarm data based on the feature set of the feature layer; a model training module for updating the anomaly detection integration model of the deployment module based on a feature set of the feature layer; and the output layer is used for outputting the prediction result of the deployment module.

A fourth aspect of the embodiments of the present application provides an electronic device, including: a memory to store computer programs and data; a processor configured to execute the computer program to implement the method of the first aspect and any embodiment of the present application.

A fifth aspect of embodiments of the present application provides a non-transitory electronic device-readable storage medium, including: a program which, when run by an electronic device, causes the electronic device to perform the method of the first aspect of an embodiment of the present application and any embodiment thereof.

According to the alarm data processing method, the device, the system, the equipment and the storage medium, the alarm data to be processed are obtained from the database, the multi-dimensional feature extraction is carried out on the alarm data to obtain the multi-dimensional feature set, then the multi-dimensional feature set is input into the pre-configured abnormal detection integrated model, the automatic prediction batch processing of the alarm data can be realized, the detection result of abnormal IP information in the alarm data is obtained, manual participation is not needed, the labor cost for processing the alarm data is greatly saved, and the alarm processing efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 2A is a schematic diagram of an application architecture of an alarm data processing system according to an embodiment of the present application;

fig. 2B is a schematic structural diagram of an alarm data processing system according to an embodiment of the present application;

fig. 2C is a schematic view of a distribution scene of an alarm log according to an embodiment of the present application;

fig. 3 is a schematic flowchart of an alarm data processing method according to an embodiment of the present application;

fig. 4A is a schematic flowchart of an alarm data processing method according to an embodiment of the present application;

FIG. 4B is a schematic diagram of a feature subspace anomaly according to an embodiment of the present application;

FIG. 4C is a schematic diagram of a local anomaly and a global anomaly according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an alarm data processing apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. In the description of the present application, the terms "first," "second," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

As shown in fig. 1, the present embodiment provides an electronic apparatus 1 including: at least one processor 11 and a memory 12, one processor being exemplified in fig. 1. The processor 11 and the memory 12 are connected by a bus 10. The memory 12 stores instruction data executable by the processor 11, and the instruction is executed by the processor 11, so that the electronic device 1 may execute all or part of the processes of the methods in the embodiments described below, so as to implement the automatic prediction batch processing on the alarm data, and obtain the abnormal IP information in the alarm data.

In an embodiment, the electronic device 1 may be a mobile phone, a tablet computer, a notebook computer, a desktop computer, or a mainframe computer system composed of a plurality of computers.

In a real production environment, once the machine learning system is deployed to a client or a first party, the model is difficult to update manually and offline due to network isolation and the like, and the cost is high. Therefore, it is necessary to have the learned model automatically updated online according to the latest data of the customer or the first party when the machine learning application system is designed.

Please refer to fig. 2A, which is a schematic diagram of an application architecture of an alarm data processing system according to an embodiment of the present application, and the schematic diagram mainly includes: the prediction batch processing and the learning batch processing.

In the learning batch processing stage, the feature extractor forms a feature vector by extracting the alarm log from the generalized stored database. The learner is a core algorithm module used for training and forming a prediction model. In the prediction batch processing stage, a feature extractor is used for extracting features from a database, prediction is carried out by combining a prediction model, and a prediction result is output through Web application. The system framework ensures that the model can be separated from the manual offline updating model, is updated on line at regular intervals, is trained to obtain the model according to the alarm log data of the latest period of time window at regular time and realizes batch prediction.

The application architecture constructs a set of online model updating architecture by combining batch learning in a batch processing mode and a mode of calling a prediction result through a database, the online model updating problem can be solved from a system level, and the Web application and the machine learning system are decoupled through the database, so that a specific programming language dependency relationship is avoided.

In an embodiment, based on the architecture of fig. 2A, as shown in fig. 2B, a schematic structural diagram of an alarm data processing system of this embodiment mainly includes: data layer, characteristic layer, deployment module, model training module and output layer, wherein:

the data layer is used for storing alarm data and generating a database, the alarm data can be alarm logs gathered to a situation awareness platform, in an actual scene, the situation awareness platform is often required to be accessed into external alarm data of a plurality of different security manufacturers and a plurality of different security devices, the alarm data structures of the different devices are usually inconsistent, the original alarm data can be subjected to standardized paradigm processing through a Common AttackPattern Engine and Classification (Common AttackPattern Enumeration and Classification data set), and the database is generated based on the processed alarm data and stored in a big data analysis platform.

And the characteristic layer is used for regularly reading the alarm data from the database of the data layer and extracting a multi-dimensional characteristic set of the alarm data. For example, from an alarm log database, calculating multi-dimensional statistical characteristics of a source IP (IP for generating an alarm log) in the last 1 hour through characteristic engineering every 1 hour, and storing the multi-dimensional characteristic set into a MySQL (relational database management system) database for periodically training a model. At the same time, the most recent lot of feature data is retained for model deployment and prediction.

And the deployment module is deployed with an abnormal detection integrated model and used for predicting abnormal IP information generating alarm data based on the feature set of the feature layer. The deployment module can read the feature data and the model file of the latest batch, is used for predicting the abnormal score of the source IP and carries out Top-K sequencing based on the abnormal score of the source IP. For the interpretation of the anomaly scores obtained by the algorithm, the z-score value of each feature deviating from the mean value can be used for interpretation.

And the model training module is used for updating the anomaly detection integrated model of the deployment module based on the feature set of the feature layer. The model training module can comprise an algorithm layer and a model evaluation and verification part, wherein the algorithm layer is provided with a plurality of unsupervised anomaly detection models, and the unsupervised anomaly detection models can be trained on the basis of a multi-dimensional feature set to obtain an anomaly detection integrated model. And the updated multidimensional feature set can be read from the feature layer periodically (for example, every 24 hours) to update the anomaly detection integrated model, so that the online learning of the whole model is ensured through the periodic batch learning of the system and the updating mode of asynchronous data, and the problem of distribution drift of the data in time is avoided.

Further, update model evaluation and verification may be performed, and if verification is successful, the original model file is overwritten, whereas the original model file is retained. The idea of evaluation and verification can adopt worst test, namely: by comparing with the alarm generated by the rule, if the contact ratio between the IP set pushed out by the system and the IP set with the risk level smaller than a certain threshold value exceeds a certain number, namely the updated model prediction precision does not reach the preset precision, the model verification fails, otherwise, the model verification succeeds.

And the output layer is used for outputting the prediction result of the deployment module. The output layer can be Web application, outputs the sequencing information of the abnormal IP generating the alarm data, and can perform regional division on the abnormal IP, such as dividing the abnormal IP into an outer network and an inner network, so that related personnel can conveniently look up the abnormal IP.

As shown in fig. 2C, the alarm log distribution scenario is a schematic diagram, and in an actual scenario, the alarm log received by the enterprise-side situation awareness platform generally conforms to the pyramid distribution in fig. 2C. Where L5 represents a security event that does result in or indicates that a security event has occurred. L4 indicates that the attacker is trying to exploit the vulnerability, but has not succeeded. L3 represents a tentative malicious activity that will not fail even if there is a vulnerability. L2 indicates a false alarm, an attack class alarm is actually not malicious. L1 represents a log-type low-risk alarm that is difficult to correlate directly to malicious behavior. As can be derived from fig. 2C, the vast majority are invalid alarms, repeat alarms, and false alarms. For example: tentative malicious behavior-attempted attacks initiated by hackers using automated tools-truly high-threat attacks L5 are relatively rare. Thus, it is possible to model from the perspective of anomaly detection: and regarding alarm data generated by most source IPs as normal points, regarding a small amount of alarms generated by real high-threat IPs as abnormal points, and performing abnormal IP identification on the alarm data by using an abnormal detection algorithm.

Please refer to fig. 3, which is a method for processing alarm data according to an embodiment of the present application, and the method may be executed by the electronic device 1 shown in fig. 1, and may be applied to the scenarios of the alarm data processing systems shown in fig. 2A to 2C, so as to implement automatic prediction batch processing on alarm data and obtain abnormal IP information in the alarm data. The method comprises the following steps:

step 301: and acquiring alarm data to be processed in the database.

In this step, a large amount of alarm data is stored in the database, and the alarm data may be alarm logs, for example, a large amount of security protection devices are deployed in a data center of an enterprise, the security devices may generate a large amount of alarm logs based on mirror flow, and the alarm logs are finally aggregated to a network security situation awareness platform, so that the database may be established based on data of the situation awareness platform. The alarm data to be processed may be a plurality of alarm logs over a period of time.

In an embodiment, before step 301, the method may further include: and collecting an original alarm data set, and carrying out standardization processing on the original alarm data set to generate a database, wherein the database comprises the attack type of each piece of alarm data.

In an actual scene, the situation awareness platform often needs to access external alarm data of a plurality of different security manufacturers and a plurality of different security devices, the alarm data structures of the different devices are usually inconsistent, and a final database is generated after an original alarm data set is subjected to standardized normal form processing through CAPEC (control and accounting), wherein alarm logs can be classified in the database, so that each alarm log has an own attack type.

Step 302: and extracting a multi-dimensional feature set of the alarm data.

In the step, the characteristic set is used for representing the safety risk degree of the alarm data, and the multidimensional characteristic set more comprehensively represents the real intention of generating the IP of the alarm data, so that the high-risk source IP can be more accurately identified. For example, a batch of alarm data contains the following contents: alarm log 1: account a of a certain APP logs in the morning. And 2, alarm log 2: account a modified the login password of the APP in the morning. Alarm log 3: account A purchased items that were not often purchased in the morning at the APP. Assume that only a single feature dimension is considered: at the login time, it is impossible to accurately know whether the IP generating the alarm log 1 is abnormal, because a normal user may log in the APP in the early morning. But if the feature dimension is added: and (4) behavior characteristics. That is, considering both log-in time and behavior, the event occurring in the IP that generates the alarm log 1 can be obtained: account A logs in APP in the morning, and has modified the login password, and bought article that is not bought often. This event normally does not occur, but currently occurs, suggesting that the IP is likely an attacker. Therefore, a multi-dimensional feature set of the alarm data can be extracted from attribute dimensions of the IP of the attacker and space-time dimensions of the attack behavior, and the like, so that the danger degree of the alarm data can be represented more completely.

Step 303: and inputting the multi-dimensional feature set into a preset anomaly detection integrated model, and outputting the anomaly IP information for generating alarm data.

In this step, based on the modeling assumption for fig. 2C: because the real high-risk IP accounts for a few in practical situation, based on the characteristic, the alarm data generated by most source IPs can be regarded as normal points, and the alarm generated by a small amount of real high-risk IPs can be regarded as abnormal points. Based on the assumption, an abnormal detection algorithm can be adopted to detect true and high-risk IP. Specifically, an anomaly detection integrated model can be obtained based on an anomaly detection algorithm, IP which generates alarm data can be output by inputting a multi-dimensional feature set of the alarm data into the anomaly detection integrated model, the IP which generates the alarm data is outlier, the outlier IP is true high-risk anomaly IP information, and the anomaly IP information can be output to an interaction interface and consulted by collinear management personnel.

According to the alarm data processing method, the alarm data to be processed is obtained from the database, the multi-dimensional feature extraction is carried out on the alarm data to obtain the multi-dimensional feature set, then the multi-dimensional feature set is input into the pre-configured anomaly detection integrated model, the automatic prediction batch processing of the alarm data can be realized, the detection result of the abnormal IP information in the alarm data is obtained, manual participation is not needed, the labor cost for alarm data processing is greatly saved, and the alarm processing efficiency is improved.

Please refer to fig. 4A, which is a method for processing alarm data according to an embodiment of the present application, where the method may be executed by the electronic device 1 shown in fig. 1, and may be applied to the scenarios of the alarm data processing systems shown in fig. 2A to 2C, so as to implement automatic prediction batch processing on alarm data and obtain abnormal IP information in the alarm data. The method comprises the following steps:

step 401: and acquiring sample alarm data in the database.

In this step, before the alarm data is detected, an abnormal detection integrated model may be established using the alarm data in the database as sample alarm data. The alarm log in the database can be directly selected as sample alarm data through the data layer shown in fig. 2B, where the alarm log may be an alarm log subjected to standardization processing.

Step 402: and extracting a multi-dimensional sample feature set of the sample alarm data.

In this step, based on the modeling assumption for fig. 2C: most of the alarm data generated by the source IP is considered as normal points, and a small number of alarms generated by the real high-threat IP are considered as abnormal points. In an actual scene, the feature dimensions of the data points are selected differently, and the obtained abnormal detection results are also different. For example, in fig. 4B, it is assumed that the triangle point and the diamond point are true outliers, and both the triangle point and the diamond point are normal points (false alarm) viewed from the feature full space a on the left side of the dotted line, but the diamond point is an outlier and the triangle point is a normal point viewed from the feature subspace B on the right side of the dotted line. Thus, looking at the data from different feature subspaces, the conclusions drawn are different. Some outliers are normal points from the global feature space, but are outliers from the subspace perspective. Therefore, in order to enhance the robustness of the subsequent anomaly detection integrated model and reduce the sensitivity of the model, the idea of sampling the feature space of the random forest can be used for reference, sample feature sets of sample alarm data are respectively extracted from a plurality of dimensions, and then anomaly scores output by integrating a plurality of feature subspaces are averaged.

Step 403: and respectively training a plurality of local anomaly detection algorithm models with different neighbor scale parameters by adopting each sample sub-feature set in the multi-dimensional sample feature set to obtain a plurality of anomaly detection base models.

In this step, the local anomaly detection algorithm model is selected for the following reasons: in actual scenes, different enterprises are oriented to different services, so that the deployment of security products and equipment is different. For example: government affair cloud, internet company and the like face the public, and more security devices are deployed in an external network. And some group enterprises with higher information security requirements use more internal enterprises, so that more security devices are deployed in the intranet. Therefore, the number and types of alarms generated by the intranet IP and the extranet IP are greatly different, two independent distributions are often formed after the characteristics layer is processed, and the problems of global abnormity and local abnormity exist.

As shown in fig. 4C: c1 is an outer net IP distribution cluster, C2 is an inner net IP distribution cluster, O2 is a local anomaly, and O1 is a global anomaly. Therefore, if a global anomaly detection algorithm such as iForest (isolated forest) or HBOS (unsupervised anomaly detection algorithm) is selected at the time of algorithm selection, a small number of clusters, for example, C2 cluster and O1, are treated as anomalies at the time of detection, which is not practical. If the local anomaly detection algorithm is selected, then C1 and C2 can be treated as two clusters, identifying O1 and O2 as anomaly points. Therefore, in the step, the local anomaly detection algorithm is selected as a core algorithm of the anomaly detection integrated model, so that a detection result which is more in line with the reality is obtained.

In an embodiment, the Local anomaly detection algorithm may be implemented based on a Local Outlier Factors (LOF) algorithm, in which the LOF algorithm is an algorithm for evaluating an anomaly condition of Local relative density, and is better applicable to a data set with inconsistent cluster densities. The LOF algorithm calculates the anomaly score primarily by the following four steps:

(1) k-nearest neighbor distance: the distance between the kth nearest point to the point p and the point p, called the k-neighbor distance of the point p, is denoted as k-distance, i.e., a neighbor scale parameter.

(2) Reachable distance (recovery distance): given the hyperparameter k of the k-nearest neighbor distance, the reachable distance from the point p to any point o is the maximum value of the k-nearest neighbor distance of the point o and the distances between the point p and the point o, namely:

reach—dist_k(p,o)＝max(d(p,o),k-distance(o))

(3) local accessibility density (local accessibility): given a hyper-parameter k, given a point p, the set of points whose distance from the point p is less than or equal to the k-nearest neighbor distance of the point p is denoted N_k(p) of the formula (I). The local reachable density of the point p is the points p and N_kThe inverse of the average reachable distance of (p), i.e.:

(4) local outlier factor (local outlierr factor): given a hyperparameter k, the local anomaly factor of a point p is a k-neighbor set N of the point p_k(p) the ratio of the average of the local achievable densities for all points to the local achievable density for point p, i.e.:

the local anomaly factor for point p, the final output of the LOF algorithm, is used to measure whether point p is an anomaly relative to its surrounding points. When the value of k is given, the higher the local anomaly factor of the point p, the more anomalous the distribution of the point p compared to the surrounding points. As long as a proper hyper-parameter k is selected, local outliers can be found even in data sets with uneven distribution and different densities.

The following principles can be obtained: on one hand, the hyper-parameter k of the LOF algorithm determines the reference range of judging whether the point p is an abnormal point, when k is infinite, the detected global abnormality is represented, and when k is 1, the detected minimum scale local abnormality is represented. In reality, the data distribution conditions in different production environments are different, and the hyper-parameter k is also different, so that the robustness of the anomaly detection integrated model in real environment data can be improved by integrating the LOF base models of a plurality of different neighboring scale parameters k by using the algorithm idea of integrated learning. In addition, the dimensionality of the sample characteristics adopted by each LOF base model is different, and the characteristic distribution of real environment data in different dimensionalities can be comprehensively considered, so that the result of a single LOF algorithm model trained by the characteristics of the single dimensionality is more in line with the actual situation and more accurate.

Step 404: and combining the detection results of the plurality of abnormal detection base models according to a preset combination strategy to generate an abnormal detection integrated model.

In this step, the anomaly detection integrated model is an ensemble learner, and the principle of the integrated algorithm is to train a plurality of base learners (base models), and then synthesize the results of the base learners by a certain combination strategy, so as to obtain the results of the ensemble learner, so that the detection results of the plurality of anomaly detection base models can be combined according to a preset combination strategy to generate the anomaly detection integrated model.

In an embodiment, the preset combination policy is an average result policy of averaging the number of neighboring scale parameters and the dimension of the feature set respectively. The integration algorithm has a plurality of combination strategies, and here, since the feature sets of each dimension are basically equivalent to the sharing of the model, an average value strategy can be selected, and a specific process for generating the anomaly detection integration model can be exemplified as follows:

1) let X be the multi-dimensional sample feature set.

2) m X_jThe jth feature subspace (i.e., the sample sub-feature set) of X, j being a positive integer, and space X being the sum of all feature subspaces, i.e., X ═ X_j

3) Any two feature subspaces X_i,X_jSatisfies the following conditions:

wherein X_iIs the ith feature subspace of X, i is a positive integer.

4) Let { K₁,…,K_nAnd the parameters are preset neighbor scale parameters K of n LOF algorithms. The anomaly score calculation formula for a single IP can be:

wherein i is 1,2,3,. and n; j is 1,2, 3.

The LOF integrated model of multi-neighbor scale parameter K neighbor subspace sampling is adopted in the algorithm, the dimensionality of sample characteristics adopted by different anomaly detection base models is different, and the characteristic distribution of real environment data in different dimensionalities can be comprehensively considered, so that the result of the LOF integrated model is more in line with the actual situation and more accurate than the result of a single LOF algorithm model trained by single dimensionality characteristics. A plurality of anomaly detection base models can be realized in parallel, and the robustness of the models can be improved.

In an embodiment, the step of building the anomaly detection integrated model may further include a process of updating the model, including: and acquiring the updated alarm data in the first time period from the database every other first time period. And extracting a multi-dimensional updating feature set for updating the alarm data. And respectively training a plurality of local anomaly detection algorithm models with different neighbor scale parameters by adopting each updating sub-feature set in the multi-dimensional updating feature set to obtain a plurality of updating base models, and combining the plurality of updating base models according to a preset combination strategy. And updating the anomaly detection integrated model according to the model file combined by the plurality of updating base models.

Based on the system as shown in fig. 2B, the first time period may be set based on actual scene requirements, for example, the first time period may be 24 hours, i.e., the anomaly detection integration model is updated once a day. The specific updating process of the model may refer to the process of establishing the anomaly detection integrated model in steps 401 to 404, which is not described herein again.

In one embodiment, updating the anomaly detection integrated model according to the combined update model files of the plurality of update base models includes: and verifying the updated model file, and if the output precision of the updated model file reaches a preset threshold value, covering the updated model file with the original file of the anomaly detection integrated model to obtain the updated anomaly detection integrated model.

The verification of the updated model file can be realized by the model evaluation and verification part shown in fig. 2B, that is, by comparing the updated model file with the alarm generated by the rule, if the contact ratio between the IP set proposed by the system and the IP set whose risk level generated by the rule is smaller than a certain threshold exceeds a certain number, that is, the updated model prediction accuracy does not reach the accuracy of the preset threshold, the model verification fails, otherwise, the model verification succeeds. If the verification is successful, the original model file is covered, otherwise, the original model file is reserved, and the updated model file can be output with the output precision meeting the requirements.

The model building process from step 401 to step 404 can be implemented based on a model training module of the system shown in fig. 2B.

Step 405: and acquiring alarm data to be processed in the database. See the description of step 301 in the above embodiments for details.

Step 406: and extracting a multi-dimensional feature set of the alarm data. See the description of step 302 in the above embodiments for details. In addition, the extracting manner of the multi-dimensional feature set may refer to the description of the extracting manner of the multi-dimensional sample feature set in step 402.

Step 407: and inputting the multi-dimensional feature set to the anomaly detection integrated model every other second time period to obtain a plurality of abnormal IP information in the alarm data in the second time period, and outputting a plurality of abnormal IP information sequencing results based on the anomaly scores of each abnormal IP.

In this step, the second time period is an interval time of outputting the prediction result, and the output prediction result may be updated every 1 hour in the actual scene by selecting 1 hour. Other suitable time periods may also be selected as the second time period, and typically, the first time period (model update interval) may be greater than the second time period (predicted result update interval).

The anomaly detection integrated model may be configured in the deployment module shown in fig. 2B, and the deployment module may read the feature data and the latest model file of the latest batch (a batch every second time period), input the feature data and the latest model file into the anomaly detection integrated model, predict an anomaly score of the source IP that generates the alarm data, and perform Top-K sorting based on the anomaly score of the source IP. For the interpretation of the anomaly scores obtained by the algorithm, the z-score value of each feature deviating from the mean value can be used for interpretation. I.e., an IP with a higher anomaly score, indicates a greater likelihood that the IP is a malicious attacker.

In an embodiment, step 407 may specifically include: and respectively inputting each sub-feature set in the multi-dimensional feature set to a plurality of abnormal detection base models, and outputting a plurality of initial detection results of abnormal IP information for generating alarm data. And combining the plurality of initial detection results according to a preset combination strategy to obtain a final detection result of the abnormal IP information generating the alarm data. In this scenario, the anomaly detection integrated model may include: a plurality of anomaly detection basis models. The specific identification process of the anomaly detection integrated model may refer to the above detailed description of establishing the anomaly detection integrated model, and is not described herein again.

According to the alarm data processing method, a set of framework of the online updating model is constructed by combining batch learning in a batch processing mode and a mode of synchronous prediction results through an asynchronous database, and the online updating problem of the model is solved from a system level rather than an algorithm level. By carrying out real-time streaming analysis and CAPEC standardization on massive single-point security equipment alarm logs, constructing a characteristic project from the attribute dimension of an attacker IP and the space-time dimension of an attack behavior, and Based on an unsupervised anomaly detection algorithm LOF and an integrated learning thought, providing an improved LOF algorithm, namely EBLOF (Embedded Based Local outliers) from the aspects of robustness and sensitivity, finding the attacker IP, assisting security analysts in analyzing massive, multi-source and heterogeneous security alarm logs, and increasing the speed and improving the efficiency.

Please refer to fig. 5, which is an alarm data processing apparatus 500 according to an embodiment of the present application, and the apparatus may be applied to the electronic device 1 shown in fig. 1, and may be applied to the scenarios of the alarm data processing systems shown in fig. 2A to 2C, so as to implement automatic prediction batch processing on alarm data and obtain abnormal IP information in the alarm data. The device includes: the system comprises an acquisition module 501, an extraction module 502 and a detection module 503, wherein the principle relationship of each module is as follows:

the obtaining module 501 is configured to obtain alarm data to be processed in a database.

The extracting module 502 is configured to extract a multi-dimensional feature set of the alarm data.

The detection module 503 is configured to input the multidimensional feature set into a preset anomaly detection integrated model, and output the anomaly IP information that generates the alarm data.

In one embodiment, the anomaly detection integration model includes: a plurality of anomaly detection basis models. The detection module 503 is configured to: and respectively inputting each sub-feature set in the multi-dimensional feature set to a plurality of abnormal detection base models, and outputting a plurality of initial detection results of abnormal IP information for generating alarm data. And combining the plurality of initial detection results according to a preset combination strategy to obtain a final detection result of the abnormal IP information generating the alarm data.

In an embodiment, the system further includes a model building module 504 configured to: and acquiring sample alarm data in the database. And extracting a multi-dimensional sample feature set of the sample alarm data. And respectively training a plurality of local anomaly detection algorithm models with different neighbor scale parameters by adopting each sample sub-feature set in the multi-dimensional sample feature set to obtain a plurality of anomaly detection base models. And combining the detection results of the plurality of abnormal detection base models according to a preset combination strategy to generate an abnormal detection integrated model.

In one embodiment, the model building module 504 is further configured to: and acquiring the updated alarm data in the first time period from the database every other first time period. And extracting a multi-dimensional updating feature set for updating the alarm data. And respectively training a plurality of local anomaly detection algorithm models with different neighbor scale parameters by adopting each updating sub-feature set in the multi-dimensional updating feature set to obtain a plurality of updating base models, and combining the plurality of updating base models according to a preset combination strategy. And updating the anomaly detection integrated model according to the model file combined by the plurality of updating base models.

In an embodiment, the preset combination policy is an average result policy of averaging the number of neighboring scale parameters and the dimension of the feature set respectively.

In one embodiment, the detection module 503 is further configured to: and inputting the multi-dimensional feature set to the anomaly detection integrated model every other second time period to obtain a plurality of abnormal IP information in the alarm data in the second time period, and outputting a plurality of abnormal IP information sequencing results based on the anomaly scores of each abnormal IP.

In one embodiment, the method further comprises: the standardizing module 505 is configured to collect an original alarm data set before obtaining alarm data to be processed in a database, and perform standardization on the original alarm data set to generate a database, where the database includes an attack type to which each piece of alarm data belongs.

For a detailed description of the above alarm data processing device 500, please refer to the description of the related method steps in the above embodiments.

An embodiment of the present invention further provides a non-transitory electronic device readable storage medium, including: a program that, when run on an electronic device, causes the electronic device to perform all or part of the procedures of the methods in the above-described embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like. The storage medium may also comprise a combination of memories of the kind described above.

Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims

1. An alarm data processing method, characterized by comprising:

acquiring alarm data to be processed in a database;

extracting a multi-dimensional feature set of the alarm data;

and inputting the multi-dimensional feature set into a preset anomaly detection integrated model, and outputting the anomaly IP information for generating the alarm data.

2. The method of claim 1, wherein the anomaly detection integration model comprises: a plurality of anomaly detection basis models; the inputting the multidimensional feature set into a preset anomaly detection integrated model and outputting the abnormal IP information for generating the alarm data comprises the following steps:

respectively inputting each sub-feature set in the multi-dimensional feature set to a plurality of abnormal detection base models, and outputting a plurality of initial detection results of abnormal IP information for generating the alarm data;

and combining the plurality of initial detection results according to a preset combination strategy to obtain a final detection result of the abnormal IP information generating the alarm data.

3. The method of claim 1, wherein the step of building the anomaly detection integration model comprises:

acquiring sample alarm data in the database;

extracting a multi-dimensional sample feature set of the sample alarm data;

respectively training a plurality of local anomaly detection algorithm models with different neighbor scale parameters by adopting each sample sub-feature set in the multi-dimensional sample feature set to obtain a plurality of anomaly detection base models;

and combining the detection results of the plurality of abnormal detection base models according to a preset combination strategy to generate the abnormal detection integrated model.

4. The method of claim 3, wherein the step of building the anomaly detection integration model further comprises:

acquiring updated alarm data in a first time period from the database every other first time period;

extracting a multi-dimensional updating feature set of the updating alarm data;

respectively training the local anomaly detection algorithm models with different parameters of the neighboring scales by adopting each updating sub-feature set in the multi-dimensional updating feature set to obtain a plurality of updating base models, and combining the plurality of updating base models according to the preset combination strategy;

and updating the anomaly detection integrated model according to the model file combined by the plurality of updating base models.

5. The method of claim 4, wherein updating the anomaly detection integration model from the combined update model files of the plurality of update base models comprises:

and verifying the updated model file, and if the output precision of the updated model file reaches a preset threshold value, covering the updated model file with the original file of the anomaly detection integrated model to obtain the updated anomaly detection integrated model.

6. The method of claim 3, wherein the predetermined combination strategy is an averaging strategy that averages the number of neighboring scale parameters and the dimension of the feature set, respectively.

7. The method according to claim 1, wherein the inputting the multidimensional feature set into a preset anomaly detection integration model and outputting the anomaly IP information for generating the alarm data comprises:

and inputting the multi-dimensional feature set to the anomaly detection integrated model every other second time period to obtain a plurality of abnormal IP information in the alarm data in the second time period, and outputting a plurality of abnormal IP information sequencing results based on the anomaly score of each abnormal IP.

8. The method of claim 1, further comprising, prior to said obtaining alarm data to be processed in the database:

and collecting an original alarm data set, and carrying out standardization processing on the original alarm data set to generate the database, wherein the database comprises the attack type of each alarm data.

9. An alarm data processing apparatus, comprising:

the acquisition module is used for acquiring alarm data to be processed in the database;

the extraction module is used for extracting a multi-dimensional feature set of the alarm data;

and the detection module is used for inputting the multi-dimensional feature set into a preset anomaly detection integrated model and outputting the anomaly IP information for generating the alarm data.

10. An alarm data processing system, comprising:

the data layer is used for storing alarm data and generating a database;

the characteristic layer is used for regularly reading alarm data from the database of the data layer and extracting a multi-dimensional characteristic set of the alarm data;

the deployment module is deployed with an abnormal detection integrated model and used for predicting abnormal IP information for generating the alarm data based on the feature set of the feature layer;

a model training module for updating the anomaly detection integration model of the deployment module based on a feature set of the feature layer;

and the output layer is used for outputting the prediction result of the deployment module.

11. An electronic device, comprising:

a memory to store a computer program;

a processor to execute the computer program to implement the method of any one of claims 1 to 8.

12. A non-transitory electronic device readable storage medium, comprising: program which, when run by an electronic device, causes the electronic device to perform the method of any one of claims 1 to 8.