CN110019070A

CN110019070A - A kind of security log clustering method based on Hadoop and system of calling to account

Info

Publication number: CN110019070A
Application number: CN201711101507.6A
Authority: CN
Inventors: 陆勰; 李明珍
Original assignee: BEIJING SAFE-CODE TECHNOLOGY Co Ltd
Current assignee: BEIJING SAFE-CODE TECHNOLOGY Co Ltd
Priority date: 2017-11-10
Filing date: 2017-11-10
Publication date: 2019-07-16

Abstract

The invention belongs to technical field be data mining and information security field.A kind of system of calling to account of the security management and control log of magnanimity of disclosure of the invention, pass through the dynamic log of research magnanimity control, establish a kind of practicable system of calling to account, realize calling to account and tracing to the source to anomalous event, threat situation perception to security incident, more particularly to a kind of improvement of clustering algorithm K-Means algorithm, in conjunction with the characteristics of Map/Reduce parallel computation, it realizes K-Means parallelization iterative calculation, improves speed, the accuracy rate of log processing.

Description

A kind of security log clustering method based on Hadoop and system of calling to account

Technical field

The invention belongs to technical field be a kind of security log clustering method based on Hadoop and system of calling to account.It belongs to In data mining and information security field, the basis of invention is based on cloud computing environment, the cloud computing mode of Hadoop framework. It is related to a kind of calling to account for the security management and control log of magnanimity, more particularly to a kind of changing for Log Clustering algorithm K-Means algorithm Into.

Background technique

Cloud computing carries huge data resource, and the sustainable growth of big data brings severe examine to physical equipment It tests, the data these savings expansions such as how to store, handle, analyzing increasingly become heat of computer and related fields concern Point, come into being open source Hadoop big data processing framework become every profession and trade heat, Hadoop platform have efficiently, can It leans on, the features such as scalability is strong, its two chief components are Hadoop distributed file system HDFS and parallel processing mould Type Map Reduce completes the processing of mass data into Map Reduce iterative processing by the extraction document from HDFS. The efficient processing capacity of core component Map/Reduce in Hadoop, entire processing stage can probably be divided into for 3 stages, the One stage was divided into data block (usually 64M) entrance of fixed size by extracting from HDFS (document storage system) Map, Map carry out first treated to it, form the key-value pair of (Key, value), are into second stage, that is, intermediate stage Combine, the stage are mainly that the sequence completed to key-value pair is handled, which needs to occupy a large amount of I/O, last single order Section is, by Combine treated median enters Reduce stage, Reduce requires processing data, output according to output As a result.

Things of a kind come together, and Clustering is deep-rooted, is popular in big data excavation, belongs to unsupervised machine Study, the classification method with various ways, generally there is the cluster based on division, based on grid, based on density etc., generally The appropriate method of the selections such as the shape according to clustering object, also can be used the mode of fusion, for our days that be studied Will characteristic (non-structured text) has chosen, algorithm simple with the principle readily Typical Representative K-Means based on division Algorithm, principle are that data are assigned to according to distance and are specified in advance by constantly calculating each sample point to the distance of cluster centre It is the S collection data of n for capacity, process is as follows in the K class of (the K value of general random input):

Input a K value (specified generally according to anticipation output class)

K value is randomly selected from sample set S as initial cluster center, calculates remaining sample value to each initial clustering The distance at center carries out being divided into nearest cluster class according to distance, continuous to iterate to calculate, until convergence.

In the iterative process of K-Means, two big research directions are related generally to, the selection of the first K value randomly chooses K value, holds The effect of cluster is easily influenced, the selection of the second initial cluster center exists and selects isolated point or noise spot as initial clustering The risk at center, causes local optimum, and for its defect, research hotspot in recent years is also just concentrated mainly on these two aspects.

In view of the powerful cloud computing ability of Hadoop, the characteristics of K-Means algorithm is simply easily realized, Map/ is utilized The computation capability of Reduce combines K-Means with Map/Reduce, greatly improves cloud computing ability, occupies In this, log is managed in conjunction with this project research object magnanimity, by improving K-Means algorithm, and the energy in conjunction with Map/Reduce The processing capacity for enough improving log, reduces rate of failing to report and rate of false alarm, further realizes to leakage of information, process blocking, network is cut The discovery and reduction of the control behavior such as stream, and evidence of calling to account is formed according to unified specification.Building one has to multidate information The system traced to the source call to account to internet security is improved, improves the running environment of credible cloud, it appears necessary.

Summary of the invention

The present invention is established a kind of practicable system of calling to account, is realized to peace by the dynamic log of research magnanimity control Total event calling to account and tracing to the source, and by the concern to security postures, is capable of the safety of each network node of control, takes preventive measures, The K-Means algorithm of this invention is additionally related to, at present the defect of the clustering algorithm of K-Means, i.e. initial cluster center and K Value selection is easy that algorithm is made to fall into locally optimal solution, in conjunction with current present Research, although many experts and scholars are equal in terms of improvement A large amount of research and study have been done, but there is more or less deficiency to a certain extent, has especially been applied to control log The research of aspect has been far from satisfying the demand of massive logs Data Analysis Services at present, studies data base collected This all using the sample set of UCI, does not apply the control log sample set of higher-dimension, on the other hand, greatly well The research of most K-Means clusters is not comprehensive, does not grow with each passing hour for the removal of isolated point, the present invention is directed to K- The further expansion and value promotion of the range of many disadvantages and the big data application of Means algorithm, carry out in terms of following two It improves, is by the purification to sample set data first, the isolated point of removal interference Clustering Effect followed by utilizes dot density And average distance, threshold values, the mainly selection of MMD algorithm (minimax distance) progress initial cluster center are set, class is reached Between similitude it is minimum, the great purpose of similitude in class, under the cloud computing mode of hadoop, pass through Map/Reduce repeatedly In generation, forms good Clustering Effect, provides good data environment for log correlation analysis, has using association algorithm formation The chain of evidence that can be called to account is supplied to user, and user extracts feature by input, finds relevant chain of evidence by the system of calling to account, And then it is called to account and is traced to the source.By the invention, the processing speed of log and proposing for secure network performance can greatly be improved It rises, preferably manages the network equipment, realize control to dynamic data and call to account.

For the present invention, provided technical solution is as follows:

It analysis for a security log based on Hadoop and calls to account system, according to the flow direction and feature of information, from Log collects displaying, and this system is mainly made of four module, is acquisition, the log storage, log analysis of log respectively With platform show etc., the main target for inventing the system is can call to account trace back to anomalous event by the analysis to log The safety of credible cloud environment, such as SQL injection, APT attack, DDOS attack are improved, using B/S framework, according to user in source Demand, input information keys, can check that log is associated with chain of evidence, and then according to log recording, tracking information source is real Now to the control of key message with call to account, including find and restore such as information leakage, blocking process, the control row such as network shuts off For.Log recording the operating status of the various network equipments, pass through pretreatment, storage, the analysis to magnanimity security log, energy Enough by display platform real time inspection network safety situation, by monitoring, instrument board displaying can be realized to equipment operation from Program level, network level and it is system-level manage layer by layer, this system relates generally to acquisition, storage, clustering, the association of log The relevant technologies such as analysis, denoising, what log collection part mainly used compatible with big data Computational frame Hadoop phase can adopt Collect software, acquisition log enters HDFS document storage system by collector and stored, and carries out data by Map/Reduce Processing and analysis, denoising completes cluster, and then is associated analysis, tactful according to control in this process, can be to control Log is operated, including sequence, label etc., to improve the accuracy rate called to account, can finally be shown by web interface pair As the security postures tendency of system, finds potential threat and attack, quick response is made to anomalous event in first time, is given The safe condition of user's presentation whole system.

Sport technique segment used by one system that can be called to account must be it is all linked with one another, the acquisition of log is using concurrent The multipoint acquisition of formula, improves the collecting efficiency of log, while coordinating HDFS system, carries out the storage and management of log, passes through The continuous iteration of Map/Reduce carries out the iteration selection of K-Means initial cluster center, until cluster centre value no longer occurs Change, iteration cluster, until convergence, the present invention uses the K-Means algorithm based on division, using correlation technology to log number According to carry out semantic association, data are no longer an independent individuals, but be in state associated with each other, be conducive to data into Row is distinguished and identification, and log stream association purpose is will program level relevant to this demand, system-level and network according to the demand of calling to account Grade control event log merge, formed complete event sequence, and then generate chain of evidence with realize call to account it is undeniable Property.User can call to account or be traceable to the specific network equipment, network event such as can by log association by input feature vector value To discriminate whether that virtual machine escape occurs, the DDOS attack or potential APT attack in a certain domain whether occur, if repair Change sensitive data, the monitoring and management carried out to it.

The successful operation for system of calling to account be unable to do without the improvement for the emphasis K-Means clustering algorithm that the present invention studies, needle It is that core is based on MMD (minimax distance) and density to the scheme adopted by the present invention to K-Means algorithm improvement Combination, on the one hand first optimize sample set data, it is initial poly- that the algorithm principle on the other hand based on distance and density finds K Class center, is constantly iterated, until sample set is assigned to K cluster.In this stage, former according to the rudimentary algorithm of K-Means Reason calculates sample average, and by calculating dot density definition, the dot density of the dot density and all sample sets that calculate the point is simultaneously Sequence, removal are far smaller than the sample of sample average, can remove noise jamming point, recalculate new samples mean value, and with this Centered on, each sample point and centre distance in sample set are calculated, distance is according to descending sequence, and selected distance is recently and most Remote corresponding two sample points sequentially find third, the 4th according to MMD algorithm principle as initial cluster center, until K Initial cluster center is iterated calculating according to the algorithm principle of K-Means, completes the cluster to sample set, finally using most Small variance evaluation function evaluates Clustering Effect, completes entire cluster process.

Preferably, improved K-Means algorithm proposed by the present invention and Map/Reduce perfect combination, establish to magnanimity The analysis of control log handles and does association analysis, for realizing the control and blocking to certain sensitive behaviors, by feature It extracts realization to call to account, can be realized to sensitive data, certain potential threat (APT), DDOS attack, information leakage etc. are called to account With trace to the source, form the monitoring in advance of complete set, the security protection called to account afterwards is called to account system.

Detailed description of the invention

Fig. 1 is that the security log of the present invention based on Hadoop is called to account system architecture diagram

Fig. 2 is the schematic diagram of Map/Reduce algorithm of the present invention

Fig. 3 is the Map/Reduce parallelization flow chart of the present invention for improving K-Means

Specific embodiment

User is shown by web front-end in present system inputs the feature critical word called to account on interface, by accessing day Will linked database is called to account and is traced to the source according to the data that the thought of " reverse " inquires needs, to sensitive data, centainly Potential threat (APT), DDOS attack, information leakage etc. calling to account and tracing to the source, the monitoring in advance of the complete set of formation, afterwards The security protection called to account is called to account system, and whole system is broadly divided into four modules, and first piece is data acquisition module, main complete At the acquisition of daily record data；Second module is log storage, the main storage for completing log and piecemeal；Third is analysis module, The main cluster for completing log, association form chain of evidence of calling to account；4th be platform display module, mainly by security postures, The functional modules such as instrument board, monitoring, inquiry show and inquire key message.According to whole frame structure, layer from low to high Layer promotes, and four-stage is divided to carry out.

(1) the log collection stage

The control log of magnanimity is acquired by log collector first, these logs have extensive source, including program Grade, system-level and network level, the above collected sample set of three-level have recorded log corresponding with safety label strategy, Log is managed, acquisition is related to the network equipment, safety equipment, operating system, application program etc., which can be well Good compatibility is realized with the document storage system of the HDFS of Hadoop, and vectorization processing is carried out to log, and this system uses Day is acquired by the way that it is disposed on each node with the Flume log collector with Hadoop with compatibility very well Then will information carries out the unitized of format.

(2) log memory phase

Log after treatment is stored in HDFS, the stage mainly complete of both work, on the one hand removal with The unrelated log of safety, including request failure and commercial paper log, video, audio class log etc., passes through the log after screening It is stored according to fixed size, usually 64M.

(3) the log analysis stage

The stage main task is to complete the cluster of magnanimity control log, association analysis, is formed with according to the chain of evidence that can be chased after, The displaying of upper platform is supported, the emphasis invention of the system focuses on the improvement of clustering algorithm, is based especially on parallelization The improvement of the clustering algorithm of the K-Means of Map/Reduce, the choosing of the improvement of K-Means algorithm mainly for initial cluster center It selects, the Map/Reduce in this stage, distributes Map according to the Split block elasticity being divided into storage, the Map stage completes to sort out, no The disconnected sample point that calculates is sorted out at a distance from cluster centre；Reduce completes cluster centre and rebuilds, by the Map stage Key-value pair enters the Reduce stage as input, Reduce is former according to the algorithm of K-Means using the processing of Combine Reason, recalculates center of a sample, and until convergence, principle is threshold values to be set, by with the point using dot density and average distance Centered on, average distance is the density that radius calculates the point, the dot density of sample point is calculated, and compared with threshold values, if far It is seemingly isolated point, and remove much smaller than threshold values person, then recalculates sample average, and as center, calculates new samples It concentrates each point from centre distance, according to descending sequence, is then initially gathered according to MMD algorithm (minimax distance) The selection at class center, similitude is minimum between reaching class, the great purpose of similitude in class, under the cloud computing mode of hadoop, By the iteration of Map/Reduce, good Clustering Effect is formed, in this process, the distance between two o'clock is calculated and usually adopts Euclidean distance calculation method, it is generally the case that for sample size be n sample set, the section that K value selects forSorting procedure is as follows:

Input: input sample collection S

Output: K clustering cluster.

Process:

Step 1 inputs S sample set and K value

Step 2 calculates the sample average of S and the average distance of the point, calculates the dot density of the point, is set as threshold values

Step 3 calculates separately each dot density in sample, and compared with threshold values, and removal is far smaller than the point of threshold values, shape again The sample set X of Cheng Xin

Step 4 recalculates the sample average point of X, as center, each point and its distance is calculated, according to descending row Sequence.

Step 5 according to MMD, choose it is minimum with maximum distance corresponding points as the first, second initial cluster center, according to MMD algorithm principle finds third, the 4th, until finding K initial cluster center.

Step 6 is according to K value size, and in conjunction with the working principle of Map/Reduce, the Map stage is constantly calculated in each point and cluster Heart distance forms (key, value) key-value pair, by the sequence of intermediate stage Combine, is transmitted to Reduce calculating, last defeated K cluster result out judges whether to restrain, continues interative computation if not restraining, conversely, output result.Output is tied Fruit selects suitable association algorithm, is associated to it according to correlation rule, forms chain of evidence of calling to account.

The related definition that this process is related to is as follows:

The basic thought for defining 1MMD algorithm is to select distance from biggish sample as initial cluster center

The algorithm determine data set X=(x1, x2, x3,.Xn), the step of initial cluster center is as follows:

Step 1: select a sample as the 1st cluster centre O1. from X in a random basis

Step 2: O1 is chosen from X apart from maximum sample as the 2nd cluster centre O2.

Step 3: it calculates residue sample xi and arrives the distance di1 and di2 of O1, O2 respectively, and obtain their minimum value di= Min (di1, di2), i=1,2, N

Step 4: if Dt=max { di } > α ∥ O1-O2 ∥, using corresponding sample xi as the 3rd cluster centre O3, Proportionality coefficient α is for controlling number of clusters

Step 5: if q cluster centre (4≤q < k) has been determined, remaining sample is calculated to existing cluster centre Distance dij. Dr=max if { min (di1, di2 ..., diq) } > α ∥ O1-O2 ∥, then using corresponding sample xr as q+ 1 cluster centre

Step 6: identical processing is repeated, the initial cluster center of number of clusters is met until finding.

It defines 2 Euclidean distances and defines d=sqrt (∑ (xi1-xi2) ^) i=1 here, 2..n, xi1 indicate first point I-th dimension coordinate, xi2 indicate the i-th dimension coordinate of second point.

(4) platform shows the stage

Mainly showing and inquiring for the stage, by inputting keyword, can inquire information relevant to the keyword, such as The ID number that virtual machine can be inputted inquires relative safety-related log, and shows the security postures of system, shape At a Security Situation Awareness Systems, the multiple functions such as inquiry, monitoring, instrument are provided with, realize that the visualization of user is handed over Mutually.

It is built by the substep of the above four-stage, the intelligent log analysis for completing entire Visual Interactive, which is called to account, is System, by the improvement to K-Means algorithm, successfully applies it in the parallel computation environment of Map/Reduce, improves number According to processing speed, can show whole network security postures.

Claims

1. the invention of a kind of security log clustering method based on Hadoop and system of calling to account, feature include:

One is capable of handling the system of calling to account of dynamic data, can be tracked to the key message of input, finds and goes wrong Root, discovery and reduction such as information leakage block process, and network such as shuts off at the control behavior；

On the one hand a kind of improved K-Means clustering algorithm improves the standard of algorithm compared with traditional K-Means clustering algorithm True rate improves the call to account efficiency and quality of system, on the other hand can preferably improve the Clustering Effect of control log, using big The storage of the core component HDFS of data cloud computing framework Hadoop and the efficient iterative ability of Map/Reduce realize Map/ The K-Means parallelization of Reduce is handled, in particular for the control log of magnanimity, under the cloud computing mode of big data processing It shows more excellent.

2. the one according to claim 1 system of calling to account with dynamic threats perception, it is characterised in that:

It is called to account feature critical word according to input, finds log and be associated with chain of evidence, comparison is called to account feature, correlation of calling to account from cloud environment Equipment or routine interface, including program level, system-level and network level, the api interface for being related to program level calls log, system-level Virtual machine escape log, safety equipment, network device state running log of network level etc., formed to dynamic log data It manages and calls to account.

3. a kind of improved K-Means clustering algorithm according to claim 1, it is characterised in that:

By the way that threshold values is arranged using dot density and sample average, average distance scheduling algorithm to the vectorization processing of control log, Isolated point is removed, initial cluster center is determined according to minimax distance algorithm, improves Log Clustering effect, pass through Map/ The cluster data of the continuous iteration outputting high quality of Reduce.