CN105138661B

CN105138661B - A kind of network security daily record k-means cluster analysis systems and method based on Hadoop

Info

Publication number: CN105138661B
Application number: CN201510553636.3A
Authority: CN
Inventors: 高岭; 苏蓉; 高妮; 王帆; 杨建锋; 雷艳婷; 申元
Original assignee: Northwest University
Current assignee: Northwest University
Priority date: 2015-09-02
Filing date: 2015-09-02
Publication date: 2018-10-30
Anticipated expiration: 2035-09-02
Also published as: CN105138661A

Abstract

A kind of network security daily record k-means cluster analysis systems and method based on Hadoop, including daily record data obtain subsystem, daily record data mixed mechanism storage management subsystem, daily record data analyzing subsystem；In data storage layer, the mixing memory mechanism storing daily record data to be cooperated with traditional data warehouse using Hadoop, and the interface operated to Hive is provided in data access layer, data storage layer and computation layer are received from Hive engines to be instructed, by HDFS, cooperation MapReduce realizes the efficient query analysis to data；When carrying out mining analysis to daily record data, realize that k-means algorithms carry out cluster result analysis to it using MapReduce；The framework that cooperated with traditional data warehouse using Hadoop compensates for traditional data warehouse in the deficiency of mass data processing, storage etc., while but also original traditional data warehouse is used to make the best；Clustering is carried out using the k-means algorithms based on MapReduce, can safe class assessment and early warning be carried out to daily record data in time.

Description

A kind of network security daily record k-means cluster analysis systems based on Hadoop and Method

Technical field

The invention belongs to technical field of computer information processing, and in particular to a kind of network security daily record based on Hadoop K-means cluster analysis systems and method.

Background technology

With the explosion of data, information content sharply increases, and the existing traditional data warehouse of enterprise has been difficult to deal with number According to growth rate.The generally use high-performance integrated machine construction of traditional data warehouse, of high cost, autgmentability is poor, and traditional number Only be good at processing structure data according to warehouse, this characteristic influence traditional data warehouse when in face of magnanimity isomeric data for The excavation of inherent value, this is Hadoop and the maximum difference of traditional data processing mode.For the existing traditional data storehouse of enterprise We will rationally utilize in library, while existing traditional data warehouse and big data platform are combined, and establish a system One data analysis and data processing architecture so that by Hadoop with the realization that cooperates in traditional data warehouse to network log Monitoring statisticss are analyzed.

Hadoop be Apache organization and administration a Distributed Computing Platform of increasing income, be one can to mass data into The software frame of row distributed treatment.It is using Hadoop distributed file system HDFS and MapReduce as the Hadoop of core User provides the transparent distributed basis framework of system bottom details.High fault tolerance, high scalability, the Highly Scalable of HDFS Property, high acquired, high-throughput the advantages that allow user to be deployed in Hadoop on cheap hardware, form distributed system； The distributed programmed models of MapReduce allow user to develop Parallel application in the case where not knowing about distributed system low-level details Program.

HDFS is the basis of data storage management in Distributed Calculation, is based on flow data mode access and processing super large text The demand of part and develop.Its characteristic provides the storage for not being afraid of failure for mass data, is at the application of super large data set Reason brings many facilities.HDFS is master/slave (Mater/Slave) architecture, has two class nodes in architecture, One kind is NameNode, is called " metadata node "；Another kind of is DataNode, is called " back end ", this two classes node point The execution node of Master and Worker specific tasks is not undertaken.But due to the property of distributed storage, HDFS clusters possess one A NameNode and multiple DataNode.Metadata node is used for managing the NameSpace of file system；Back end is file The place of data is really stored in system.

MapReduce parallel computation frames are that a parallelisation procedure executes system.It provide one comprising Map and The parallel process model and process in two stages of Reduce handle data with key-value pair data input mode, and can be automatic complete At the division and management and running of data.In program execution, MapReduce parallel computation frames will be responsible for dispatching and distributing calculating Resource, division and inputoutput data, the execution of scheduler program, the execution state of monitoring programme, and be responsible for each when program executes The synchronization of calculate node and compiling for intermediate result.

Sqoop is a tool that rapid batch data exchange is carried out between relational database and Hadoop platform.It can To import the batch data in a relational database in HDFS, Hive of Hadoop, Hadoop can also be put down in turn Data in platform import in relational database.

Hive is a data warehouse established on Hadoop, for managing the structuring being stored in HDFS/half Structural data.It allows directly to use the HiveQL query languages of similar SQL to write data query analysis journey as programming interface Sequence, and the required persistence architecture of data warehouse, storage management and query analysis function are provided, and HiveQL sentences are the bottom of at Layer is converted into corresponding MapReduce programs when realizing and is executed.

Invention content

To overcome above-mentioned the deficiencies in the prior art, the purpose of the present invention is to provide a kind of network securitys based on Hadoop Daily record k-means cluster analysis systems and method, on the basis of the reasonable traditional data warehouse for utilizing and having built up, big data Platform integration is entered, and is established a unified data storage and data processing architecture, is overcome the extended capability in traditional data warehouse Difference is only good at processing structure data, the shortcomings that can not being excavated to the inherent value of magnanimity isomeric data.

To achieve the above object, the technical solution adopted by the present invention is：A kind of network security daily record k- based on Hadoop Means cluster analysis systems include that daily record data obtains subsystem, daily record data mixed mechanism storage management subsystem, day Will data analytics subsystem；

It is the network security daily record data for acquiring all devices that the daily record data, which obtains subsystem,；

The daily record data mixed mechanism storage management subsystem is that all daily record datas are managed and are stored；

The daily record data analyzing subsystem is to carry out quick search analyzing processing to all daily record datas, and to daily record The potential value of data carries out mining analysis.

The daily record data obtains subsystem under Linux environment, and configuration Syslogd concentrates log server, uses Syslog modes acquisition and recording equipment and syslog data, and daily record data concentrate tube is managed.

The daily record data mixed mechanism storage management subsystem integrates Hadoop platform and traditional data warehouse, including HDFS distributed file system modules, Hadoop platform and traditional data warehouse collaboration module.

The daily record data searching and analysis subsystem mainly carries out simple statistical query point using tool hive to data Analysis, according to demand, writes HiveQL query statements, under the driving of Hive Driver, completes morphology in HiveQL query statements Analysis, syntactic analysis, compiling, optimization and the generation of inquiry plan, the inquiry plan of generation are stored in HDFS, and subsequent It is called and is executed by MapReduce, the analysis for potential information in daily record data realizes respective algorithms by writing MapReduce programs excavate its inherent value.

The constituent instruments access process of the HDFS distributed file systems module is：

1）Filename is sent to NameNode by application program by the client-side program of HDFS；

2）After NomeNode receives filename, the corresponding data block of retrieval file name in HDFS catalogues, further according to Data block information, which is found, preserves the addresses data block DataNode, these addresses are transmitted back to client；

3）After client receives these addresses DataNode, concurrently carry out data transmission grasping with these DataNode Make, while the correlation log of operating result is submitted to NameNode.

The Hadoop platform and traditional data warehouse collaboration module, metadata of the configuration MySQL database as Hive Library, the Schema table structural informations etc. for storing Hive, by Sqoop tools realize data in traditional data warehouse and Transmission between Hadoop platform.

Using Hadoop clusters as platform, integrating traditional data warehouse and big data platform, using MySQL database as Hive Metadatabase, the Schema table structural informations for storing Hive, using Sqoop tools realize data in traditional data warehouse Transmission between big data platform；Include that data active layer, data storage layer, computation layer, data analysis layer and result are shown Layer；

The data active layer concentrates log server by configuring Syslogd, acquires the daily record number in all devices According to, then daily record data imported by data storage layer from traditional data warehouse by Sqoop tools；

The data storage layer, the mixing storage architecture to be cooperated with traditional data warehouse using Hadoop, passes through Sqoop Data transfer tool imports data in HDFS, handles metadata, and by treated, data imported into corresponding table in Hive In；

The result display layer, user issue a request to data analysis layer；

The request that user sends is converted to corresponding HiveQL sentences, in Hive Driver by the data analysis layer Driving under, completion execute operation；

The computation layer is received from Hive engines and is instructed, and by the HDFS of data storage layer, coordinates MapReduce The processing analysis for realizing data, finally returns the result display layer by result.

A kind of network security daily record k-means clustering methods based on Hadoop, which is characterized in that including following step Suddenly：

1）Daily record data pre-processes, the Syslog_incoming_mes files of the content of text of conversion log description information For a text vector file；

2）The realization of k-means algorithms based on MapReduce runs k-means clustering algorithms on text vector.

The daily record data pretreatment, includes the following steps：

1）Function word is removed, is removed in daily record description information text without sincere word；

2）Part of speech, this system is marked to be marked using english-left3words-distsim.tagger segmenter every Word in daily record description；

3）Extract go-word, after participle, system extraction be noun (NN, NNS, NNP, NNPS), verb (VB, VBP, VBN, VBD) and adjective (JJ, JJR, JJS), these words all have the actual meaning, can accurately express log information；

4）Frequent dictionary is obtained, to all record statistic frequencies, the word of high frequency just describes have masterpiece in field at this With, be more than threshold value word as keyword elements, select frequent dictionary, being capable of effectively expressing daily record using frequent dictionary Information；

5）Text vector file is generated, daily record description field is compared and frequent dictionary is obtained by a succession of 0,1 composition Keyspace, multiple keyspace set constitute text vector file.

The realization of the k-means algorithms based on MapReduce, includes the following steps：

1）All points in original data set are scanned, and randomly select k point as initial cluster center；

2）Each Map nodes, which are read, has local data set, generates cluster set with k-means algorithms, finally exists The Reduce stages global clustering center of several cluster set symphysis Cheng Xin, repeats this process until meeting termination condition；

3）The work of partition clustering is carried out to all data elements according to the cluster center ultimately generated.

The advantages of technical solution of the present invention, is mainly reflected in：

1）The support that Hadoop is provided on the basis of existing traditional data warehouse, establishes a unified data and deposits Storage and data processing architecture, compensate for traditional data warehouse mass data processing, storage etc. deficiency, while but also Traditional data warehouse originally is used to make the best.

2）As data are more and more, more cluster resources are needed to handle these data, and Hadoop is one easy Expansion system, it is only necessary to which a new node is simply configured, so that it may very easily to extend cluster, promote it and calculate energy Power.

3）For the isomeric data in mass network daily record, first with MapReduce handle metadata, then will treated number According to importeding into Hive in corresponding table, hiveQL sentences are then write according to demand, simply query analysis are carried out to data, Realize that k-means algorithms carry out mining analysis to data with MapReduce.Improve query analysis efficiency, the potential valence of data Value is also excavated.

Description of the drawings

Fig. 1 is the system structure functional block diagram of the present invention.

Fig. 2 is the Web Log Analysis system architecture diagram of the present invention to be cooperated with traditional data warehouse based on Hadoop.

Fig. 3 is daily record data k-means clustering algorithm research frameworks of the present invention.

Specific implementation mode

Technical scheme of the present invention is described in detail with reference to embodiment and Figure of description, but not limited to this.

Referring to Fig. 1, a kind of network security daily record k-means cluster analysis systems based on Hadoop, include daily record number According to acquisition subsystem 11, daily record data mixed mechanism storage management subsystem 12, daily record data analyzing subsystem 13；

It is the network security daily record data for acquiring all devices that the daily record data, which obtains subsystem 11,；

Daily record data mixed mechanism storage management subsystem 12 is that all daily record datas are managed and are stored；

Daily record data analyzing subsystem 13 is to carry out quick search analyzing processing to all daily record datas, and to daily record number According to potential value carry out mining analysis.

Steps are as follows for the operation workflow of the system modules：

Step 1：Daily record data obtains：Configuration Syslogd concentrates log server to pass through using UDP as transport protocol The log management configuration of all safety equipments is sent to the log server for being mounted with Syslog software systems by destination interface, System log server receives daily record data and writes in journal file automatically；

Step 2：The table syslog_incoming of the log information in MySQL is imported into HDFS using Sqoop, is used Order：sqoop import --connect jdbc:mysql://219.245.31.39:3306/syslog --username sqoop --password sqoop --table syslog_incoming -m1

Sqoop imports a table by a MapReduce operation from MySQL, and a line is extracted in this operation from table Row record, is then written to HDFS by record, and the namenode in cluster undertakes the position storage of data, and storage location is believed Breath tell the ends client, after obtaining location information, the ends client start to write data, be when writing data by deblocking, and More parts are stored as, is placed on different datanode nodes, client first writes data into first node, is saved at first While point receives data, and its received data is pushed to second, second is pushed to third node, successively class It pushes away；

Step 3：It writes MapReduce programs and useful information is extracted to the log data importeding into HDFS；

Step 4：A hive table is generated using Sqoop according to the table in the relational data source extracted, uses order The definition of corresponding hive tables is directly generated, then load is stored in the data in HDFS：

sqoop create-hive-table --connect jdbc:mysql://219.245.31.39:3306/ syslog --table syslog_incoming --fields-terminated-by ‘,’

Start hive, loads data：

load data inpath "syslog_incoming" into table syslog_incoming；

Step 5：According to business demand, write corresponding hiveQL sentences or MapReduce programs, to daily record data into Row statistical analysis；The statistical analysis the specific steps are：Subregion is carried out to tables of data according to business demand, and subregion is to create It is defined with PARTITIONED BY clauses when table.Therefore, according to demand respectively table record be defined as by grade and when Between（Year, season, the moon）Subregion is constituted, and following example is to be defined as being made of grade subregion by table record：

hive >create table syslog_incoming_priority (facility varchar, data data, host varchar)

>partitioned by (priority varchar)

>row format delimited

>fields terminated by ‘\t’

>stored as textfile;

After defining table structure, data are loaded into partition table：

hive >insert into table syslog_incoming_priority

>partitioned (priority)

>select facility, data, host, priority

>from syslog_incoming;

In file system level, subregion is the lower nested subdirectory of entry record, at this point, having in entry directory structures multiple etc. Grade subregion, data file are then stored in bottom catalogue, finally according to demand, write query statement hiveQL, cluster will be looked into again Inquiry sentence is converted to MapReduce tasks and is run.

Step 6：Query analysis result is imported by Sqoop in MySQL, foreground display interface is by chart to user Displaying.

Referring to Fig. 3, a kind of network security daily record k-means clustering methods based on Hadoop include the following steps：

The daily record data pretreatment, includes the following steps：

3）Extract go-word, after participle, system extraction be noun (NN, NNS, NNP, NNPS), verb (VB, VBP, VBN, VBD) and adjective (JJ, JJR, JJS).These words all have the actual meaning, can accurately express log information；

4）Frequent dictionary is obtained, to all record statistic frequencies, the word of high frequency just describes have masterpiece in field at this With, be more than threshold value word as keyword elements.Frequent dictionary is selected, it being capable of effectively expressing daily record using frequent dictionary Information；

Initial cluster center selects, and provides the data structure of cluster first, such preserves the essential information such as cluster id of a cluster, in Heart coordinate and belong to the cluster point number, type definition is as follows：

public class Cluster implements writable{

private int clusterID;// cluster id

private long numOfPoints;// belong to the cluster point number

private Instance center;// cluster central point information

}

Then k point is randomly selected as initial cluster center, extracts flow：It is sky to initialize cluster centralization, then Scan entire data set.If current cluster centralization size is less than k, the point scanned is added in cluster center, otherwise A bit in cluster centralization is replaced with the probability of 1/ (1+k).By this step, the cluster central information of generation is written for we To under Cluster-0 catalogues, globally shared information when file in the catalogue is as next round iteration is added to MapReduce Distribution shared buffer memory in be used as globally shared data.

Iterate to calculate cluster center.The stage needs to execute successive ignition, before starting each map nodes firstly the need of The information that the cluster generated in last round of iteration is read in setup () method, includes the following steps：

1）Read in initial cluster center：Read be stored in shared buffer memory be all nodes sharings initial cluster centre data；

2）The realization of map methods：Map methods need to find the cluster center of nearest neighbours for each incoming data point, and And using the id in cluster as key, which launches as value, indicates the cluster that this data point belongs to where id；

3）The realization of combiner：In order to mitigate network data transmission expense, we the ends map using combiner come pair The result that the ends map generate does a merger, has both alleviated data transfer overheads of the map to the ends reduce in this way, while also mitigating The computing cost at the ends reduce, the key of Combiner outputs and the type of value must be with the type phases of the map keys and value exported Together.In reduce programs, we calculate the Provisional Center of these points according to the information for all the points for belonging to the same cluster, this In realized in the method simply averaged, i.e., by all the points in cluster be added divided by the cluster in contained all the points at this time Number；

4）The realization of Reducer：Reduce stages and Combiner are done about the same, by the output of Combiner As a result further merger output is carried out.

The processing for computing repeatedly cluster central step, until obtained cluster centre no longer changes.

Data are divided according to final cluster centre.After obtaining final cluster centre, according to the cluster obtained All data acquisition systems are scanned at center, and each data point is divided into apart from nearest cluster centre.

Embodiment：

Hadoop distributed type assemblies environment is built first, including 5 PC machine are built.One master server, remaining four are From server.Hadoop, then installation configuration Sqoop, hive, MySQL on namenode are configured on every machine.This implementation Using the log recording of all safety equipments of Shaanxi Li An electricity supermarkets in example, wherein file size is 16G.According to demand to day Will is timed update daily, the statistical query result in more new business.

This method can realize express statistic inquiry by hive, it is advantageous that：Learning cost is low, can pass through class SQL statement fast implements simple MapReduce statistics, it is not necessary to develop special MapReduce applications, be very suitable for data bins The statistical analysis in library.It can accelerate the inquiry velocity of data fragmentation using subregion, improve search efficiency.Pass through MapReduce realities Existing k-means algorithms, and safe class assessment is carried out to the result of k-means clustering algorithms output, for alarm in same IP, Danger classes proportion it is larger in time make prompt so that the potential value of daily record data is excavated.

Claims

1. a kind of network security daily record k-means cluster analysis systems based on Hadoop, which is characterized in that include daily record number According to acquisition subsystem (11), daily record data mixed mechanism storage management subsystem (12), daily record data analyzing subsystem (13)；

It is the network security daily record data for acquiring all devices that the daily record data, which obtains subsystem (11),；

The daily record data mixed mechanism storage management subsystem (12) is that all daily record datas are managed and are stored；

The daily record data analyzing subsystem (13) is to carry out quick search analyzing processing to all daily record datas, and to daily record The potential value of data carries out mining analysis；

The daily record data obtains subsystem（11）Under Linux environment, configuration Syslogd concentrates log server, uses Syslog modes acquisition and recording equipment and syslog data, and daily record data concentrate tube is managed；

The constituent instruments access process of HDFS distributed file system modules is：

2）After NomeNode receives filename, the corresponding data block of retrieval file name in HDFS catalogues, further according to data Block message, which is found, preserves the addresses data block DataNode, these addresses are transmitted back to client；

3）After client receives these addresses DataNode, concurrently carry out data transmission operating with these DataNode, The correlation log of operating result is submitted to NameNode simultaneously；

The Hadoop platform and traditional data warehouse collaboration module, metadatabase of the configuration MySQL database as Hive, are used In the Schema table structural informations of storage Hive, realize data in traditional data warehouse and Hadoop platform by Sqoop tools Between transmission；

Daily record data analyzing subsystem（13）, simple statistical query analysis is mainly carried out to data using tool hive, according to Demand writes HiveQL query statements, under the driving of Hive Driver, completes morphological analysis, language in HiveQL query statements Method analysis, compiling, optimization and the generation of inquiry plan, the inquiry plan of generation is stored in HDFS, and then by MapReduce, which is called, to be executed, and the analysis for potential information in daily record data realizes respective algorithms by writing MapReduce programs excavate its inherent value.

2. a kind of network security daily record k-means cluster analysis systems based on Hadoop according to claim 1, special Sign is, the daily record data mixed mechanism storage management subsystem（12）, integrate Hadoop platform and traditional data warehouse, packet Include HDFS distributed file system modules, Hadoop platform and traditional data warehouse collaboration module.

3. a kind of network security daily record k-means cluster analysis systems based on Hadoop, which is characterized in that with Hadoop clusters For platform, integrating traditional data warehouse and big data platform, using MySQL database as the metadatabase of Hive, for storing The Schema table structural informations of Hive realize data between traditional data warehouse and big data platform using Sqoop tools Transmission；Include data active layer (21), data storage layer (22), computation layer (23), data analysis layer (24) and result display layer (25)；

The data active layer (21) concentrates log server by configuring Syslogd, acquires the daily record number in all devices According to, then daily record data imported by data storage layer from traditional data warehouse by Sqoop tools；

The data storage layer (22), the mixing storage architecture to be cooperated with traditional data warehouse using Hadoop pass through Sqoop Data transfer tool imports data in HDFS, handles metadata, and by treated, data imported into corresponding table in Hive In；

The result display layer (25), user issue a request to data analysis layer；

The request that user sends is converted to corresponding HiveQL sentences, in Hive Driver by the data analysis layer (24) Driving under, completion execute operation；

The computation layer (23) is received from Hive engines and is instructed, and by the HDFS of data storage layer, coordinates MapReduce The processing analysis for realizing data, finally returns the result display layer by result.

4. utilizing the method for hierarchial-cluster analysis described in claim 1, which is characterized in that include the following steps：

1）Daily record data pre-processes, and the Syslog_incoming_mes files of the content of text of conversion log description information are one A text vector file；

2）The realization of k-means algorithms based on MapReduce runs k-means clustering algorithms on text vector；

The daily record data pretreatment, includes the following steps：

2）Part of speech, this system is marked to mark every day using english-left3words-distsim.tagger segmenter Word in will description；

3）Go-word is extracted, after participle, system extraction is noun NN, NNS, NNP, NNPS, verb VB, VBP, VBN, VBD and adjective JJ, JJR, JJS, these words all have the actual meaning, can accurately express log information；

4）Frequent dictionary is obtained, to all record statistic frequencies, the word of high frequency just has role of delegate in description field, is more than The word of threshold value selects frequent dictionary as keyword elements, being capable of effectively expressing log information using frequent dictionary；

5）Text vector file is generated, daily record description field is compared and frequent dictionary is obtained by a succession of 0,1 composition Keyspace, multiple keyspace set constitute text vector file；

2）Each Map nodes, which are read, has local data set, cluster set is generated with k-means algorithms, finally in Reduce The stage global clustering center of several cluster set symphysis Cheng Xin, repeats this process until meeting termination condition；