CN105138661A

CN105138661A - Hadoop-based k-means clustering analysis system and method of network security log

Info

Publication number: CN105138661A
Application number: CN201510553636.3A
Authority: CN
Inventors: 高岭; 苏蓉; 高妮; 王帆; 杨建锋; 雷艳婷; 申元
Original assignee: Northwest University
Current assignee: Northwest University
Priority date: 2015-09-02
Filing date: 2015-09-02
Publication date: 2015-12-09
Anticipated expiration: 2035-09-02
Also published as: CN105138661B

Abstract

The invention provides a hadoop-based k-means clustering analysis system and method of a network security log. The hadoop-based k-means clustering analysis system comprises a log data acquisition subsystem, a log data mixing mechanism storage management subsystem and a log data analysis subsystem. The method includes the steps that in a data storage layer, a mixing storage mechanism with Hadoop cooperating with a traditional data warehouse is adopted to store log data, a Hive operation interface is provided in a data access layer, the data storage layer and a computing layer receive instructions from a Hive engine, and efficient query analysis on the data is achieved by being matched with MapReduce through HDFS; when mining analysis is conducted on log data, MapReduce is adopted to conduct clustering mining analysis on the network security log through a k-means algorithm; the framework with the Hadoop cooperating with the traditional data warehouse is adopted, the detects of the traditional data warehouse on the aspects of mass data processing, storage and the like, and meanwhile an original traditional data warehouse is fully used; clustering analysis is conducted through the MapReduce-based k-means algorithm, and safety grade evaluation and early warning can be conducted on log data timely.

Description

A kind of network security daily record k-means cluster analysis system based on Hadoop and method

Technical field

The invention belongs to technical field of computer information processing, be specifically related to a kind of network security daily record k-means cluster analysis system based on Hadoop and method.

Background technology

Along with the blast of data, the sharply increase of quantity of information, the existing traditional data warehouse of enterprise has been difficult to the growth rate dealing with data.Traditional data warehouse adopts high-performance integrated machine construction usually, cost is high, extendability is poor, and process structural data is only good in traditional data warehouse, this properties influence to traditional data warehouse when in the face of magnanimity isomeric data for the excavation of inherent value, this is the maximum difference of Hadoop and traditional data processing mode.For the existing traditional data warehouse of enterprise, we want Appropriate application, existing traditional data warehouse and large data platform to be combined simultaneously, set up a unified data analysis and data processing architecture, make by the cooperate monitoring statistical study that realize network log of Hadoop with traditional data warehouse.

Hadoop is a Distributed Computing Platform of increasing income of Apache organization and administration, is a software frame that can carry out distributed treatment to mass data.The Hadoop being core with Hadoop distributed file system HDFS and MapReduce provides the transparent distributed basis framework of system bottom details for user.The advantages such as the high fault tolerance of HDFS, high scalability, enhanced scalability, height are acquired, high-throughput allow user to be deployed in by Hadoop on cheap hardware, form distributed system; The distributed programmed model of MapReduce allows user to develop concurrent application when not understanding distributed system low-level details.

HDFS is the basis of data storage management in Distributed Calculation, and the demand based on flow data mode access and process super large file is developed.Its characteristic is that mass data provides the storage of not being afraid of fault, for the application process of super large data set brings a lot of facility.HDFS is master/slave (Mater/Slave) architecture, and have two category nodes in its architecture, a class is NameNode, cries again " metadata node "; Another kind of is DataNode, cries again " back end ", and this two category node bears the XM of Master and Worker specific tasks respectively.But due to the character of distributed storage, HDFS cluster has a NameNode and multiple DataNode.Metadata node is used for the NameSpace of managing file system; Back end is the real place storing data in file system.

MapReduce parallel computation frame is a parallelisation procedure executive system.It provide parallel process model and process that one comprises two stages of Map and Reduce, process data with key-value pair data input mode, and automatically can complete division and the management and running of data.In program performs, MapReduce parallel computation frame will be responsible for scheduling and distributes calculation resources, divide and inputoutput data, the execution of scheduler program, the executing state of watchdog routine, and be responsible for compiling of the synchronous and intermediate result of each computing node when program performs.

Sqoop is an instrument carrying out rapid batch exchanges data between relational database and Hadoop platform.Batch data in a relational database can import in HDFS, Hive of Hadoop by it, also can conversely by the data importing relational database in Hadoop platform.

Hive is a data warehouse being based upon on Hadoop, for the structuring/semi-structured data of managed storage in HDFS.It allows directly to write data query routine analyzer with the HiveQL query language of similar SQL as DLL (dynamic link library), and provide persistence architecture required for data warehouse, storage administration and query analysis function, and HiveQL statement is converted into corresponding MapReduce program is performed when bottom layer realization.

Summary of the invention

For overcoming above-mentioned the deficiencies in the prior art, the object of the present invention is to provide a kind of network security daily record k-means cluster analysis system based on Hadoop and method, on the basis in the traditional data warehouse that Appropriate application has been built up, large data platform is integrated, set up unified data to store and data processing architecture, the extended capability overcoming traditional data warehouse is poor, is only good at process structural data, the shortcoming cannot excavated the inherent value of magnanimity isomeric data.

For achieving the above object, the technical solution used in the present invention is: a kind of network security daily record k-means cluster analysis system based on Hadoop, includes daily record data and obtains subsystem, daily record data mixed mechanism storage management subsystem, daily record data analyzing subsystem;

It is the network security daily record data gathering all devices that described daily record data obtains subsystem;

Described daily record data mixed mechanism storage management subsystem carries out managing to all daily record datas and stores;

Described daily record data analyzing subsystem carries out fast query analyzing and processing to all daily record datas, and carry out mining analysis to the potential value of daily record data.

Described daily record data obtains subsystem under Linux environment, and configuration Syslogd concentrates log server, adopts Syslog mode acquisition and recording equipment and syslog data, and daily record data is concentrated management.

Described daily record data mixed mechanism storage management subsystem, integrates Hadoop platform and traditional data warehouse, comprises HDFS distributed file system module, Hadoop platform and traditional data warehouse collaboration module.

Described daily record data searching and analysis subsystem, main employing instrument hive carries out simple statistical query analysis to data, according to demand, write HiveQL query statement, under the driving of HiveDriver, complete the generation of lexical analysis in HiveQL query statement, grammatical analysis, compiling, optimization and inquiry plan, the inquiry plan generated is stored in HDFS, and calling execution by MapReduce subsequently, for the analysis of information potential in daily record data, excavate its inherent value by writing the MapReduce program realizing respective algorithms.

The basic document access process of described HDFS distributed file system module is:

1) filename is sent to NameNode by the client-side program of HDFS by application program;

2) after NomeNode receives filename, the data block of retrieving files name correspondence in HDFS catalogue, then find preservation data block DataNode address according to data block information, these addresses are transmitted back to client;

3), after client receives these DataNode addresses, carry out data transfer operation concurrently with these DataNode, the correlation log of operating result is submitted to NameNode simultaneously.

Described Hadoop platform and traditional data warehouse collaboration module, configuration MySQL database is as the metadatabase of Hive, for storing the Schema list structure information etc. of Hive, realize the transmission of data between traditional data warehouse and Hadoop platform by Sqoop instrument.

With Hadoop cluster for platform, integrating traditional data warehouse and large data platform, using the metadatabase of MySQL database as Hive, for storing the Schema list structure information of Hive, Sqoop instrument is used to realize the transmission of data between traditional data warehouse and large data platform; Include data active layer, data storage layer, computation layer, data analysis layer and result display layer;

Described data active layer, concentrates log server by configuration Syslogd, gathers the daily record data in all devices, then by Sqoop instrument, daily record data is imported to data storage layer from traditional data warehouse;

Described data storage layer, adopts the mixing storage architecture that cooperate with traditional data warehouse of Hadoop, by Sqoop data transfer tool by data importing in HDFS, process metadata, by the table of the data importing correspondence in Hive after processing;

Described result display layer, user issues a request to data analysis layer;

Described data analysis layer, is converted to corresponding HiveQL statement by the request that user sends, under the driving of HiveDriver, completes executable operations;

Described computation layer, from the instruction of Hive engine accepts, and by the HDFS of data storage layer, coordinate MapReduce to realize the Treatment Analysis of data, result returns results display layer the most at last.

Based on a network security daily record k-means clustering method of Hadoop, it is characterized in that, comprise the following steps:

1) daily record data pre-service, the Syslog_incoming_mes file of the content of text of conversion log descriptor is a text vector file;

2) based on the realization of the k-means algorithm of MapReduce, text vector runs k-means clustering algorithm.

Described daily record data pre-service, comprises the following steps:

1) remove function word, remove in daily record descriptor text without sincere word;

2) mark part of speech, native system use english-left3words-distsim.tagger segmenter marks the word in every bar daily record description;

3) extract go-word, after participle, that system is extracted is noun (NN, NNS, NNP, NNPS), verb (VB, VBP, VBN, VBD) and adjective (JJ, JJR, JJS), these words all have the actual meaning, accurately can express log information;

4) obtain frequent dictionary, to all record statistical frequencies, the word of high frequency just describes in field have role of delegate at this, and the word exceeding threshold value, as keyword element, selects frequent dictionary, uses frequent dictionary can effectively expressing log information;

5) generate text vector file, contrast daily record description field and frequent dictionary obtain the keyspace by a succession of 0,1 composition, and multiple keyspace set forms text vector file.

The realization of the described k-means algorithm based on MapReduce, comprises the following steps:

1) scan all points in raw data set, and random selecting k point is as initial bunch center;

2) each Map node reads and there is local data set, generates cluster set, finally at the global clustering center of Reduce stage with some cluster set symphysis Cheng Xin, repeat this process until meet termination condition with k-means algorithm;

3) according to the final bunch center generated, all data elements are carried out to the work of partition clustering.

The advantage of technical solution of the present invention is mainly reflected in:

1) support of Hadoop is provided on the basis in existing traditional data warehouse, establish unified data to store and data processing architecture, compensate for the deficiency of traditional data warehouse in mass data processing, storage etc., also make original traditional data warehouse used to make the best simultaneously.

2) along with data grows is many, need more cluster resource to process these data, and Hadoop is an easy expanding system, only needs simply to configure a new node, just can expand cluster very easily, promotes its computing power.

3) for the isomeric data in mass network daily record, first use MapReduce process metadata, again by table corresponding in Hive for the data importing after process, then write hiveQL statement according to demand and query analysis is simply carried out to data, realize k-means algorithm with MapReduce and mining analysis is carried out to data.Improve query analysis efficiency, the potential value of data have also been obtained excavation.

Accompanying drawing explanation

Fig. 1 is system architecture theory diagram of the present invention.

Fig. 2 is the Web Log Analysis system architecture diagram cooperated with traditional data warehouse based on Hadoop of the present invention.

Fig. 3 is daily record data k-means clustering algorithm research framework of the present invention.

Embodiment

Below in conjunction with embodiment and Figure of description, technical scheme of the present invention is described in detail, but is not limited thereto.

See Fig. 1, a kind of network security daily record k-means cluster analysis system based on Hadoop, includes daily record data and obtains subsystem 11, daily record data mixed mechanism storage management subsystem 12, daily record data analyzing subsystem 13;

It is the network security daily record datas gathering all devices that described daily record data obtains subsystem 11;

Daily record data mixed mechanism storage management subsystem 12 carries out managing to all daily record datas and stores;

Daily record data analyzing subsystem 13 carries out fast query analyzing and processing to all daily record datas, and carry out mining analysis to the potential value of daily record data.

The operation workflow step of this system modules is as follows:

Step 1: daily record data obtains: configuration Syslogd concentrates log server, use UDP as host-host protocol, pass through destination interface, the log management of all safety equipment configuration is sent to the log server having installed Syslog software systems, and SYSLOG server automatic reception daily record data is also write in journal file;

Step 2: use Sqoop that the table syslog_incoming of the log information in MySQL is imported to HDFS, utility command: sqoopimport--connectjdbc:mysql: // 219.245.31.39:3306/syslog--usernamesqoop--passwordsqoop--tablesyslog_incoming-m1

Sqoop imports a table by a MapReduce operation from MySQL, this operation extracts record line by line from table, then record is written to HDFS, the position that namenode in cluster bears data stores, and tell that client holds by stored position information, after obtaining positional information, client end starts to write data, by deblocking when writing data, and be stored as many parts, be placed on different datanode nodes, data are first write first node by client, while first node receives data, the data-pushing it received again is to second, second is pushed to the 3rd node, the like,

Step 3: write MapReduce program and useful information is extracted to the log data imported in HDFS;

Step 4: utilize Sqoop to generate a hive table according to the table in the relation data source extracted, utility command directly generates the definition that corresponding hive shows, and then loads the data be kept in HDFS:

sqoopcreate-hive-table--connectjdbc:mysql://219.245.31.39:3306/syslog--tablesyslog_incoming--fields-terminated-by‘,’

Start hive, load data:

loaddatainpath“syslog_incoming”intotablesyslog_incoming；

Step 5: according to business demand, writes corresponding hiveQL statement or MapReduce program, carries out statistical study to daily record data; Described statistical study concrete steps are: carry out subregion according to business demand to tables of data, and subregion is that standby PARTITIONEDBY clause defines when establishment table.Therefore, respectively table record is defined as according to demand and is made up of grade and time (year, season, the moon) subregion, following example is defined as by table record to be made up of grade subregion:

hive>createtablesyslog_incoming_priority(facilityvarchar,datadata,hostvarchar)

>partitionedby(priorityvarchar)

>rowformatdelimited

>fieldsterminatedby‘\t’

>storedastextfile;

After defining list structure, Data import in partition table:

hive>insertintotablesyslog_incoming_priority

>partitioned(priority)

>selectfacility,data,host,priority

>fromsyslog_incoming;

At file system level, subregion is the lower nested sub-directory of entry record, now, multiple grade subregion is had in entry directory structures, data file then leaves in bottom catalogue, according to demand, write query statement hiveQL, query statement is converted to MapReduce task and runs by cluster more finally.

Step 6: import in MySQL by Sqoop by query analysis result, foreground display interface is shown to user by chart.

See Fig. 3, a kind of network security daily record k-means clustering method based on Hadoop, comprises the following steps:

Described daily record data pre-service, comprises the following steps:

3) extract go-word, after participle, what system was extracted is noun (NN, NNS, NNP, NNPS), verb (VB, VBP, VBN, VBD) and adjective (JJ, JJR, JJS).These words all have the actual meaning, accurately can express log information;

4) obtain frequent dictionary, to all record statistical frequencies, the word of high frequency just describes in field have role of delegate at this, exceedes the word of threshold value as keyword element.Select frequent dictionary, use frequent dictionary can effectively expressing log information;

Initial cluster center is selected, the data structure first to provide bunch, such essential information of preserving one bunch as a bunch id, centre coordinate and belong to the number of point of this bunch, its type definition is as follows:

publicclassClusterimplementswritable{

PrivateintclusterID; // bunch id

PrivatelongnumOfPoints; // belong to the number of the point of this bunch

PrivateInstancecenter; // bunch central point information

}

Then randomly draw k point as initial bunch center, extract flow process: initialization bunch centralization is empty, then scans whole data set.If current cluster centralization size is less than k, then in the point scanned being joined bunch in the heart, otherwise with the probability of 1/ (1+k) replace in bunch centralization a bit.Under by this step, bunch central information produced is written to Cluster-0 catalogue by us, the file in this catalogue is shared information as overall situation during next round iteration and is joined in the distribution shared buffer memory of MapReduce as the shared data of the overall situation.

Iterative computation bunch center.This stage needs to perform successive ignition, before starting each map node first need to read in setup () method produce in last round of iteration bunch information, comprise the following steps:

1) initial cluster center is read in: read the initial cluster centre data left in for all nodes sharing in shared buffer memory;

2) realization of map method: map method needs for each data point imported into finds from its nearest bunch center, and using bunch in id as key, this data point is launched as value, represent this data point belong to id place bunch;

3) realization of combiner: in order to alleviate network data transmission expense, we utilize combiner to do a merger to the result that map end produces at map end, so both alleviate the data transfer overhead that map holds to reduce, also mitigate the computing cost of reduce end, the key that the key that Combiner exports must export with map with the type of value is identical with the type of value simultaneously.In reduce program, we according to belong to same bunch information a little calculate the Provisional Center of these points, realize with the method for simply averaging here, be added a little in being about to bunch divided by this bunch now contained by number a little;

4) realization of Reducer: it is about the same that Reduce stage and Combiner do, the Output rusults of Combiner is carried out further merger output by it.

The process of double counting bunch central step, until the cluster centre of trying to achieve no longer changes.

According to final cluster centre dividing data.After obtaining final cluster centre, according to the cluster centre obtained, scan all data acquisitions, each data point is divided into nearest cluster centre.

Embodiment:

First build Hadoop distributed type assemblies environment, comprise 5 PCs and build.A master server, all the other four is from server.Every platform machine configures Hadoop, then on namenode Install and configure Sqoop, hive, MySQL.Adopt the log recording of all safety equipment in Li An electricity supermarket, Shaanxi in the present embodiment, wherein file size is 16G.To carry out timing every day to daily record according to demand to upgrade, statistical query result in more new business.

The method can realize express statistic inquiry by hive, it is advantageous that: learning cost is low, can realize simple MapReduce fast add up by class SQL statement, need not develop special MapReduce application, the statistical study of very applicable data warehouse.Use subregion can accelerate the inquiry velocity of data fragmentation, improve search efficiency.K-means algorithm is realized by MapReduce, and safe class assessment is carried out to the result that k-means clustering algorithm exports, for alarm in same IP, danger classes proportion larger make prompting in time, the potential value of daily record data is excavated.

Claims

1. the network security daily record k-means cluster analysis system based on Hadoop, it is characterized in that, include daily record data and obtain subsystem (11), daily record data mixed mechanism storage management subsystem (12), daily record data analyzing subsystem (13);

It is the network security daily record data gathering all devices that described daily record data obtains subsystem (11);

Described daily record data mixed mechanism storage management subsystem (12) carries out managing to all daily record datas and stores;

Described daily record data analyzing subsystem (13) carries out fast query analyzing and processing to all daily record datas, and carry out mining analysis to the potential value of daily record data.

2. a kind of network security daily record k-means cluster analysis system based on Hadoop according to claim 1, it is characterized in that, described daily record data obtains subsystem (11) under Linux environment, configuration Syslogd concentrates log server, adopt Syslog mode acquisition and recording equipment and syslog data, and daily record data is concentrated management.

3. a kind of network security daily record k-means cluster analysis system based on Hadoop according to claim 1, it is characterized in that, described daily record data mixed mechanism storage management subsystem (12), integrate Hadoop platform and traditional data warehouse, comprise HDFS distributed file system module, Hadoop platform and traditional data warehouse collaboration module.

4. a kind of network security daily record k-means cluster analysis system based on Hadoop according to claim 1, it is characterized in that, described daily record data searching and analysis subsystem (13), main employing instrument hive carries out simple statistical query analysis to data, according to demand, write HiveQL query statement, under the driving of HiveDriver, complete lexical analysis in HiveQL query statement, grammatical analysis, compiling, the generation of optimization and inquiry plan, the inquiry plan generated is stored in HDFS, and calling execution by MapReduce subsequently, for the analysis of information potential in daily record data, its inherent value is excavated by writing the MapReduce program realizing respective algorithms.

5. a kind of network security daily record k-means cluster analysis system based on Hadoop according to claim 3, it is characterized in that, the basic document access process of described HDFS distributed file system module is:

6. a kind of network security daily record k-means cluster analysis system based on Hadoop according to claim 3, it is characterized in that, described Hadoop platform and traditional data warehouse collaboration module, configuration MySQL database is as the metadatabase of Hive, for storing the Schema list structure information etc. of Hive, realize the transmission of data between traditional data warehouse and Hadoop platform by Sqoop instrument.

7. the network security daily record k-means cluster analysis system based on Hadoop, it is characterized in that, with Hadoop cluster for platform, integrating traditional data warehouse and large data platform, using the metadatabase of MySQL database as Hive, for storing the Schema list structure information of Hive, Sqoop instrument is used to realize the transmission of data between traditional data warehouse and large data platform; Include data active layer (21), data storage layer (22), computation layer (23), data analysis layer (24) and result display layer (25);

Described data active layer (21), concentrates log server by configuration Syslogd, gathers the daily record data in all devices, then by Sqoop instrument, daily record data is imported to data storage layer from traditional data warehouse;

Described data storage layer (22), adopt the mixing storage architecture that Hadoop cooperates with traditional data warehouse, by Sqoop data transfer tool by data importing in HDFS, process metadata, in the table corresponding in Hive by the data importing after process;

Described result display layer (25), user issues a request to data analysis layer;

Described data analysis layer (24), is converted to corresponding HiveQL statement by the request that user sends, under the driving of HiveDriver, completes executable operations;

Described computation layer (23), from the instruction of Hive engine accepts, and by the HDFS of data storage layer, coordinate MapReduce to realize the Treatment Analysis of data, result returns results display layer the most at last.

8., based on a network security daily record k-means clustering method of Hadoop, it is characterized in that, comprise the following steps:

Daily record data pre-service, the Syslog_incoming_mes file of the content of text of conversion log descriptor is a text vector file;

Based on the realization of the k-means algorithm of MapReduce, text vector runs k-means clustering algorithm.

9. a kind of network security daily record k-means clustering method based on Hadoop according to claim 8, it is characterized in that, described daily record data pre-service, comprises the following steps:

10. a kind of network security daily record k-means clustering method based on Hadoop according to claim 8, it is characterized in that, the realization of the described k-means algorithm based on MapReduce, comprises the following steps:

All points in the set of scanning raw data, and random selecting k point is as initial bunch center;