CN105138661A - Hadoop-based k-means clustering analysis system and method of network security log - Google Patents

Hadoop-based k-means clustering analysis system and method of network security log Download PDF

Info

Publication number
CN105138661A
CN105138661A CN201510553636.3A CN201510553636A CN105138661A CN 105138661 A CN105138661 A CN 105138661A CN 201510553636 A CN201510553636 A CN 201510553636A CN 105138661 A CN105138661 A CN 105138661A
Authority
CN
China
Prior art keywords
data
daily record
hadoop
analysis
network security
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510553636.3A
Other languages
Chinese (zh)
Other versions
CN105138661B (en
Inventor
高岭
苏蓉
高妮
王帆
杨建锋
雷艳婷
申元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwest University
Original Assignee
Northwest University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwest University filed Critical Northwest University
Priority to CN201510553636.3A priority Critical patent/CN105138661B/en
Publication of CN105138661A publication Critical patent/CN105138661A/en
Application granted granted Critical
Publication of CN105138661B publication Critical patent/CN105138661B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems

Abstract

The invention provides a hadoop-based k-means clustering analysis system and method of a network security log. The hadoop-based k-means clustering analysis system comprises a log data acquisition subsystem, a log data mixing mechanism storage management subsystem and a log data analysis subsystem. The method includes the steps that in a data storage layer, a mixing storage mechanism with Hadoop cooperating with a traditional data warehouse is adopted to store log data, a Hive operation interface is provided in a data access layer, the data storage layer and a computing layer receive instructions from a Hive engine, and efficient query analysis on the data is achieved by being matched with MapReduce through HDFS; when mining analysis is conducted on log data, MapReduce is adopted to conduct clustering mining analysis on the network security log through a k-means algorithm; the framework with the Hadoop cooperating with the traditional data warehouse is adopted, the detects of the traditional data warehouse on the aspects of mass data processing, storage and the like, and meanwhile an original traditional data warehouse is fully used; clustering analysis is conducted through the MapReduce-based k-means algorithm, and safety grade evaluation and early warning can be conducted on log data timely.

Description

A kind of network security daily record k-means cluster analysis system based on Hadoop and method
Technical field
The invention belongs to technical field of computer information processing, be specifically related to a kind of network security daily record k-means cluster analysis system based on Hadoop and method.
Background technology
Along with the blast of data, the sharply increase of quantity of information, the existing traditional data warehouse of enterprise has been difficult to the growth rate dealing with data.Traditional data warehouse adopts high-performance integrated machine construction usually, cost is high, extendability is poor, and process structural data is only good in traditional data warehouse, this properties influence to traditional data warehouse when in the face of magnanimity isomeric data for the excavation of inherent value, this is the maximum difference of Hadoop and traditional data processing mode.For the existing traditional data warehouse of enterprise, we want Appropriate application, existing traditional data warehouse and large data platform to be combined simultaneously, set up a unified data analysis and data processing architecture, make by the cooperate monitoring statistical study that realize network log of Hadoop with traditional data warehouse.
Hadoop is a Distributed Computing Platform of increasing income of Apache organization and administration, is a software frame that can carry out distributed treatment to mass data.The Hadoop being core with Hadoop distributed file system HDFS and MapReduce provides the transparent distributed basis framework of system bottom details for user.The advantages such as the high fault tolerance of HDFS, high scalability, enhanced scalability, height are acquired, high-throughput allow user to be deployed in by Hadoop on cheap hardware, form distributed system; The distributed programmed model of MapReduce allows user to develop concurrent application when not understanding distributed system low-level details.
HDFS is the basis of data storage management in Distributed Calculation, and the demand based on flow data mode access and process super large file is developed.Its characteristic is that mass data provides the storage of not being afraid of fault, for the application process of super large data set brings a lot of facility.HDFS is master/slave (Mater/Slave) architecture, and have two category nodes in its architecture, a class is NameNode, cries again " metadata node "; Another kind of is DataNode, cries again " back end ", and this two category node bears the XM of Master and Worker specific tasks respectively.But due to the character of distributed storage, HDFS cluster has a NameNode and multiple DataNode.Metadata node is used for the NameSpace of managing file system; Back end is the real place storing data in file system.
MapReduce parallel computation frame is a parallelisation procedure executive system.It provide parallel process model and process that one comprises two stages of Map and Reduce, process data with key-value pair data input mode, and automatically can complete division and the management and running of data.In program performs, MapReduce parallel computation frame will be responsible for scheduling and distributes calculation resources, divide and inputoutput data, the execution of scheduler program, the executing state of watchdog routine, and be responsible for compiling of the synchronous and intermediate result of each computing node when program performs.
Sqoop is an instrument carrying out rapid batch exchanges data between relational database and Hadoop platform.Batch data in a relational database can import in HDFS, Hive of Hadoop by it, also can conversely by the data importing relational database in Hadoop platform.
Hive is a data warehouse being based upon on Hadoop, for the structuring/semi-structured data of managed storage in HDFS.It allows directly to write data query routine analyzer with the HiveQL query language of similar SQL as DLL (dynamic link library), and provide persistence architecture required for data warehouse, storage administration and query analysis function, and HiveQL statement is converted into corresponding MapReduce program is performed when bottom layer realization.
Summary of the invention
For overcoming above-mentioned the deficiencies in the prior art, the object of the present invention is to provide a kind of network security daily record k-means cluster analysis system based on Hadoop and method, on the basis in the traditional data warehouse that Appropriate application has been built up, large data platform is integrated, set up unified data to store and data processing architecture, the extended capability overcoming traditional data warehouse is poor, is only good at process structural data, the shortcoming cannot excavated the inherent value of magnanimity isomeric data.
For achieving the above object, the technical solution used in the present invention is: a kind of network security daily record k-means cluster analysis system based on Hadoop, includes daily record data and obtains subsystem, daily record data mixed mechanism storage management subsystem, daily record data analyzing subsystem;
It is the network security daily record data gathering all devices that described daily record data obtains subsystem;
Described daily record data mixed mechanism storage management subsystem carries out managing to all daily record datas and stores;
Described daily record data analyzing subsystem carries out fast query analyzing and processing to all daily record datas, and carry out mining analysis to the potential value of daily record data.
Described daily record data obtains subsystem under Linux environment, and configuration Syslogd concentrates log server, adopts Syslog mode acquisition and recording equipment and syslog data, and daily record data is concentrated management.
Described daily record data mixed mechanism storage management subsystem, integrates Hadoop platform and traditional data warehouse, comprises HDFS distributed file system module, Hadoop platform and traditional data warehouse collaboration module.
Described daily record data searching and analysis subsystem, main employing instrument hive carries out simple statistical query analysis to data, according to demand, write HiveQL query statement, under the driving of HiveDriver, complete the generation of lexical analysis in HiveQL query statement, grammatical analysis, compiling, optimization and inquiry plan, the inquiry plan generated is stored in HDFS, and calling execution by MapReduce subsequently, for the analysis of information potential in daily record data, excavate its inherent value by writing the MapReduce program realizing respective algorithms.
The basic document access process of described HDFS distributed file system module is:
1) filename is sent to NameNode by the client-side program of HDFS by application program;
2) after NomeNode receives filename, the data block of retrieving files name correspondence in HDFS catalogue, then find preservation data block DataNode address according to data block information, these addresses are transmitted back to client;
3), after client receives these DataNode addresses, carry out data transfer operation concurrently with these DataNode, the correlation log of operating result is submitted to NameNode simultaneously.
Described Hadoop platform and traditional data warehouse collaboration module, configuration MySQL database is as the metadatabase of Hive, for storing the Schema list structure information etc. of Hive, realize the transmission of data between traditional data warehouse and Hadoop platform by Sqoop instrument.
With Hadoop cluster for platform, integrating traditional data warehouse and large data platform, using the metadatabase of MySQL database as Hive, for storing the Schema list structure information of Hive, Sqoop instrument is used to realize the transmission of data between traditional data warehouse and large data platform; Include data active layer, data storage layer, computation layer, data analysis layer and result display layer;
Described data active layer, concentrates log server by configuration Syslogd, gathers the daily record data in all devices, then by Sqoop instrument, daily record data is imported to data storage layer from traditional data warehouse;
Described data storage layer, adopts the mixing storage architecture that cooperate with traditional data warehouse of Hadoop, by Sqoop data transfer tool by data importing in HDFS, process metadata, by the table of the data importing correspondence in Hive after processing;
Described result display layer, user issues a request to data analysis layer;
Described data analysis layer, is converted to corresponding HiveQL statement by the request that user sends, under the driving of HiveDriver, completes executable operations;
Described computation layer, from the instruction of Hive engine accepts, and by the HDFS of data storage layer, coordinate MapReduce to realize the Treatment Analysis of data, result returns results display layer the most at last.
Based on a network security daily record k-means clustering method of Hadoop, it is characterized in that, comprise the following steps:
1) daily record data pre-service, the Syslog_incoming_mes file of the content of text of conversion log descriptor is a text vector file;
2) based on the realization of the k-means algorithm of MapReduce, text vector runs k-means clustering algorithm.
Described daily record data pre-service, comprises the following steps:
1) remove function word, remove in daily record descriptor text without sincere word;
2) mark part of speech, native system use english-left3words-distsim.tagger segmenter marks the word in every bar daily record description;
3) extract go-word, after participle, that system is extracted is noun (NN, NNS, NNP, NNPS), verb (VB, VBP, VBN, VBD) and adjective (JJ, JJR, JJS), these words all have the actual meaning, accurately can express log information;
4) obtain frequent dictionary, to all record statistical frequencies, the word of high frequency just describes in field have role of delegate at this, and the word exceeding threshold value, as keyword element, selects frequent dictionary, uses frequent dictionary can effectively expressing log information;
5) generate text vector file, contrast daily record description field and frequent dictionary obtain the keyspace by a succession of 0,1 composition, and multiple keyspace set forms text vector file.
The realization of the described k-means algorithm based on MapReduce, comprises the following steps:
1) scan all points in raw data set, and random selecting k point is as initial bunch center;
2) each Map node reads and there is local data set, generates cluster set, finally at the global clustering center of Reduce stage with some cluster set symphysis Cheng Xin, repeat this process until meet termination condition with k-means algorithm;
3) according to the final bunch center generated, all data elements are carried out to the work of partition clustering.
The advantage of technical solution of the present invention is mainly reflected in:
1) support of Hadoop is provided on the basis in existing traditional data warehouse, establish unified data to store and data processing architecture, compensate for the deficiency of traditional data warehouse in mass data processing, storage etc., also make original traditional data warehouse used to make the best simultaneously.
2) along with data grows is many, need more cluster resource to process these data, and Hadoop is an easy expanding system, only needs simply to configure a new node, just can expand cluster very easily, promotes its computing power.
3) for the isomeric data in mass network daily record, first use MapReduce process metadata, again by table corresponding in Hive for the data importing after process, then write hiveQL statement according to demand and query analysis is simply carried out to data, realize k-means algorithm with MapReduce and mining analysis is carried out to data.Improve query analysis efficiency, the potential value of data have also been obtained excavation.
Accompanying drawing explanation
Fig. 1 is system architecture theory diagram of the present invention.
Fig. 2 is the Web Log Analysis system architecture diagram cooperated with traditional data warehouse based on Hadoop of the present invention.
Fig. 3 is daily record data k-means clustering algorithm research framework of the present invention.
Embodiment
Below in conjunction with embodiment and Figure of description, technical scheme of the present invention is described in detail, but is not limited thereto.
See Fig. 1, a kind of network security daily record k-means cluster analysis system based on Hadoop, includes daily record data and obtains subsystem 11, daily record data mixed mechanism storage management subsystem 12, daily record data analyzing subsystem 13;
It is the network security daily record datas gathering all devices that described daily record data obtains subsystem 11;
Daily record data mixed mechanism storage management subsystem 12 carries out managing to all daily record datas and stores;
Daily record data analyzing subsystem 13 carries out fast query analyzing and processing to all daily record datas, and carry out mining analysis to the potential value of daily record data.
The operation workflow step of this system modules is as follows:
Step 1: daily record data obtains: configuration Syslogd concentrates log server, use UDP as host-host protocol, pass through destination interface, the log management of all safety equipment configuration is sent to the log server having installed Syslog software systems, and SYSLOG server automatic reception daily record data is also write in journal file;
Step 2: use Sqoop that the table syslog_incoming of the log information in MySQL is imported to HDFS, utility command: sqoopimport--connectjdbc:mysql: // 219.245.31.39:3306/syslog--usernamesqoop--passwordsqoop--tablesyslog_incoming-m1
Sqoop imports a table by a MapReduce operation from MySQL, this operation extracts record line by line from table, then record is written to HDFS, the position that namenode in cluster bears data stores, and tell that client holds by stored position information, after obtaining positional information, client end starts to write data, by deblocking when writing data, and be stored as many parts, be placed on different datanode nodes, data are first write first node by client, while first node receives data, the data-pushing it received again is to second, second is pushed to the 3rd node, the like,
Step 3: write MapReduce program and useful information is extracted to the log data imported in HDFS;
Step 4: utilize Sqoop to generate a hive table according to the table in the relation data source extracted, utility command directly generates the definition that corresponding hive shows, and then loads the data be kept in HDFS:
sqoopcreate-hive-table--connectjdbc:mysql://219.245.31.39:3306/syslog--tablesyslog_incoming--fields-terminated-by‘,’
Start hive, load data:
loaddatainpath“syslog_incoming”intotablesyslog_incoming;
Step 5: according to business demand, writes corresponding hiveQL statement or MapReduce program, carries out statistical study to daily record data; Described statistical study concrete steps are: carry out subregion according to business demand to tables of data, and subregion is that standby PARTITIONEDBY clause defines when establishment table.Therefore, respectively table record is defined as according to demand and is made up of grade and time (year, season, the moon) subregion, following example is defined as by table record to be made up of grade subregion:
hive>createtablesyslog_incoming_priority(facilityvarchar,datadata,hostvarchar)
>partitionedby(priorityvarchar)
>rowformatdelimited
>fieldsterminatedby‘\t’
>storedastextfile;
After defining list structure, Data import in partition table:
hive>insertintotablesyslog_incoming_priority
>partitioned(priority)
>selectfacility,data,host,priority
>fromsyslog_incoming;
At file system level, subregion is the lower nested sub-directory of entry record, now, multiple grade subregion is had in entry directory structures, data file then leaves in bottom catalogue, according to demand, write query statement hiveQL, query statement is converted to MapReduce task and runs by cluster more finally.
Step 6: import in MySQL by Sqoop by query analysis result, foreground display interface is shown to user by chart.
See Fig. 3, a kind of network security daily record k-means clustering method based on Hadoop, comprises the following steps:
1) daily record data pre-service, the Syslog_incoming_mes file of the content of text of conversion log descriptor is a text vector file;
2) based on the realization of the k-means algorithm of MapReduce, text vector runs k-means clustering algorithm.
Described daily record data pre-service, comprises the following steps:
1) remove function word, remove in daily record descriptor text without sincere word;
2) mark part of speech, native system use english-left3words-distsim.tagger segmenter marks the word in every bar daily record description;
3) extract go-word, after participle, what system was extracted is noun (NN, NNS, NNP, NNPS), verb (VB, VBP, VBN, VBD) and adjective (JJ, JJR, JJS).These words all have the actual meaning, accurately can express log information;
4) obtain frequent dictionary, to all record statistical frequencies, the word of high frequency just describes in field have role of delegate at this, exceedes the word of threshold value as keyword element.Select frequent dictionary, use frequent dictionary can effectively expressing log information;
5) generate text vector file, contrast daily record description field and frequent dictionary obtain the keyspace by a succession of 0,1 composition, and multiple keyspace set forms text vector file.
The realization of the described k-means algorithm based on MapReduce, comprises the following steps:
Initial cluster center is selected, the data structure first to provide bunch, such essential information of preserving one bunch as a bunch id, centre coordinate and belong to the number of point of this bunch, its type definition is as follows:
publicclassClusterimplementswritable{
PrivateintclusterID; // bunch id
PrivatelongnumOfPoints; // belong to the number of the point of this bunch
PrivateInstancecenter; // bunch central point information
}
Then randomly draw k point as initial bunch center, extract flow process: initialization bunch centralization is empty, then scans whole data set.If current cluster centralization size is less than k, then in the point scanned being joined bunch in the heart, otherwise with the probability of 1/ (1+k) replace in bunch centralization a bit.Under by this step, bunch central information produced is written to Cluster-0 catalogue by us, the file in this catalogue is shared information as overall situation during next round iteration and is joined in the distribution shared buffer memory of MapReduce as the shared data of the overall situation.
Iterative computation bunch center.This stage needs to perform successive ignition, before starting each map node first need to read in setup () method produce in last round of iteration bunch information, comprise the following steps:
1) initial cluster center is read in: read the initial cluster centre data left in for all nodes sharing in shared buffer memory;
2) realization of map method: map method needs for each data point imported into finds from its nearest bunch center, and using bunch in id as key, this data point is launched as value, represent this data point belong to id place bunch;
3) realization of combiner: in order to alleviate network data transmission expense, we utilize combiner to do a merger to the result that map end produces at map end, so both alleviate the data transfer overhead that map holds to reduce, also mitigate the computing cost of reduce end, the key that the key that Combiner exports must export with map with the type of value is identical with the type of value simultaneously.In reduce program, we according to belong to same bunch information a little calculate the Provisional Center of these points, realize with the method for simply averaging here, be added a little in being about to bunch divided by this bunch now contained by number a little;
4) realization of Reducer: it is about the same that Reduce stage and Combiner do, the Output rusults of Combiner is carried out further merger output by it.
The process of double counting bunch central step, until the cluster centre of trying to achieve no longer changes.
According to final cluster centre dividing data.After obtaining final cluster centre, according to the cluster centre obtained, scan all data acquisitions, each data point is divided into nearest cluster centre.
Embodiment:
First build Hadoop distributed type assemblies environment, comprise 5 PCs and build.A master server, all the other four is from server.Every platform machine configures Hadoop, then on namenode Install and configure Sqoop, hive, MySQL.Adopt the log recording of all safety equipment in Li An electricity supermarket, Shaanxi in the present embodiment, wherein file size is 16G.To carry out timing every day to daily record according to demand to upgrade, statistical query result in more new business.
The method can realize express statistic inquiry by hive, it is advantageous that: learning cost is low, can realize simple MapReduce fast add up by class SQL statement, need not develop special MapReduce application, the statistical study of very applicable data warehouse.Use subregion can accelerate the inquiry velocity of data fragmentation, improve search efficiency.K-means algorithm is realized by MapReduce, and safe class assessment is carried out to the result that k-means clustering algorithm exports, for alarm in same IP, danger classes proportion larger make prompting in time, the potential value of daily record data is excavated.

Claims (10)

1. the network security daily record k-means cluster analysis system based on Hadoop, it is characterized in that, include daily record data and obtain subsystem (11), daily record data mixed mechanism storage management subsystem (12), daily record data analyzing subsystem (13);
It is the network security daily record data gathering all devices that described daily record data obtains subsystem (11);
Described daily record data mixed mechanism storage management subsystem (12) carries out managing to all daily record datas and stores;
Described daily record data analyzing subsystem (13) carries out fast query analyzing and processing to all daily record datas, and carry out mining analysis to the potential value of daily record data.
2. a kind of network security daily record k-means cluster analysis system based on Hadoop according to claim 1, it is characterized in that, described daily record data obtains subsystem (11) under Linux environment, configuration Syslogd concentrates log server, adopt Syslog mode acquisition and recording equipment and syslog data, and daily record data is concentrated management.
3. a kind of network security daily record k-means cluster analysis system based on Hadoop according to claim 1, it is characterized in that, described daily record data mixed mechanism storage management subsystem (12), integrate Hadoop platform and traditional data warehouse, comprise HDFS distributed file system module, Hadoop platform and traditional data warehouse collaboration module.
4. a kind of network security daily record k-means cluster analysis system based on Hadoop according to claim 1, it is characterized in that, described daily record data searching and analysis subsystem (13), main employing instrument hive carries out simple statistical query analysis to data, according to demand, write HiveQL query statement, under the driving of HiveDriver, complete lexical analysis in HiveQL query statement, grammatical analysis, compiling, the generation of optimization and inquiry plan, the inquiry plan generated is stored in HDFS, and calling execution by MapReduce subsequently, for the analysis of information potential in daily record data, its inherent value is excavated by writing the MapReduce program realizing respective algorithms.
5. a kind of network security daily record k-means cluster analysis system based on Hadoop according to claim 3, it is characterized in that, the basic document access process of described HDFS distributed file system module is:
1) filename is sent to NameNode by the client-side program of HDFS by application program;
2) after NomeNode receives filename, the data block of retrieving files name correspondence in HDFS catalogue, then find preservation data block DataNode address according to data block information, these addresses are transmitted back to client;
3), after client receives these DataNode addresses, carry out data transfer operation concurrently with these DataNode, the correlation log of operating result is submitted to NameNode simultaneously.
6. a kind of network security daily record k-means cluster analysis system based on Hadoop according to claim 3, it is characterized in that, described Hadoop platform and traditional data warehouse collaboration module, configuration MySQL database is as the metadatabase of Hive, for storing the Schema list structure information etc. of Hive, realize the transmission of data between traditional data warehouse and Hadoop platform by Sqoop instrument.
7. the network security daily record k-means cluster analysis system based on Hadoop, it is characterized in that, with Hadoop cluster for platform, integrating traditional data warehouse and large data platform, using the metadatabase of MySQL database as Hive, for storing the Schema list structure information of Hive, Sqoop instrument is used to realize the transmission of data between traditional data warehouse and large data platform; Include data active layer (21), data storage layer (22), computation layer (23), data analysis layer (24) and result display layer (25);
Described data active layer (21), concentrates log server by configuration Syslogd, gathers the daily record data in all devices, then by Sqoop instrument, daily record data is imported to data storage layer from traditional data warehouse;
Described data storage layer (22), adopt the mixing storage architecture that Hadoop cooperates with traditional data warehouse, by Sqoop data transfer tool by data importing in HDFS, process metadata, in the table corresponding in Hive by the data importing after process;
Described result display layer (25), user issues a request to data analysis layer;
Described data analysis layer (24), is converted to corresponding HiveQL statement by the request that user sends, under the driving of HiveDriver, completes executable operations;
Described computation layer (23), from the instruction of Hive engine accepts, and by the HDFS of data storage layer, coordinate MapReduce to realize the Treatment Analysis of data, result returns results display layer the most at last.
8., based on a network security daily record k-means clustering method of Hadoop, it is characterized in that, comprise the following steps:
Daily record data pre-service, the Syslog_incoming_mes file of the content of text of conversion log descriptor is a text vector file;
Based on the realization of the k-means algorithm of MapReduce, text vector runs k-means clustering algorithm.
9. a kind of network security daily record k-means clustering method based on Hadoop according to claim 8, it is characterized in that, described daily record data pre-service, comprises the following steps:
1) remove function word, remove in daily record descriptor text without sincere word;
2) mark part of speech, native system use english-left3words-distsim.tagger segmenter marks the word in every bar daily record description;
3) extract go-word, after participle, that system is extracted is noun (NN, NNS, NNP, NNPS), verb (VB, VBP, VBN, VBD) and adjective (JJ, JJR, JJS), these words all have the actual meaning, accurately can express log information;
4) obtain frequent dictionary, to all record statistical frequencies, the word of high frequency just describes in field have role of delegate at this, and the word exceeding threshold value, as keyword element, selects frequent dictionary, uses frequent dictionary can effectively expressing log information;
5) generate text vector file, contrast daily record description field and frequent dictionary obtain the keyspace by a succession of 0,1 composition, and multiple keyspace set forms text vector file.
10. a kind of network security daily record k-means clustering method based on Hadoop according to claim 8, it is characterized in that, the realization of the described k-means algorithm based on MapReduce, comprises the following steps:
All points in the set of scanning raw data, and random selecting k point is as initial bunch center;
2) each Map node reads and there is local data set, generates cluster set, finally at the global clustering center of Reduce stage with some cluster set symphysis Cheng Xin, repeat this process until meet termination condition with k-means algorithm;
3) according to the final bunch center generated, all data elements are carried out to the work of partition clustering.
CN201510553636.3A 2015-09-02 2015-09-02 A kind of network security daily record k-means cluster analysis systems and method based on Hadoop Expired - Fee Related CN105138661B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510553636.3A CN105138661B (en) 2015-09-02 2015-09-02 A kind of network security daily record k-means cluster analysis systems and method based on Hadoop

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510553636.3A CN105138661B (en) 2015-09-02 2015-09-02 A kind of network security daily record k-means cluster analysis systems and method based on Hadoop

Publications (2)

Publication Number Publication Date
CN105138661A true CN105138661A (en) 2015-12-09
CN105138661B CN105138661B (en) 2018-10-30

Family

ID=54724008

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510553636.3A Expired - Fee Related CN105138661B (en) 2015-09-02 2015-09-02 A kind of network security daily record k-means cluster analysis systems and method based on Hadoop

Country Status (1)

Country Link
CN (1) CN105138661B (en)

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105608203A (en) * 2015-12-24 2016-05-25 Tcl集团股份有限公司 Internet of things log processing method and device based on Hadoop platform
CN105824892A (en) * 2016-03-11 2016-08-03 广东电网有限责任公司电力科学研究院 Method for synchronizing and processing data by data pool
CN106168965A (en) * 2016-07-01 2016-11-30 竹间智能科技(上海)有限公司 Knowledge mapping constructing system
CN106919555A (en) * 2015-12-28 2017-07-04 国际商业机器公司 The system and method that the field of the data for being included in log stream is extracted
CN107579944A (en) * 2016-07-05 2018-01-12 南京联成科技发展股份有限公司 Based on artificial intelligence and MapReduce security attack Forecasting Methodologies
CN107943647A (en) * 2017-11-21 2018-04-20 北京小度互娱科技有限公司 A kind of reliable distributed information log collection method and system
CN107958022A (en) * 2017-11-06 2018-04-24 余帝乾 A kind of method that Web log excavates
CN108133043A (en) * 2018-01-12 2018-06-08 福建星瑞格软件有限公司 A kind of server running log structured storage method based on big data
CN108446568A (en) * 2018-03-19 2018-08-24 西北大学 A kind of histogram data dissemination method going trend analysis difference secret protection
CN108933785A (en) * 2018-06-29 2018-12-04 平安科技(深圳)有限公司 Network risks monitoring method, device, computer equipment and storage medium
CN109254903A (en) * 2018-08-03 2019-01-22 挖财网络技术有限公司 A kind of intelligentized log analysis method and device
CN109446042A (en) * 2018-10-12 2019-03-08 安徽南瑞中天电力电子有限公司 A kind of blog management method and system for intelligent power equipment
CN109525593A (en) * 2018-12-20 2019-03-26 中科曙光国际信息产业有限公司 A kind of pair of hadoop big data platform concentrates security management and control system and method
CN109766368A (en) * 2018-11-14 2019-05-17 国云科技股份有限公司 A kind of data query polymorphic type view output system and method based on Hive
CN110069551A (en) * 2019-04-25 2019-07-30 江南大学 Medical Devices O&M information excavating analysis system and its application method based on Spark
CN110135184A (en) * 2018-02-09 2019-08-16 中兴通讯股份有限公司 A kind of method, apparatus, equipment and the storage medium of static data desensitization
CN110378550A (en) * 2019-06-03 2019-10-25 东南大学 The processing method of the extensive food data of multi-source based on distributed structure/architecture
CN110581873A (en) * 2018-06-11 2019-12-17 中国移动通信集团浙江有限公司 cross-cluster redirection method and monitoring server
CN111259068A (en) * 2020-04-28 2020-06-09 成都四方伟业软件股份有限公司 Data development method and system based on data warehouse
CN111913954A (en) * 2020-06-20 2020-11-10 杭州城市大数据运营有限公司 Intelligent data standard catalog generation method and device
CN111966293A (en) * 2020-08-18 2020-11-20 北京明略昭辉科技有限公司 Cold and hot data analysis method and system
CN112306787A (en) * 2019-07-24 2021-02-02 阿里巴巴集团控股有限公司 Error log processing method and device, electronic equipment and intelligent sound box
CN112380348A (en) * 2020-11-25 2021-02-19 中信百信银行股份有限公司 Metadata processing method and device, electronic equipment and computer-readable storage medium
CN113220760A (en) * 2021-04-28 2021-08-06 北京达佳互联信息技术有限公司 Data processing method, device, server and storage medium
CN113238912A (en) * 2021-05-08 2021-08-10 国家计算机网络与信息安全管理中心 Aggregation processing method for network security log data
CN113641750A (en) * 2021-08-20 2021-11-12 广东云药科技有限公司 Enterprise big data analysis platform
CN116737846A (en) * 2023-05-31 2023-09-12 深圳华夏凯词财富管理有限公司 Asset management data safety protection warehouse system based on Hive

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294580A (en) * 2016-07-28 2017-01-04 武汉虹信技术服务有限责任公司 LTE network MR data analysing method based on HADOOP platform

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999633A (en) * 2012-12-18 2013-03-27 北京师范大学珠海分校 Cloud cluster extraction method of network information
CN103399887A (en) * 2013-07-19 2013-11-20 蓝盾信息安全技术股份有限公司 Query and statistical analysis system for mass logs
CN103425762A (en) * 2013-08-05 2013-12-04 南京邮电大学 Telecom operator mass data processing method based on Hadoop platform
CN103544328A (en) * 2013-11-15 2014-01-29 南京大学 Parallel k mean value clustering method based on Hadoop
CN104616205A (en) * 2014-11-24 2015-05-13 北京科东电力控制系统有限责任公司 Distributed log analysis based operation state monitoring method of power system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999633A (en) * 2012-12-18 2013-03-27 北京师范大学珠海分校 Cloud cluster extraction method of network information
CN103399887A (en) * 2013-07-19 2013-11-20 蓝盾信息安全技术股份有限公司 Query and statistical analysis system for mass logs
CN103425762A (en) * 2013-08-05 2013-12-04 南京邮电大学 Telecom operator mass data processing method based on Hadoop platform
CN103544328A (en) * 2013-11-15 2014-01-29 南京大学 Parallel k mean value clustering method based on Hadoop
CN104616205A (en) * 2014-11-24 2015-05-13 北京科东电力控制系统有限责任公司 Distributed log analysis based operation state monitoring method of power system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
付伟等: "一种基于Hadoop和K_means的Web日志分析方案的设计", 《第十九届全国青年通信学术年会论文集》 *
王帅等: "网络安全分析中的大数据技术应用", 《电信科学》 *

Cited By (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105608203B (en) * 2015-12-24 2019-09-17 Tcl集团股份有限公司 A kind of Internet of Things log processing method and device based on Hadoop platform
CN105608203A (en) * 2015-12-24 2016-05-25 Tcl集团股份有限公司 Internet of things log processing method and device based on Hadoop platform
CN106919555A (en) * 2015-12-28 2017-07-04 国际商业机器公司 The system and method that the field of the data for being included in log stream is extracted
CN106919555B (en) * 2015-12-28 2020-04-24 国际商业机器公司 System and method for field extraction of data contained within a log stream
CN105824892A (en) * 2016-03-11 2016-08-03 广东电网有限责任公司电力科学研究院 Method for synchronizing and processing data by data pool
CN106168965A (en) * 2016-07-01 2016-11-30 竹间智能科技(上海)有限公司 Knowledge mapping constructing system
CN106168965B (en) * 2016-07-01 2020-06-30 竹间智能科技(上海)有限公司 Knowledge graph construction system
CN107579944A (en) * 2016-07-05 2018-01-12 南京联成科技发展股份有限公司 Based on artificial intelligence and MapReduce security attack Forecasting Methodologies
CN107579944B (en) * 2016-07-05 2020-08-11 南京联成科技发展股份有限公司 Artificial intelligence and MapReduce-based security attack prediction method
CN107958022A (en) * 2017-11-06 2018-04-24 余帝乾 A kind of method that Web log excavates
CN107943647A (en) * 2017-11-21 2018-04-20 北京小度互娱科技有限公司 A kind of reliable distributed information log collection method and system
CN108133043A (en) * 2018-01-12 2018-06-08 福建星瑞格软件有限公司 A kind of server running log structured storage method based on big data
CN110135184A (en) * 2018-02-09 2019-08-16 中兴通讯股份有限公司 A kind of method, apparatus, equipment and the storage medium of static data desensitization
CN110135184B (en) * 2018-02-09 2023-12-22 中兴通讯股份有限公司 Method, device, equipment and storage medium for desensitizing static data
CN108446568A (en) * 2018-03-19 2018-08-24 西北大学 A kind of histogram data dissemination method going trend analysis difference secret protection
CN108446568B (en) * 2018-03-19 2021-04-13 西北大学 Histogram data publishing method for trend analysis differential privacy protection
CN110581873B (en) * 2018-06-11 2022-06-14 中国移动通信集团浙江有限公司 Cross-cluster redirection method and monitoring server
CN110581873A (en) * 2018-06-11 2019-12-17 中国移动通信集团浙江有限公司 cross-cluster redirection method and monitoring server
CN108933785B (en) * 2018-06-29 2021-02-05 平安科技(深圳)有限公司 Network risk monitoring method and device, computer equipment and storage medium
CN108933785A (en) * 2018-06-29 2018-12-04 平安科技(深圳)有限公司 Network risks monitoring method, device, computer equipment and storage medium
CN109254903A (en) * 2018-08-03 2019-01-22 挖财网络技术有限公司 A kind of intelligentized log analysis method and device
CN109446042A (en) * 2018-10-12 2019-03-08 安徽南瑞中天电力电子有限公司 A kind of blog management method and system for intelligent power equipment
CN109446042B (en) * 2018-10-12 2021-12-14 安徽南瑞中天电力电子有限公司 Log management method and system for intelligent electric equipment
CN109766368B (en) * 2018-11-14 2021-08-27 国云科技股份有限公司 Hive-based data query multi-type view output system and method
CN109766368A (en) * 2018-11-14 2019-05-17 国云科技股份有限公司 A kind of data query polymorphic type view output system and method based on Hive
CN109525593A (en) * 2018-12-20 2019-03-26 中科曙光国际信息产业有限公司 A kind of pair of hadoop big data platform concentrates security management and control system and method
CN109525593B (en) * 2018-12-20 2022-02-22 中科曙光国际信息产业有限公司 Centralized safety management and control system and method for hadoop big data platform
CN110069551A (en) * 2019-04-25 2019-07-30 江南大学 Medical Devices O&M information excavating analysis system and its application method based on Spark
CN110378550A (en) * 2019-06-03 2019-10-25 东南大学 The processing method of the extensive food data of multi-source based on distributed structure/architecture
CN112306787B (en) * 2019-07-24 2022-08-09 阿里巴巴集团控股有限公司 Error log processing method and device, electronic equipment and intelligent sound box
CN112306787A (en) * 2019-07-24 2021-02-02 阿里巴巴集团控股有限公司 Error log processing method and device, electronic equipment and intelligent sound box
CN111259068A (en) * 2020-04-28 2020-06-09 成都四方伟业软件股份有限公司 Data development method and system based on data warehouse
CN111913954B (en) * 2020-06-20 2023-08-04 杭州城市大数据运营有限公司 Intelligent data standard catalog generation method and device
CN111913954A (en) * 2020-06-20 2020-11-10 杭州城市大数据运营有限公司 Intelligent data standard catalog generation method and device
CN111966293A (en) * 2020-08-18 2020-11-20 北京明略昭辉科技有限公司 Cold and hot data analysis method and system
CN112380348A (en) * 2020-11-25 2021-02-19 中信百信银行股份有限公司 Metadata processing method and device, electronic equipment and computer-readable storage medium
CN112380348B (en) * 2020-11-25 2024-03-26 中信百信银行股份有限公司 Metadata processing method, apparatus, electronic device and computer readable storage medium
CN113220760A (en) * 2021-04-28 2021-08-06 北京达佳互联信息技术有限公司 Data processing method, device, server and storage medium
CN113238912A (en) * 2021-05-08 2021-08-10 国家计算机网络与信息安全管理中心 Aggregation processing method for network security log data
CN113238912B (en) * 2021-05-08 2022-12-06 国家计算机网络与信息安全管理中心 Aggregation processing method for network security log data
CN113641750A (en) * 2021-08-20 2021-11-12 广东云药科技有限公司 Enterprise big data analysis platform
CN116737846A (en) * 2023-05-31 2023-09-12 深圳华夏凯词财富管理有限公司 Asset management data safety protection warehouse system based on Hive

Also Published As

Publication number Publication date
CN105138661B (en) 2018-10-30

Similar Documents

Publication Publication Date Title
CN105138661A (en) Hadoop-based k-means clustering analysis system and method of network security log
CN107291807B (en) SPARQL query optimization method based on graph traversal
Das et al. Big data analytics: A framework for unstructured data analysis
CN110674154B (en) Spark-based method for inserting, updating and deleting data in Hive
CN106446153A (en) Distributed newSQL database system and method
CN110633186A (en) Log monitoring system for electric power metering micro-service architecture and implementation method
CN104102710A (en) Massive data query method
WO2018036324A1 (en) Smart city information sharing method and device
Mohammed et al. A review of big data environment and its related technologies
CN102917009B (en) A kind of stock certificate data collection based on cloud computing technology and storage means and system
CN104239377A (en) Platform-crossing data retrieval method and device
CN103399894A (en) Distributed transaction processing method on basis of shared storage pool
CN106569896A (en) Data distribution and parallel processing method and system
Li et al. The overview of big data storage and management
Das et al. A study on big data integration with data warehouse
Ding et al. ComMapReduce: An improvement of MapReduce with lightweight communication mechanisms
CN104516985A (en) Rapid mass data importing method based on HBase database
Arputhamary et al. A review on big data integration
Chen et al. Related technologies
Suciu et al. Big data technology for scientific applications
Suguna et al. Improvement of Hadoop ecosystem and their pros and cons in Big data
CN113590651A (en) Cross-cluster data processing system and method based on HQL
Pan et al. An open sharing pattern design of massive power big data
Zhang et al. The research and design of SQL processing in a data-mining system based on MapReduce
Yang et al. AstroServ: A distributed database for serving large-scale full life-cycle astronomical data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20181030

Termination date: 20200902

CF01 Termination of patent right due to non-payment of annual fee