CN105138661B - A kind of network security daily record k-means cluster analysis systems and method based on Hadoop - Google Patents

A kind of network security daily record k-means cluster analysis systems and method based on Hadoop Download PDF

Info

Publication number
CN105138661B
CN105138661B CN201510553636.3A CN201510553636A CN105138661B CN 105138661 B CN105138661 B CN 105138661B CN 201510553636 A CN201510553636 A CN 201510553636A CN 105138661 B CN105138661 B CN 105138661B
Authority
CN
China
Prior art keywords
data
daily record
hadoop
analysis
hive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201510553636.3A
Other languages
Chinese (zh)
Other versions
CN105138661A (en
Inventor
高岭
苏蓉
高妮
王帆
杨建锋
雷艳婷
申元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwest University
Original Assignee
Northwest University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwest University filed Critical Northwest University
Priority to CN201510553636.3A priority Critical patent/CN105138661B/en
Publication of CN105138661A publication Critical patent/CN105138661A/en
Application granted granted Critical
Publication of CN105138661B publication Critical patent/CN105138661B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of network security daily record k-means cluster analysis systems and method based on Hadoop, including daily record data obtain subsystem, daily record data mixed mechanism storage management subsystem, daily record data analyzing subsystem;In data storage layer, the mixing memory mechanism storing daily record data to be cooperated with traditional data warehouse using Hadoop, and the interface operated to Hive is provided in data access layer, data storage layer and computation layer are received from Hive engines to be instructed, by HDFS, cooperation MapReduce realizes the efficient query analysis to data;When carrying out mining analysis to daily record data, realize that k-means algorithms carry out cluster result analysis to it using MapReduce;The framework that cooperated with traditional data warehouse using Hadoop compensates for traditional data warehouse in the deficiency of mass data processing, storage etc., while but also original traditional data warehouse is used to make the best;Clustering is carried out using the k-means algorithms based on MapReduce, can safe class assessment and early warning be carried out to daily record data in time.

Description

A kind of network security daily record k-means cluster analysis systems based on Hadoop and Method
Technical field
The invention belongs to technical field of computer information processing, and in particular to a kind of network security daily record based on Hadoop K-means cluster analysis systems and method.
Background technology
With the explosion of data, information content sharply increases, and the existing traditional data warehouse of enterprise has been difficult to deal with number According to growth rate.The generally use high-performance integrated machine construction of traditional data warehouse, of high cost, autgmentability is poor, and traditional number Only be good at processing structure data according to warehouse, this characteristic influence traditional data warehouse when in face of magnanimity isomeric data for The excavation of inherent value, this is Hadoop and the maximum difference of traditional data processing mode.For the existing traditional data storehouse of enterprise We will rationally utilize in library, while existing traditional data warehouse and big data platform are combined, and establish a system One data analysis and data processing architecture so that by Hadoop with the realization that cooperates in traditional data warehouse to network log Monitoring statisticss are analyzed.
Hadoop be Apache organization and administration a Distributed Computing Platform of increasing income, be one can to mass data into The software frame of row distributed treatment.It is using Hadoop distributed file system HDFS and MapReduce as the Hadoop of core User provides the transparent distributed basis framework of system bottom details.High fault tolerance, high scalability, the Highly Scalable of HDFS Property, high acquired, high-throughput the advantages that allow user to be deployed in Hadoop on cheap hardware, form distributed system; The distributed programmed models of MapReduce allow user to develop Parallel application in the case where not knowing about distributed system low-level details Program.
HDFS is the basis of data storage management in Distributed Calculation, is based on flow data mode access and processing super large text The demand of part and develop.Its characteristic provides the storage for not being afraid of failure for mass data, is at the application of super large data set Reason brings many facilities.HDFS is master/slave (Mater/Slave) architecture, has two class nodes in architecture, One kind is NameNode, is called " metadata node ";Another kind of is DataNode, is called " back end ", this two classes node point The execution node of Master and Worker specific tasks is not undertaken.But due to the property of distributed storage, HDFS clusters possess one A NameNode and multiple DataNode.Metadata node is used for managing the NameSpace of file system;Back end is file The place of data is really stored in system.
MapReduce parallel computation frames are that a parallelisation procedure executes system.It provide one comprising Map and The parallel process model and process in two stages of Reduce handle data with key-value pair data input mode, and can be automatic complete At the division and management and running of data.In program execution, MapReduce parallel computation frames will be responsible for dispatching and distributing calculating Resource, division and inputoutput data, the execution of scheduler program, the execution state of monitoring programme, and be responsible for each when program executes The synchronization of calculate node and compiling for intermediate result.
Sqoop is a tool that rapid batch data exchange is carried out between relational database and Hadoop platform.It can To import the batch data in a relational database in HDFS, Hive of Hadoop, Hadoop can also be put down in turn Data in platform import in relational database.
Hive is a data warehouse established on Hadoop, for managing the structuring being stored in HDFS/half Structural data.It allows directly to use the HiveQL query languages of similar SQL to write data query analysis journey as programming interface Sequence, and the required persistence architecture of data warehouse, storage management and query analysis function are provided, and HiveQL sentences are the bottom of at Layer is converted into corresponding MapReduce programs when realizing and is executed.
Invention content
To overcome above-mentioned the deficiencies in the prior art, the purpose of the present invention is to provide a kind of network securitys based on Hadoop Daily record k-means cluster analysis systems and method, on the basis of the reasonable traditional data warehouse for utilizing and having built up, big data Platform integration is entered, and is established a unified data storage and data processing architecture, is overcome the extended capability in traditional data warehouse Difference is only good at processing structure data, the shortcomings that can not being excavated to the inherent value of magnanimity isomeric data.
To achieve the above object, the technical solution adopted by the present invention is:A kind of network security daily record k- based on Hadoop Means cluster analysis systems include that daily record data obtains subsystem, daily record data mixed mechanism storage management subsystem, day Will data analytics subsystem;
It is the network security daily record data for acquiring all devices that the daily record data, which obtains subsystem,;
The daily record data mixed mechanism storage management subsystem is that all daily record datas are managed and are stored;
The daily record data analyzing subsystem is to carry out quick search analyzing processing to all daily record datas, and to daily record The potential value of data carries out mining analysis.
The daily record data obtains subsystem under Linux environment, and configuration Syslogd concentrates log server, uses Syslog modes acquisition and recording equipment and syslog data, and daily record data concentrate tube is managed.
The daily record data mixed mechanism storage management subsystem integrates Hadoop platform and traditional data warehouse, including HDFS distributed file system modules, Hadoop platform and traditional data warehouse collaboration module.
The daily record data searching and analysis subsystem mainly carries out simple statistical query point using tool hive to data Analysis, according to demand, writes HiveQL query statements, under the driving of Hive Driver, completes morphology in HiveQL query statements Analysis, syntactic analysis, compiling, optimization and the generation of inquiry plan, the inquiry plan of generation are stored in HDFS, and subsequent It is called and is executed by MapReduce, the analysis for potential information in daily record data realizes respective algorithms by writing MapReduce programs excavate its inherent value.
The constituent instruments access process of the HDFS distributed file systems module is:
1)Filename is sent to NameNode by application program by the client-side program of HDFS;
2)After NomeNode receives filename, the corresponding data block of retrieval file name in HDFS catalogues, further according to Data block information, which is found, preserves the addresses data block DataNode, these addresses are transmitted back to client;
3)After client receives these addresses DataNode, concurrently carry out data transmission grasping with these DataNode Make, while the correlation log of operating result is submitted to NameNode.
The Hadoop platform and traditional data warehouse collaboration module, metadata of the configuration MySQL database as Hive Library, the Schema table structural informations etc. for storing Hive, by Sqoop tools realize data in traditional data warehouse and Transmission between Hadoop platform.
Using Hadoop clusters as platform, integrating traditional data warehouse and big data platform, using MySQL database as Hive Metadatabase, the Schema table structural informations for storing Hive, using Sqoop tools realize data in traditional data warehouse Transmission between big data platform;Include that data active layer, data storage layer, computation layer, data analysis layer and result are shown Layer;
The data active layer concentrates log server by configuring Syslogd, acquires the daily record number in all devices According to, then daily record data imported by data storage layer from traditional data warehouse by Sqoop tools;
The data storage layer, the mixing storage architecture to be cooperated with traditional data warehouse using Hadoop, passes through Sqoop Data transfer tool imports data in HDFS, handles metadata, and by treated, data imported into corresponding table in Hive In;
The result display layer, user issue a request to data analysis layer;
The request that user sends is converted to corresponding HiveQL sentences, in Hive Driver by the data analysis layer Driving under, completion execute operation;
The computation layer is received from Hive engines and is instructed, and by the HDFS of data storage layer, coordinates MapReduce The processing analysis for realizing data, finally returns the result display layer by result.
A kind of network security daily record k-means clustering methods based on Hadoop, which is characterized in that including following step Suddenly:
1)Daily record data pre-processes, the Syslog_incoming_mes files of the content of text of conversion log description information For a text vector file;
2)The realization of k-means algorithms based on MapReduce runs k-means clustering algorithms on text vector.
The daily record data pretreatment, includes the following steps:
1)Function word is removed, is removed in daily record description information text without sincere word;
2)Part of speech, this system is marked to be marked using english-left3words-distsim.tagger segmenter every Word in daily record description;
3)Extract go-word, after participle, system extraction be noun (NN, NNS, NNP, NNPS), verb (VB, VBP, VBN, VBD) and adjective (JJ, JJR, JJS), these words all have the actual meaning, can accurately express log information;
4)Frequent dictionary is obtained, to all record statistic frequencies, the word of high frequency just describes have masterpiece in field at this With, be more than threshold value word as keyword elements, select frequent dictionary, being capable of effectively expressing daily record using frequent dictionary Information;
5)Text vector file is generated, daily record description field is compared and frequent dictionary is obtained by a succession of 0,1 composition Keyspace, multiple keyspace set constitute text vector file.
The realization of the k-means algorithms based on MapReduce, includes the following steps:
1)All points in original data set are scanned, and randomly select k point as initial cluster center;
2)Each Map nodes, which are read, has local data set, generates cluster set with k-means algorithms, finally exists The Reduce stages global clustering center of several cluster set symphysis Cheng Xin, repeats this process until meeting termination condition;
3)The work of partition clustering is carried out to all data elements according to the cluster center ultimately generated.
The advantages of technical solution of the present invention, is mainly reflected in:
1)The support that Hadoop is provided on the basis of existing traditional data warehouse, establishes a unified data and deposits Storage and data processing architecture, compensate for traditional data warehouse mass data processing, storage etc. deficiency, while but also Traditional data warehouse originally is used to make the best.
2)As data are more and more, more cluster resources are needed to handle these data, and Hadoop is one easy Expansion system, it is only necessary to which a new node is simply configured, so that it may very easily to extend cluster, promote it and calculate energy Power.
3)For the isomeric data in mass network daily record, first with MapReduce handle metadata, then will treated number According to importeding into Hive in corresponding table, hiveQL sentences are then write according to demand, simply query analysis are carried out to data, Realize that k-means algorithms carry out mining analysis to data with MapReduce.Improve query analysis efficiency, the potential valence of data Value is also excavated.
Description of the drawings
Fig. 1 is the system structure functional block diagram of the present invention.
Fig. 2 is the Web Log Analysis system architecture diagram of the present invention to be cooperated with traditional data warehouse based on Hadoop.
Fig. 3 is daily record data k-means clustering algorithm research frameworks of the present invention.
Specific implementation mode
Technical scheme of the present invention is described in detail with reference to embodiment and Figure of description, but not limited to this.
Referring to Fig. 1, a kind of network security daily record k-means cluster analysis systems based on Hadoop, include daily record number According to acquisition subsystem 11, daily record data mixed mechanism storage management subsystem 12, daily record data analyzing subsystem 13;
It is the network security daily record data for acquiring all devices that the daily record data, which obtains subsystem 11,;
Daily record data mixed mechanism storage management subsystem 12 is that all daily record datas are managed and are stored;
Daily record data analyzing subsystem 13 is to carry out quick search analyzing processing to all daily record datas, and to daily record number According to potential value carry out mining analysis.
Steps are as follows for the operation workflow of the system modules:
Step 1:Daily record data obtains:Configuration Syslogd concentrates log server to pass through using UDP as transport protocol The log management configuration of all safety equipments is sent to the log server for being mounted with Syslog software systems by destination interface, System log server receives daily record data and writes in journal file automatically;
Step 2:The table syslog_incoming of the log information in MySQL is imported into HDFS using Sqoop, is used Order:sqoop import --connect jdbc:mysql://219.245.31.39:3306/syslog --username sqoop --password sqoop --table syslog_incoming -m1
Sqoop imports a table by a MapReduce operation from MySQL, and a line is extracted in this operation from table Row record, is then written to HDFS by record, and the namenode in cluster undertakes the position storage of data, and storage location is believed Breath tell the ends client, after obtaining location information, the ends client start to write data, be when writing data by deblocking, and More parts are stored as, is placed on different datanode nodes, client first writes data into first node, is saved at first While point receives data, and its received data is pushed to second, second is pushed to third node, successively class It pushes away;
Step 3:It writes MapReduce programs and useful information is extracted to the log data importeding into HDFS;
Step 4:A hive table is generated using Sqoop according to the table in the relational data source extracted, uses order The definition of corresponding hive tables is directly generated, then load is stored in the data in HDFS:
sqoop create-hive-table --connect jdbc:mysql://219.245.31.39:3306/ syslog --table syslog_incoming --fields-terminated-by ‘,’
Start hive, loads data:
load data inpath "syslog_incoming" into table syslog_incoming;
Step 5:According to business demand, write corresponding hiveQL sentences or MapReduce programs, to daily record data into Row statistical analysis;The statistical analysis the specific steps are:Subregion is carried out to tables of data according to business demand, and subregion is to create It is defined with PARTITIONED BY clauses when table.Therefore, according to demand respectively table record be defined as by grade and when Between(Year, season, the moon)Subregion is constituted, and following example is to be defined as being made of grade subregion by table record:
hive >create table syslog_incoming_priority (facility varchar, data data, host varchar)
>partitioned by (priority varchar)
>row format delimited
>fields terminated by ‘\t’
>stored as textfile;
After defining table structure, data are loaded into partition table:
hive >insert into table syslog_incoming_priority
>partitioned (priority)
>select facility, data, host, priority
>from syslog_incoming;
In file system level, subregion is the lower nested subdirectory of entry record, at this point, having in entry directory structures multiple etc. Grade subregion, data file are then stored in bottom catalogue, finally according to demand, write query statement hiveQL, cluster will be looked into again Inquiry sentence is converted to MapReduce tasks and is run.
Step 6:Query analysis result is imported by Sqoop in MySQL, foreground display interface is by chart to user Displaying.
Referring to Fig. 3, a kind of network security daily record k-means clustering methods based on Hadoop include the following steps:
1)Daily record data pre-processes, the Syslog_incoming_mes files of the content of text of conversion log description information For a text vector file;
2)The realization of k-means algorithms based on MapReduce runs k-means clustering algorithms on text vector.
The daily record data pretreatment, includes the following steps:
1)Function word is removed, is removed in daily record description information text without sincere word;
2)Part of speech, this system is marked to be marked using english-left3words-distsim.tagger segmenter every Word in daily record description;
3)Extract go-word, after participle, system extraction be noun (NN, NNS, NNP, NNPS), verb (VB, VBP, VBN, VBD) and adjective (JJ, JJR, JJS).These words all have the actual meaning, can accurately express log information;
4)Frequent dictionary is obtained, to all record statistic frequencies, the word of high frequency just describes have masterpiece in field at this With, be more than threshold value word as keyword elements.Frequent dictionary is selected, it being capable of effectively expressing daily record using frequent dictionary Information;
5)Text vector file is generated, daily record description field is compared and frequent dictionary is obtained by a succession of 0,1 composition Keyspace, multiple keyspace set constitute text vector file.
The realization of the k-means algorithms based on MapReduce, includes the following steps:
Initial cluster center selects, and provides the data structure of cluster first, such preserves the essential information such as cluster id of a cluster, in Heart coordinate and belong to the cluster point number, type definition is as follows:
public class Cluster implements writable{
private int clusterID;// cluster id
private long numOfPoints;// belong to the cluster point number
private Instance center;// cluster central point information
}
Then k point is randomly selected as initial cluster center, extracts flow:It is sky to initialize cluster centralization, then Scan entire data set.If current cluster centralization size is less than k, the point scanned is added in cluster center, otherwise A bit in cluster centralization is replaced with the probability of 1/ (1+k).By this step, the cluster central information of generation is written for we To under Cluster-0 catalogues, globally shared information when file in the catalogue is as next round iteration is added to MapReduce Distribution shared buffer memory in be used as globally shared data.
Iterate to calculate cluster center.The stage needs to execute successive ignition, before starting each map nodes firstly the need of The information that the cluster generated in last round of iteration is read in setup () method, includes the following steps:
1)Read in initial cluster center:Read be stored in shared buffer memory be all nodes sharings initial cluster centre data;
2)The realization of map methods:Map methods need to find the cluster center of nearest neighbours for each incoming data point, and And using the id in cluster as key, which launches as value, indicates the cluster that this data point belongs to where id;
3)The realization of combiner:In order to mitigate network data transmission expense, we the ends map using combiner come pair The result that the ends map generate does a merger, has both alleviated data transfer overheads of the map to the ends reduce in this way, while also mitigating The computing cost at the ends reduce, the key of Combiner outputs and the type of value must be with the type phases of the map keys and value exported Together.In reduce programs, we calculate the Provisional Center of these points according to the information for all the points for belonging to the same cluster, this In realized in the method simply averaged, i.e., by all the points in cluster be added divided by the cluster in contained all the points at this time Number;
4)The realization of Reducer:Reduce stages and Combiner are done about the same, by the output of Combiner As a result further merger output is carried out.
The processing for computing repeatedly cluster central step, until obtained cluster centre no longer changes.
Data are divided according to final cluster centre.After obtaining final cluster centre, according to the cluster obtained All data acquisition systems are scanned at center, and each data point is divided into apart from nearest cluster centre.
Embodiment:
Hadoop distributed type assemblies environment is built first, including 5 PC machine are built.One master server, remaining four are From server.Hadoop, then installation configuration Sqoop, hive, MySQL on namenode are configured on every machine.This implementation Using the log recording of all safety equipments of Shaanxi Li An electricity supermarkets in example, wherein file size is 16G.According to demand to day Will is timed update daily, the statistical query result in more new business.
This method can realize express statistic inquiry by hive, it is advantageous that:Learning cost is low, can pass through class SQL statement fast implements simple MapReduce statistics, it is not necessary to develop special MapReduce applications, be very suitable for data bins The statistical analysis in library.It can accelerate the inquiry velocity of data fragmentation using subregion, improve search efficiency.Pass through MapReduce realities Existing k-means algorithms, and safe class assessment is carried out to the result of k-means clustering algorithms output, for alarm in same IP, Danger classes proportion it is larger in time make prompt so that the potential value of daily record data is excavated.

Claims (4)

1. a kind of network security daily record k-means cluster analysis systems based on Hadoop, which is characterized in that include daily record number According to acquisition subsystem (11), daily record data mixed mechanism storage management subsystem (12), daily record data analyzing subsystem (13);
It is the network security daily record data for acquiring all devices that the daily record data, which obtains subsystem (11),;
The daily record data mixed mechanism storage management subsystem (12) is that all daily record datas are managed and are stored;
The daily record data analyzing subsystem (13) is to carry out quick search analyzing processing to all daily record datas, and to daily record The potential value of data carries out mining analysis;
The daily record data obtains subsystem(11)Under Linux environment, configuration Syslogd concentrates log server, uses Syslog modes acquisition and recording equipment and syslog data, and daily record data concentrate tube is managed;
The constituent instruments access process of HDFS distributed file system modules is:
1)Filename is sent to NameNode by application program by the client-side program of HDFS;
2)After NomeNode receives filename, the corresponding data block of retrieval file name in HDFS catalogues, further according to data Block message, which is found, preserves the addresses data block DataNode, these addresses are transmitted back to client;
3)After client receives these addresses DataNode, concurrently carry out data transmission operating with these DataNode, The correlation log of operating result is submitted to NameNode simultaneously;
The Hadoop platform and traditional data warehouse collaboration module, metadatabase of the configuration MySQL database as Hive, are used In the Schema table structural informations of storage Hive, realize data in traditional data warehouse and Hadoop platform by Sqoop tools Between transmission;
Daily record data analyzing subsystem(13), simple statistical query analysis is mainly carried out to data using tool hive, according to Demand writes HiveQL query statements, under the driving of Hive Driver, completes morphological analysis, language in HiveQL query statements Method analysis, compiling, optimization and the generation of inquiry plan, the inquiry plan of generation is stored in HDFS, and then by MapReduce, which is called, to be executed, and the analysis for potential information in daily record data realizes respective algorithms by writing MapReduce programs excavate its inherent value.
2. a kind of network security daily record k-means cluster analysis systems based on Hadoop according to claim 1, special Sign is, the daily record data mixed mechanism storage management subsystem(12), integrate Hadoop platform and traditional data warehouse, packet Include HDFS distributed file system modules, Hadoop platform and traditional data warehouse collaboration module.
3. a kind of network security daily record k-means cluster analysis systems based on Hadoop, which is characterized in that with Hadoop clusters For platform, integrating traditional data warehouse and big data platform, using MySQL database as the metadatabase of Hive, for storing The Schema table structural informations of Hive realize data between traditional data warehouse and big data platform using Sqoop tools Transmission;Include data active layer (21), data storage layer (22), computation layer (23), data analysis layer (24) and result display layer (25);
The data active layer (21) concentrates log server by configuring Syslogd, acquires the daily record number in all devices According to, then daily record data imported by data storage layer from traditional data warehouse by Sqoop tools;
The data storage layer (22), the mixing storage architecture to be cooperated with traditional data warehouse using Hadoop pass through Sqoop Data transfer tool imports data in HDFS, handles metadata, and by treated, data imported into corresponding table in Hive In;
The result display layer (25), user issue a request to data analysis layer;
The request that user sends is converted to corresponding HiveQL sentences, in Hive Driver by the data analysis layer (24) Driving under, completion execute operation;
The computation layer (23) is received from Hive engines and is instructed, and by the HDFS of data storage layer, coordinates MapReduce The processing analysis for realizing data, finally returns the result display layer by result.
4. utilizing the method for hierarchial-cluster analysis described in claim 1, which is characterized in that include the following steps:
1)Daily record data pre-processes, and the Syslog_incoming_mes files of the content of text of conversion log description information are one A text vector file;
2)The realization of k-means algorithms based on MapReduce runs k-means clustering algorithms on text vector;
The daily record data pretreatment, includes the following steps:
1)Function word is removed, is removed in daily record description information text without sincere word;
2)Part of speech, this system is marked to mark every day using english-left3words-distsim.tagger segmenter Word in will description;
3)Go-word is extracted, after participle, system extraction is noun NN, NNS, NNP, NNPS, verb VB, VBP, VBN, VBD and adjective JJ, JJR, JJS, these words all have the actual meaning, can accurately express log information;
4)Frequent dictionary is obtained, to all record statistic frequencies, the word of high frequency just has role of delegate in description field, is more than The word of threshold value selects frequent dictionary as keyword elements, being capable of effectively expressing log information using frequent dictionary;
5)Text vector file is generated, daily record description field is compared and frequent dictionary is obtained by a succession of 0,1 composition Keyspace, multiple keyspace set constitute text vector file;
The realization of the k-means algorithms based on MapReduce, includes the following steps:
1)All points in original data set are scanned, and randomly select k point as initial cluster center;
2)Each Map nodes, which are read, has local data set, cluster set is generated with k-means algorithms, finally in Reduce The stage global clustering center of several cluster set symphysis Cheng Xin, repeats this process until meeting termination condition;
3)The work of partition clustering is carried out to all data elements according to the cluster center ultimately generated.
CN201510553636.3A 2015-09-02 2015-09-02 A kind of network security daily record k-means cluster analysis systems and method based on Hadoop Expired - Fee Related CN105138661B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510553636.3A CN105138661B (en) 2015-09-02 2015-09-02 A kind of network security daily record k-means cluster analysis systems and method based on Hadoop

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510553636.3A CN105138661B (en) 2015-09-02 2015-09-02 A kind of network security daily record k-means cluster analysis systems and method based on Hadoop

Publications (2)

Publication Number Publication Date
CN105138661A CN105138661A (en) 2015-12-09
CN105138661B true CN105138661B (en) 2018-10-30

Family

ID=54724008

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510553636.3A Expired - Fee Related CN105138661B (en) 2015-09-02 2015-09-02 A kind of network security daily record k-means cluster analysis systems and method based on Hadoop

Country Status (1)

Country Link
CN (1) CN105138661B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294580A (en) * 2016-07-28 2017-01-04 武汉虹信技术服务有限责任公司 LTE network MR data analysing method based on HADOOP platform

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105608203B (en) * 2015-12-24 2019-09-17 Tcl集团股份有限公司 A kind of Internet of Things log processing method and device based on Hadoop platform
US9973521B2 (en) * 2015-12-28 2018-05-15 International Business Machines Corporation System and method for field extraction of data contained within a log stream
CN105824892A (en) * 2016-03-11 2016-08-03 广东电网有限责任公司电力科学研究院 Method for synchronizing and processing data by data pool
CN106168965B (en) * 2016-07-01 2020-06-30 竹间智能科技(上海)有限公司 Knowledge graph construction system
CN107579944B (en) * 2016-07-05 2020-08-11 南京联成科技发展股份有限公司 Artificial intelligence and MapReduce-based security attack prediction method
CN107958022A (en) * 2017-11-06 2018-04-24 余帝乾 A kind of method that Web log excavates
CN107943647A (en) * 2017-11-21 2018-04-20 北京小度互娱科技有限公司 A kind of reliable distributed information log collection method and system
CN108133043B (en) * 2018-01-12 2022-07-29 福建星瑞格软件有限公司 Structured storage method for server running logs based on big data
CN110135184B (en) * 2018-02-09 2023-12-22 中兴通讯股份有限公司 Method, device, equipment and storage medium for desensitizing static data
CN108446568B (en) * 2018-03-19 2021-04-13 西北大学 Histogram data publishing method for trend analysis differential privacy protection
CN110581873B (en) * 2018-06-11 2022-06-14 中国移动通信集团浙江有限公司 Cross-cluster redirection method and monitoring server
CN108933785B (en) * 2018-06-29 2021-02-05 平安科技(深圳)有限公司 Network risk monitoring method and device, computer equipment and storage medium
CN109254903A (en) * 2018-08-03 2019-01-22 挖财网络技术有限公司 A kind of intelligentized log analysis method and device
CN109446042B (en) * 2018-10-12 2021-12-14 安徽南瑞中天电力电子有限公司 Log management method and system for intelligent electric equipment
CN109766368B (en) * 2018-11-14 2021-08-27 国云科技股份有限公司 Hive-based data query multi-type view output system and method
CN109525593B (en) * 2018-12-20 2022-02-22 中科曙光国际信息产业有限公司 Centralized safety management and control system and method for hadoop big data platform
CN110069551A (en) * 2019-04-25 2019-07-30 江南大学 Medical Devices O&M information excavating analysis system and its application method based on Spark
CN110378550A (en) * 2019-06-03 2019-10-25 东南大学 The processing method of the extensive food data of multi-source based on distributed structure/architecture
CN112306787B (en) * 2019-07-24 2022-08-09 阿里巴巴集团控股有限公司 Error log processing method and device, electronic equipment and intelligent sound box
CN111259068A (en) * 2020-04-28 2020-06-09 成都四方伟业软件股份有限公司 Data development method and system based on data warehouse
CN111913954B (en) * 2020-06-20 2023-08-04 杭州城市大数据运营有限公司 Intelligent data standard catalog generation method and device
CN111966293A (en) * 2020-08-18 2020-11-20 北京明略昭辉科技有限公司 Cold and hot data analysis method and system
CN112380348B (en) * 2020-11-25 2024-03-26 中信百信银行股份有限公司 Metadata processing method, apparatus, electronic device and computer readable storage medium
CN113238912B (en) * 2021-05-08 2022-12-06 国家计算机网络与信息安全管理中心 Aggregation processing method for network security log data
CN113641750A (en) * 2021-08-20 2021-11-12 广东云药科技有限公司 Enterprise big data analysis platform
CN116737846A (en) * 2023-05-31 2023-09-12 深圳华夏凯词财富管理有限公司 Asset management data safety protection warehouse system based on Hive

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999633A (en) * 2012-12-18 2013-03-27 北京师范大学珠海分校 Cloud cluster extraction method of network information
CN103399887A (en) * 2013-07-19 2013-11-20 蓝盾信息安全技术股份有限公司 Query and statistical analysis system for mass logs
CN103425762A (en) * 2013-08-05 2013-12-04 南京邮电大学 Telecom operator mass data processing method based on Hadoop platform
CN103544328A (en) * 2013-11-15 2014-01-29 南京大学 Parallel k mean value clustering method based on Hadoop
CN104616205A (en) * 2014-11-24 2015-05-13 北京科东电力控制系统有限责任公司 Distributed log analysis based operation state monitoring method of power system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999633A (en) * 2012-12-18 2013-03-27 北京师范大学珠海分校 Cloud cluster extraction method of network information
CN103399887A (en) * 2013-07-19 2013-11-20 蓝盾信息安全技术股份有限公司 Query and statistical analysis system for mass logs
CN103425762A (en) * 2013-08-05 2013-12-04 南京邮电大学 Telecom operator mass data processing method based on Hadoop platform
CN103544328A (en) * 2013-11-15 2014-01-29 南京大学 Parallel k mean value clustering method based on Hadoop
CN104616205A (en) * 2014-11-24 2015-05-13 北京科东电力控制系统有限责任公司 Distributed log analysis based operation state monitoring method of power system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种基于Hadoop和K_means的Web日志分析方案的设计;付伟等;《第十九届全国青年通信学术年会论文集》;20141015;第171页左栏"5.2 基于Hadoop的K-means算法的并行化"至第173页左栏"5.3.3会话识别",图2 *
网络安全分析中的大数据技术应用;王帅等;《电信科学》;20150731(第7期);第2015176-3页右栏第2段至第2015176-5页左栏第4段,图1 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294580A (en) * 2016-07-28 2017-01-04 武汉虹信技术服务有限责任公司 LTE network MR data analysing method based on HADOOP platform

Also Published As

Publication number Publication date
CN105138661A (en) 2015-12-09

Similar Documents

Publication Publication Date Title
CN105138661B (en) A kind of network security daily record k-means cluster analysis systems and method based on Hadoop
CN105183834B (en) A kind of traffic big data semantic applications method of servicing based on ontology library
CN114399006B (en) Multi-source abnormal composition image data fusion method and system based on super-calculation
CN103430144A (en) Data source analytics
Mohammed et al. A review of big data environment and its related technologies
CN102917009B (en) A kind of stock certificate data collection based on cloud computing technology and storage means and system
CN104133858A (en) Intelligent double-engine analysis system and intelligent double-engine analysis method based on column storage
CN111026874A (en) Data processing method and server of knowledge graph
Das et al. A study on big data integration with data warehouse
CN109783484A (en) The construction method and system of the data service platform of knowledge based map
Li et al. The overview of big data storage and management
Chen et al. Metadata-based information resource integration for research management
Banaei et al. Hadoop and its role in modern image processing
CN103226608A (en) Parallel file searching method based on folder-level telescopic Bloom Filter bit diagram
CN115858513A (en) Data governance method, data governance device, computer equipment and storage medium
Shakhovska et al. Big Data Model" Entity and Features"
Li et al. Survey of recent research progress and issues in big data
Arputhamary et al. A review on big data integration
Chen et al. Related technologies
Suciu et al. Big data technology for scientific applications
Jo et al. Constructing national geospatial big data platform: current status and future direction
Ediger et al. Real-time streaming intelligence: Integrating graph and nlp analytics
Yu et al. A police big data analytics platform: Framework and implications
CN113590651A (en) Cross-cluster data processing system and method based on HQL
Pan et al. An open sharing pattern design of massive power big data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20181030

Termination date: 20200902

CF01 Termination of patent right due to non-payment of annual fee