CN105138661B - A kind of network security daily record k-means cluster analysis systems and method based on Hadoop - Google Patents
A kind of network security daily record k-means cluster analysis systems and method based on Hadoop Download PDFInfo
- Publication number
- CN105138661B CN105138661B CN201510553636.3A CN201510553636A CN105138661B CN 105138661 B CN105138661 B CN 105138661B CN 201510553636 A CN201510553636 A CN 201510553636A CN 105138661 B CN105138661 B CN 105138661B
- Authority
- CN
- China
- Prior art keywords
- data
- daily record
- hadoop
- analysis
- hive
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2457—Query processing with adaptation to user needs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of network security daily record k-means cluster analysis systems and method based on Hadoop, including daily record data obtain subsystem, daily record data mixed mechanism storage management subsystem, daily record data analyzing subsystem;In data storage layer, the mixing memory mechanism storing daily record data to be cooperated with traditional data warehouse using Hadoop, and the interface operated to Hive is provided in data access layer, data storage layer and computation layer are received from Hive engines to be instructed, by HDFS, cooperation MapReduce realizes the efficient query analysis to data;When carrying out mining analysis to daily record data, realize that k-means algorithms carry out cluster result analysis to it using MapReduce;The framework that cooperated with traditional data warehouse using Hadoop compensates for traditional data warehouse in the deficiency of mass data processing, storage etc., while but also original traditional data warehouse is used to make the best;Clustering is carried out using the k-means algorithms based on MapReduce, can safe class assessment and early warning be carried out to daily record data in time.
Description
Technical field
The invention belongs to technical field of computer information processing, and in particular to a kind of network security daily record based on Hadoop
K-means cluster analysis systems and method.
Background technology
With the explosion of data, information content sharply increases, and the existing traditional data warehouse of enterprise has been difficult to deal with number
According to growth rate.The generally use high-performance integrated machine construction of traditional data warehouse, of high cost, autgmentability is poor, and traditional number
Only be good at processing structure data according to warehouse, this characteristic influence traditional data warehouse when in face of magnanimity isomeric data for
The excavation of inherent value, this is Hadoop and the maximum difference of traditional data processing mode.For the existing traditional data storehouse of enterprise
We will rationally utilize in library, while existing traditional data warehouse and big data platform are combined, and establish a system
One data analysis and data processing architecture so that by Hadoop with the realization that cooperates in traditional data warehouse to network log
Monitoring statisticss are analyzed.
Hadoop be Apache organization and administration a Distributed Computing Platform of increasing income, be one can to mass data into
The software frame of row distributed treatment.It is using Hadoop distributed file system HDFS and MapReduce as the Hadoop of core
User provides the transparent distributed basis framework of system bottom details.High fault tolerance, high scalability, the Highly Scalable of HDFS
Property, high acquired, high-throughput the advantages that allow user to be deployed in Hadoop on cheap hardware, form distributed system;
The distributed programmed models of MapReduce allow user to develop Parallel application in the case where not knowing about distributed system low-level details
Program.
HDFS is the basis of data storage management in Distributed Calculation, is based on flow data mode access and processing super large text
The demand of part and develop.Its characteristic provides the storage for not being afraid of failure for mass data, is at the application of super large data set
Reason brings many facilities.HDFS is master/slave (Mater/Slave) architecture, has two class nodes in architecture,
One kind is NameNode, is called " metadata node ";Another kind of is DataNode, is called " back end ", this two classes node point
The execution node of Master and Worker specific tasks is not undertaken.But due to the property of distributed storage, HDFS clusters possess one
A NameNode and multiple DataNode.Metadata node is used for managing the NameSpace of file system;Back end is file
The place of data is really stored in system.
MapReduce parallel computation frames are that a parallelisation procedure executes system.It provide one comprising Map and
The parallel process model and process in two stages of Reduce handle data with key-value pair data input mode, and can be automatic complete
At the division and management and running of data.In program execution, MapReduce parallel computation frames will be responsible for dispatching and distributing calculating
Resource, division and inputoutput data, the execution of scheduler program, the execution state of monitoring programme, and be responsible for each when program executes
The synchronization of calculate node and compiling for intermediate result.
Sqoop is a tool that rapid batch data exchange is carried out between relational database and Hadoop platform.It can
To import the batch data in a relational database in HDFS, Hive of Hadoop, Hadoop can also be put down in turn
Data in platform import in relational database.
Hive is a data warehouse established on Hadoop, for managing the structuring being stored in HDFS/half
Structural data.It allows directly to use the HiveQL query languages of similar SQL to write data query analysis journey as programming interface
Sequence, and the required persistence architecture of data warehouse, storage management and query analysis function are provided, and HiveQL sentences are the bottom of at
Layer is converted into corresponding MapReduce programs when realizing and is executed.
Invention content
To overcome above-mentioned the deficiencies in the prior art, the purpose of the present invention is to provide a kind of network securitys based on Hadoop
Daily record k-means cluster analysis systems and method, on the basis of the reasonable traditional data warehouse for utilizing and having built up, big data
Platform integration is entered, and is established a unified data storage and data processing architecture, is overcome the extended capability in traditional data warehouse
Difference is only good at processing structure data, the shortcomings that can not being excavated to the inherent value of magnanimity isomeric data.
To achieve the above object, the technical solution adopted by the present invention is:A kind of network security daily record k- based on Hadoop
Means cluster analysis systems include that daily record data obtains subsystem, daily record data mixed mechanism storage management subsystem, day
Will data analytics subsystem;
It is the network security daily record data for acquiring all devices that the daily record data, which obtains subsystem,;
The daily record data mixed mechanism storage management subsystem is that all daily record datas are managed and are stored;
The daily record data analyzing subsystem is to carry out quick search analyzing processing to all daily record datas, and to daily record
The potential value of data carries out mining analysis.
The daily record data obtains subsystem under Linux environment, and configuration Syslogd concentrates log server, uses
Syslog modes acquisition and recording equipment and syslog data, and daily record data concentrate tube is managed.
The daily record data mixed mechanism storage management subsystem integrates Hadoop platform and traditional data warehouse, including
HDFS distributed file system modules, Hadoop platform and traditional data warehouse collaboration module.
The daily record data searching and analysis subsystem mainly carries out simple statistical query point using tool hive to data
Analysis, according to demand, writes HiveQL query statements, under the driving of Hive Driver, completes morphology in HiveQL query statements
Analysis, syntactic analysis, compiling, optimization and the generation of inquiry plan, the inquiry plan of generation are stored in HDFS, and subsequent
It is called and is executed by MapReduce, the analysis for potential information in daily record data realizes respective algorithms by writing
MapReduce programs excavate its inherent value.
The constituent instruments access process of the HDFS distributed file systems module is:
1)Filename is sent to NameNode by application program by the client-side program of HDFS;
2)After NomeNode receives filename, the corresponding data block of retrieval file name in HDFS catalogues, further according to
Data block information, which is found, preserves the addresses data block DataNode, these addresses are transmitted back to client;
3)After client receives these addresses DataNode, concurrently carry out data transmission grasping with these DataNode
Make, while the correlation log of operating result is submitted to NameNode.
The Hadoop platform and traditional data warehouse collaboration module, metadata of the configuration MySQL database as Hive
Library, the Schema table structural informations etc. for storing Hive, by Sqoop tools realize data in traditional data warehouse and
Transmission between Hadoop platform.
Using Hadoop clusters as platform, integrating traditional data warehouse and big data platform, using MySQL database as Hive
Metadatabase, the Schema table structural informations for storing Hive, using Sqoop tools realize data in traditional data warehouse
Transmission between big data platform;Include that data active layer, data storage layer, computation layer, data analysis layer and result are shown
Layer;
The data active layer concentrates log server by configuring Syslogd, acquires the daily record number in all devices
According to, then daily record data imported by data storage layer from traditional data warehouse by Sqoop tools;
The data storage layer, the mixing storage architecture to be cooperated with traditional data warehouse using Hadoop, passes through Sqoop
Data transfer tool imports data in HDFS, handles metadata, and by treated, data imported into corresponding table in Hive
In;
The result display layer, user issue a request to data analysis layer;
The request that user sends is converted to corresponding HiveQL sentences, in Hive Driver by the data analysis layer
Driving under, completion execute operation;
The computation layer is received from Hive engines and is instructed, and by the HDFS of data storage layer, coordinates MapReduce
The processing analysis for realizing data, finally returns the result display layer by result.
A kind of network security daily record k-means clustering methods based on Hadoop, which is characterized in that including following step
Suddenly:
1)Daily record data pre-processes, the Syslog_incoming_mes files of the content of text of conversion log description information
For a text vector file;
2)The realization of k-means algorithms based on MapReduce runs k-means clustering algorithms on text vector.
The daily record data pretreatment, includes the following steps:
1)Function word is removed, is removed in daily record description information text without sincere word;
2)Part of speech, this system is marked to be marked using english-left3words-distsim.tagger segmenter every
Word in daily record description;
3)Extract go-word, after participle, system extraction be noun (NN, NNS, NNP, NNPS), verb (VB,
VBP, VBN, VBD) and adjective (JJ, JJR, JJS), these words all have the actual meaning, can accurately express log information;
4)Frequent dictionary is obtained, to all record statistic frequencies, the word of high frequency just describes have masterpiece in field at this
With, be more than threshold value word as keyword elements, select frequent dictionary, being capable of effectively expressing daily record using frequent dictionary
Information;
5)Text vector file is generated, daily record description field is compared and frequent dictionary is obtained by a succession of 0,1 composition
Keyspace, multiple keyspace set constitute text vector file.
The realization of the k-means algorithms based on MapReduce, includes the following steps:
1)All points in original data set are scanned, and randomly select k point as initial cluster center;
2)Each Map nodes, which are read, has local data set, generates cluster set with k-means algorithms, finally exists
The Reduce stages global clustering center of several cluster set symphysis Cheng Xin, repeats this process until meeting termination condition;
3)The work of partition clustering is carried out to all data elements according to the cluster center ultimately generated.
The advantages of technical solution of the present invention, is mainly reflected in:
1)The support that Hadoop is provided on the basis of existing traditional data warehouse, establishes a unified data and deposits
Storage and data processing architecture, compensate for traditional data warehouse mass data processing, storage etc. deficiency, while but also
Traditional data warehouse originally is used to make the best.
2)As data are more and more, more cluster resources are needed to handle these data, and Hadoop is one easy
Expansion system, it is only necessary to which a new node is simply configured, so that it may very easily to extend cluster, promote it and calculate energy
Power.
3)For the isomeric data in mass network daily record, first with MapReduce handle metadata, then will treated number
According to importeding into Hive in corresponding table, hiveQL sentences are then write according to demand, simply query analysis are carried out to data,
Realize that k-means algorithms carry out mining analysis to data with MapReduce.Improve query analysis efficiency, the potential valence of data
Value is also excavated.
Description of the drawings
Fig. 1 is the system structure functional block diagram of the present invention.
Fig. 2 is the Web Log Analysis system architecture diagram of the present invention to be cooperated with traditional data warehouse based on Hadoop.
Fig. 3 is daily record data k-means clustering algorithm research frameworks of the present invention.
Specific implementation mode
Technical scheme of the present invention is described in detail with reference to embodiment and Figure of description, but not limited to this.
Referring to Fig. 1, a kind of network security daily record k-means cluster analysis systems based on Hadoop, include daily record number
According to acquisition subsystem 11, daily record data mixed mechanism storage management subsystem 12, daily record data analyzing subsystem 13;
It is the network security daily record data for acquiring all devices that the daily record data, which obtains subsystem 11,;
Daily record data mixed mechanism storage management subsystem 12 is that all daily record datas are managed and are stored;
Daily record data analyzing subsystem 13 is to carry out quick search analyzing processing to all daily record datas, and to daily record number
According to potential value carry out mining analysis.
Steps are as follows for the operation workflow of the system modules:
Step 1:Daily record data obtains:Configuration Syslogd concentrates log server to pass through using UDP as transport protocol
The log management configuration of all safety equipments is sent to the log server for being mounted with Syslog software systems by destination interface,
System log server receives daily record data and writes in journal file automatically;
Step 2:The table syslog_incoming of the log information in MySQL is imported into HDFS using Sqoop, is used
Order:sqoop import --connect jdbc:mysql://219.245.31.39:3306/syslog --username
sqoop --password sqoop --table syslog_incoming -m1
Sqoop imports a table by a MapReduce operation from MySQL, and a line is extracted in this operation from table
Row record, is then written to HDFS by record, and the namenode in cluster undertakes the position storage of data, and storage location is believed
Breath tell the ends client, after obtaining location information, the ends client start to write data, be when writing data by deblocking, and
More parts are stored as, is placed on different datanode nodes, client first writes data into first node, is saved at first
While point receives data, and its received data is pushed to second, second is pushed to third node, successively class
It pushes away;
Step 3:It writes MapReduce programs and useful information is extracted to the log data importeding into HDFS;
Step 4:A hive table is generated using Sqoop according to the table in the relational data source extracted, uses order
The definition of corresponding hive tables is directly generated, then load is stored in the data in HDFS:
sqoop create-hive-table --connect jdbc:mysql://219.245.31.39:3306/
syslog --table syslog_incoming --fields-terminated-by ‘,’
Start hive, loads data:
load data inpath "syslog_incoming" into table syslog_incoming;
Step 5:According to business demand, write corresponding hiveQL sentences or MapReduce programs, to daily record data into
Row statistical analysis;The statistical analysis the specific steps are:Subregion is carried out to tables of data according to business demand, and subregion is to create
It is defined with PARTITIONED BY clauses when table.Therefore, according to demand respectively table record be defined as by grade and when
Between(Year, season, the moon)Subregion is constituted, and following example is to be defined as being made of grade subregion by table record:
hive >create table syslog_incoming_priority (facility varchar, data
data, host varchar)
>partitioned by (priority varchar)
>row format delimited
>fields terminated by ‘\t’
>stored as textfile;
After defining table structure, data are loaded into partition table:
hive >insert into table syslog_incoming_priority
>partitioned (priority)
>select facility, data, host, priority
>from syslog_incoming;
In file system level, subregion is the lower nested subdirectory of entry record, at this point, having in entry directory structures multiple etc.
Grade subregion, data file are then stored in bottom catalogue, finally according to demand, write query statement hiveQL, cluster will be looked into again
Inquiry sentence is converted to MapReduce tasks and is run.
Step 6:Query analysis result is imported by Sqoop in MySQL, foreground display interface is by chart to user
Displaying.
Referring to Fig. 3, a kind of network security daily record k-means clustering methods based on Hadoop include the following steps:
1)Daily record data pre-processes, the Syslog_incoming_mes files of the content of text of conversion log description information
For a text vector file;
2)The realization of k-means algorithms based on MapReduce runs k-means clustering algorithms on text vector.
The daily record data pretreatment, includes the following steps:
1)Function word is removed, is removed in daily record description information text without sincere word;
2)Part of speech, this system is marked to be marked using english-left3words-distsim.tagger segmenter every
Word in daily record description;
3)Extract go-word, after participle, system extraction be noun (NN, NNS, NNP, NNPS), verb (VB,
VBP, VBN, VBD) and adjective (JJ, JJR, JJS).These words all have the actual meaning, can accurately express log information;
4)Frequent dictionary is obtained, to all record statistic frequencies, the word of high frequency just describes have masterpiece in field at this
With, be more than threshold value word as keyword elements.Frequent dictionary is selected, it being capable of effectively expressing daily record using frequent dictionary
Information;
5)Text vector file is generated, daily record description field is compared and frequent dictionary is obtained by a succession of 0,1 composition
Keyspace, multiple keyspace set constitute text vector file.
The realization of the k-means algorithms based on MapReduce, includes the following steps:
Initial cluster center selects, and provides the data structure of cluster first, such preserves the essential information such as cluster id of a cluster, in
Heart coordinate and belong to the cluster point number, type definition is as follows:
public class Cluster implements writable{
private int clusterID;// cluster id
private long numOfPoints;// belong to the cluster point number
private Instance center;// cluster central point information
}
Then k point is randomly selected as initial cluster center, extracts flow:It is sky to initialize cluster centralization, then
Scan entire data set.If current cluster centralization size is less than k, the point scanned is added in cluster center, otherwise
A bit in cluster centralization is replaced with the probability of 1/ (1+k).By this step, the cluster central information of generation is written for we
To under Cluster-0 catalogues, globally shared information when file in the catalogue is as next round iteration is added to MapReduce
Distribution shared buffer memory in be used as globally shared data.
Iterate to calculate cluster center.The stage needs to execute successive ignition, before starting each map nodes firstly the need of
The information that the cluster generated in last round of iteration is read in setup () method, includes the following steps:
1)Read in initial cluster center:Read be stored in shared buffer memory be all nodes sharings initial cluster centre data;
2)The realization of map methods:Map methods need to find the cluster center of nearest neighbours for each incoming data point, and
And using the id in cluster as key, which launches as value, indicates the cluster that this data point belongs to where id;
3)The realization of combiner:In order to mitigate network data transmission expense, we the ends map using combiner come pair
The result that the ends map generate does a merger, has both alleviated data transfer overheads of the map to the ends reduce in this way, while also mitigating
The computing cost at the ends reduce, the key of Combiner outputs and the type of value must be with the type phases of the map keys and value exported
Together.In reduce programs, we calculate the Provisional Center of these points according to the information for all the points for belonging to the same cluster, this
In realized in the method simply averaged, i.e., by all the points in cluster be added divided by the cluster in contained all the points at this time
Number;
4)The realization of Reducer:Reduce stages and Combiner are done about the same, by the output of Combiner
As a result further merger output is carried out.
The processing for computing repeatedly cluster central step, until obtained cluster centre no longer changes.
Data are divided according to final cluster centre.After obtaining final cluster centre, according to the cluster obtained
All data acquisition systems are scanned at center, and each data point is divided into apart from nearest cluster centre.
Embodiment:
Hadoop distributed type assemblies environment is built first, including 5 PC machine are built.One master server, remaining four are
From server.Hadoop, then installation configuration Sqoop, hive, MySQL on namenode are configured on every machine.This implementation
Using the log recording of all safety equipments of Shaanxi Li An electricity supermarkets in example, wherein file size is 16G.According to demand to day
Will is timed update daily, the statistical query result in more new business.
This method can realize express statistic inquiry by hive, it is advantageous that:Learning cost is low, can pass through class
SQL statement fast implements simple MapReduce statistics, it is not necessary to develop special MapReduce applications, be very suitable for data bins
The statistical analysis in library.It can accelerate the inquiry velocity of data fragmentation using subregion, improve search efficiency.Pass through MapReduce realities
Existing k-means algorithms, and safe class assessment is carried out to the result of k-means clustering algorithms output, for alarm in same IP,
Danger classes proportion it is larger in time make prompt so that the potential value of daily record data is excavated.
Claims (4)
1. a kind of network security daily record k-means cluster analysis systems based on Hadoop, which is characterized in that include daily record number
According to acquisition subsystem (11), daily record data mixed mechanism storage management subsystem (12), daily record data analyzing subsystem (13);
It is the network security daily record data for acquiring all devices that the daily record data, which obtains subsystem (11),;
The daily record data mixed mechanism storage management subsystem (12) is that all daily record datas are managed and are stored;
The daily record data analyzing subsystem (13) is to carry out quick search analyzing processing to all daily record datas, and to daily record
The potential value of data carries out mining analysis;
The daily record data obtains subsystem(11)Under Linux environment, configuration Syslogd concentrates log server, uses
Syslog modes acquisition and recording equipment and syslog data, and daily record data concentrate tube is managed;
The constituent instruments access process of HDFS distributed file system modules is:
1)Filename is sent to NameNode by application program by the client-side program of HDFS;
2)After NomeNode receives filename, the corresponding data block of retrieval file name in HDFS catalogues, further according to data
Block message, which is found, preserves the addresses data block DataNode, these addresses are transmitted back to client;
3)After client receives these addresses DataNode, concurrently carry out data transmission operating with these DataNode,
The correlation log of operating result is submitted to NameNode simultaneously;
The Hadoop platform and traditional data warehouse collaboration module, metadatabase of the configuration MySQL database as Hive, are used
In the Schema table structural informations of storage Hive, realize data in traditional data warehouse and Hadoop platform by Sqoop tools
Between transmission;
Daily record data analyzing subsystem(13), simple statistical query analysis is mainly carried out to data using tool hive, according to
Demand writes HiveQL query statements, under the driving of Hive Driver, completes morphological analysis, language in HiveQL query statements
Method analysis, compiling, optimization and the generation of inquiry plan, the inquiry plan of generation is stored in HDFS, and then by
MapReduce, which is called, to be executed, and the analysis for potential information in daily record data realizes respective algorithms by writing
MapReduce programs excavate its inherent value.
2. a kind of network security daily record k-means cluster analysis systems based on Hadoop according to claim 1, special
Sign is, the daily record data mixed mechanism storage management subsystem(12), integrate Hadoop platform and traditional data warehouse, packet
Include HDFS distributed file system modules, Hadoop platform and traditional data warehouse collaboration module.
3. a kind of network security daily record k-means cluster analysis systems based on Hadoop, which is characterized in that with Hadoop clusters
For platform, integrating traditional data warehouse and big data platform, using MySQL database as the metadatabase of Hive, for storing
The Schema table structural informations of Hive realize data between traditional data warehouse and big data platform using Sqoop tools
Transmission;Include data active layer (21), data storage layer (22), computation layer (23), data analysis layer (24) and result display layer
(25);
The data active layer (21) concentrates log server by configuring Syslogd, acquires the daily record number in all devices
According to, then daily record data imported by data storage layer from traditional data warehouse by Sqoop tools;
The data storage layer (22), the mixing storage architecture to be cooperated with traditional data warehouse using Hadoop pass through Sqoop
Data transfer tool imports data in HDFS, handles metadata, and by treated, data imported into corresponding table in Hive
In;
The result display layer (25), user issue a request to data analysis layer;
The request that user sends is converted to corresponding HiveQL sentences, in Hive Driver by the data analysis layer (24)
Driving under, completion execute operation;
The computation layer (23) is received from Hive engines and is instructed, and by the HDFS of data storage layer, coordinates MapReduce
The processing analysis for realizing data, finally returns the result display layer by result.
4. utilizing the method for hierarchial-cluster analysis described in claim 1, which is characterized in that include the following steps:
1)Daily record data pre-processes, and the Syslog_incoming_mes files of the content of text of conversion log description information are one
A text vector file;
2)The realization of k-means algorithms based on MapReduce runs k-means clustering algorithms on text vector;
The daily record data pretreatment, includes the following steps:
1)Function word is removed, is removed in daily record description information text without sincere word;
2)Part of speech, this system is marked to mark every day using english-left3words-distsim.tagger segmenter
Word in will description;
3)Go-word is extracted, after participle, system extraction is noun NN, NNS, NNP, NNPS, verb VB, VBP,
VBN, VBD and adjective JJ, JJR, JJS, these words all have the actual meaning, can accurately express log information;
4)Frequent dictionary is obtained, to all record statistic frequencies, the word of high frequency just has role of delegate in description field, is more than
The word of threshold value selects frequent dictionary as keyword elements, being capable of effectively expressing log information using frequent dictionary;
5)Text vector file is generated, daily record description field is compared and frequent dictionary is obtained by a succession of 0,1 composition
Keyspace, multiple keyspace set constitute text vector file;
The realization of the k-means algorithms based on MapReduce, includes the following steps:
1)All points in original data set are scanned, and randomly select k point as initial cluster center;
2)Each Map nodes, which are read, has local data set, cluster set is generated with k-means algorithms, finally in Reduce
The stage global clustering center of several cluster set symphysis Cheng Xin, repeats this process until meeting termination condition;
3)The work of partition clustering is carried out to all data elements according to the cluster center ultimately generated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510553636.3A CN105138661B (en) | 2015-09-02 | 2015-09-02 | A kind of network security daily record k-means cluster analysis systems and method based on Hadoop |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510553636.3A CN105138661B (en) | 2015-09-02 | 2015-09-02 | A kind of network security daily record k-means cluster analysis systems and method based on Hadoop |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105138661A CN105138661A (en) | 2015-12-09 |
CN105138661B true CN105138661B (en) | 2018-10-30 |
Family
ID=54724008
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510553636.3A Expired - Fee Related CN105138661B (en) | 2015-09-02 | 2015-09-02 | A kind of network security daily record k-means cluster analysis systems and method based on Hadoop |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105138661B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106294580A (en) * | 2016-07-28 | 2017-01-04 | 武汉虹信技术服务有限责任公司 | LTE network MR data analysing method based on HADOOP platform |
Families Citing this family (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105608203B (en) * | 2015-12-24 | 2019-09-17 | Tcl集团股份有限公司 | A kind of Internet of Things log processing method and device based on Hadoop platform |
US9973521B2 (en) * | 2015-12-28 | 2018-05-15 | International Business Machines Corporation | System and method for field extraction of data contained within a log stream |
CN105824892A (en) * | 2016-03-11 | 2016-08-03 | 广东电网有限责任公司电力科学研究院 | Method for synchronizing and processing data by data pool |
CN106168965B (en) * | 2016-07-01 | 2020-06-30 | 竹间智能科技(上海)有限公司 | Knowledge graph construction system |
CN107579944B (en) * | 2016-07-05 | 2020-08-11 | 南京联成科技发展股份有限公司 | Artificial intelligence and MapReduce-based security attack prediction method |
CN107958022A (en) * | 2017-11-06 | 2018-04-24 | 余帝乾 | A kind of method that Web log excavates |
CN107943647A (en) * | 2017-11-21 | 2018-04-20 | 北京小度互娱科技有限公司 | A kind of reliable distributed information log collection method and system |
CN108133043B (en) * | 2018-01-12 | 2022-07-29 | 福建星瑞格软件有限公司 | Structured storage method for server running logs based on big data |
CN110135184B (en) * | 2018-02-09 | 2023-12-22 | 中兴通讯股份有限公司 | Method, device, equipment and storage medium for desensitizing static data |
CN108446568B (en) * | 2018-03-19 | 2021-04-13 | 西北大学 | Histogram data publishing method for trend analysis differential privacy protection |
CN110581873B (en) * | 2018-06-11 | 2022-06-14 | 中国移动通信集团浙江有限公司 | Cross-cluster redirection method and monitoring server |
CN108933785B (en) * | 2018-06-29 | 2021-02-05 | 平安科技(深圳)有限公司 | Network risk monitoring method and device, computer equipment and storage medium |
CN109254903A (en) * | 2018-08-03 | 2019-01-22 | 挖财网络技术有限公司 | A kind of intelligentized log analysis method and device |
CN109446042B (en) * | 2018-10-12 | 2021-12-14 | 安徽南瑞中天电力电子有限公司 | Log management method and system for intelligent electric equipment |
CN109766368B (en) * | 2018-11-14 | 2021-08-27 | 国云科技股份有限公司 | Hive-based data query multi-type view output system and method |
CN109525593B (en) * | 2018-12-20 | 2022-02-22 | 中科曙光国际信息产业有限公司 | Centralized safety management and control system and method for hadoop big data platform |
CN110069551A (en) * | 2019-04-25 | 2019-07-30 | 江南大学 | Medical Devices O&M information excavating analysis system and its application method based on Spark |
CN110378550A (en) * | 2019-06-03 | 2019-10-25 | 东南大学 | The processing method of the extensive food data of multi-source based on distributed structure/architecture |
CN112306787B (en) * | 2019-07-24 | 2022-08-09 | 阿里巴巴集团控股有限公司 | Error log processing method and device, electronic equipment and intelligent sound box |
CN111259068A (en) * | 2020-04-28 | 2020-06-09 | 成都四方伟业软件股份有限公司 | Data development method and system based on data warehouse |
CN111913954B (en) * | 2020-06-20 | 2023-08-04 | 杭州城市大数据运营有限公司 | Intelligent data standard catalog generation method and device |
CN111966293A (en) * | 2020-08-18 | 2020-11-20 | 北京明略昭辉科技有限公司 | Cold and hot data analysis method and system |
CN112380348B (en) * | 2020-11-25 | 2024-03-26 | 中信百信银行股份有限公司 | Metadata processing method, apparatus, electronic device and computer readable storage medium |
CN113238912B (en) * | 2021-05-08 | 2022-12-06 | 国家计算机网络与信息安全管理中心 | Aggregation processing method for network security log data |
CN113641750A (en) * | 2021-08-20 | 2021-11-12 | 广东云药科技有限公司 | Enterprise big data analysis platform |
CN116737846A (en) * | 2023-05-31 | 2023-09-12 | 深圳华夏凯词财富管理有限公司 | Asset management data safety protection warehouse system based on Hive |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102999633A (en) * | 2012-12-18 | 2013-03-27 | 北京师范大学珠海分校 | Cloud cluster extraction method of network information |
CN103399887A (en) * | 2013-07-19 | 2013-11-20 | 蓝盾信息安全技术股份有限公司 | Query and statistical analysis system for mass logs |
CN103425762A (en) * | 2013-08-05 | 2013-12-04 | 南京邮电大学 | Telecom operator mass data processing method based on Hadoop platform |
CN103544328A (en) * | 2013-11-15 | 2014-01-29 | 南京大学 | Parallel k mean value clustering method based on Hadoop |
CN104616205A (en) * | 2014-11-24 | 2015-05-13 | 北京科东电力控制系统有限责任公司 | Distributed log analysis based operation state monitoring method of power system |
-
2015
- 2015-09-02 CN CN201510553636.3A patent/CN105138661B/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102999633A (en) * | 2012-12-18 | 2013-03-27 | 北京师范大学珠海分校 | Cloud cluster extraction method of network information |
CN103399887A (en) * | 2013-07-19 | 2013-11-20 | 蓝盾信息安全技术股份有限公司 | Query and statistical analysis system for mass logs |
CN103425762A (en) * | 2013-08-05 | 2013-12-04 | 南京邮电大学 | Telecom operator mass data processing method based on Hadoop platform |
CN103544328A (en) * | 2013-11-15 | 2014-01-29 | 南京大学 | Parallel k mean value clustering method based on Hadoop |
CN104616205A (en) * | 2014-11-24 | 2015-05-13 | 北京科东电力控制系统有限责任公司 | Distributed log analysis based operation state monitoring method of power system |
Non-Patent Citations (2)
Title |
---|
一种基于Hadoop和K_means的Web日志分析方案的设计;付伟等;《第十九届全国青年通信学术年会论文集》;20141015;第171页左栏"5.2 基于Hadoop的K-means算法的并行化"至第173页左栏"5.3.3会话识别",图2 * |
网络安全分析中的大数据技术应用;王帅等;《电信科学》;20150731(第7期);第2015176-3页右栏第2段至第2015176-5页左栏第4段,图1 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106294580A (en) * | 2016-07-28 | 2017-01-04 | 武汉虹信技术服务有限责任公司 | LTE network MR data analysing method based on HADOOP platform |
Also Published As
Publication number | Publication date |
---|---|
CN105138661A (en) | 2015-12-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105138661B (en) | A kind of network security daily record k-means cluster analysis systems and method based on Hadoop | |
CN105183834B (en) | A kind of traffic big data semantic applications method of servicing based on ontology library | |
CN114399006B (en) | Multi-source abnormal composition image data fusion method and system based on super-calculation | |
CN103430144A (en) | Data source analytics | |
Mohammed et al. | A review of big data environment and its related technologies | |
CN102917009B (en) | A kind of stock certificate data collection based on cloud computing technology and storage means and system | |
CN104133858A (en) | Intelligent double-engine analysis system and intelligent double-engine analysis method based on column storage | |
CN111026874A (en) | Data processing method and server of knowledge graph | |
Das et al. | A study on big data integration with data warehouse | |
CN109783484A (en) | The construction method and system of the data service platform of knowledge based map | |
Li et al. | The overview of big data storage and management | |
Chen et al. | Metadata-based information resource integration for research management | |
Banaei et al. | Hadoop and its role in modern image processing | |
CN103226608A (en) | Parallel file searching method based on folder-level telescopic Bloom Filter bit diagram | |
CN115858513A (en) | Data governance method, data governance device, computer equipment and storage medium | |
Shakhovska et al. | Big Data Model" Entity and Features" | |
Li et al. | Survey of recent research progress and issues in big data | |
Arputhamary et al. | A review on big data integration | |
Chen et al. | Related technologies | |
Suciu et al. | Big data technology for scientific applications | |
Jo et al. | Constructing national geospatial big data platform: current status and future direction | |
Ediger et al. | Real-time streaming intelligence: Integrating graph and nlp analytics | |
Yu et al. | A police big data analytics platform: Framework and implications | |
CN113590651A (en) | Cross-cluster data processing system and method based on HQL | |
Pan et al. | An open sharing pattern design of massive power big data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20181030 Termination date: 20200902 |
|
CF01 | Termination of patent right due to non-payment of annual fee |