CN104156389B - Deep-packet detection system and method based on Hadoop platform - Google Patents

Deep-packet detection system and method based on Hadoop platform Download PDF

Info

Publication number
CN104156389B
CN104156389B CN201410317160.9A CN201410317160A CN104156389B CN 104156389 B CN104156389 B CN 104156389B CN 201410317160 A CN201410317160 A CN 201410317160A CN 104156389 B CN104156389 B CN 104156389B
Authority
CN
China
Prior art keywords
dpi
modules
tuple
stream
key
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410317160.9A
Other languages
Chinese (zh)
Other versions
CN104156389A (en
Inventor
雒江涛
杨军超
胡汝荣
向程超
高伟
王小平
申建
刘勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201410317160.9A priority Critical patent/CN104156389B/en
Publication of CN104156389A publication Critical patent/CN104156389A/en
Application granted granted Critical
Publication of CN104156389B publication Critical patent/CN104156389B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Abstract

The present invention discloses a kind of deep-packet detection system and method based on Hadoop platform, it is related to data mining technology, the present invention includes web crawlers part and deep-packet detection part, web crawlers unit captures the page from internet, document analysis unit is analyzed webpage to obtain the mapping relations of uniform resource position mark URL and webpage classification contents, and continuous iteration updates the data the mapping relations storehouse in storehouse;Initial data is parsed into five-tuple stream by deep-packet detection part, inputs TC modules, does traffic marking, generates given traffic streams, given traffic streams are changed into DPI events, DPI events are matched with mapping relations storehouse, completes DPI event statistics.Deep packet inspection technical is integrated into Hadoop platform by the present invention, meets the needs of big data storage is analyzed with flux deepness.

Description

Deep-packet detection system and method based on Hadoop platform
Technical field
The present invention relates to the analysis of mass network data, more particularly to a kind of deep-packet detection system.
Background technology
Deep packet inspection technical is that DPI technologies are a kind of flow detections and control technology based on application layer, and deep packet is examined Survey technology is widely used in the analysis of packet application type, user behavior analysis, and the side such as intrusion detection, virus/worm detection Face, it is the important means of data mining.
The arrival in big data epoch brings new impact to legacy network flow analysis method, particularly in network traffics Classification charging, the marketing and intelligent pipeline construction of monitoring, safety management, content auditing, and telecom operators etc. Higher requirement and challenge are proposed for flow analysis.
Legacy network flow analysis method mainly includes counting based on host-host protocol port, feature, traffic characteristic Analysis, above-mentioned analysis method can not meet traffic classification and the multi-functional demand of depth analysis.Stream based on deep packet inspection technical Amount identification advantage is the procotol that can parse relatively deep, has higher matching accuracy rate, but due to DPI needs pair Each packet is parsed, and is increased along with the explosion type of network traffics, and processing speed has turned into based on DPI flux deepness point The bottleneck of analysis.Need the challenge of accurate, speed and cost for solving that big data depth analysis faces using new method.
The content of the invention
Based on problem above, the present invention make full use of the increasing income of Hadoop Distributed Computing Platforms, efficiently, stably, it is fault-tolerant Property the advantage such as height, deep packet inspection technical is integrated into Hadoop platform, meets the need of big data storage and flux deepness analysis Ask.
The present invention solve above-mentioned technical problem technical scheme be:It is proposed a kind of based on Hadoop (distributed system bases Framework) platform deep-packet detection system, the system includes web crawlers and deep-packet detection part, and web crawlers part passes through Crawl and analysis webpage, continuous iteration renewal mapping relations storehouse, match for deep-packet detection part and use, the part includes network Reptile module and web page analysis module, webcrawler module crawl specific website web page files, are provided defeated for web page analysis module Enter;Web page analysis module analysis web page files, URL (URL) and webpage classification contents mapping relations are obtained, Match and use for DPI modules.The mapping relations storehouse in storehouse is updated the data according to the crawl continuous iteration of the page;Deep-packet detection part Including resolve packet PA modules, traffic classification TC modules, deep-packet detection DPI modules, initial data is parsed into by PA modules Five-tuple stream, inputs TC modules, and the five-tuple stream of input is traffic marking, generation given traffic streams input DPI by TC modules Given traffic streams are changed into DPI events by module, DPI modules, and DPI events are matched with mapping relations storehouse, complete DPI events system Meter.
Initial data is parsed into five-tuple stream by PA modules, and input TC modules specifically include, and PA modules read HDFS Central Plains Beginning data flow, it is Key, is used as the defeated of MapReduce in the form of key-value pair of the packet content for Value using packet offset Enter, as a result export by five-tuple be Key, five-tuple stream and stream characteristic statisticses for Value in the form of, store in HDFS.TC moulds The five-tuple stream of input is done traffic marking by block, and generation given traffic streams input DPI modules are specifically included, and TC modules are read Five-tuple stream in HDFS, using five-tuple be Key, in the form of key-value pair of the five-tuple stream for Value as MapReduce input, As a result export by five-tuple/service marker be Key, service marker stream for Value in the form of, store in HDFS.DPI modules will Given traffic streams change into DPI events and specifically included, and DPI modules read given traffic streams in HDFS, with five-tuple/service marker The key-value pair form for being characterized as Value for Key, given traffic streams inputs as MapReduce, as a result exports with five-tuple/industry Business is labeled as the form that Key, DPI event are Value.
The present invention also proposes that one kind is based on Hadoop platform deep packet inspection method, including step:Webcrawler module is not Disconnected circulation crawl specific website web page files, document analysis module are analyzed web page files, are obtained in URL and webpage classification Database is arrived in the mapping relations of appearance, storage, and the mapping relations storehouse in storehouse is updated the data according to the crawl continuous iteration of the page;PA modules Initial data is parsed into five-tuple stream input TC modules, the five-tuple stream of input is done traffic marking by TC modules, and generation is special Determine Business Stream input DPI modules, given traffic streams are changed into DPI events by DPI modules, by DPI events and mapping relations storehouse Match somebody with somebody, complete DPI event statistics.
The present invention make full use of the increasing income of Hadoop Distributed Computing Platforms, efficiently, stably, the advantage such as fault-tolerance height, will Deep packet inspection technical based on web crawlers is integrated into Hadoop platform, reaches the purpose of efficient flux deepness analysis.This Invention can parse the procotol of relatively deep, have higher matching accuracy rate, and processing speed is fast, solve big data depth Accurate, speed issue in degree analysis.
Brief description of the drawings
The present invention of accompanying drawing 1 is based on Hadoop platform deep-packet detection system block schematic illustration;
The present invention of accompanying drawing 2 is based on Hadoop platform deep-packet detection system web crawlers partial process view;
The present invention of accompanying drawing 3 is based on Hadoop platform deep-packet detection system deep-packet detection partial process view.
Embodiment
Deep-packet detection part is established in Hadoop platform, and finishing service flow label and given traffic streams (refer mainly to Web service stream) functions of DPI events is changed into, DPI events are (for example, user A to the profound recognition result of network event Sometime browsed a certain video website), be DPI modules output.The part includes resolve packet PA modules, flow point Class TC modules, deep-packet detection DPI modules.PA modules mainly complete resolve packet function, and initial data is parsed into five yuan (five-tuple includes group stream:Source IP address, source port, purpose IP address, destination interface, transport layer protocol number), it is output to TC moulds The five-tuple stream of input is done traffic marking by block, TC modules, and input is provided for DPI modules;DPI modules complete given traffic streams DPI events are changed into, DPI events are matched with mapping relations storehouse, it is complete according to the matching of DPI events and information in mapping relations storehouse Into DPI event statistics.
Below in conjunction with accompanying drawing and specific implementation, the present invention will be further described, specific as follows:
Hadoop platform deep-packet detection system block schematic illustration is based on for the present invention as shown in Figure 1, the system includes Two parts of web crawlers and deep-packet detection.
Web crawlers part includes webcrawler module, document analysis module, database, and web crawlers unit is from internet The page is captured, document analysis unit is analyzed webpage to obtain the mapping of uniform resource position mark URL and webpage classification contents Relation, the mapping relations storehouse in storehouse is updated the data according to the crawl continuous iteration of the page, is matched for deep-packet detection part DPI modules Use.
Deep-packet detection part is established in Hadoop platform, and finishing service flow label and given traffic streams change into DPI events.The part includes tri- resolve packet PA, traffic classification TC, deep-packet detection DPI modules.PA modules complete number According to Packet analyzing, initial data is parsed into five-tuple stream, input is provided for TC modules;TC modules complete flow label function, will be defeated The five-tuple stream entered does traffic marking, generates given traffic streams, inputs DPI modules;DPI modules change into given traffic streams Deep-packet detection DPI events, DPI events are matched with mapping relations storehouse, complete DPI event statistics.IO Format main work( Can be the data segmentation and reading of modules input and output.Distributed memory systems of the HDFS as Hadoop, its is main Function is the storage to initial data and modules data processed result.
It is illustrated in figure 2 the present invention and is based on Hadoop platform deep-packet detection system web crawlers partial process view.Network Reptile part is divided into two stages of webpage capture and web page analysis, is completed by following steps:
Webcrawler module constantly circulates crawl specific website web page files;Document analysis module is divided web page files Analysis, URL and webpage classification contents mapping relations are obtained, database is arrived in storage, is used for DPI modules;
It is illustrated in figure 3 deep-packet detection partial process view.Deep-packet detection includes Packet analyzing, traffic classification marks and deep Degree bag detection three phases, specifically comprise the following steps:
Step 1, webcrawler module circulation crawl specific website web page files;
Step 2, web page analysis module is analyzed the web page files of crawl, obtains reflecting for URL and webpage classification contents Relation is penetrated, mapping relations storehouse is arrived in storage, and deep-packet detection is carried out for deep-packet detection DPI modules;
Deep-packet detection module includes resolve packet, traffic classification mark and deep-packet detection three phases, specific step Suddenly include:
Step 3, distributed memory system HDFS (Hadoop are arrived in data acquisition unit crawl network raw data stream, storage Distributed File System, Hadoop);
Step 4, resolve packet module PA reads original data stream in HDFS, using packet offset as Key (strong), number According to input of the key-value pair form that bag content is Value (value) as programming paradigm unit MapReduce, as a result export with five yuan The form that group is Key, five-tuple stream and stream characteristic statisticses are Value, is stored in HDFS;Step 5, traffic classification mark module TC reads five-tuple stream in HDFS, is Key, is used as MapReduce's in the form of key-value pair of the five-tuple stream for Value using five-tuple Input, as a result export by five-tuple/service marker be Key, service marker stream for Value in the form of, as a result store in HDFS;
Step 6, deep-packet detection module DPI read HDFS in specific transactions mark stream, using five-tuple/service marker as Key, given traffic streams feature field be Value key-value pair form as MapReduce input, as a result export with five-tuple/ Service marker is the form that Key, DPI event are Value;
Step 7, DPI events are matched with mapping relations storehouse and obtains DPI statistical results, by DPI statistical results storage to number According to storehouse, for inquiry;Based on DPI events (including user, time, action), the depth data excavation to network traffics is completed.
Data acquisition unit captures network data, the distributed memory system as original data stream storage to Hadoop platform HDFS;PA modules read original data stream in HDFS, using key-value pair of the packet offset as Key, packet content for Value Input of the form as MapReduce, as a result export using shape of the five-tuple as Key, five-tuple stream and stream characteristic statisticses for Value Formula, as a result store in HDFS, PA module end-of-jobs;TC modules read five-tuple stream in HDFS, using five-tuple as Key, five Input of the key-value pair form as MapReduce that tuple stream is Value, as a result export using five-tuple/service marker as Key, Service marker stream is Value form, is as a result stored in HDFS;DPI modules read given traffic streams in HDFS, with five yuan The key-value pair form that group/traffic marking is Key, given traffic streams are Value as MapReduce input, as a result export with Five-tuple/service marker is the form that Key, DPI event are Value, passes through feature field and the web crawlers part of DPI events The mapping relations storehouse matching of foundation, based on DPI events (including user, time, action), completes the depth data to network traffics Excavate.

Claims (2)

1. the deep-packet detection system based on Hadoop platform, it is characterised in that the system includes web crawlers part and depth Spend bag detection part, web crawlers part include webcrawler module, document analysis module, database, webcrawler module from Internet captures the page, and document analysis module is analyzed webpage to obtain uniform resource position mark URL and webpage classification contents Mapping relations, mapping relations storehouse in storehouse is updated the data according to the crawl continuous iteration of the page;Deep-packet detection part includes number According to Packet analyzing PA modules, traffic classification TC modules, deep-packet detection DPI modules, resolve packet PA modules are by initial data solution Five-tuple stream is analysed into, input flow rate classification TC modules, the five-tuple stream of input is done traffic marking by traffic classification TC modules, raw Deep-packet detection DPI modules are inputted into given traffic streams, given traffic streams are changed into DPI events by deep-packet detection DPI modules, DPI events are matched with mapping relations storehouse, DPI event statistics is completed, specifically includes:Data acquisition unit captures network raw data Distributed memory system HDFS is arrived in stream, storage;Resolve packet PA modules read original data stream in HDFS, are offset with packet The input to be good for Key, key-value pair form that packet content is value Value as programming paradigm unit MapReduce is measured, as a result Output by five-tuple be Key, five-tuple stream and stream characteristic statisticses for Value in the form of, store in HDFS, by initial data solution Analyse into five-tuple stream, input flow rate classification TC modules;Traffic classification TC modules read HDFS in five-tuple stream, using five-tuple as The input of Key, the key-value pair form that five-tuple stream is Value as MapReduce, is as a result exported with five-tuple/service marker The form for being Value for Key, service marker stream, is as a result stored in HDFS;Deep-packet detection DPI modules read special in HDFS Determine service marker stream, using five-tuple/service marker be Key, in the form of key-value pair of the given traffic streams feature field for Value as MapReduce input, as a result export by five-tuple/service marker for Key, DPI event for Value in the form of, by specific transactions Circulation chemical conversion DPI events, DPI events are matched with mapping relations storehouse;DPI events are matched with mapping relations storehouse and obtain DPI systems Result is counted, database is arrived into the storage of DPI statistical results, for inquiry;Depth data to network traffics is completed based on DPI events Excavate.
2. one kind is based on Hadoop platform deep packet inspection method, it is characterised in that including step:Webcrawler module is constantly followed Ring captures specific website web page files, and document analysis module is analyzed web page files, obtains URL and webpage classification contents Database is arrived in mapping relations, storage, and the mapping relations storehouse in storehouse is updated the data according to the crawl continuous iteration of the page;Resolve packet Initial data is parsed into five-tuple stream input flow rate classification TC modules by PA modules, and traffic classification TC modules are by the five-tuple of input Stream does traffic marking, generation given traffic streams input deep-packet detection DPI modules, and deep-packet detection DPI modules are by specific industry Business circulation chemical conversion DPI events, DPI events are matched with mapping relations storehouse, DPI event statistics is completed, specifically includes:Data acquisition Device captures network raw data stream, and distributed memory system HDFS is arrived in storage;Resolve packet PA modules read original in HDFS Data flow, it is strong Key, is used as programming paradigm unit in the form of the key-value pair that packet content is value Value using packet offset MapReduce input, as a result export by five-tuple be Key, five-tuple stream and stream characteristic statisticses for Value in the form of, storage Into HDFS, initial data is parsed into five-tuple stream, input flow rate classification TC modules;Traffic classification TC modules are read in HDFS Five-tuple stream, using five-tuple be Key, in the form of key-value pair of the five-tuple stream for Value as MapReduce input, it is as a result defeated Go out by five-tuple/service marker be Key, service marker stream for Value in the form of, as a result store in HDFS;Deep-packet detection DPI modules read specific transactions mark stream in HDFS, are by Key, given traffic streams feature field of five-tuple/service marker Value key-value pair form inputs as MapReduce, as a result exports and is by Key, DPI event of five-tuple/service marker Value form, given traffic streams are changed into DPI events, DPI events are matched with mapping relations storehouse;By DPI events with reflecting Penetrate the matching of relation storehouse and obtain DPI statistical results, database is arrived into the storage of DPI statistical results, for inquiry;It is complete based on DPI events The depth data of paired network traffics excavates.
CN201410317160.9A 2014-07-04 2014-07-04 Deep-packet detection system and method based on Hadoop platform Active CN104156389B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410317160.9A CN104156389B (en) 2014-07-04 2014-07-04 Deep-packet detection system and method based on Hadoop platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410317160.9A CN104156389B (en) 2014-07-04 2014-07-04 Deep-packet detection system and method based on Hadoop platform

Publications (2)

Publication Number Publication Date
CN104156389A CN104156389A (en) 2014-11-19
CN104156389B true CN104156389B (en) 2017-12-26

Family

ID=51881893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410317160.9A Active CN104156389B (en) 2014-07-04 2014-07-04 Deep-packet detection system and method based on Hadoop platform

Country Status (1)

Country Link
CN (1) CN104156389B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104486116A (en) * 2014-12-12 2015-04-01 北京百度网讯科技有限公司 Multidimensional query method and multidimensional query system of flow data
CN104486157A (en) * 2014-12-16 2015-04-01 国家电网公司 Information system performance detecting method based on deep packet analysis
CN105812324B (en) * 2014-12-30 2019-04-05 华为技术有限公司 The method, apparatus and system of IDC information security management
CN104636434A (en) * 2014-12-31 2015-05-20 百度在线网络技术(北京)有限公司 Search result processing method and device
CN106303751B (en) * 2015-05-18 2020-06-30 中兴通讯股份有限公司 Method and system for realizing directional flow packet
CN105117649B (en) * 2015-07-30 2018-11-30 中国科学院计算技术研究所 A kind of anti-virus method and system for virtual machine
CN107948266A (en) * 2017-11-17 2018-04-20 武汉绿色网络信息服务有限责任公司 The processing method and system of HTTP uplink traffics in asymmetric routed environment
CN108171887B (en) * 2017-12-20 2020-10-20 新华三技术有限公司 Method and device for charging electric quantity
CN109829094A (en) * 2019-01-23 2019-05-31 钟祥博谦信息科技有限公司 Distributed reptile system
CN110990669A (en) * 2019-10-16 2020-04-10 广州丰石科技有限公司 DPI (deep packet inspection) analysis method and system based on rule generation
CN111641531B (en) * 2020-05-12 2021-08-17 国家计算机网络与信息安全管理中心 DPDK-based data packet distribution and feature extraction method
CN112272123B (en) * 2020-10-16 2022-04-15 北京锐安科技有限公司 Network traffic analysis method, system, device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101414939A (en) * 2008-11-28 2009-04-22 武汉虹旭信息技术有限责任公司 Internet application recognition method based on dynamical depth package detection
CN101741744A (en) * 2009-12-17 2010-06-16 东南大学 Network flow identification method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120182891A1 (en) * 2011-01-19 2012-07-19 Youngseok Lee Packet analysis system and method using hadoop based parallel computation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101414939A (en) * 2008-11-28 2009-04-22 武汉虹旭信息技术有限责任公司 Internet application recognition method based on dynamical depth package detection
CN101741744A (en) * 2009-12-17 2010-06-16 东南大学 Network flow identification method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于Hadoop的深度包检测技术的研究;魏军;《中国优秀硕士学位论文全文数据包》;20131215(第S2期);参见第5.1-5.3节 *
基于并行Bloom过滤器组的深度包检测研究;胡国良;《中国优秀硕士学位论文全文数据库》;20140615(第06期);全文 *

Also Published As

Publication number Publication date
CN104156389A (en) 2014-11-19

Similar Documents

Publication Publication Date Title
CN104156389B (en) Deep-packet detection system and method based on Hadoop platform
WO2020119662A1 (en) Network traffic classification method
CN104022920B (en) A kind of LTE network flux recognition system and method
CN101741744B (en) Network flow identification method
CN101645806B (en) Network flow classifying system and network flow classifying method combining DPI and DFI
CN108259371A (en) A kind of network flow data analysis method and device based on stream process
CN102035698B (en) HTTP tunnel detection method based on decision tree classification algorithm
CN103942335B (en) Construction method of uninterrupted crawler system oriented to web page structure change
CN104899324B (en) One kind monitoring systematic sample training system based on IDC harmful informations
CN101593200A (en) Chinese Web page classification method based on the keyword frequency analysis
CN109766525A (en) A kind of sensitive information leakage detection framework of data-driven
WO2017071179A1 (en) Method and apparatus for recognizing user behaviour object based on flow analysis
TW201214169A (en) Recognition of target words using designated characteristic values
CN106330584A (en) Identification method and identification device of business flow
CN106155817A (en) Business information processing method, server and system
CN102542061A (en) Intelligent product classification method
CN107465643A (en) A kind of net flow assorted method of deep learning
CN105808722A (en) Information discrimination method and system
CN109275045B (en) DFI-based mobile terminal encrypted video advertisement traffic identification method
CN110020161B (en) Data processing method, log processing method and terminal
CN103491089A (en) Transcoding method and system of data recovery based on HTTP
CN112381119A (en) Multi-scene classification method and system based on decentralized application encryption flow characteristics
CN107480270A (en) A kind of real time individual based on user feedback data stream recommends method and system
CN107832344A (en) A kind of food security Internet public opinion analysis method based on storm stream calculation frameworks
WO2016201876A1 (en) Service identification method and device for encrypted traffic, and computer storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant