CN104156389B - Deep-packet detection system and method based on Hadoop platform - Google Patents
Deep-packet detection system and method based on Hadoop platform Download PDFInfo
- Publication number
- CN104156389B CN104156389B CN201410317160.9A CN201410317160A CN104156389B CN 104156389 B CN104156389 B CN 104156389B CN 201410317160 A CN201410317160 A CN 201410317160A CN 104156389 B CN104156389 B CN 104156389B
- Authority
- CN
- China
- Prior art keywords
- dpi
- modules
- tuple
- stream
- key
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2471—Distributed queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Abstract
The present invention discloses a kind of deep-packet detection system and method based on Hadoop platform, it is related to data mining technology, the present invention includes web crawlers part and deep-packet detection part, web crawlers unit captures the page from internet, document analysis unit is analyzed webpage to obtain the mapping relations of uniform resource position mark URL and webpage classification contents, and continuous iteration updates the data the mapping relations storehouse in storehouse;Initial data is parsed into five-tuple stream by deep-packet detection part, inputs TC modules, does traffic marking, generates given traffic streams, given traffic streams are changed into DPI events, DPI events are matched with mapping relations storehouse, completes DPI event statistics.Deep packet inspection technical is integrated into Hadoop platform by the present invention, meets the needs of big data storage is analyzed with flux deepness.
Description
Technical field
The present invention relates to the analysis of mass network data, more particularly to a kind of deep-packet detection system.
Background technology
Deep packet inspection technical is that DPI technologies are a kind of flow detections and control technology based on application layer, and deep packet is examined
Survey technology is widely used in the analysis of packet application type, user behavior analysis, and the side such as intrusion detection, virus/worm detection
Face, it is the important means of data mining.
The arrival in big data epoch brings new impact to legacy network flow analysis method, particularly in network traffics
Classification charging, the marketing and intelligent pipeline construction of monitoring, safety management, content auditing, and telecom operators etc.
Higher requirement and challenge are proposed for flow analysis.
Legacy network flow analysis method mainly includes counting based on host-host protocol port, feature, traffic characteristic
Analysis, above-mentioned analysis method can not meet traffic classification and the multi-functional demand of depth analysis.Stream based on deep packet inspection technical
Amount identification advantage is the procotol that can parse relatively deep, has higher matching accuracy rate, but due to DPI needs pair
Each packet is parsed, and is increased along with the explosion type of network traffics, and processing speed has turned into based on DPI flux deepness point
The bottleneck of analysis.Need the challenge of accurate, speed and cost for solving that big data depth analysis faces using new method.
The content of the invention
Based on problem above, the present invention make full use of the increasing income of Hadoop Distributed Computing Platforms, efficiently, stably, it is fault-tolerant
Property the advantage such as height, deep packet inspection technical is integrated into Hadoop platform, meets the need of big data storage and flux deepness analysis
Ask.
The present invention solve above-mentioned technical problem technical scheme be:It is proposed a kind of based on Hadoop (distributed system bases
Framework) platform deep-packet detection system, the system includes web crawlers and deep-packet detection part, and web crawlers part passes through
Crawl and analysis webpage, continuous iteration renewal mapping relations storehouse, match for deep-packet detection part and use, the part includes network
Reptile module and web page analysis module, webcrawler module crawl specific website web page files, are provided defeated for web page analysis module
Enter;Web page analysis module analysis web page files, URL (URL) and webpage classification contents mapping relations are obtained,
Match and use for DPI modules.The mapping relations storehouse in storehouse is updated the data according to the crawl continuous iteration of the page;Deep-packet detection part
Including resolve packet PA modules, traffic classification TC modules, deep-packet detection DPI modules, initial data is parsed into by PA modules
Five-tuple stream, inputs TC modules, and the five-tuple stream of input is traffic marking, generation given traffic streams input DPI by TC modules
Given traffic streams are changed into DPI events by module, DPI modules, and DPI events are matched with mapping relations storehouse, complete DPI events system
Meter.
Initial data is parsed into five-tuple stream by PA modules, and input TC modules specifically include, and PA modules read HDFS Central Plains
Beginning data flow, it is Key, is used as the defeated of MapReduce in the form of key-value pair of the packet content for Value using packet offset
Enter, as a result export by five-tuple be Key, five-tuple stream and stream characteristic statisticses for Value in the form of, store in HDFS.TC moulds
The five-tuple stream of input is done traffic marking by block, and generation given traffic streams input DPI modules are specifically included, and TC modules are read
Five-tuple stream in HDFS, using five-tuple be Key, in the form of key-value pair of the five-tuple stream for Value as MapReduce input,
As a result export by five-tuple/service marker be Key, service marker stream for Value in the form of, store in HDFS.DPI modules will
Given traffic streams change into DPI events and specifically included, and DPI modules read given traffic streams in HDFS, with five-tuple/service marker
The key-value pair form for being characterized as Value for Key, given traffic streams inputs as MapReduce, as a result exports with five-tuple/industry
Business is labeled as the form that Key, DPI event are Value.
The present invention also proposes that one kind is based on Hadoop platform deep packet inspection method, including step:Webcrawler module is not
Disconnected circulation crawl specific website web page files, document analysis module are analyzed web page files, are obtained in URL and webpage classification
Database is arrived in the mapping relations of appearance, storage, and the mapping relations storehouse in storehouse is updated the data according to the crawl continuous iteration of the page;PA modules
Initial data is parsed into five-tuple stream input TC modules, the five-tuple stream of input is done traffic marking by TC modules, and generation is special
Determine Business Stream input DPI modules, given traffic streams are changed into DPI events by DPI modules, by DPI events and mapping relations storehouse
Match somebody with somebody, complete DPI event statistics.
The present invention make full use of the increasing income of Hadoop Distributed Computing Platforms, efficiently, stably, the advantage such as fault-tolerance height, will
Deep packet inspection technical based on web crawlers is integrated into Hadoop platform, reaches the purpose of efficient flux deepness analysis.This
Invention can parse the procotol of relatively deep, have higher matching accuracy rate, and processing speed is fast, solve big data depth
Accurate, speed issue in degree analysis.
Brief description of the drawings
The present invention of accompanying drawing 1 is based on Hadoop platform deep-packet detection system block schematic illustration;
The present invention of accompanying drawing 2 is based on Hadoop platform deep-packet detection system web crawlers partial process view;
The present invention of accompanying drawing 3 is based on Hadoop platform deep-packet detection system deep-packet detection partial process view.
Embodiment
Deep-packet detection part is established in Hadoop platform, and finishing service flow label and given traffic streams (refer mainly to
Web service stream) functions of DPI events is changed into, DPI events are (for example, user A to the profound recognition result of network event
Sometime browsed a certain video website), be DPI modules output.The part includes resolve packet PA modules, flow point
Class TC modules, deep-packet detection DPI modules.PA modules mainly complete resolve packet function, and initial data is parsed into five yuan
(five-tuple includes group stream:Source IP address, source port, purpose IP address, destination interface, transport layer protocol number), it is output to TC moulds
The five-tuple stream of input is done traffic marking by block, TC modules, and input is provided for DPI modules;DPI modules complete given traffic streams
DPI events are changed into, DPI events are matched with mapping relations storehouse, it is complete according to the matching of DPI events and information in mapping relations storehouse
Into DPI event statistics.
Below in conjunction with accompanying drawing and specific implementation, the present invention will be further described, specific as follows:
Hadoop platform deep-packet detection system block schematic illustration is based on for the present invention as shown in Figure 1, the system includes
Two parts of web crawlers and deep-packet detection.
Web crawlers part includes webcrawler module, document analysis module, database, and web crawlers unit is from internet
The page is captured, document analysis unit is analyzed webpage to obtain the mapping of uniform resource position mark URL and webpage classification contents
Relation, the mapping relations storehouse in storehouse is updated the data according to the crawl continuous iteration of the page, is matched for deep-packet detection part DPI modules
Use.
Deep-packet detection part is established in Hadoop platform, and finishing service flow label and given traffic streams change into
DPI events.The part includes tri- resolve packet PA, traffic classification TC, deep-packet detection DPI modules.PA modules complete number
According to Packet analyzing, initial data is parsed into five-tuple stream, input is provided for TC modules;TC modules complete flow label function, will be defeated
The five-tuple stream entered does traffic marking, generates given traffic streams, inputs DPI modules;DPI modules change into given traffic streams
Deep-packet detection DPI events, DPI events are matched with mapping relations storehouse, complete DPI event statistics.IO Format main work(
Can be the data segmentation and reading of modules input and output.Distributed memory systems of the HDFS as Hadoop, its is main
Function is the storage to initial data and modules data processed result.
It is illustrated in figure 2 the present invention and is based on Hadoop platform deep-packet detection system web crawlers partial process view.Network
Reptile part is divided into two stages of webpage capture and web page analysis, is completed by following steps:
Webcrawler module constantly circulates crawl specific website web page files;Document analysis module is divided web page files
Analysis, URL and webpage classification contents mapping relations are obtained, database is arrived in storage, is used for DPI modules;
It is illustrated in figure 3 deep-packet detection partial process view.Deep-packet detection includes Packet analyzing, traffic classification marks and deep
Degree bag detection three phases, specifically comprise the following steps:
Step 1, webcrawler module circulation crawl specific website web page files;
Step 2, web page analysis module is analyzed the web page files of crawl, obtains reflecting for URL and webpage classification contents
Relation is penetrated, mapping relations storehouse is arrived in storage, and deep-packet detection is carried out for deep-packet detection DPI modules;
Deep-packet detection module includes resolve packet, traffic classification mark and deep-packet detection three phases, specific step
Suddenly include:
Step 3, distributed memory system HDFS (Hadoop are arrived in data acquisition unit crawl network raw data stream, storage
Distributed File System, Hadoop);
Step 4, resolve packet module PA reads original data stream in HDFS, using packet offset as Key (strong), number
According to input of the key-value pair form that bag content is Value (value) as programming paradigm unit MapReduce, as a result export with five yuan
The form that group is Key, five-tuple stream and stream characteristic statisticses are Value, is stored in HDFS;Step 5, traffic classification mark module
TC reads five-tuple stream in HDFS, is Key, is used as MapReduce's in the form of key-value pair of the five-tuple stream for Value using five-tuple
Input, as a result export by five-tuple/service marker be Key, service marker stream for Value in the form of, as a result store in HDFS;
Step 6, deep-packet detection module DPI read HDFS in specific transactions mark stream, using five-tuple/service marker as
Key, given traffic streams feature field be Value key-value pair form as MapReduce input, as a result export with five-tuple/
Service marker is the form that Key, DPI event are Value;
Step 7, DPI events are matched with mapping relations storehouse and obtains DPI statistical results, by DPI statistical results storage to number
According to storehouse, for inquiry;Based on DPI events (including user, time, action), the depth data excavation to network traffics is completed.
Data acquisition unit captures network data, the distributed memory system as original data stream storage to Hadoop platform
HDFS;PA modules read original data stream in HDFS, using key-value pair of the packet offset as Key, packet content for Value
Input of the form as MapReduce, as a result export using shape of the five-tuple as Key, five-tuple stream and stream characteristic statisticses for Value
Formula, as a result store in HDFS, PA module end-of-jobs;TC modules read five-tuple stream in HDFS, using five-tuple as Key, five
Input of the key-value pair form as MapReduce that tuple stream is Value, as a result export using five-tuple/service marker as Key,
Service marker stream is Value form, is as a result stored in HDFS;DPI modules read given traffic streams in HDFS, with five yuan
The key-value pair form that group/traffic marking is Key, given traffic streams are Value as MapReduce input, as a result export with
Five-tuple/service marker is the form that Key, DPI event are Value, passes through feature field and the web crawlers part of DPI events
The mapping relations storehouse matching of foundation, based on DPI events (including user, time, action), completes the depth data to network traffics
Excavate.
Claims (2)
1. the deep-packet detection system based on Hadoop platform, it is characterised in that the system includes web crawlers part and depth
Spend bag detection part, web crawlers part include webcrawler module, document analysis module, database, webcrawler module from
Internet captures the page, and document analysis module is analyzed webpage to obtain uniform resource position mark URL and webpage classification contents
Mapping relations, mapping relations storehouse in storehouse is updated the data according to the crawl continuous iteration of the page;Deep-packet detection part includes number
According to Packet analyzing PA modules, traffic classification TC modules, deep-packet detection DPI modules, resolve packet PA modules are by initial data solution
Five-tuple stream is analysed into, input flow rate classification TC modules, the five-tuple stream of input is done traffic marking by traffic classification TC modules, raw
Deep-packet detection DPI modules are inputted into given traffic streams, given traffic streams are changed into DPI events by deep-packet detection DPI modules,
DPI events are matched with mapping relations storehouse, DPI event statistics is completed, specifically includes:Data acquisition unit captures network raw data
Distributed memory system HDFS is arrived in stream, storage;Resolve packet PA modules read original data stream in HDFS, are offset with packet
The input to be good for Key, key-value pair form that packet content is value Value as programming paradigm unit MapReduce is measured, as a result
Output by five-tuple be Key, five-tuple stream and stream characteristic statisticses for Value in the form of, store in HDFS, by initial data solution
Analyse into five-tuple stream, input flow rate classification TC modules;Traffic classification TC modules read HDFS in five-tuple stream, using five-tuple as
The input of Key, the key-value pair form that five-tuple stream is Value as MapReduce, is as a result exported with five-tuple/service marker
The form for being Value for Key, service marker stream, is as a result stored in HDFS;Deep-packet detection DPI modules read special in HDFS
Determine service marker stream, using five-tuple/service marker be Key, in the form of key-value pair of the given traffic streams feature field for Value as
MapReduce input, as a result export by five-tuple/service marker for Key, DPI event for Value in the form of, by specific transactions
Circulation chemical conversion DPI events, DPI events are matched with mapping relations storehouse;DPI events are matched with mapping relations storehouse and obtain DPI systems
Result is counted, database is arrived into the storage of DPI statistical results, for inquiry;Depth data to network traffics is completed based on DPI events
Excavate.
2. one kind is based on Hadoop platform deep packet inspection method, it is characterised in that including step:Webcrawler module is constantly followed
Ring captures specific website web page files, and document analysis module is analyzed web page files, obtains URL and webpage classification contents
Database is arrived in mapping relations, storage, and the mapping relations storehouse in storehouse is updated the data according to the crawl continuous iteration of the page;Resolve packet
Initial data is parsed into five-tuple stream input flow rate classification TC modules by PA modules, and traffic classification TC modules are by the five-tuple of input
Stream does traffic marking, generation given traffic streams input deep-packet detection DPI modules, and deep-packet detection DPI modules are by specific industry
Business circulation chemical conversion DPI events, DPI events are matched with mapping relations storehouse, DPI event statistics is completed, specifically includes:Data acquisition
Device captures network raw data stream, and distributed memory system HDFS is arrived in storage;Resolve packet PA modules read original in HDFS
Data flow, it is strong Key, is used as programming paradigm unit in the form of the key-value pair that packet content is value Value using packet offset
MapReduce input, as a result export by five-tuple be Key, five-tuple stream and stream characteristic statisticses for Value in the form of, storage
Into HDFS, initial data is parsed into five-tuple stream, input flow rate classification TC modules;Traffic classification TC modules are read in HDFS
Five-tuple stream, using five-tuple be Key, in the form of key-value pair of the five-tuple stream for Value as MapReduce input, it is as a result defeated
Go out by five-tuple/service marker be Key, service marker stream for Value in the form of, as a result store in HDFS;Deep-packet detection
DPI modules read specific transactions mark stream in HDFS, are by Key, given traffic streams feature field of five-tuple/service marker
Value key-value pair form inputs as MapReduce, as a result exports and is by Key, DPI event of five-tuple/service marker
Value form, given traffic streams are changed into DPI events, DPI events are matched with mapping relations storehouse;By DPI events with reflecting
Penetrate the matching of relation storehouse and obtain DPI statistical results, database is arrived into the storage of DPI statistical results, for inquiry;It is complete based on DPI events
The depth data of paired network traffics excavates.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410317160.9A CN104156389B (en) | 2014-07-04 | 2014-07-04 | Deep-packet detection system and method based on Hadoop platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410317160.9A CN104156389B (en) | 2014-07-04 | 2014-07-04 | Deep-packet detection system and method based on Hadoop platform |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104156389A CN104156389A (en) | 2014-11-19 |
CN104156389B true CN104156389B (en) | 2017-12-26 |
Family
ID=51881893
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410317160.9A Active CN104156389B (en) | 2014-07-04 | 2014-07-04 | Deep-packet detection system and method based on Hadoop platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104156389B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104486116A (en) * | 2014-12-12 | 2015-04-01 | 北京百度网讯科技有限公司 | Multidimensional query method and multidimensional query system of flow data |
CN104486157A (en) * | 2014-12-16 | 2015-04-01 | 国家电网公司 | Information system performance detecting method based on deep packet analysis |
CN105812324B (en) * | 2014-12-30 | 2019-04-05 | 华为技术有限公司 | The method, apparatus and system of IDC information security management |
CN104636434A (en) * | 2014-12-31 | 2015-05-20 | 百度在线网络技术(北京)有限公司 | Search result processing method and device |
CN106303751B (en) * | 2015-05-18 | 2020-06-30 | 中兴通讯股份有限公司 | Method and system for realizing directional flow packet |
CN105117649B (en) * | 2015-07-30 | 2018-11-30 | 中国科学院计算技术研究所 | A kind of anti-virus method and system for virtual machine |
CN107948266A (en) * | 2017-11-17 | 2018-04-20 | 武汉绿色网络信息服务有限责任公司 | The processing method and system of HTTP uplink traffics in asymmetric routed environment |
CN108171887B (en) * | 2017-12-20 | 2020-10-20 | 新华三技术有限公司 | Method and device for charging electric quantity |
CN109829094A (en) * | 2019-01-23 | 2019-05-31 | 钟祥博谦信息科技有限公司 | Distributed reptile system |
CN110990669A (en) * | 2019-10-16 | 2020-04-10 | 广州丰石科技有限公司 | DPI (deep packet inspection) analysis method and system based on rule generation |
CN111641531B (en) * | 2020-05-12 | 2021-08-17 | 国家计算机网络与信息安全管理中心 | DPDK-based data packet distribution and feature extraction method |
CN112272123B (en) * | 2020-10-16 | 2022-04-15 | 北京锐安科技有限公司 | Network traffic analysis method, system, device, electronic equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101414939A (en) * | 2008-11-28 | 2009-04-22 | 武汉虹旭信息技术有限责任公司 | Internet application recognition method based on dynamical depth package detection |
CN101741744A (en) * | 2009-12-17 | 2010-06-16 | 东南大学 | Network flow identification method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120182891A1 (en) * | 2011-01-19 | 2012-07-19 | Youngseok Lee | Packet analysis system and method using hadoop based parallel computation |
-
2014
- 2014-07-04 CN CN201410317160.9A patent/CN104156389B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101414939A (en) * | 2008-11-28 | 2009-04-22 | 武汉虹旭信息技术有限责任公司 | Internet application recognition method based on dynamical depth package detection |
CN101741744A (en) * | 2009-12-17 | 2010-06-16 | 东南大学 | Network flow identification method |
Non-Patent Citations (2)
Title |
---|
基于Hadoop的深度包检测技术的研究;魏军;《中国优秀硕士学位论文全文数据包》;20131215(第S2期);参见第5.1-5.3节 * |
基于并行Bloom过滤器组的深度包检测研究;胡国良;《中国优秀硕士学位论文全文数据库》;20140615(第06期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN104156389A (en) | 2014-11-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104156389B (en) | Deep-packet detection system and method based on Hadoop platform | |
WO2020119662A1 (en) | Network traffic classification method | |
CN104022920B (en) | A kind of LTE network flux recognition system and method | |
CN101741744B (en) | Network flow identification method | |
CN101645806B (en) | Network flow classifying system and network flow classifying method combining DPI and DFI | |
CN108259371A (en) | A kind of network flow data analysis method and device based on stream process | |
CN102035698B (en) | HTTP tunnel detection method based on decision tree classification algorithm | |
CN103942335B (en) | Construction method of uninterrupted crawler system oriented to web page structure change | |
CN104899324B (en) | One kind monitoring systematic sample training system based on IDC harmful informations | |
CN101593200A (en) | Chinese Web page classification method based on the keyword frequency analysis | |
CN109766525A (en) | A kind of sensitive information leakage detection framework of data-driven | |
WO2017071179A1 (en) | Method and apparatus for recognizing user behaviour object based on flow analysis | |
TW201214169A (en) | Recognition of target words using designated characteristic values | |
CN106330584A (en) | Identification method and identification device of business flow | |
CN106155817A (en) | Business information processing method, server and system | |
CN102542061A (en) | Intelligent product classification method | |
CN107465643A (en) | A kind of net flow assorted method of deep learning | |
CN105808722A (en) | Information discrimination method and system | |
CN109275045B (en) | DFI-based mobile terminal encrypted video advertisement traffic identification method | |
CN110020161B (en) | Data processing method, log processing method and terminal | |
CN103491089A (en) | Transcoding method and system of data recovery based on HTTP | |
CN112381119A (en) | Multi-scene classification method and system based on decentralized application encryption flow characteristics | |
CN107480270A (en) | A kind of real time individual based on user feedback data stream recommends method and system | |
CN107832344A (en) | A kind of food security Internet public opinion analysis method based on storm stream calculation frameworks | |
WO2016201876A1 (en) | Service identification method and device for encrypted traffic, and computer storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |