Connect public, paid and private patent data with Google Patents Public Datasets

Deep packet detection device, webpage data processing method, and webpage data acquisition method and system

Info

Publication number
CN101997915A
CN101997915A CN 201010532086 CN201010532086A CN101997915A CN 101997915 A CN101997915 A CN 101997915A CN 201010532086 CN201010532086 CN 201010532086 CN 201010532086 A CN201010532086 A CN 201010532086A CN 101997915 A CN101997915 A CN 101997915A
Authority
CN
Grant status
Application
Patent type
Prior art keywords
webpage
data
acquisition
method
http
Prior art date
Application number
CN 201010532086
Other languages
Chinese (zh)
Other versions
CN101997915B (en )
Inventor
杨俊�
蒋丹舟
蔡逆水
陈强
Original Assignee
中国电信股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Abstract

The invention discloses a webpage data processing method, a webpage data acquisition method, a deep packet detection device and a webpage data acquisition system. The webpage data acquisition method comprises the following steps: selectively capturing hyper text transport protocol (HTTP) messages of a flow webpage server according to a webpage address information base; analyzing the content of the captured HTTP messages; extracting the content of tag fields in the HTTP messages; and selectively acquiring the data in the captured HTTP messages according to the content of the tag fields. According to the invention, the deep packet detection technique with the webpage data acquisition technique are combined, the acquisition and analysis efficiency of the webpage data is improved, the cost for acquiring and analyzing mass data is reduced, and simultaneously the webpage data more accurately can be acquired because the tag fields are adopted.

Description

深度包检测装置、网页数据处理方法、采集方法及系统 Deep packet inspection means, web page data processing method, data collection method and system

技术领域 FIELD

[0001] 本发明涉及互联网技术领域,特别地,涉及一种深度包检测装置、网页数据处理方法、网页数据采集方法及网页数据采集系统。 [0001] The present invention relates to the field of Internet technologies, and particularly, to a deep packet inspection device, data processing method webpage, web page data collection method and a data acquisition system.

背景技术 Background technique

[0002] 随着WEB技术和TOB应用的快速发展,对各种TOB应用网站,特别是电子渠道、电子商务等平台中的集中监控、用户数据采集和统计分析的应用也越来越广泛。 [0002] With the rapid development of WEB technology and application of TOB, TOB applications for a variety of websites, especially electronic channels, e-commerce platform centralized monitoring, application user data collection and statistical analysis has become increasingly widespread. 但是,由于用户量庞大的电子渠道和电子商务等平台的用户数据是海量的,因此,在实际工作中需要对海量数据进行选择性地采集。 However, due to the huge amount of user data in the user's electronic channels and e-commerce platform is massive, and therefore, in practical work needs to be selectively collect massive amounts of data.

[0003] 然而,现有的网页在设计之初并没有考虑数据采集问题,而且现有的网页普遍存在页面地址及采集数据杂乱、准确性不高等问题,因此,基于现有的网页难于进行高效和准确地数据采集。 [0003] However, the existing web early in the design did not consider the issue of data collection, and the existing web page address common data collection and clutter, not high accuracy, therefore, based on existing pages difficult to conduct efficient and accurate data collection.

发明内容 SUMMARY

[0004] 本发明要解决的一个技术问题是提供一种深度包检测装置、网页数据处理方法、 网页数据采集方法及网页数据采集系统,能够高效且准确地对网页的数据进行采集。 [0004] A technical problem to be solved by the present invention is to provide a deep packet inspection device, data processing method webpage, web page data collection method and a data acquisition system capable of efficiently and accurately the data pages collected.

[0005] 根据本发明的一方面,提出了一种网页数据处理方法,包括根据数据采集需求确定每个网页的HTTP协议报文的数据采集范围;在每个网页的HTTP协议报文中加入标签字段,标签字段的内容表示网页的HTTP协议报文的数据采集范围。 [0005] According to an aspect of the present invention, a data processing method, a web page, including the data collection requirements of each page to determine the HTTP protocol packets of data acquisition range; label added to the HTTP protocol packets for each webpage content field, the tag field indicates the data acquisition range HTTP protocol packets page.

[0006] 根据本发明网页数据处理方法的一个实施例,标签字段设置在每个网页的HTTP 协议报文的头部字段中。 [0006] The page data processing method according to an embodiment of the present invention embodiment, the label field is set to the HTTP protocol header field of the packets in each page.

[0007] 根据本发明网页数据处理方法的另一实施例,网页的HTTP协议报文的数据采集范围包括提取HTTP协议报文中的全部数据、提取HTTP协议报文中的部分数据、以及不提取HTTP协议报文中的任何数据。 Data acquisition range HTTP protocol packets embodiment, the page [0007] According to another embodiment of the present invention, the page data processing method comprising extracting all data packets of the HTTP protocol, the HTTP protocol to extract partial data packets, and does not extract any data HTTP protocol packets.

[0008] 根据本发明的另一方面,还提出了一种网页数据采集方法,包括根据网页地址信息库选择性地抓取流向网页服务器的HTTP协议报文;解析抓取到的HTTP协议报文的内容; 提取HTTP协议报文中的标签字段的内容;根据标签字段的内容对抓取到的HTTP协议报文中的数据进行选择性采集。 [0008] According to another aspect of the present invention, there is proposed a method for collecting data on the page, including the HTTP protocol packets flowing web server based on the page address information base for selectively gripping; parse HTTP protocol to fetch packets content; extracting content tag field of the HTTP protocol packets; selective collection of the HTTP protocol to fetch data packets in accordance with the contents of the tag field.

[0009] 根据本发明网页数据采集方法的一个实施例,通过下述步骤形成流向网页服务器的HTTP协议报文:根据数据采集需求确定每个网页的HTTP协议报文的数据采集范围;在每个网页的HTTP协议报文中加入标签字段,形成流向网页服务器的HTTP协议报文,其中, 标签字段的内容表示网页的HTTP协议报文的数据采集范围。 [0009] The HTTP protocol packet to one embodiment, is formed by the steps of flowing web page server data acquisition method according to the present invention: data acquisition requirements determination range data acquisition HTTP protocol packets according to each page; each HTTP protocol packets of the page tag field added to form a flow of the web server HTTP protocol packets, wherein the content of the tag field indicates the data acquisition range HTTP protocol packets page.

[0010] 根据本发明网页数据采集方法的另一实施例,标签字段设置在每个网页的HTTP 协议报文的头部字段中。 [0010] According to another embodiment of the web page data collection method of the present invention, the tag field provided in a header field of the HTTP protocol packets in each page.

[0011] 根据本发明网页数据采集方法的又一实施例,网页的HTTP协议报文的数据采集范围包括提取HTTP协议报文中的全部数据、提取HTTP协议报文中的部分数据、以及不提取HTTP协议报文中的任何数据。 Data acquisition range HTTP protocol packets embodiment, the page [0011] According to a further embodiment of the web page data collection method of the present invention comprises extracting all the data packets in the HTTP protocol, the HTTP protocol to extract partial data packets, and does not extract any data HTTP protocol packets.

[0012] 根据本发明的又一方面,还提出了一种深度包检测装置,包括地址筛选模块,用于根据网页地址信息库选择性地抓取流向网页服务器的HTTP协议报文;报文解析模块,与地址筛选模块相连,用于解析抓取到的HTTP协议报文的内容;标签内容提取模块,与报文解析模块相连,用于提取HTTP协议报文中的标签字段的内容,其中,标签字段的内容表示网页的HTTP协议报文的数据采集范围;数据采集模块,与标签内容提取模块相连,用于根据标签字段的内容对抓取到的HTTP协议报文中的数据进行选择性采集。 [0012] According to another aspect of the present invention, there is proposed a deep packet inspection device, comprising a screening module address, the page address for the packet according to the information base to selectively flow crawler web server HTTP protocol; message parsing modules, and address filtering module is connected, for gripping parsed content of the HTTP protocol packets; label content extraction module, connected with the message parsing module configured to extract the contents of the tag field HTTP protocol packets, wherein, tag field indicates the contents of the data acquisition range HTTP protocol packets page; data acquisition module, content extraction module is connected with the label, for selective collection of the HTTP protocol to fetch data packets based on the contents of the tag field .

[0013] 根据本发明深度包检测装置的一个实施例,标签字段设置在流向网页服务器的HTTP协议报文的头部字段中。 [0013] According to one embodiment of the deep packet inspection device according to the present invention, the tag field provided in a header field of the HTTP protocol packets flowing web server.

[0014] 根据本发明深度包检测装置的另一实施例,网页的HTTP协议报文的数据采集范围包括提取HTTP协议报文中的全部数据、提取HTTP协议报文中的部分数据、以及不提取HTTP协议报文中的任何数据。 Data acquisition range HTTP protocol packets embodiment, the page [0014] According to another embodiment of the deep packet inspection device of the present invention comprises extracting all the data packets of the HTTP protocol, the HTTP protocol to extract partial data packets, and does not extract any data HTTP protocol packets.

[0015] 根据本发明的再一方面,还提出了一种网页数据采集系统,包括上述实施例中的深度包检测装置以及网页数据处理装置,其中,网页数据处理装置包括采集范围确定模块, 用于根据数据采集需求确定每个网页的HTTP协议报文的数据采集范围;数据处理模块,与采集范围确定模块相连,用于在每个网页的HTTP协议报文中加入标签字段,形成流向网页服务器的HTTP协议报文,其中,标签字段的内容表示网页的HTTP协议报文的数据采集范围。 [0015] According to another aspect of the present invention also proposes a web page data collection system, including a deep packet inspection device, and a web page data processing apparatus of the above-described embodiment, wherein the web processing apparatus comprises a data acquisition range determining module, with data acquisition for determining the scope of the HTTP protocol packets each page based on the data collection requirements; data processing module, connected with the collection range determination module for adding a tag field in the HTTP protocol packets per page, the web server flow is formed HTTP protocol packets, wherein the content of the tag field indicates the data acquisition range HTTP protocol packets page.

[0016] 本发明提供的深度包检测装置、网页数据处理方法、网页数据采集方法及网页数据采集系统,能够将深度包检测(De印Packet InspectiomDPI)技术与网页数据采集技术相结合,提升了对网页数据的采集效率,减小了对海量数据进行采集和分析的成本。 [0016] The deep packet inspection device of the present invention provides, web page data processing method, the page data collection method and the page data acquisition system, capable of deep packet inspection (De printing Packet InspectiomDPI) technology and web page data acquisition technology, promoted to webpage data collection efficiency, reduce the cost of the mass data acquisition and analysis. 同时, 由于采用标签字段,所以能够更准确地确定网页的数据采集范围,从而提高了数据采集的准确性。 Meanwhile, since the tag field, it is possible to determine the page data acquisition range more accurately, thereby improving the accuracy of data collection.

附图说明 BRIEF DESCRIPTION

[0017] 此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分。 [0017] The drawings described herein are provided for further understanding of the present invention, constitute a part of this application. 在附图中: In the drawings:

[0018] 图1是本发明网页数据处理方法的一个实施例的流程示意图。 [0018] FIG. 1 is a flow diagram of one embodiment of a data processing method of the present invention the page.

[0019] 图2是本发明网页数据采集方法的一个实施例的流程示意图。 [0019] FIG. 2 is a flow diagram of one embodiment of the present invention, the page data acquisition process.

[0020] 图3是本发明网页数据采集方法的又一实施例的流程示意图。 [0020] FIG. 3 is a schematic flow diagram of another embodiment of the page data collection method of the present invention.

[0021] 图4是本发明深度包检测装置的一个实施例的结构示意图。 [0021] FIG. 4 is a block diagram of one embodiment of the deep packet inspection device according to the present invention.

[0022] 图5是本发明网页数据采集系统的一个实施例的结构示意图。 [0022] FIG 5 is a configuration diagram of an embodiment of the page data acquisition system of the present invention.

具体实施方式 detailed description

[0023] 下面参照附图对本发明进行更全面的描述,其中说明本发明的示例性实施例。 [0023] The following more fully described with reference to the accompanying drawings of the present invention, wherein the exemplary embodiments described exemplary embodiment of the present invention. 本发明的示例性实施例及其说明用于解释本发明,但并不构成对本发明的不当限定。 Exemplary embodiments of the present invention are used to explain the present invention, but without unduly limiting the present invention.

[0024] 以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本发明及其应用或使用的任何限制。 [0024] The following description of at least one exemplary embodiment is merely illustrative, and not as any limitation on the present invention and its application, or uses.

[0025] 本发明将DPI技术和TOB网页数据采集技术相结合,在分析了DPI选择性数据采集原理的基础上,为了提升采集分析的效率,提出了便于DPI采集的网页数据处理方法、网页数据采集方法、深度包检测装置以及网页数据采集系统。 [0025] The present invention will be DPI technology and TOB web data acquisition technology, based on the analysis of the principle of selective data collection DPI, in order to enhance the efficiency of acquisition and analysis, the data processing method is proposed to facilitate web DPI collected data pages acquisition method, deep packet inspection device, and the page data acquisition system.

[0026] 在进行DPI选择性采集时,首先需要建立一个库,存储待采集的页面地址,每个请求到服务器后先根据这个库进行地址查询,如果网页的地址与库中的页面地址相匹配,则提取页面的内容。 [0026] When performing DPI selective collection, first need to establish a library page address, store data to be collected, the address query according to the library after every request to the server if the web page address and the page address matches the library the content of the page is extracted.

[0027] 图1是本发明网页数据处理方法的一个实施例的流程示意图。 [0027] FIG. 1 is a flow diagram of one embodiment of a data processing method of the present invention the page.

[0028] 如图1所示,该实施例可以包括以下步骤: [0028] As shown in FIG. 1, the embodiment may include the steps of:

[0029] S102,根据数据采集需求确定对每个网页的HTTP协议报文进行数据采集的范围; [0029] S102, the data collection requirements determined for each page HTTP protocol packets range of data collection;

[0030] S104,在每个网页的HTTP协议报文中添加标签字段,该标签字段的内容表示对网页的HTTP协议报文进行数据采集的范围,其中,该标签字段可以位于HTTP协议报文中的任何位置,优选地,可以将标签字段设置在每个网页的HTTP协议报文的头部字段中。 [0030] S104, the HTTP protocol packets added in the tag field for each page, the contents of the tag field indicates the HTTP protocol packets page range data acquisition is performed, wherein, the tag field may be located in the HTTP protocol packets any position, preferably, the tag field may be provided in a header field of the HTTP protocol packets in each page.

[0031] 另外,对网页的HTTP协议报文进行数据采集的范围可以包括提取HTTP协议报文中的全部数据(即,包括报文头至报文尾的所有数据)、提取HTTP协议报文中的部分数据(例如,IP地址、用户名、页面地址、访问时间、登录类别以及页面参数等)、以及不提取HTTP 协议报文中的任何数据。 [0031] Further, the HTTP protocol packets on the page will be a range of data collection may include extracting all data HTTP protocol packets (i.e., including all data packet header to the packet tail), extracted HTTP protocol packet the part of the data (eg, IP address, user name, page address, access times, login page and category parameters, etc.), as well as any HTTP protocol data packets are not extracted.

[0032] 该实施例在进行网页数据处理的同时考虑了DPI技术,在分析了DPI选择性数据采集原理的基础上提出了便于DPI采集的网页数据处理方法,该实施例能够显著提升网页数据的采集效率,并且提高数据采集的准确性。 [0032] Meanwhile, the embodiment is performed on the page data processed DPI technology considered, the analysis presented in the Web page data processing method facilitates selective DPI DPI acquired basic principle of data acquisition, this embodiment can significantly enhance the webpage data collection efficiency, and improve the accuracy of data collection.

[0033] 在本发明网页数据处理方法的另一实施例中,首先需要对页面地址进行规范(例如,http://202. 23. 24. 153/news/sports,代表新闻中的体育内容),然后,将TOB网站网页分为不同的层级,对应不同的数据采集范围,再在网页的HTTP协议报文中加入标签字段, 该标签字段的内容对应于不同的数据采集范围。 [0033] In a further data processing method of the present invention, the embodiment of the page, the page address is first necessary to regulate (e.g., http:. // 202 23. 24. 153 / news / sports, sports content representative of the news) then, the TOB website pages into different levels, corresponding to different range of data collection, a tag field added in the HTTP protocol packets of the page, the contents of the tag field corresponding to a different range of data collection. 根据RFC协议规范,HTTP协议报文的头部字段可以根据具体应用需要嵌入自定义字段内容,因此,可以在电子渠道网页实现时嵌入自定义的HTTP头部字段信息(即,标签字段),针对不同的数据采集需求,对网页嵌入不同的自定义信息,从而实现对网页的层级分类,进一步地可以为数据采集作好准备。 The RFC protocol specification, HTTP protocol packet header field may need to embed a custom field content according to the particular application, therefore, may be embedded in a custom HTTP header field information (i.e., tag field) when the electronic page channel implemented, for different data collection needs for different custom web page embedded information, enabling hierarchical classification of the page, you can prepare for further data acquisition.

[0034] 图2是本发明网页数据采集方法的一个实施例的流程示意图。 [0034] FIG. 2 is a flow diagram of one embodiment of the present invention, the page data acquisition process.

[0035] 如图2所示,该实施例可以包括以下步骤: As shown in the embodiment [0035] FIG. 2 embodiment may comprise the steps of:

[0036] S202,根据网页地址信息库选择性地抓取流向网页服务器的HTTP协议报文,其中,该网页地址信息库中可以存储待抓取网页的页面地址,在流向网页服务器的页面的地址满足网页地址信息库的要求(例如,该页面地址存储于网页地址信息库中)时,才被抓取并进行后续的报文解析与数据提取; [0036] S202, based on the page address information base to selectively flow crawler web server HTTP protocol packets, wherein the web page address information may be stored in the repository to be crawl the web page address, the address of the page flow web server to meet the requirements of the page address information base (e.g., the web page address stored in the address database information), the fetch and was only subsequent data packet parsing and extraction;

[0037] S204,解析抓取到的HTTP协议报文的内容; [0037] S204, analyzes the contents of HTTP protocol to fetch packets;

[0038] S206,提取HTTP协议报文中的标签字段的内容; [0038] S206, extracts a content tag field of the HTTP protocol packets;

[0039] S208,根据标签字段的内容对抓取到的HTTP协议报文中的数据进行选择性采集。 [0039] S208, based on the contents of the tag field selectively to fetch data acquisition HTTP protocol packets.

[0040] 其中,可以通过下述步骤形成流向网页服务器的HTTP协议报文:根据数据采集需求确定对每个网页的HTTP协议报文进行数据采集的范围;在每个网页的HTTP协议报文中加入标签字段,形成流向网页服务器的HTTP协议报文,其中,标签字段的内容表示对网页的HTTP协议报文进行数据采集的范围。 [0040] wherein the flow can be formed web server HTTP protocol packets by the following steps: data acquisition range for the HTTP protocol packets for each page is determined according to data collection requirements; the HTTP protocol packets for each webpage tag field added to form a flow of the web server HTTP protocol packets, wherein the content of the tag field representation of web HTTP protocol packets range data acquisition.

[0041] 在一个实例中,对网页的HTTP协议报文进行数据采集的范围可以包括提取HTTP协议报文中的全部数据(即,包括报文头至报文尾的所有数据)、提取HTTP协议报文中的部分数据(例如,IP地址、用户名、页面地址、访问时间、登录类别以及页面参数等)、以及不提取HTTP协议报文中的任何数据。 [0041] In one example, HTTP web page protocol packets for data acquisition ranges may include extracting the entire data of the HTTP protocol packets (i.e., including all data packet header to the packet tail), extracting the HTTP protocol part of the data (eg, IP address, user name, page address, access times, login page and category parameters, etc.) in the packet, as well as any HTTP protocol data packets are not extracted.

[0042] 可选地,标签字段可以位于每个网页的HTTP协议报文中的任何位置,优选地,可以将标签字段设置在每个网页的HTTP协议报文的头部字段中。 [0042] Alternatively, the tag field may be located anywhere HTTP protocol packets in each page, preferably, the label field can be set in a header field of the HTTP protocol packets in each page.

[0043] 该实施例在进行数据采集时,首先根据网页地址信息库筛选待采集的网页,在很大程度上减少了海量数据的干扰。 [0043] When the data acquisition is performed Example embodiment, the first web page in accordance with the address information to be acquired library screening, considerably reduced interference massive data. 进一步地,该实施例还解析所抓取网页的HTTP头部字段内容,提取自定义的头部字段标签内容,按照标签的内容采取不同的数据采集提取流程,例如,可以提取HTTP协议报文的全部内容、提取HTTP协议报文的部分内容或者不提取任何内容,从而实现带选择性的数据采集,减小海量数据对于技术及成本的压力,同时提高了数据采集的效率和准确性。 Further, this embodiment also parses the contents of the HTTP header field crawled pages, extracts header field customized label content, data collection to take a different extraction procedure, e.g., HTTP protocol packets may be extracted in accordance with the contents of tag the entire content, the content extraction section or HTTP protocol packets does not extract any content, in order to achieve selective data acquisition with reduced pressure for mass data technology and cost, while improving efficiency and accuracy of data collection.

[0044] 在本发明网页数据采集方法的另一实施例中,根据RFC协议规范解析流向TOB网站服务器的HTTP协议报文,根据解析出的标签字段的内容在待采集数据内容的相应位置提取具体信息。 The corresponding position in the embodiment, according to the protocol specification RFC analytical flow TOB web server HTTP protocol packets [0044] In another embodiment of the present invention, the page data acquisition method, according to the contents of the tag field parsed data content to be acquired in particular extraction information. 具体地,DPI装置在处理HTTP协议时,解析相应的自定义头部字段内容(即, 标签字段的内容),根据自定义头部字段内容的定义调用不同的数据采集流程,以实现网页数据的提取。 Specifically, the DPI device is in the HTTP protocol processing, parsing the corresponding custom content header field (i.e., the contents of the tag field), call different from the data acquisition process according to the content definition header field definition, in order to achieve the webpage data extract. HTTP协议头部字段嵌入的自定义内容可以分为标签和内容两个部分,自定义的头部字段可以约定以“X-”开头,例如,"X-type :0”可以表示提取HTTP协议报文的所有内容,“X-type :1”可以表示只提取URL地址。 HTTP protocol header fields are embedded custom content can be divided into two portions and a content tag from a header field may be defined convention starts with "X-", e.g., "X-type: 0" may represent the HTTP protocol packets extract All the contents of the text, "X-type: 1" may represent only extract URL address. 根据数据采集内容的层级需要,可以定义一个或者多个自定义头部标签,分别赋予不同的内容,代表提取不同的数据。 The content data acquisition required level, may define one or more custom tag header, respectively assigned to different contents, the data representative of the extracted different.

[0045] 图3是本发明网页数据采集方法的又一实施例的流程示意图。 [0045] FIG. 3 is a schematic flow diagram of another embodiment of the page data collection method of the present invention.

[0046] 如图3所示,该实施例可以包括以下步骤: [0046] As shown in FIG. 3, this embodiment includes the following steps:

[0047] S302,搭建DPI采集系统,与目标采集网站进行数据镜像; [0047] S302, DPI build acquisition system, and the target image data acquisition site;

[0048] S304,建立网页地址信息库,其中存储了待抓取网页的地址; [0048] S304, the establishment of a web page address information repository, which stores the address to be crawled pages;

[0049] S306,建立选择性解析内容深度信息库,其中存储了不同自定义标签对应的数据采集解析子程序,例如,提取HTTP协议报文的全部内容所使用的全部数据采集解析子程序、提取HTTP协议报文的部分内容所使用的部分数据采集解析子程序等; [0049] S306, the depth establishment selective parsing the content repository, which stores various data acquisition routines parsing custom label, for example, to extract the entire contents of all data packets of the HTTP protocol used for parsing subroutine collected, extracted part of partial data packets to the HTTP protocol used for parsing subroutines collected;

[0050] S308,根据网页地址信息库对流向网页服务器的页面进行选择性抓取; [0050] S308, a page flow is selectively crawl web server based on the page address information base;

[0051] S310,存储所抓取的数据; [0051] S310, storing the captured data;

[0052] S312,解析抓取到的页面的HTTP协议报文的内容,根据HTTP协议报文中的标签字段的内容对抓取到的HTTP协议报文中的数据进行选择性采集; [0052] S312, the parsed pages to crawl HTTP protocol packets content, based on the contents of the tag field HTTP protocol packets selectively to fetch data acquisition HTTP protocol packets;

[0053] S314,分类存储解析后的数据。 [0053] S314, the classified data storage resolution.

[0054] 图4是本发明深度包检测装置的一个实施例的结构示意图。 [0054] FIG. 4 is a block diagram of one embodiment of the deep packet inspection device according to the present invention.

[0055] 如图4所示,该实施例的深度包检测装置10可以包括: [0055] 4, deep packet inspection device 10 of this embodiment may include:

[0056] 地址筛选模块11,用于根据网页地址信息库选择性地抓取流向网页服务器的HTTP协议报文; [0056] Address filtering module 11, for selectively gripping a web server HTTP protocol flow of packets based on the page address information base;

[0057] 报文解析模块12,与地址筛选模块相连,用于解析抓取到的HTTP协议报文的内容; [0057] SUMMARY message parsing module 12, connected to the address filtering module, for parsing HTTP protocol to fetch packets;

[0058] 标签内容提取模块13,与报文解析模块相连,用于提取HTTP协议报文中的标签字段的内容,其中,标签字段的内容表示网页的HTTP协议报文的数据采集范围,可选地,网页的HTTP协议报文的数据采集范围可以包括提取HTTP协议报文中的全部数据、提取HTTP协议报文中的部分数据、以及不提取HTTP协议报文中的任何数据; [0058] Tag Content extraction module 13, connected with the message parsing module, for extracting the contents of tag field in the HTTP protocol packets, wherein the contents of the tag field indicates the data acquisition range page HTTP protocol packets, optionally HTTP, the web protocol packets may include data acquisition range HTTP protocol to extract all of the data packets, the data extraction section HTTP protocol packets, and does not extract any data packets of the HTTP protocol;

[0059] 数据采集模块14,与标签内容提取模块相连,用于根据标签字段的内容对抓取到的HTTP协议报文中的数据进行选择性采集。 [0059] 14, the content data acquisition module tag extraction module is connected, for selectively to fetch data acquisition HTTP protocol packets in accordance with the contents of the tag field.

[0060] 可选地,可以将标签字段设置在流向网页服务器的HTTP协议报文的头部字段中。 [0060] Alternatively, the tag field may be provided in a header field of the HTTP protocol packets flowing web server.

[0061] 该实施例在进行数据采集时,首先根据网页地址筛选待采集的网页,在很大程度上减少了对海量的处理。 [0061] When the data acquisition is performed Example embodiment, screening is first to be acquired in accordance with the page address of the page, the processing of considerably reduced mass. 另外,该实施例还解析所抓取网页的HTTP头部字段内容,提取自定义的头部字段标签内容,按照标签的内容采取不同的数据采集提取流程,可以提取HTTP 协议报文的全部内容、提取HTTP协议报文的部分内容或者不提取任何内容等,从而实现带选择性的数据采集,减小海量数据对于技术及成本的压力,同时提高了数据采集的效率和准确性。 Further, this embodiment also parse HTTP header field contents of the fetched page, extracting header field customized label content to take a different extraction procedure in accordance with the content data acquisition tag, the entire contents of the HTTP protocol can be extracted packets, part of extracting packets or HTTP protocol does not extract the contents of any other, so that data acquisition with selectively reduced pressure for mass data technology and cost, while improving efficiency and accuracy of data collection.

[0062] 图5是本发明网页数据采集系统的一个实施例的结构示意图。 [0062] FIG 5 is a configuration diagram of an embodiment of the page data acquisition system of the present invention.

[0063] 如图5所示,该实施例的网页数据采集系统可以包括前述实施例中的深度包检测装置10以及网页数据处理装置21,其中,网页数据处理装置21包括: [0063] As shown in FIG 5, the page data acquisition system of this embodiment may include the aforementioned embodiments deep packet inspection device 10 and the data processing apparatus 21 pages, wherein the page data processing apparatus 21 comprises:

[0064] 采集范围确定模块211,用于根据数据采集需求确定每个网页的HTTP协议报文的数据采集范围; [0064] acquisition range determining module 211 for determining the data collection requirements of data acquisition range HTTP protocol packets for each page;

[0065] 数据处理模块212,与采集范围确定模块相连,用于在每个网页的HTTP协议报文中加入标签字段,形成流向网页服务器的HTTP协议报文,其中,标签字段的内容表示网页的HTTP协议报文的数据采集范围。 [0065] The data processing module 212, coupled with the capture range determination module for adding a tag field in the HTTP protocol packets per page, flow formed web server HTTP protocol packets, wherein the contents of the tag field indicates page data acquisition range HTTP protocol packets.

[0066] 虽然已经通过示例对本发明的一些特定实施例进行了详细说明,但是本领域的技术人员应该理解,以上示例仅是为了进行说明,而不是为了限制本发明的范围。 [0066] Although a detailed description of specific embodiments of the present invention by way of example, those skilled in the art will appreciate that the above examples are intended to be illustrative and not intended to limit the scope of the invention only. 本领域的技术人员应该理解,可在不脱离本发明的范围和精神的情况下,对以上实施例进行修改。 Those skilled in the art will appreciate, may be made without departing from the scope and spirit of the present invention, the above embodiments can be modified. 本发明的范围由所附权利要求来限定。 Scope of the invention defined by the appended claims.

Claims (11)

  1. 一种网页数据处理方法,其特征在于,包括:根据数据采集需求确定每个网页的HTTP协议报文的数据采集范围;在每个网页的HTTP协议报文中加入标签字段,所述标签字段的内容表示网页的HTTP协议报文的数据采集范围。 One kind of web page data processing method, comprising: collecting data according to the data acquisition needs to determine the scope of the HTTP protocol packets for each page; tag field added to the HTTP protocol packets per page, the tag field content represents the scope of data collection HTTP protocol packets of the page.
  2. 2.根据权利要求1所述的方法,其特征在于,所述标签字段设置在所述每个网页的HTTP协议报文的头部字段中。 2. The method according to claim 1, wherein the tag field is set in a header field of the HTTP protocol packets for each webpage.
  3. 3.根据权利要求1所述的方法,其特征在于,所述网页的HTTP协议报文的数据采集范围包括提取HTTP协议报文中的全部数据、提取HTTP协议报文中的部分数据、以及不提取HTTP协议报文中的任何数据。 3. The method according to claim 1, characterized in that the range of the data collection page HTTP protocol packets comprising extracting the entire data of the HTTP protocol packets, the data extraction section HTTP protocol packets, and not any HTTP protocol packet data extraction.
  4. 4. 一种网页数据采集方法,其特征在于,包括:根据网页地址信息库选择性地抓取流向网页服务器的HTTP协议报文;解析抓取到的HTTP协议报文的内容;提取所述HTTP协议报文中的标签字段的内容;根据所述标签字段的内容对所述抓取到的HTTP协议报文中的数据进行选择性采集。 A web page data collection method, comprising: based on the page address information base to selectively flow crawler web server HTTP protocol packets; crawled content parsing HTTP protocol packets; extracting the HTTP content tag field of the protocol message; HTTP protocol data packets to the crawler is selectively acquired based on the contents of the tag field.
  5. 5.根据权利要求4所述的方法,其特征在于,通过下述步骤形成所述流向网页服务器的HTTP协议报文:根据数据采集需求确定每个网页的HTTP协议报文的数据采集范围;在每个网页的HTTP协议报文中加入所述标签字段,形成所述流向网页服务器的HTTP 协议报文,其中,所述标签字段的内容表示网页的HTTP协议报文的数据采集范围。 The method according to claim 4, characterized in that a flow of the HTTP protocol packets web server by the steps of: determining the data collection requirements of data acquisition range HTTP protocol packets for each page; the HTTP protocol packets each page is added to the tag field, the HTTP protocol packets forming said flow web server, wherein the content of the tag field indicates the data acquisition range HTTP protocol packets page.
  6. 6.根据权利要求4或5所述的方法,其特征在于,所述标签字段设置在所述每个网页的HTTP协议报文的头部字段中。 The method according to claim 4 or claim 5, wherein the tag field is set in a header field of the HTTP protocol packets for each webpage.
  7. 7.根据权利要求5所述的方法,其特征在于,所述网页的HTTP协议报文的数据采集范围包括提取HTTP协议报文中的全部数据、提取HTTP协议报文中的部分数据、以及不提取HTTP协议报文中的任何数据。 7. The method according to claim 5, characterized in that, said webpage HTTP protocol packets include data acquisition HTTP protocol to extract all the data packets, the data extraction section HTTP protocol packets, and not any HTTP protocol packet data extraction.
  8. 8. 一种深度包检测装置,其特征在于,包括:地址筛选模块,用于根据网页地址信息库选择性地抓取流向网页服务器的HTTP协议报文;报文解析模块,与所述地址筛选模块相连,用于解析抓取到的HTTP协议报文的内容;标签内容提取模块,与所述报文解析模块相连,用于提取所述HTTP协议报文中的标签字段的内容,其中,所述标签字段的内容表示网页的HTTP协议报文的数据采集范围;数据采集模块,与所述标签内容提取模块相连,用于根据所述标签字段的内容对所述抓取到的HTTP协议报文中的数据进行选择性采集。 A deep packet inspection device, comprising: an address filter module, according to the address information base web crawler flow selectively web server HTTP protocol packet; packet parsing module, and said address filter module is connected, a content parsing HTTP protocol to fetch packets; label content extraction module, connected with the message parsing module for extracting the contents of the tag field of the HTTP protocol packets, wherein the contents of the tag field indicates the data acquisition range HTTP protocol packets page; data acquisition module, is connected with the tag content extraction module for the contents of the tag field of the HTTP protocol to fetch packets data is selectively collected.
  9. 9.根据权利要求8所述的装置,其特征在于,所述标签字段设置在所述流向网页服务器的HTTP协议报文的头部字段中。 9. The apparatus according to claim 8, wherein the tag field of the header field is provided in the flow of the web server in the HTTP protocol packets.
  10. 10.根据权利要求8所述的装置,其特征在于,所述网页的HTTP协议报文的数据采集范围包括提取HTTP协议报文中的全部数据、提取HTTP协议报文中的部分数据、以及不提取HTTP协议报文中的任何数据。 10. The apparatus according to claim 8, characterized in that the range of the data collection page HTTP protocol packets comprising extracting the entire data of the HTTP protocol packets, the data extraction section HTTP protocol packets, and not any HTTP protocol packet data extraction.
  11. 11. 一种网页数据采集系统,其特征在于,包括权利要求8-10中任一项所述的深度包检测装置以及网页数据处理装置,其中,所述网页数据处理装置包括:采集范围确定模块,用于根据数据采集需求确定每个网页的HTTP协议报文的数据采集范围;数据处理模块,与所述采集范围确定模块相连,用于在每个网页的HTTP协议报文中加入所述标签字段,形成所述流向网页服务器的HTTP协议报文,其中,所述标签字段的内容表示网页的HTTP协议报文的数据采集范围。 A web page data acquisition system, wherein the detecting means includes a depth and a page packet data processing apparatus as claimed in any one of claims 8-10, wherein said web data processing apparatus comprising: acquisition range determining module , demand for acquiring the data of each page to determine the HTTP protocol packets of data acquisition range; data processing module, coupled to said range determining acquisition module, the HTTP protocol packets for each page is added to the tag field, the HTTP protocol packets forming said flow web server, wherein the content of the tag field indicates the data acquisition range HTTP protocol packets page.
CN 201010532086 2010-10-29 2010-10-29 Deep packet detection device, webpage data processing method, and webpage data acquisition method and system CN101997915B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010532086 CN101997915B (en) 2010-10-29 2010-10-29 Deep packet detection device, webpage data processing method, and webpage data acquisition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010532086 CN101997915B (en) 2010-10-29 2010-10-29 Deep packet detection device, webpage data processing method, and webpage data acquisition method and system

Publications (2)

Publication Number Publication Date
CN101997915A true true CN101997915A (en) 2011-03-30
CN101997915B CN101997915B (en) 2014-01-08

Family

ID=43787485

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010532086 CN101997915B (en) 2010-10-29 2010-10-29 Deep packet detection device, webpage data processing method, and webpage data acquisition method and system

Country Status (1)

Country Link
CN (1) CN101997915B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103888307A (en) * 2012-12-20 2014-06-25 中国电信股份有限公司 Method, user side board card and broadband access gateway used for optimizing deep packet detection
CN104486157A (en) * 2014-12-16 2015-04-01 国家电网公司 Information system performance detecting method based on deep packet analysis

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020091755A1 (en) * 2000-06-30 2002-07-11 Attila Narin Supplemental request header for applications or devices using web browsers
CN1402156A (en) * 2001-08-22 2003-03-12 威瑟科技股份有限公司 Web site information extracting system and method
WO2007101478A1 (en) * 2006-03-09 2007-09-13 Tecs Research And Development Limited A method of monitoring online banner activity
CN101094135A (en) * 2006-06-23 2007-12-26 腾讯科技(深圳)有限公司 Method and system for extracting information of content in Internet
CN101399749A (en) * 2007-09-27 2009-04-01 华为技术有限公司 Method, system and device for packet filtering
CN101556609A (en) * 2009-05-19 2009-10-14 杭州信杨通信技术有限公司 Customer behavior analysis and service system based on web contents
CN101667182A (en) * 2008-09-05 2010-03-10 华为技术有限公司 Method, system and device for performing secondary operation on web pages

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020091755A1 (en) * 2000-06-30 2002-07-11 Attila Narin Supplemental request header for applications or devices using web browsers
CN1402156A (en) * 2001-08-22 2003-03-12 威瑟科技股份有限公司 Web site information extracting system and method
WO2007101478A1 (en) * 2006-03-09 2007-09-13 Tecs Research And Development Limited A method of monitoring online banner activity
CN101094135A (en) * 2006-06-23 2007-12-26 腾讯科技(深圳)有限公司 Method and system for extracting information of content in Internet
CN101399749A (en) * 2007-09-27 2009-04-01 华为技术有限公司 Method, system and device for packet filtering
CN101667182A (en) * 2008-09-05 2010-03-10 华为技术有限公司 Method, system and device for performing secondary operation on web pages
CN101556609A (en) * 2009-05-19 2009-10-14 杭州信杨通信技术有限公司 Customer behavior analysis and service system based on web contents

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
中华人民共和国工业和信息化部: "《中华人民共和国通信行业标准》", 15 June 2009, article "深度包检测设备技术要求" *
韩树人 等: "基于嵌入式Web服务器的远程实时数据采集", 《计算机技术与发展》, vol. 18, no. 1, 31 January 2008 (2008-01-31) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103888307A (en) * 2012-12-20 2014-06-25 中国电信股份有限公司 Method, user side board card and broadband access gateway used for optimizing deep packet detection
CN103888307B (en) * 2012-12-20 2017-11-17 中国电信股份有限公司 A method for optimizing deep packet inspection, and a user-side board Broadband Access
CN104486157A (en) * 2014-12-16 2015-04-01 国家电网公司 Information system performance detecting method based on deep packet analysis

Also Published As

Publication number Publication date Type
CN101997915B (en) 2014-01-08 grant

Similar Documents

Publication Publication Date Title
US20080294647A1 (en) Methods and apparatus to monitor content distributed by the internet
US20130086677A1 (en) Method and device for detecting phishing web page
CN101369276A (en) Evidence obtaining method for Web browser caching data
CN102098327A (en) Method and device for downloading online video sniffer
CN101655868A (en) Network data mining method, network data transmitting method and equipment
CN101715004A (en) Internet video-oriented distributed acquisition method and system
CN102801697A (en) Malicious code detection method and system based on plurality of URLs (Uniform Resource Locator)
CN101231661A (en) Method and system for digging object grade knowledge
US20130013583A1 (en) Online video tracking and identifying method and system
CN102624703A (en) Method and device for filtering uniform resource locators (URLs)
CN101763425A (en) Universal method for capturing webpage contents of any webpage
CN102739679A (en) URL(Uniform Resource Locator) classification-based phishing website detection method
CN102254027A (en) Method for obtaining webpage contents in batch
CN101937469A (en) Information capture method of video website
CN101902484A (en) Method and system for classifying local area network http application services
CN102710795A (en) Hotspot collecting method and device
CN101261643A (en) Website page information statistical method and apparatus
CN102819723A (en) Method and system for detecting malicious two-dimension codes
CN103179095A (en) Method and client device for detecting phishing websites
CN102790762A (en) Phishing website detection method based on uniform resource locator (URL) classification
CN102222187A (en) Domain name structural feature-based hang horse web page detection method
US20130198391A1 (en) System And Method For Main Page Identification In Web Decoding
CN102664935A (en) Method and system for associated output of WEB class user behavior and user information
CN102867053A (en) Method, device and system for collecting effective information web pages in website information
CN102063484A (en) Discovery method and device of third-party WEB application program

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C14 Grant of patent or utility model