CN104899323A

CN104899323A - Crawler system used for IDC harmful information monitoring platform

Info

Publication number: CN104899323A
Application number: CN201510343175.7A
Authority: CN
Inventors: 彭光辉; 屈立笳; 陶磊; 苏礼刚; 林伟
Original assignee: CHENGDU GOLDTEL INDUSTRY GROUP Co Ltd
Current assignee: CHENGDU GOLDTEL INDUSTRY GROUP Co Ltd
Priority date: 2015-06-19
Filing date: 2015-06-19
Publication date: 2015-09-09
Anticipated expiration: 2035-06-19
Also published as: CN104899323B

Abstract

The invention discloses a crawler system used for an IDC harmful information monitoring platform. The crawler system used for the IDC harmful information monitoring platform comprises one or more crawler clusters, wherein each crawler cluster comprises multiple crawler nodes and a crawler root node which form a distributed data acquisition network; each crawler root node is used for controlling and managing the crawler nodes in each crawler cluster; each crawler node is used for acquiring harmful information in the network and comprises a multithreading webpage acquisition module, a webpage library, a code identification and processing module, a webpage content automatic extraction module, a URL (Uniform Resource Locator) filter, a URL deduplication module and a URL scheduling module. The crawler system used for the IDC harmful information monitoring platform provides a powerful data collection function, and the dynamic webpage and static webpage are monitored comprehensively in real time through multiple crawler clusters.

Description

A kind of crawler system for IDC harmful information monitoring platform

Technical field

The present invention relates to a kind of crawler system for IDC harmful information monitoring platform.

Background technology

Along with developing rapidly of network, WWW becomes the carrier of bulk information, how effectively to extract and to utilize these information to become a huge challenge.Search engine becomes as the instrument of auxiliary people's retrieving information entrance and the guide that user accesses WWW.But these versatility search engines also also exist certain limitation.

In the face of the Web Community's environment become increasingly active, each netizen may become publisher and the diffuser of harmful information, and network is harmful to route of transmission and more and more extensively comprises blog, news, forum, microblogging and other approach.Web crawlers is the precursor technique that various search engine can realize, the arriving of large data age and the develop rapidly of Internet technology, makes web crawlers have more great Research Significance.Reply web data amount has a big increase, the network text update cycle is short and the series of challenges such as structure of web page dynamic change, high-level efficiency and the web crawlers of non-stop run becomes the study hotspot that harmful information excavates.

Summary of the invention

The object of the invention is to overcome the deficiencies in the prior art, a kind of crawler system for IDC harmful information monitoring platform is provided, present system provides powerful data collection function, by multiple reptile cluster, monitoring is in real time carried out comprehensively to dynamic web page and static Web page.

The object of the invention is to be achieved through the following technical solutions: a kind of crawler system for IDC harmful information monitoring platform, it comprises one or more reptile cluster, and each reptile cluster includes multiple reptile node and a reptile root node, form a distributed data acquisition network, wherein, reptile root node is used for carrying out control and management to the reptile node in this reptile cluster, and reptile node is used for the harmful information in collection network.

In the present invention, described each reptile node forms by following multiple module:

1, multithreading web retrieval module, comprises multiple web retrieval passage and web analysis module, for dissimilar webpage, is gathered it by the web retrieval passage that matches with it and web analysis module;

2, web page library, stores the webpage that multithreading web retrieval module gathers;

3, code identification processing module, automatically identifies the type of coding of webpage, and carries out code conversion process to it;

4, the automatic extraction module of web page contents, comprises dynamic web content extraction module and static web contents extraction module, there is the URL of harmful Intelligence Page according to responsive dictionary after capturing code conversion process;

5, url filtering device, filters the URL not needing to download;

6, URL duplicate removal module, whether consistent with the URL stored in URL storer for judging the URL after filtering, if consistent, no longer follow-up process is carried out to this URL;

7, URL scheduler module, according to the URL queue after duplicate removal, controls multithreading web retrieval module and downloads corresponding webpage.

Described reptile node also comprises removing duplicate webpages module, for judging that whether web page contents is consistent with the web page contents downloaded, if consistent, no longer carry out follow-up process to this webpage, and being deleted from web page library.

Described removing duplicate webpages module comprises fingerprint computing module, fingerprint base and fingerprint duplicate removal module, fingerprint computing module is according to web page fingerprint algorithm, the content of webpage is generated fingerprint through calculating, fingerprint in this generation fingerprint and fingerprint base contrasts by fingerprint duplicate removal module, if there is identical or akin fingerprint, then judge that this web page contents was downloaded, fingerprint base is for storing finger print data, and the fingerprint base of each reptile node carries out synchronized update.

Described reptile node also comprises label counter and label counting journal file, and these data for recording the download number in web page library, and are recorded in label counting journal file by label counter.

Described reptile node also comprises interval handling module, and interval handling module generates interval rule automatically by webpage scoring and weight of website, and controls the automatic extraction module of web page contents and carry out the crawl of corresponding interval to webpage.

Described reptile node also comprises rules for grasping and arranges module, and rules for grasping arranges module according to set rules for grasping, controls the automatic extraction module of web page contents and carries out corresponding grasping movement to webpage.

The type of coding of webpage is converted to Unicode transform format UTF by described code identification processing module automatically.

Described reptile node also comprises anti-crawler capturing module, when webpage is provided with anti-crawlers, starts anti-crawler capturing module, carries out pressure collection to target web.

Described reptile node also comprises acquisition monitoring module, and the duty of reptile node, acquisition tasks, sampling depth and log information are transmitted to reptile root node and carry out convergence processing by acquisition monitoring module, and receive the control of reptile root node.

Described reptile node also comprises fire wall, and multithreading web retrieval module is carried out retrieval by fire wall to the harmful information on network and crawled.

Described crawler system also comprises full-text database, index data base and row order sequenced data storehouse, and full-text database, index data base are all connected with reptile node and reptile root node with row order sequenced data storehouse.

The invention has the beneficial effects as follows: a kind of crawler system for IDC harmful information monitoring platform proposed by the invention, has following multiple functional characteristics:

1) multithreading collection: customize different strategies for dissimilar website, gathers and supports multithreading, realize snap information collection;

2) distributed capture: carry out larger scale data acquisition by multiple reptile cluster, some reptile nodes;

3) acquisition monitoring: monitor and managment is carried out to reptile node duty, acquisition tasks, sampling depth, daily record, system operation report etc.;

4) web page contents extracts automatically: can gather multiple dynamic and static state webpage, the webpages such as such as HTM, HTML, SHTML, XML, PHP, ASP, JSP, JavaScript;

5) coding identifies conversion automatically: support that the Multi-encodings such as GBK, GB2312, BIG5, UTF-8, UTF-16, BIGENDIAN, ISO8859-1 identify automatically, it is UTF that system carries out code conversion automatically;

6) incremental update: ensure reptile node only gather upgraded last time after the webpage of newly-generated or change, the webpage downloaded without Resurvey carrys out the efficiency that guarantee information upgrades, and user also also can set whole collection as required;

7) anti-crawler capturing: anti-crawlers website is set for part should corresponding strategies be set, avoid capturing the page;

8) reptile interval captures: adopt webpage scoring and weight of website etc. automatically to generate interval rule, carry out the crawl of corresponding interval to webpage;

9) self-defined rules for grasping: user also oneself can arrange rules for grasping.

Accompanying drawing explanation

Fig. 1 is crawler system structured flowchart of the present invention;

Fig. 2 is the structural principle block diagram of reptile node in the present invention.

Embodiment

Below in conjunction with accompanying drawing, technical scheme of the present invention is described in further detail, but protection scope of the present invention is not limited to the following stated.

As shown in Figure 1, a kind of crawler system for IDC harmful information monitoring platform, it be responsible for carrying out from internet raw data discovery, crawl with normalized.According to the difference of interconnected web-based applications, comprise one or more reptile cluster, and each reptile cluster includes multiple reptile node and a reptile root node, form a distributed data acquisition network, wherein, reptile root node is used for carrying out control and management to the reptile node in this reptile cluster, and intercoms mutually with host computer, and reptile node is used for the harmful information in collection network.

As shown in Figure 2, in the present invention, described each reptile node forms by following multiple module:

1, multithreading web retrieval module, comprises multiple web retrieval passage and web analysis module, for dissimilar webpage, is gathered it by the web retrieval passage that matches with it and web analysis module; Described web analysis module comprises dns resolution module, HTTP parsing module, FTP parsing module, GOPHER parsing module etc.;

Realize multithreading acquisition function: different strategies can be customized for dissimilar website, gather and support multithreading, realize snap information collection;

3, code identification processing module, automatically identifies the type of coding of webpage, and carries out code conversion process to it; Support that the Multi-encodings such as GBK, GB2312, BIG5, UTF-8, UTF-16, BIGENDIAN, ISO8859-1 identify automatically, it is UTF that system carries out code conversion automatically;

4, the automatic extraction module of web page contents, comprises dynamic web content extraction module and static web contents extraction module, there is the URL of harmful Intelligence Page according to responsive dictionary after capturing code conversion process; Can multiple dynamic and static state webpage be gathered, the webpages such as such as HTM, HTML, SHTML, XML, PHP, ASP, JSP, JavaScript;

5, url filtering device, filters the URL not needing to download;

6, URL duplicate removal module, whether consistent with the URL stored in URL storer for judging the URL after filtering, if consistent, no longer follow-up process is carried out to this URL; Realize incremental update function, ensure reptile node only gather upgraded last time after the webpage of newly-generated or change, the webpage downloaded without Resurvey carrys out the efficiency that guarantee information upgrades, and user also also can set whole collection as required;

Claims

1. the crawler system for IDC harmful information monitoring platform, it is characterized in that: it comprises one or more reptile cluster, and each reptile cluster includes multiple reptile node and a reptile root node, form a distributed data acquisition network, wherein, reptile root node is used for carrying out control and management to the reptile node in this reptile cluster, and reptile node is used for the harmful information in collection network, and described each reptile node forms by following multiple module:

Multithreading web retrieval module, comprises multiple web retrieval passage and web analysis module, for dissimilar webpage, is gathered it by the web retrieval passage that matches with it and web analysis module;

Web page library, stores the webpage that multithreading web retrieval module gathers;

Code identification processing module, automatically identifies the type of coding of webpage, and carries out code conversion process to it;

The automatic extraction module of web page contents, comprises dynamic web content extraction module and static web contents extraction module, there is the URL of harmful Intelligence Page according to responsive dictionary according to responsive dictionary after capturing code conversion process;

Url filtering device, filters the URL not needing to download;

URL duplicate removal module, whether consistent with the URL stored in URL storer for judging the URL after filtering, if consistent, no longer follow-up process is carried out to this URL;

URL scheduler module, according to the URL queue after duplicate removal, controls multithreading web retrieval module and downloads corresponding webpage.

2. a kind of crawler system for IDC harmful information monitoring platform according to claim 1, it is characterized in that: described reptile node also comprises removing duplicate webpages module, for judging that whether web page contents is consistent with the web page contents downloaded, if consistent, no longer follow-up process carried out to this webpage, and deleted from web page library.

3. a kind of crawler system for IDC harmful information monitoring platform according to claim 2, it is characterized in that: described removing duplicate webpages module comprises fingerprint computing module, fingerprint base and fingerprint duplicate removal module, fingerprint computing module is according to web page fingerprint algorithm, the content of webpage is generated fingerprint through calculating, fingerprint in this generation fingerprint and fingerprint base contrasts by fingerprint duplicate removal module, if there is identical or akin fingerprint, then judge that this web page contents was downloaded, fingerprint base is for storing finger print data, and the fingerprint base of each reptile node carries out synchronized update.

4. a kind of crawler system for IDC harmful information monitoring platform according to claim 1, it is characterized in that: described reptile node also comprises label counter and label counting journal file, these data for recording the download number in web page library, and are recorded in label counting journal file by label counter.

5. a kind of crawler system for IDC harmful information monitoring platform according to claim 1, it is characterized in that: described reptile node also comprises interval handling module, interval handling module generates interval rule automatically by webpage scoring and weight of website, and controls the automatic extraction module of web page contents and carry out the crawl of corresponding interval to webpage.

6. a kind of crawler system for IDC harmful information monitoring platform according to claim 1, it is characterized in that: described reptile node also comprises rules for grasping and arranges module, rules for grasping arranges module according to set rules for grasping, controls the automatic extraction module of web page contents and carries out corresponding grasping movement to webpage.

7. a kind of crawler system for IDC harmful information monitoring platform according to claim 1, is characterized in that: the type of coding of webpage is converted to Unicode transform format UTF by described code identification processing module automatically.

8. a kind of crawler system for IDC harmful information monitoring platform according to claim 1, it is characterized in that: described reptile node also comprises anti-crawler capturing module, when webpage is provided with anti-crawlers, start anti-crawler capturing module, pressure collection is carried out to target web.

9. a kind of crawler system for IDC harmful information monitoring platform according to claim 1, it is characterized in that: described reptile node also comprises acquisition monitoring module, the duty of reptile node, acquisition tasks, sampling depth and log information are transmitted to reptile root node and carry out convergence processing by acquisition monitoring module, and receive the control of reptile root node.

10. a kind of crawler system for IDC harmful information monitoring platform according to claim 1, is characterized in that: described reptile node also comprises fire wall, multithreading web retrieval module is carried out retrieval by fire wall to the harmful information on network and crawled;