CN104899323A - Crawler system used for IDC harmful information monitoring platform - Google Patents

Crawler system used for IDC harmful information monitoring platform Download PDF

Info

Publication number
CN104899323A
CN104899323A CN201510343175.7A CN201510343175A CN104899323A CN 104899323 A CN104899323 A CN 104899323A CN 201510343175 A CN201510343175 A CN 201510343175A CN 104899323 A CN104899323 A CN 104899323A
Authority
CN
China
Prior art keywords
module
reptile
webpage
crawler
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510343175.7A
Other languages
Chinese (zh)
Other versions
CN104899323B (en
Inventor
彭光辉
屈立笳
陶磊
苏礼刚
林伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CHENGDU GOLDTEL INDUSTRY GROUP Co Ltd
Original Assignee
CHENGDU GOLDTEL INDUSTRY GROUP Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHENGDU GOLDTEL INDUSTRY GROUP Co Ltd filed Critical CHENGDU GOLDTEL INDUSTRY GROUP Co Ltd
Priority to CN201510343175.7A priority Critical patent/CN104899323B/en
Publication of CN104899323A publication Critical patent/CN104899323A/en
Application granted granted Critical
Publication of CN104899323B publication Critical patent/CN104899323B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a crawler system used for an IDC harmful information monitoring platform. The crawler system used for the IDC harmful information monitoring platform comprises one or more crawler clusters, wherein each crawler cluster comprises multiple crawler nodes and a crawler root node which form a distributed data acquisition network; each crawler root node is used for controlling and managing the crawler nodes in each crawler cluster; each crawler node is used for acquiring harmful information in the network and comprises a multithreading webpage acquisition module, a webpage library, a code identification and processing module, a webpage content automatic extraction module, a URL (Uniform Resource Locator) filter, a URL deduplication module and a URL scheduling module. The crawler system used for the IDC harmful information monitoring platform provides a powerful data collection function, and the dynamic webpage and static webpage are monitored comprehensively in real time through multiple crawler clusters.

Description

A kind of crawler system for IDC harmful information monitoring platform
Technical field
The present invention relates to a kind of crawler system for IDC harmful information monitoring platform.
Background technology
Along with developing rapidly of network, WWW becomes the carrier of bulk information, how effectively to extract and to utilize these information to become a huge challenge.Search engine becomes as the instrument of auxiliary people's retrieving information entrance and the guide that user accesses WWW.But these versatility search engines also also exist certain limitation.
In the face of the Web Community's environment become increasingly active, each netizen may become publisher and the diffuser of harmful information, and network is harmful to route of transmission and more and more extensively comprises blog, news, forum, microblogging and other approach.Web crawlers is the precursor technique that various search engine can realize, the arriving of large data age and the develop rapidly of Internet technology, makes web crawlers have more great Research Significance.Reply web data amount has a big increase, the network text update cycle is short and the series of challenges such as structure of web page dynamic change, high-level efficiency and the web crawlers of non-stop run becomes the study hotspot that harmful information excavates.
Summary of the invention
The object of the invention is to overcome the deficiencies in the prior art, a kind of crawler system for IDC harmful information monitoring platform is provided, present system provides powerful data collection function, by multiple reptile cluster, monitoring is in real time carried out comprehensively to dynamic web page and static Web page.
The object of the invention is to be achieved through the following technical solutions: a kind of crawler system for IDC harmful information monitoring platform, it comprises one or more reptile cluster, and each reptile cluster includes multiple reptile node and a reptile root node, form a distributed data acquisition network, wherein, reptile root node is used for carrying out control and management to the reptile node in this reptile cluster, and reptile node is used for the harmful information in collection network.
In the present invention, described each reptile node forms by following multiple module:
1, multithreading web retrieval module, comprises multiple web retrieval passage and web analysis module, for dissimilar webpage, is gathered it by the web retrieval passage that matches with it and web analysis module;
2, web page library, stores the webpage that multithreading web retrieval module gathers;
3, code identification processing module, automatically identifies the type of coding of webpage, and carries out code conversion process to it;
4, the automatic extraction module of web page contents, comprises dynamic web content extraction module and static web contents extraction module, there is the URL of harmful Intelligence Page according to responsive dictionary after capturing code conversion process;
5, url filtering device, filters the URL not needing to download;
6, URL duplicate removal module, whether consistent with the URL stored in URL storer for judging the URL after filtering, if consistent, no longer follow-up process is carried out to this URL;
7, URL scheduler module, according to the URL queue after duplicate removal, controls multithreading web retrieval module and downloads corresponding webpage.
Described reptile node also comprises removing duplicate webpages module, for judging that whether web page contents is consistent with the web page contents downloaded, if consistent, no longer carry out follow-up process to this webpage, and being deleted from web page library.
Described removing duplicate webpages module comprises fingerprint computing module, fingerprint base and fingerprint duplicate removal module, fingerprint computing module is according to web page fingerprint algorithm, the content of webpage is generated fingerprint through calculating, fingerprint in this generation fingerprint and fingerprint base contrasts by fingerprint duplicate removal module, if there is identical or akin fingerprint, then judge that this web page contents was downloaded, fingerprint base is for storing finger print data, and the fingerprint base of each reptile node carries out synchronized update.
Described reptile node also comprises label counter and label counting journal file, and these data for recording the download number in web page library, and are recorded in label counting journal file by label counter.
Described reptile node also comprises interval handling module, and interval handling module generates interval rule automatically by webpage scoring and weight of website, and controls the automatic extraction module of web page contents and carry out the crawl of corresponding interval to webpage.
Described reptile node also comprises rules for grasping and arranges module, and rules for grasping arranges module according to set rules for grasping, controls the automatic extraction module of web page contents and carries out corresponding grasping movement to webpage.
The type of coding of webpage is converted to Unicode transform format UTF by described code identification processing module automatically.
Described reptile node also comprises anti-crawler capturing module, when webpage is provided with anti-crawlers, starts anti-crawler capturing module, carries out pressure collection to target web.
Described reptile node also comprises acquisition monitoring module, and the duty of reptile node, acquisition tasks, sampling depth and log information are transmitted to reptile root node and carry out convergence processing by acquisition monitoring module, and receive the control of reptile root node.
Described reptile node also comprises fire wall, and multithreading web retrieval module is carried out retrieval by fire wall to the harmful information on network and crawled.
Described crawler system also comprises full-text database, index data base and row order sequenced data storehouse, and full-text database, index data base are all connected with reptile node and reptile root node with row order sequenced data storehouse.
The invention has the beneficial effects as follows: a kind of crawler system for IDC harmful information monitoring platform proposed by the invention, has following multiple functional characteristics:
1) multithreading collection: customize different strategies for dissimilar website, gathers and supports multithreading, realize snap information collection;
2) distributed capture: carry out larger scale data acquisition by multiple reptile cluster, some reptile nodes;
3) acquisition monitoring: monitor and managment is carried out to reptile node duty, acquisition tasks, sampling depth, daily record, system operation report etc.;
4) web page contents extracts automatically: can gather multiple dynamic and static state webpage, the webpages such as such as HTM, HTML, SHTML, XML, PHP, ASP, JSP, JavaScript;
5) coding identifies conversion automatically: support that the Multi-encodings such as GBK, GB2312, BIG5, UTF-8, UTF-16, BIGENDIAN, ISO8859-1 identify automatically, it is UTF that system carries out code conversion automatically;
6) incremental update: ensure reptile node only gather upgraded last time after the webpage of newly-generated or change, the webpage downloaded without Resurvey carrys out the efficiency that guarantee information upgrades, and user also also can set whole collection as required;
7) anti-crawler capturing: anti-crawlers website is set for part should corresponding strategies be set, avoid capturing the page;
8) reptile interval captures: adopt webpage scoring and weight of website etc. automatically to generate interval rule, carry out the crawl of corresponding interval to webpage;
9) self-defined rules for grasping: user also oneself can arrange rules for grasping.
Accompanying drawing explanation
Fig. 1 is crawler system structured flowchart of the present invention;
Fig. 2 is the structural principle block diagram of reptile node in the present invention.
Embodiment
Below in conjunction with accompanying drawing, technical scheme of the present invention is described in further detail, but protection scope of the present invention is not limited to the following stated.
As shown in Figure 1, a kind of crawler system for IDC harmful information monitoring platform, it be responsible for carrying out from internet raw data discovery, crawl with normalized.According to the difference of interconnected web-based applications, comprise one or more reptile cluster, and each reptile cluster includes multiple reptile node and a reptile root node, form a distributed data acquisition network, wherein, reptile root node is used for carrying out control and management to the reptile node in this reptile cluster, and intercoms mutually with host computer, and reptile node is used for the harmful information in collection network.
As shown in Figure 2, in the present invention, described each reptile node forms by following multiple module:
1, multithreading web retrieval module, comprises multiple web retrieval passage and web analysis module, for dissimilar webpage, is gathered it by the web retrieval passage that matches with it and web analysis module; Described web analysis module comprises dns resolution module, HTTP parsing module, FTP parsing module, GOPHER parsing module etc.;
Realize multithreading acquisition function: different strategies can be customized for dissimilar website, gather and support multithreading, realize snap information collection;
2, web page library, stores the webpage that multithreading web retrieval module gathers;
3, code identification processing module, automatically identifies the type of coding of webpage, and carries out code conversion process to it; Support that the Multi-encodings such as GBK, GB2312, BIG5, UTF-8, UTF-16, BIGENDIAN, ISO8859-1 identify automatically, it is UTF that system carries out code conversion automatically;
4, the automatic extraction module of web page contents, comprises dynamic web content extraction module and static web contents extraction module, there is the URL of harmful Intelligence Page according to responsive dictionary after capturing code conversion process; Can multiple dynamic and static state webpage be gathered, the webpages such as such as HTM, HTML, SHTML, XML, PHP, ASP, JSP, JavaScript;
5, url filtering device, filters the URL not needing to download;
6, URL duplicate removal module, whether consistent with the URL stored in URL storer for judging the URL after filtering, if consistent, no longer follow-up process is carried out to this URL; Realize incremental update function, ensure reptile node only gather upgraded last time after the webpage of newly-generated or change, the webpage downloaded without Resurvey carrys out the efficiency that guarantee information upgrades, and user also also can set whole collection as required;
7, URL scheduler module, according to the URL queue after duplicate removal, controls multithreading web retrieval module and downloads corresponding webpage.
Described reptile node also comprises removing duplicate webpages module, for judging that whether web page contents is consistent with the web page contents downloaded, if consistent, no longer carry out follow-up process to this webpage, and being deleted from web page library.
Described removing duplicate webpages module comprises fingerprint computing module, fingerprint base and fingerprint duplicate removal module, fingerprint computing module is according to web page fingerprint algorithm, the content of webpage is generated fingerprint through calculating, fingerprint in this generation fingerprint and fingerprint base contrasts by fingerprint duplicate removal module, if there is identical or akin fingerprint, then judge that this web page contents was downloaded, fingerprint base is for storing finger print data, and the fingerprint base of each reptile node carries out synchronized update.
Described reptile node also comprises label counter and label counting journal file, and these data for recording the download number in web page library, and are recorded in label counting journal file by label counter.
Described reptile node also comprises interval handling module, and interval handling module generates interval rule automatically by webpage scoring and weight of website, and controls the automatic extraction module of web page contents and carry out the crawl of corresponding interval to webpage.
Described reptile node also comprises rules for grasping and arranges module, and rules for grasping arranges module according to set rules for grasping, controls the automatic extraction module of web page contents and carries out corresponding grasping movement to webpage.
Described reptile node also comprises anti-crawler capturing module, when webpage is provided with anti-crawlers, starts anti-crawler capturing module, carries out pressure collection to target web.
Described reptile node also comprises acquisition monitoring module, and the duty of reptile node, acquisition tasks, sampling depth and log information are transmitted to reptile root node and carry out convergence processing by acquisition monitoring module, and receive the control of reptile root node.
Described reptile node also comprises fire wall, and multithreading web retrieval module is carried out retrieval by fire wall to the harmful information on network and crawled.
Described crawler system also comprises full-text database, index data base and row order sequenced data storehouse, and full-text database, index data base are all connected with reptile node and reptile root node with row order sequenced data storehouse.

Claims (10)

1. the crawler system for IDC harmful information monitoring platform, it is characterized in that: it comprises one or more reptile cluster, and each reptile cluster includes multiple reptile node and a reptile root node, form a distributed data acquisition network, wherein, reptile root node is used for carrying out control and management to the reptile node in this reptile cluster, and reptile node is used for the harmful information in collection network, and described each reptile node forms by following multiple module:
Multithreading web retrieval module, comprises multiple web retrieval passage and web analysis module, for dissimilar webpage, is gathered it by the web retrieval passage that matches with it and web analysis module;
Web page library, stores the webpage that multithreading web retrieval module gathers;
Code identification processing module, automatically identifies the type of coding of webpage, and carries out code conversion process to it;
The automatic extraction module of web page contents, comprises dynamic web content extraction module and static web contents extraction module, there is the URL of harmful Intelligence Page according to responsive dictionary according to responsive dictionary after capturing code conversion process;
Url filtering device, filters the URL not needing to download;
URL duplicate removal module, whether consistent with the URL stored in URL storer for judging the URL after filtering, if consistent, no longer follow-up process is carried out to this URL;
URL scheduler module, according to the URL queue after duplicate removal, controls multithreading web retrieval module and downloads corresponding webpage.
2. a kind of crawler system for IDC harmful information monitoring platform according to claim 1, it is characterized in that: described reptile node also comprises removing duplicate webpages module, for judging that whether web page contents is consistent with the web page contents downloaded, if consistent, no longer follow-up process carried out to this webpage, and deleted from web page library.
3. a kind of crawler system for IDC harmful information monitoring platform according to claim 2, it is characterized in that: described removing duplicate webpages module comprises fingerprint computing module, fingerprint base and fingerprint duplicate removal module, fingerprint computing module is according to web page fingerprint algorithm, the content of webpage is generated fingerprint through calculating, fingerprint in this generation fingerprint and fingerprint base contrasts by fingerprint duplicate removal module, if there is identical or akin fingerprint, then judge that this web page contents was downloaded, fingerprint base is for storing finger print data, and the fingerprint base of each reptile node carries out synchronized update.
4. a kind of crawler system for IDC harmful information monitoring platform according to claim 1, it is characterized in that: described reptile node also comprises label counter and label counting journal file, these data for recording the download number in web page library, and are recorded in label counting journal file by label counter.
5. a kind of crawler system for IDC harmful information monitoring platform according to claim 1, it is characterized in that: described reptile node also comprises interval handling module, interval handling module generates interval rule automatically by webpage scoring and weight of website, and controls the automatic extraction module of web page contents and carry out the crawl of corresponding interval to webpage.
6. a kind of crawler system for IDC harmful information monitoring platform according to claim 1, it is characterized in that: described reptile node also comprises rules for grasping and arranges module, rules for grasping arranges module according to set rules for grasping, controls the automatic extraction module of web page contents and carries out corresponding grasping movement to webpage.
7. a kind of crawler system for IDC harmful information monitoring platform according to claim 1, is characterized in that: the type of coding of webpage is converted to Unicode transform format UTF by described code identification processing module automatically.
8. a kind of crawler system for IDC harmful information monitoring platform according to claim 1, it is characterized in that: described reptile node also comprises anti-crawler capturing module, when webpage is provided with anti-crawlers, start anti-crawler capturing module, pressure collection is carried out to target web.
9. a kind of crawler system for IDC harmful information monitoring platform according to claim 1, it is characterized in that: described reptile node also comprises acquisition monitoring module, the duty of reptile node, acquisition tasks, sampling depth and log information are transmitted to reptile root node and carry out convergence processing by acquisition monitoring module, and receive the control of reptile root node.
10. a kind of crawler system for IDC harmful information monitoring platform according to claim 1, is characterized in that: described reptile node also comprises fire wall, multithreading web retrieval module is carried out retrieval by fire wall to the harmful information on network and crawled;
Described crawler system also comprises full-text database, index data base and row order sequenced data storehouse, and full-text database, index data base are all connected with reptile node and reptile root node with row order sequenced data storehouse.
CN201510343175.7A 2015-06-19 2015-06-19 A kind of crawler system for IDC harmful information monitoring platforms Active CN104899323B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510343175.7A CN104899323B (en) 2015-06-19 2015-06-19 A kind of crawler system for IDC harmful information monitoring platforms

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510343175.7A CN104899323B (en) 2015-06-19 2015-06-19 A kind of crawler system for IDC harmful information monitoring platforms

Publications (2)

Publication Number Publication Date
CN104899323A true CN104899323A (en) 2015-09-09
CN104899323B CN104899323B (en) 2018-09-11

Family

ID=54031985

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510343175.7A Active CN104899323B (en) 2015-06-19 2015-06-19 A kind of crawler system for IDC harmful information monitoring platforms

Country Status (1)

Country Link
CN (1) CN104899323B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105743901A (en) * 2016-03-07 2016-07-06 携程计算机技术(上海)有限公司 Server, anti-crawler system and anti-crawler verification method
CN106326447A (en) * 2016-08-26 2017-01-11 北京量科邦信息技术有限公司 Detection method and system of data captured by crowd sourcing network crawlers
CN107273498A (en) * 2017-06-16 2017-10-20 成都布林特信息技术有限公司 Public sentiment big data processing method
CN108121706A (en) * 2016-11-28 2018-06-05 央视国际网络无锡有限公司 A kind of optimization method of distributed reptile
CN109213912A (en) * 2018-08-16 2019-01-15 北京神州泰岳软件股份有限公司 A kind of method and network data crawl dispatching device of crawl network data
CN111143720A (en) * 2018-11-06 2020-05-12 顺丰科技有限公司 URL duplicate removal method, device and storage medium
CN111651656A (en) * 2020-06-02 2020-09-11 重庆邮电大学 Method and system for dynamic webpage crawler based on agent mode
CN112015963A (en) * 2020-08-21 2020-12-01 北京金和网络股份有限公司 Web crawler system based on big data
CN112035725A (en) * 2020-09-03 2020-12-04 北大方正集团有限公司 Data acquisition system and method
CN113378172A (en) * 2020-02-25 2021-09-10 奇安信科技集团股份有限公司 Method, apparatus, computer system, and medium for identifying sensitive web pages

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073683A (en) * 2010-12-22 2011-05-25 四川大学 Distributed real-time news information acquisition system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073683A (en) * 2010-12-22 2011-05-25 四川大学 Distributed real-time news information acquisition system

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
曹忠: "一种优化的网络爬虫的设计与实现", 《电脑知识与技术》 *
李春生: "基于WEB信息采集的分布式网络爬虫搜索引擎的研究", 《中国优秀硕士学位论文全文数据库》 *
苏旋: "分布式网络爬虫技术的研究与实现", 《中国优秀硕士学位论文全文数据库》 *
苏旋: "分布式网络爬虫技术的研究与实现", 《中国优秀硕士论文全文数据库》 *
苏金波等: "基于关键词相关性的有害信息爬虫系统研究", 《计算机技术与发展》 *
赵立磊: "基于网页去重的垂直搜索引擎设计与实现", 《中国优秀硕士学位论文全文数据库》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105743901A (en) * 2016-03-07 2016-07-06 携程计算机技术(上海)有限公司 Server, anti-crawler system and anti-crawler verification method
CN105743901B (en) * 2016-03-07 2019-04-09 携程计算机技术(上海)有限公司 Server, anti-crawler system and anti-crawler verification method
CN106326447A (en) * 2016-08-26 2017-01-11 北京量科邦信息技术有限公司 Detection method and system of data captured by crowd sourcing network crawlers
CN108121706A (en) * 2016-11-28 2018-06-05 央视国际网络无锡有限公司 A kind of optimization method of distributed reptile
CN107273498A (en) * 2017-06-16 2017-10-20 成都布林特信息技术有限公司 Public sentiment big data processing method
CN109213912A (en) * 2018-08-16 2019-01-15 北京神州泰岳软件股份有限公司 A kind of method and network data crawl dispatching device of crawl network data
CN111143720A (en) * 2018-11-06 2020-05-12 顺丰科技有限公司 URL duplicate removal method, device and storage medium
CN113378172A (en) * 2020-02-25 2021-09-10 奇安信科技集团股份有限公司 Method, apparatus, computer system, and medium for identifying sensitive web pages
CN113378172B (en) * 2020-02-25 2023-12-29 奇安信科技集团股份有限公司 Method, apparatus, computer system and medium for identifying sensitive web pages
CN111651656A (en) * 2020-06-02 2020-09-11 重庆邮电大学 Method and system for dynamic webpage crawler based on agent mode
CN112015963A (en) * 2020-08-21 2020-12-01 北京金和网络股份有限公司 Web crawler system based on big data
CN112035725A (en) * 2020-09-03 2020-12-04 北大方正集团有限公司 Data acquisition system and method

Also Published As

Publication number Publication date
CN104899323B (en) 2018-09-11

Similar Documents

Publication Publication Date Title
CN104899323A (en) Crawler system used for IDC harmful information monitoring platform
CN104951539B (en) Internet data center's harmful information monitoring system
CN104899324B (en) One kind monitoring systematic sample training system based on IDC harmful informations
US10031973B2 (en) Method and system for identifying a sensor to be deployed in a physical environment
CN110019267A (en) A kind of metadata updates method, apparatus, system, electronic equipment and storage medium
CN104516982A (en) Method and system for extracting Web information based on Nutch
CN103455600B (en) A kind of video URL grasping means, device and server apparatus
CN108416034B (en) Information acquisition system based on financial heterogeneous big data and control method thereof
CN107092826A (en) Web page contents real-time safety monitoring method
CN104598536B (en) A kind of distributed network information structuring processing method
CN104134108A (en) Sales data analysis method of electronic commerce website
CN103902667A (en) Simple network information collector achieving method based on meta-search
US9336316B2 (en) Image URL-based junk detection
CN105975599B (en) Method and device for monitoring page embedded points of website
CN114398138A (en) Interface generation method and device, computer equipment and storage medium
CN103886033B (en) Intelligent vertical searching device and method for safety industry chain
CN112000866B (en) Internet data analysis method, device, electronic device and medium
CN108287831B (en) URL classification method and system and data processing method and system
Deka NoSQL web crawler application
CN109246069B (en) Webpage login method and device and readable storage medium
KR20200103133A (en) Method and apparatus for performing extract-transfrom-load procedures in a hadoop-based big data processing system
CN110851678A (en) Method and device for crawling data
CN104063506A (en) Method and device for identifying repeated web pages
CN109714199B (en) Network traffic analysis and traceability system based on big data architecture
Panum et al. Kraaler: A user-perspective web crawler

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant