CN110807137A - Distributed big data acquisition implementation method - Google Patents

Distributed big data acquisition implementation method Download PDF

Info

Publication number
CN110807137A
CN110807137A CN201910290171.5A CN201910290171A CN110807137A CN 110807137 A CN110807137 A CN 110807137A CN 201910290171 A CN201910290171 A CN 201910290171A CN 110807137 A CN110807137 A CN 110807137A
Authority
CN
China
Prior art keywords
url
module
pool
agent
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910290171.5A
Other languages
Chinese (zh)
Inventor
江晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Congyun Information Technology Co Ltd
Original Assignee
Shanghai Congyun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Congyun Information Technology Co Ltd filed Critical Shanghai Congyun Information Technology Co Ltd
Priority to CN201910290171.5A priority Critical patent/CN110807137A/en
Publication of CN110807137A publication Critical patent/CN110807137A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to the technical field of big data, in particular to a distributed big data acquisition implementation method which comprises a grabbing module, an IP proxy pool module, an analysis module, a URL processing module and a data storage module, wherein the IP proxy pool module is provided with proxy updating and proxy distribution, the analysis module is provided with URL data extraction and basic data extraction, the URL processing module is provided with URL filtering, the URL filtering is connected with a URL queue, and the URL queue is connected with a URL distribution process. The invention ensures the system to work continuously by switching the agent, eliminates the waiting of the limited time, and greatly improves the acquisition efficiency of the system.

Description

Distributed big data acquisition implementation method
Technical Field
The invention relates to the technical field of big data, in particular to a distributed big data acquisition implementation method.
Background
With the popularization and application of emerging technologies of the internet, such as mobile internet, electronic commerce, social networking and the like, network data, such as images, videos, logs and the like, is increased explosively. Commodity transaction data generated by nearly 4 hundred million members of the Taobao network is about 20TB, and log data generated by about 10 million users in the Facebook is more than 300 TB. The big data era has come, and the big data field is also the subject of the current hot research. Data is the basis for realizing big data research, and the traditional data acquisition technical scheme is difficult to meet the requirement of rapidly acquiring a high-quality data set. Therefore, how to efficiently collect mass high-quality data plays an extremely important role in large-data application and research.
Disclosure of Invention
The invention aims to solve the defects in the prior art and provides a distributed big data acquisition implementation method.
In order to achieve the purpose, the invention adopts the following technical scheme:
a distributed big data acquisition implementation method comprises a grabbing module, an IP proxy pool module, an analysis module, a URL processing module and a data storage module, wherein the IP proxy pool module is provided with proxy updating and proxy distribution, the analysis module is provided with URL data extraction and basic data extraction, the URL processing module is provided with URL filtering, the URL filtering is connected with a URL queue, and the URL queue is connected with a URL distribution process.
Preferably, the capturing module acquires data from the internet, the IP proxy pool module and the URL processing module, and the analyzing module receives the data of the capturing module and transmits the data to the URL processing module and the data storage module.
Preferably, the URL processing module includes a main control node and hosts, the main control node extracts URLs from the URL queues to be grabbed and allocates the URLs to the respective grabbing hosts, then the grabbing hosts complete the acquisition task and the analysis task and submit the URLs that have been successfully grabbed and the extracted new URLs to the main control node for processing, the URLs that have been successfully grabbed are cached in the crawled set, then the new URLs are filtered according to the crawled set and cached in the corresponding queues to be grabbed, wherein the queues to be grabbed and the crawled sets are both implemented by using a memory database, and the queues to be grabbed adopt a pre-stored and pre-allocated strategy for subsequent crawling.
Preferably, each agent in the IP agent pool module pool has a flag to record the status of the agent: and (3) useful and usenguseless, extracting the agent marked as useful from the pool to be distributed to the capturing host every time, modifying the agent marked as using, if the agent marked as useles returned from the capturing host due to IP limitation, otherwise, normally returning the agent marked as useful, detecting all the agents marked as non-using in the pool when automatic updating is triggered, and deleting useless agents from the pool after detection.
Preferably, the web page in the host builds a label tree, and the building steps are as follows:
the method comprises the following steps: stacking the root tag < HTML >;
step two: when a chunk start tag is encountered, the chunk is taken as a child node of the stack top block node. And stacking the block label;
step three: when a partitioning end label is encountered, pushing the stack out of the stack;
step four: and if the stack is empty, finishing construction, otherwise, continuing scanning, and meeting a second step of starting label skipping when the block is encountered and a third step of finishing label skipping when the block is encountered.
Compared with the prior art, the invention provides a distributed big data acquisition implementation method, which has the following beneficial effects:
the invention is characterized in that a capturing module acquires URLs to be crawled from a URL queue, then calls available agents acquired from an IP agent pool, captures original data from the Internet, and sends the data to an analysis module for processing, the analysis module firstly preprocesses the data to remove some obvious noise, then extracts text information through a text extraction algorithm based on label tree block node weight, URL related data are sent to a URL processing module for processing, basic data are processed by a data storage module, the URL processing module is mainly used for controlling distributed capturing, and the data storage module is used for regularizing and persisting the data, thereby laying a foundation for subsequent analysis and processing.
Drawings
Fig. 1 is a schematic structural diagram of a distributed big data acquisition implementation method according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it is to be understood that the terms "upper", "lower", "front", "rear", "left", "right", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention.
Referring to fig. 1, a distributed big data acquisition implementation method includes a capture module, an IP proxy pool module, an analysis module, a URL processing module, and a data storage module, where the IP proxy pool module is provided with proxy update and proxy allocation, the analysis module is provided with URL data extraction and basic data extraction, the URL processing module is provided with URL filtering, the URL filtering is connected with a URL queue, and the URL queue is connected with a URL allocation process.
The capturing module acquires data from the internet, the IP agent pool module and the URL processing module respectively, and the analyzing module receives the data of the capturing module and then transmits the data to the URL processing module and the data storage module respectively.
The URL processing module comprises a main control node and hosts, the main control node extracts URLs from the URL queues to be grabbed and distributes the URLs to the grabbing hosts, the grabbing hosts complete acquisition tasks and analysis tasks and deliver the URLs which are successfully grabbed and the extracted new URLs to the main control node for processing, the URLs which are successfully grabbed are cached in a crawled set, new URLs are filtered according to the crawled set and are cached in the corresponding queues to be grabbed, the queues to be grabbed and the crawled sets are all realized by using a memory database, and the queues to be grabbed adopt a strategy of first storing and first distributing for subsequent crawling.
Each agent in the IP agent pool module pool has a mark to record the state of the agent: and (3) useful and usenguseless, extracting the agent marked as useful from the pool to be distributed to the capturing host every time, modifying the agent marked as using, if the agent marked as useles returned from the capturing host due to IP limitation, otherwise, normally returning the agent marked as useful, detecting all the agents marked as non-using in the pool when automatic updating is triggered, and deleting useless agents from the pool after detection.
The method comprises the following steps that a webpage in a host builds a label tree, and the building steps are as follows:
the method comprises the following steps: stacking the root tag < HTML >;
step two: when a chunk start tag is encountered, the chunk is taken as a child node of the stack top block node. And stacking the block label;
step three: when a partitioning end label is encountered, pushing the stack out of the stack;
step four: and if the stack is empty, finishing construction, otherwise, continuing scanning, and meeting a second step of starting label skipping when the block is encountered and a third step of finishing label skipping when the block is encountered.
When the method is used, the capturing module acquires the URL to be crawled from the URL queue, then the available agent acquired from the IP agent pool is called, the original data is captured from the Internet and is processed by the analyzing module, the analyzing module firstly preprocesses the data to remove some obvious noise, then text information is extracted through a text extraction algorithm based on the node weight of the label tree block, the URL related data is processed by the URL processing module, the basic data is processed by the data storage module, the URL processing module is mainly used for controlling distributed capturing, and the data storage module is used for regularizing and persisting the data, so that a foundation is laid for subsequent analysis and processing.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (5)

1. The distributed big data acquisition implementation method comprises a grabbing module, an IP proxy pool module, an analysis module, a URL processing module and a data storage module, and is characterized in that the IP proxy pool module is provided with proxy updating and proxy distribution, the analysis module is provided with URL data extraction and basic data extraction, the URL processing module is provided with URL filtering, the URL filtering is connected with a URL queue, and the URL queue is connected with a URL distribution process.
2. The distributed big data collection implementation method of claim 1, wherein the capture module obtains data from the internet, the IP proxy pool module, and the URL processing module, respectively, and the parsing module receives the data of the capture module and then transmits the data to the URL processing module and the data storage module, respectively.
3. The distributed big data acquisition implementation method according to claim 1, wherein the URL processing module includes a main control node and a host, the main control node extracts URLs from the URL queue to be grabbed and allocates the URLs to the respective grabbing hosts, then the grabbing hosts complete the acquisition task and the analysis task and submit the URLs that have been successfully grabbed and the extracted new URLs to the main control node for processing, the URLs that have been successfully grabbed are cached in the crawled set, then the new URLs are filtered according to the crawled set and cached in the corresponding queue to be grabbed, wherein the queue to be grabbed and the crawled set are both implemented by using a memory database, and the queue to be grabbed adopts a strategy of first-in-first allocation for subsequent crawling.
4. The method as claimed in claim 1, wherein each agent in the IP agent pool module pool has a flag to record the status of the agent: and (3) useful and usenguseless, extracting the agent marked as useful from the pool to be distributed to the capturing host every time, modifying the agent marked as using, if the agent marked as useles returned from the capturing host due to IP limitation, otherwise, normally returning the agent marked as useful, detecting all the agents marked as non-using in the pool when automatic updating is triggered, and deleting useless agents from the pool after detection.
5. The distributed big data collection implementation method of claim 3, wherein the web page in the host builds a label tree, and the building steps are as follows:
the method comprises the following steps: stacking the root tag < HTML >;
step two: when a chunk start tag is encountered, the chunk is taken as a child node of the stack top block node. And stacking the block label;
step three: when a partitioning end label is encountered, pushing the stack out of the stack;
step four: and if the stack is empty, finishing construction, otherwise, continuing scanning, and meeting a second step of starting label skipping when the block is encountered and a third step of finishing label skipping when the block is encountered.
CN201910290171.5A 2019-04-11 2019-04-11 Distributed big data acquisition implementation method Pending CN110807137A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910290171.5A CN110807137A (en) 2019-04-11 2019-04-11 Distributed big data acquisition implementation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910290171.5A CN110807137A (en) 2019-04-11 2019-04-11 Distributed big data acquisition implementation method

Publications (1)

Publication Number Publication Date
CN110807137A true CN110807137A (en) 2020-02-18

Family

ID=69487330

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910290171.5A Pending CN110807137A (en) 2019-04-11 2019-04-11 Distributed big data acquisition implementation method

Country Status (1)

Country Link
CN (1) CN110807137A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030065711A1 (en) * 2001-10-01 2003-04-03 International Business Machines Corporation Method and apparatus for content-aware web switching
US20040148571A1 (en) * 2003-01-27 2004-07-29 Lue Vincent Wen-Jeng Method and apparatus for adapting web contents to different display area
CN102306204A (en) * 2011-09-28 2012-01-04 武汉大学 Subject area identifying method based on weight of text structure
CN103023998A (en) * 2012-11-29 2013-04-03 网宿科技股份有限公司 Temporary jump error correction method and system based on content distribution network fringe node
CN105243159A (en) * 2015-10-28 2016-01-13 福建亿榕信息技术有限公司 Visual script editor-based distributed web crawler system
CN106021608A (en) * 2016-06-22 2016-10-12 广东亿迅科技有限公司 Distributed crawler system and implementing method thereof
WO2017113324A1 (en) * 2015-12-31 2017-07-06 孙燕群 Regular expression-based url filtering method
CN109508422A (en) * 2018-12-05 2019-03-22 南京邮电大学 The height of multithreading intelligent scheduling is hidden crawler system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030065711A1 (en) * 2001-10-01 2003-04-03 International Business Machines Corporation Method and apparatus for content-aware web switching
US20040148571A1 (en) * 2003-01-27 2004-07-29 Lue Vincent Wen-Jeng Method and apparatus for adapting web contents to different display area
CN102306204A (en) * 2011-09-28 2012-01-04 武汉大学 Subject area identifying method based on weight of text structure
CN103023998A (en) * 2012-11-29 2013-04-03 网宿科技股份有限公司 Temporary jump error correction method and system based on content distribution network fringe node
CN105243159A (en) * 2015-10-28 2016-01-13 福建亿榕信息技术有限公司 Visual script editor-based distributed web crawler system
WO2017113324A1 (en) * 2015-12-31 2017-07-06 孙燕群 Regular expression-based url filtering method
CN106021608A (en) * 2016-06-22 2016-10-12 广东亿迅科技有限公司 Distributed crawler system and implementing method thereof
CN109508422A (en) * 2018-12-05 2019-03-22 南京邮电大学 The height of multithreading intelligent scheduling is hidden crawler system

Similar Documents

Publication Publication Date Title
CN110062025B (en) Data acquisition method, device, server and storage medium
CN102054028B (en) Method for implementing web-rendering function by using web crawler system
CN105589943B (en) The method, apparatus and server of the picture adaptive processes of result of page searching
CN101986284B (en) Dynamic recovery system for waste space of virtual machine image file
CN103970788A (en) Webpage-crawling-based crawler technology
CN104077402B (en) Data processing method and data handling system
CN104516982A (en) Method and system for extracting Web information based on Nutch
CN102436513A (en) Distributed search method and system
CN102045305A (en) Method and system for monitoring and tracking multimedia resource transmission
CN102164186A (en) Method and system for realizing cloud search service
US10462257B2 (en) Method and apparatus for obtaining user account
CN104348859B (en) File synchronisation method, device, server, terminal and system
CN104866528B (en) Multi-platform collecting method and system
CN106599270B (en) Network data capturing method and crawler
CN106921703A (en) The method of cross-border data syn-chronization, system, and domestic and overseas data center
CN106599001A (en) Webpage content acquisition method and system
CN104199893B (en) A kind of system and method for quickly issuing full media content
US20180307530A1 (en) Data persistence method and system thereof in stream computing
CN103955517B (en) Method and system for converting data in documental database to relational database
CN113656673A (en) Master-slave distributed content crawling robot for advertisement delivery
CN103136225B (en) A kind of method and system of Internet picture conversion
EP3226149A1 (en) Method and device for providing website authentication data for search engine
CN113946399B (en) Space data loading method and device
CN107679091A (en) A kind of search system and method based on big data
CN110807137A (en) Distributed big data acquisition implementation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination