CN110807137A

CN110807137A - Distributed big data acquisition implementation method

Info

Publication number: CN110807137A
Application number: CN201910290171.5A
Authority: CN
Inventors: 江晶
Original assignee: Shanghai Congyun Information Technology Co Ltd
Current assignee: Shanghai Congyun Information Technology Co Ltd
Priority date: 2019-04-11
Filing date: 2019-04-11
Publication date: 2020-02-18

Abstract

The invention relates to the technical field of big data, in particular to a distributed big data acquisition implementation method which comprises a grabbing module, an IP proxy pool module, an analysis module, a URL processing module and a data storage module, wherein the IP proxy pool module is provided with proxy updating and proxy distribution, the analysis module is provided with URL data extraction and basic data extraction, the URL processing module is provided with URL filtering, the URL filtering is connected with a URL queue, and the URL queue is connected with a URL distribution process. The invention ensures the system to work continuously by switching the agent, eliminates the waiting of the limited time, and greatly improves the acquisition efficiency of the system.

Description

Distributed big data acquisition implementation method

Technical Field

The invention relates to the technical field of big data, in particular to a distributed big data acquisition implementation method.

Background

With the popularization and application of emerging technologies of the internet, such as mobile internet, electronic commerce, social networking and the like, network data, such as images, videos, logs and the like, is increased explosively. Commodity transaction data generated by nearly 4 hundred million members of the Taobao network is about 20TB, and log data generated by about 10 million users in the Facebook is more than 300 TB. The big data era has come, and the big data field is also the subject of the current hot research. Data is the basis for realizing big data research, and the traditional data acquisition technical scheme is difficult to meet the requirement of rapidly acquiring a high-quality data set. Therefore, how to efficiently collect mass high-quality data plays an extremely important role in large-data application and research.

Disclosure of Invention

The invention aims to solve the defects in the prior art and provides a distributed big data acquisition implementation method.

In order to achieve the purpose, the invention adopts the following technical scheme:

a distributed big data acquisition implementation method comprises a grabbing module, an IP proxy pool module, an analysis module, a URL processing module and a data storage module, wherein the IP proxy pool module is provided with proxy updating and proxy distribution, the analysis module is provided with URL data extraction and basic data extraction, the URL processing module is provided with URL filtering, the URL filtering is connected with a URL queue, and the URL queue is connected with a URL distribution process.

Preferably, the capturing module acquires data from the internet, the IP proxy pool module and the URL processing module, and the analyzing module receives the data of the capturing module and transmits the data to the URL processing module and the data storage module.

Preferably, the URL processing module includes a main control node and hosts, the main control node extracts URLs from the URL queues to be grabbed and allocates the URLs to the respective grabbing hosts, then the grabbing hosts complete the acquisition task and the analysis task and submit the URLs that have been successfully grabbed and the extracted new URLs to the main control node for processing, the URLs that have been successfully grabbed are cached in the crawled set, then the new URLs are filtered according to the crawled set and cached in the corresponding queues to be grabbed, wherein the queues to be grabbed and the crawled sets are both implemented by using a memory database, and the queues to be grabbed adopt a pre-stored and pre-allocated strategy for subsequent crawling.

Preferably, each agent in the IP agent pool module pool has a flag to record the status of the agent: and (3) useful and usenguseless, extracting the agent marked as useful from the pool to be distributed to the capturing host every time, modifying the agent marked as using, if the agent marked as useles returned from the capturing host due to IP limitation, otherwise, normally returning the agent marked as useful, detecting all the agents marked as non-using in the pool when automatic updating is triggered, and deleting useless agents from the pool after detection.

Preferably, the web page in the host builds a label tree, and the building steps are as follows:

the method comprises the following steps: stacking the root tag < HTML >;

step two: when a chunk start tag is encountered, the chunk is taken as a child node of the stack top block node. And stacking the block label;

step three: when a partitioning end label is encountered, pushing the stack out of the stack;

step four: and if the stack is empty, finishing construction, otherwise, continuing scanning, and meeting a second step of starting label skipping when the block is encountered and a third step of finishing label skipping when the block is encountered.

Compared with the prior art, the invention provides a distributed big data acquisition implementation method, which has the following beneficial effects:

the invention is characterized in that a capturing module acquires URLs to be crawled from a URL queue, then calls available agents acquired from an IP agent pool, captures original data from the Internet, and sends the data to an analysis module for processing, the analysis module firstly preprocesses the data to remove some obvious noise, then extracts text information through a text extraction algorithm based on label tree block node weight, URL related data are sent to a URL processing module for processing, basic data are processed by a data storage module, the URL processing module is mainly used for controlling distributed capturing, and the data storage module is used for regularizing and persisting the data, thereby laying a foundation for subsequent analysis and processing.

Drawings

Fig. 1 is a schematic structural diagram of a distributed big data acquisition implementation method according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it is to be understood that the terms "upper", "lower", "front", "rear", "left", "right", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention.

Referring to fig. 1, a distributed big data acquisition implementation method includes a capture module, an IP proxy pool module, an analysis module, a URL processing module, and a data storage module, where the IP proxy pool module is provided with proxy update and proxy allocation, the analysis module is provided with URL data extraction and basic data extraction, the URL processing module is provided with URL filtering, the URL filtering is connected with a URL queue, and the URL queue is connected with a URL allocation process.

The capturing module acquires data from the internet, the IP agent pool module and the URL processing module respectively, and the analyzing module receives the data of the capturing module and then transmits the data to the URL processing module and the data storage module respectively.

The URL processing module comprises a main control node and hosts, the main control node extracts URLs from the URL queues to be grabbed and distributes the URLs to the grabbing hosts, the grabbing hosts complete acquisition tasks and analysis tasks and deliver the URLs which are successfully grabbed and the extracted new URLs to the main control node for processing, the URLs which are successfully grabbed are cached in a crawled set, new URLs are filtered according to the crawled set and are cached in the corresponding queues to be grabbed, the queues to be grabbed and the crawled sets are all realized by using a memory database, and the queues to be grabbed adopt a strategy of first storing and first distributing for subsequent crawling.

Each agent in the IP agent pool module pool has a mark to record the state of the agent: and (3) useful and usenguseless, extracting the agent marked as useful from the pool to be distributed to the capturing host every time, modifying the agent marked as using, if the agent marked as useles returned from the capturing host due to IP limitation, otherwise, normally returning the agent marked as useful, detecting all the agents marked as non-using in the pool when automatic updating is triggered, and deleting useless agents from the pool after detection.

The method comprises the following steps that a webpage in a host builds a label tree, and the building steps are as follows:

the method comprises the following steps: stacking the root tag < HTML >;

When the method is used, the capturing module acquires the URL to be crawled from the URL queue, then the available agent acquired from the IP agent pool is called, the original data is captured from the Internet and is processed by the analyzing module, the analyzing module firstly preprocesses the data to remove some obvious noise, then text information is extracted through a text extraction algorithm based on the node weight of the label tree block, the URL related data is processed by the URL processing module, the basic data is processed by the data storage module, the URL processing module is mainly used for controlling distributed capturing, and the data storage module is used for regularizing and persisting the data, so that a foundation is laid for subsequent analysis and processing.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The distributed big data acquisition implementation method comprises a grabbing module, an IP proxy pool module, an analysis module, a URL processing module and a data storage module, and is characterized in that the IP proxy pool module is provided with proxy updating and proxy distribution, the analysis module is provided with URL data extraction and basic data extraction, the URL processing module is provided with URL filtering, the URL filtering is connected with a URL queue, and the URL queue is connected with a URL distribution process.

2. The distributed big data collection implementation method of claim 1, wherein the capture module obtains data from the internet, the IP proxy pool module, and the URL processing module, respectively, and the parsing module receives the data of the capture module and then transmits the data to the URL processing module and the data storage module, respectively.

3. The distributed big data acquisition implementation method according to claim 1, wherein the URL processing module includes a main control node and a host, the main control node extracts URLs from the URL queue to be grabbed and allocates the URLs to the respective grabbing hosts, then the grabbing hosts complete the acquisition task and the analysis task and submit the URLs that have been successfully grabbed and the extracted new URLs to the main control node for processing, the URLs that have been successfully grabbed are cached in the crawled set, then the new URLs are filtered according to the crawled set and cached in the corresponding queue to be grabbed, wherein the queue to be grabbed and the crawled set are both implemented by using a memory database, and the queue to be grabbed adopts a strategy of first-in-first allocation for subsequent crawling.

4. The method as claimed in claim 1, wherein each agent in the IP agent pool module pool has a flag to record the status of the agent: and (3) useful and usenguseless, extracting the agent marked as useful from the pool to be distributed to the capturing host every time, modifying the agent marked as using, if the agent marked as useles returned from the capturing host due to IP limitation, otherwise, normally returning the agent marked as useful, detecting all the agents marked as non-using in the pool when automatic updating is triggered, and deleting useless agents from the pool after detection.

5. The distributed big data collection implementation method of claim 3, wherein the web page in the host builds a label tree, and the building steps are as follows:

the method comprises the following steps: stacking the root tag < HTML >;