CN110807137A - Distributed big data acquisition implementation method - Google Patents
Distributed big data acquisition implementation method Download PDFInfo
- Publication number
- CN110807137A CN110807137A CN201910290171.5A CN201910290171A CN110807137A CN 110807137 A CN110807137 A CN 110807137A CN 201910290171 A CN201910290171 A CN 201910290171A CN 110807137 A CN110807137 A CN 110807137A
- Authority
- CN
- China
- Prior art keywords
- url
- module
- pool
- agent
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 23
- 238000013500 data storage Methods 0.000 claims abstract description 11
- 238000013075 data extraction Methods 0.000 claims abstract description 8
- 238000001914 filtration Methods 0.000 claims abstract description 8
- 239000000284 extract Substances 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 3
- 230000009193 crawling Effects 0.000 claims description 3
- 238000001514 detection method Methods 0.000 claims description 3
- 238000000638 solvent extraction Methods 0.000 claims description 3
- 230000001960 triggered effect Effects 0.000 claims description 3
- 238000013480 data collection Methods 0.000 claims 2
- 238000000605 extraction Methods 0.000 description 2
- 230000002085 persistent effect Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention relates to the technical field of big data, in particular to a distributed big data acquisition implementation method which comprises a grabbing module, an IP proxy pool module, an analysis module, a URL processing module and a data storage module, wherein the IP proxy pool module is provided with proxy updating and proxy distribution, the analysis module is provided with URL data extraction and basic data extraction, the URL processing module is provided with URL filtering, the URL filtering is connected with a URL queue, and the URL queue is connected with a URL distribution process. The invention ensures the system to work continuously by switching the agent, eliminates the waiting of the limited time, and greatly improves the acquisition efficiency of the system.
Description
Technical Field
The invention relates to the technical field of big data, in particular to a distributed big data acquisition implementation method.
Background
With the popularization and application of emerging technologies of the internet, such as mobile internet, electronic commerce, social networking and the like, network data, such as images, videos, logs and the like, is increased explosively. Commodity transaction data generated by nearly 4 hundred million members of the Taobao network is about 20TB, and log data generated by about 10 million users in the Facebook is more than 300 TB. The big data era has come, and the big data field is also the subject of the current hot research. Data is the basis for realizing big data research, and the traditional data acquisition technical scheme is difficult to meet the requirement of rapidly acquiring a high-quality data set. Therefore, how to efficiently collect mass high-quality data plays an extremely important role in large-data application and research.
Disclosure of Invention
The invention aims to solve the defects in the prior art and provides a distributed big data acquisition implementation method.
In order to achieve the purpose, the invention adopts the following technical scheme:
a distributed big data acquisition implementation method comprises a grabbing module, an IP proxy pool module, an analysis module, a URL processing module and a data storage module, wherein the IP proxy pool module is provided with proxy updating and proxy distribution, the analysis module is provided with URL data extraction and basic data extraction, the URL processing module is provided with URL filtering, the URL filtering is connected with a URL queue, and the URL queue is connected with a URL distribution process.
Preferably, the capturing module acquires data from the internet, the IP proxy pool module and the URL processing module, and the analyzing module receives the data of the capturing module and transmits the data to the URL processing module and the data storage module.
Preferably, the URL processing module includes a main control node and hosts, the main control node extracts URLs from the URL queues to be grabbed and allocates the URLs to the respective grabbing hosts, then the grabbing hosts complete the acquisition task and the analysis task and submit the URLs that have been successfully grabbed and the extracted new URLs to the main control node for processing, the URLs that have been successfully grabbed are cached in the crawled set, then the new URLs are filtered according to the crawled set and cached in the corresponding queues to be grabbed, wherein the queues to be grabbed and the crawled sets are both implemented by using a memory database, and the queues to be grabbed adopt a pre-stored and pre-allocated strategy for subsequent crawling.
Preferably, each agent in the IP agent pool module pool has a flag to record the status of the agent: and (3) useful and usenguseless, extracting the agent marked as useful from the pool to be distributed to the capturing host every time, modifying the agent marked as using, if the agent marked as useles returned from the capturing host due to IP limitation, otherwise, normally returning the agent marked as useful, detecting all the agents marked as non-using in the pool when automatic updating is triggered, and deleting useless agents from the pool after detection.
Preferably, the web page in the host builds a label tree, and the building steps are as follows:
the method comprises the following steps: stacking the root tag < HTML >;
step two: when a chunk start tag is encountered, the chunk is taken as a child node of the stack top block node. And stacking the block label;
step three: when a partitioning end label is encountered, pushing the stack out of the stack;
step four: and if the stack is empty, finishing construction, otherwise, continuing scanning, and meeting a second step of starting label skipping when the block is encountered and a third step of finishing label skipping when the block is encountered.
Compared with the prior art, the invention provides a distributed big data acquisition implementation method, which has the following beneficial effects:
the invention is characterized in that a capturing module acquires URLs to be crawled from a URL queue, then calls available agents acquired from an IP agent pool, captures original data from the Internet, and sends the data to an analysis module for processing, the analysis module firstly preprocesses the data to remove some obvious noise, then extracts text information through a text extraction algorithm based on label tree block node weight, URL related data are sent to a URL processing module for processing, basic data are processed by a data storage module, the URL processing module is mainly used for controlling distributed capturing, and the data storage module is used for regularizing and persisting the data, thereby laying a foundation for subsequent analysis and processing.
Drawings
Fig. 1 is a schematic structural diagram of a distributed big data acquisition implementation method according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it is to be understood that the terms "upper", "lower", "front", "rear", "left", "right", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention.
Referring to fig. 1, a distributed big data acquisition implementation method includes a capture module, an IP proxy pool module, an analysis module, a URL processing module, and a data storage module, where the IP proxy pool module is provided with proxy update and proxy allocation, the analysis module is provided with URL data extraction and basic data extraction, the URL processing module is provided with URL filtering, the URL filtering is connected with a URL queue, and the URL queue is connected with a URL allocation process.
The capturing module acquires data from the internet, the IP agent pool module and the URL processing module respectively, and the analyzing module receives the data of the capturing module and then transmits the data to the URL processing module and the data storage module respectively.
The URL processing module comprises a main control node and hosts, the main control node extracts URLs from the URL queues to be grabbed and distributes the URLs to the grabbing hosts, the grabbing hosts complete acquisition tasks and analysis tasks and deliver the URLs which are successfully grabbed and the extracted new URLs to the main control node for processing, the URLs which are successfully grabbed are cached in a crawled set, new URLs are filtered according to the crawled set and are cached in the corresponding queues to be grabbed, the queues to be grabbed and the crawled sets are all realized by using a memory database, and the queues to be grabbed adopt a strategy of first storing and first distributing for subsequent crawling.
Each agent in the IP agent pool module pool has a mark to record the state of the agent: and (3) useful and usenguseless, extracting the agent marked as useful from the pool to be distributed to the capturing host every time, modifying the agent marked as using, if the agent marked as useles returned from the capturing host due to IP limitation, otherwise, normally returning the agent marked as useful, detecting all the agents marked as non-using in the pool when automatic updating is triggered, and deleting useless agents from the pool after detection.
The method comprises the following steps that a webpage in a host builds a label tree, and the building steps are as follows:
the method comprises the following steps: stacking the root tag < HTML >;
step two: when a chunk start tag is encountered, the chunk is taken as a child node of the stack top block node. And stacking the block label;
step three: when a partitioning end label is encountered, pushing the stack out of the stack;
step four: and if the stack is empty, finishing construction, otherwise, continuing scanning, and meeting a second step of starting label skipping when the block is encountered and a third step of finishing label skipping when the block is encountered.
When the method is used, the capturing module acquires the URL to be crawled from the URL queue, then the available agent acquired from the IP agent pool is called, the original data is captured from the Internet and is processed by the analyzing module, the analyzing module firstly preprocesses the data to remove some obvious noise, then text information is extracted through a text extraction algorithm based on the node weight of the label tree block, the URL related data is processed by the URL processing module, the basic data is processed by the data storage module, the URL processing module is mainly used for controlling distributed capturing, and the data storage module is used for regularizing and persisting the data, so that a foundation is laid for subsequent analysis and processing.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (5)
1. The distributed big data acquisition implementation method comprises a grabbing module, an IP proxy pool module, an analysis module, a URL processing module and a data storage module, and is characterized in that the IP proxy pool module is provided with proxy updating and proxy distribution, the analysis module is provided with URL data extraction and basic data extraction, the URL processing module is provided with URL filtering, the URL filtering is connected with a URL queue, and the URL queue is connected with a URL distribution process.
2. The distributed big data collection implementation method of claim 1, wherein the capture module obtains data from the internet, the IP proxy pool module, and the URL processing module, respectively, and the parsing module receives the data of the capture module and then transmits the data to the URL processing module and the data storage module, respectively.
3. The distributed big data acquisition implementation method according to claim 1, wherein the URL processing module includes a main control node and a host, the main control node extracts URLs from the URL queue to be grabbed and allocates the URLs to the respective grabbing hosts, then the grabbing hosts complete the acquisition task and the analysis task and submit the URLs that have been successfully grabbed and the extracted new URLs to the main control node for processing, the URLs that have been successfully grabbed are cached in the crawled set, then the new URLs are filtered according to the crawled set and cached in the corresponding queue to be grabbed, wherein the queue to be grabbed and the crawled set are both implemented by using a memory database, and the queue to be grabbed adopts a strategy of first-in-first allocation for subsequent crawling.
4. The method as claimed in claim 1, wherein each agent in the IP agent pool module pool has a flag to record the status of the agent: and (3) useful and usenguseless, extracting the agent marked as useful from the pool to be distributed to the capturing host every time, modifying the agent marked as using, if the agent marked as useles returned from the capturing host due to IP limitation, otherwise, normally returning the agent marked as useful, detecting all the agents marked as non-using in the pool when automatic updating is triggered, and deleting useless agents from the pool after detection.
5. The distributed big data collection implementation method of claim 3, wherein the web page in the host builds a label tree, and the building steps are as follows:
the method comprises the following steps: stacking the root tag < HTML >;
step two: when a chunk start tag is encountered, the chunk is taken as a child node of the stack top block node. And stacking the block label;
step three: when a partitioning end label is encountered, pushing the stack out of the stack;
step four: and if the stack is empty, finishing construction, otherwise, continuing scanning, and meeting a second step of starting label skipping when the block is encountered and a third step of finishing label skipping when the block is encountered.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910290171.5A CN110807137A (en) | 2019-04-11 | 2019-04-11 | Distributed big data acquisition implementation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910290171.5A CN110807137A (en) | 2019-04-11 | 2019-04-11 | Distributed big data acquisition implementation method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110807137A true CN110807137A (en) | 2020-02-18 |
Family
ID=69487330
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910290171.5A Pending CN110807137A (en) | 2019-04-11 | 2019-04-11 | Distributed big data acquisition implementation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110807137A (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030065711A1 (en) * | 2001-10-01 | 2003-04-03 | International Business Machines Corporation | Method and apparatus for content-aware web switching |
US20040148571A1 (en) * | 2003-01-27 | 2004-07-29 | Lue Vincent Wen-Jeng | Method and apparatus for adapting web contents to different display area |
CN102306204A (en) * | 2011-09-28 | 2012-01-04 | 武汉大学 | Subject area identifying method based on weight of text structure |
CN103023998A (en) * | 2012-11-29 | 2013-04-03 | 网宿科技股份有限公司 | Temporary jump error correction method and system based on content distribution network fringe node |
CN105243159A (en) * | 2015-10-28 | 2016-01-13 | 福建亿榕信息技术有限公司 | Visual script editor-based distributed web crawler system |
CN106021608A (en) * | 2016-06-22 | 2016-10-12 | 广东亿迅科技有限公司 | Distributed crawler system and implementing method thereof |
WO2017113324A1 (en) * | 2015-12-31 | 2017-07-06 | 孙燕群 | Regular expression-based url filtering method |
CN109508422A (en) * | 2018-12-05 | 2019-03-22 | 南京邮电大学 | The height of multithreading intelligent scheduling is hidden crawler system |
-
2019
- 2019-04-11 CN CN201910290171.5A patent/CN110807137A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030065711A1 (en) * | 2001-10-01 | 2003-04-03 | International Business Machines Corporation | Method and apparatus for content-aware web switching |
US20040148571A1 (en) * | 2003-01-27 | 2004-07-29 | Lue Vincent Wen-Jeng | Method and apparatus for adapting web contents to different display area |
CN102306204A (en) * | 2011-09-28 | 2012-01-04 | 武汉大学 | Subject area identifying method based on weight of text structure |
CN103023998A (en) * | 2012-11-29 | 2013-04-03 | 网宿科技股份有限公司 | Temporary jump error correction method and system based on content distribution network fringe node |
CN105243159A (en) * | 2015-10-28 | 2016-01-13 | 福建亿榕信息技术有限公司 | Visual script editor-based distributed web crawler system |
WO2017113324A1 (en) * | 2015-12-31 | 2017-07-06 | 孙燕群 | Regular expression-based url filtering method |
CN106021608A (en) * | 2016-06-22 | 2016-10-12 | 广东亿迅科技有限公司 | Distributed crawler system and implementing method thereof |
CN109508422A (en) * | 2018-12-05 | 2019-03-22 | 南京邮电大学 | The height of multithreading intelligent scheduling is hidden crawler system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110062025B (en) | Data acquisition method, device, server and storage medium | |
CN102054028B (en) | Method for implementing web-rendering function by using web crawler system | |
CN105589943B (en) | The method, apparatus and server of the picture adaptive processes of result of page searching | |
CN101986284B (en) | Dynamic recovery system for waste space of virtual machine image file | |
CN103970788A (en) | Webpage-crawling-based crawler technology | |
CN104077402B (en) | Data processing method and data handling system | |
CN104516982A (en) | Method and system for extracting Web information based on Nutch | |
CN102436513A (en) | Distributed search method and system | |
CN102045305A (en) | Method and system for monitoring and tracking multimedia resource transmission | |
CN102164186A (en) | Method and system for realizing cloud search service | |
US10462257B2 (en) | Method and apparatus for obtaining user account | |
CN104348859B (en) | File synchronisation method, device, server, terminal and system | |
CN104866528B (en) | Multi-platform collecting method and system | |
CN106599270B (en) | Network data capturing method and crawler | |
CN106921703A (en) | The method of cross-border data syn-chronization, system, and domestic and overseas data center | |
CN106599001A (en) | Webpage content acquisition method and system | |
CN104199893B (en) | A kind of system and method for quickly issuing full media content | |
US20180307530A1 (en) | Data persistence method and system thereof in stream computing | |
CN103955517B (en) | Method and system for converting data in documental database to relational database | |
CN113656673A (en) | Master-slave distributed content crawling robot for advertisement delivery | |
CN103136225B (en) | A kind of method and system of Internet picture conversion | |
EP3226149A1 (en) | Method and device for providing website authentication data for search engine | |
CN113946399B (en) | Space data loading method and device | |
CN107679091A (en) | A kind of search system and method based on big data | |
CN110807137A (en) | Distributed big data acquisition implementation method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |