CN112347330A - Distributed parallel acquisition method for urban big data - Google Patents

Distributed parallel acquisition method for urban big data Download PDF

Info

Publication number
CN112347330A
CN112347330A CN202011221308.0A CN202011221308A CN112347330A CN 112347330 A CN112347330 A CN 112347330A CN 202011221308 A CN202011221308 A CN 202011221308A CN 112347330 A CN112347330 A CN 112347330A
Authority
CN
China
Prior art keywords
request
crawled
data
crawler
queue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011221308.0A
Other languages
Chinese (zh)
Inventor
徐磊
王纪军
张震宇
赵琳
陈刚
王春波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Electric Power Information Technology Co Ltd
Original Assignee
Jiangsu Electric Power Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Electric Power Information Technology Co Ltd filed Critical Jiangsu Electric Power Information Technology Co Ltd
Priority to CN202011221308.0A priority Critical patent/CN112347330A/en
Publication of CN112347330A publication Critical patent/CN112347330A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Abstract

The invention discloses a distributed parallel acquisition method facing to urban big data, which comprises the steps of firstly selecting data to be acquired and determining a target website for capturing the data; secondly, capturing webpage url information of the data from the target website, storing the webpage url information in a main server database, and maintaining the webpage url information into a website queue to be captured; then the main server schedules the crawling queue, generates url crawling requests and distributes the url crawling requests to the crawler processes on the plurality of servers for execution; and finally, storing the acquired data into a database on the main server by the crawler program on each server. The data acquisition method greatly improves the data acquisition rate of the crawler program.

Description

Distributed parallel acquisition method for urban big data
Technical Field
The invention belongs to the field of big data processing, and relates to a distributed parallel acquisition method for urban big data.
Background
In the preliminary stage of research and application of big data technology, when investment schemes are designed in the traditional industry, a rapid and convenient data acquisition mode and a processing and analyzing method are lacked, and the investment cost cannot be efficiently analyzed and predicted by utilizing the previous data. Therefore, the decision of investment scheme usually depends on the experience knowledge of decision maker, and lacks the basis of system, which often causes the problem of inappropriate investment allocation decision, resulting in that the fund can not be utilized more fully. In recent years, with the development of internet technology and the maturity of big data acquisition and analysis technology, big data analysis technology has achieved good results in many fields, and the combination of traditional industry and high and new technology has become the trend of the times. By analyzing the change of data such as economy, population and the like of decades in different cities and combining the existing budget investment result, the change rule can be mastered, and further reference and prediction are provided for future investment schemes. Therefore, the acquisition of city big data information in aspects such as GDP, population, enterprise, traffic and the like is realized, data set support can be provided for investment scheme planning of traditional enterprises such as power systems and the like, and the method has important practical significance.
Currently, the way to obtain large amounts of structured data from the internet is mainly web crawler technology. When the data volume to be acquired is huge, the traditional web crawler uses a single machine crawling mode, the operation speed is limited by the fact that a large number of requests are accumulated in a queue to wait for scheduling in the process of analyzing and downloading the webpage content, and therefore the efficiency of the single machine crawler is difficult to improve. Meanwhile, for crawling tasks with deeper web page levels, the use of a single crawler also leads to excessively complex logic of a page resolution function.
Disclosure of Invention
The invention aims to provide a distributed parallel acquisition method for urban big data, which can improve the data acquisition efficiency, simplify the page analysis logic and greatly improve the data acquisition rate of a crawler program.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a distributed parallel acquisition method facing to urban big data utilizes a mode of sharing a crawling request queue to perform distributed crawling on the urban big data, and comprises the following steps:
1) selecting city data types to be acquired, such as city calendar year GDP, population data change, enterprise information and the like, determining one or more main pages to be crawled, analyzing the hierarchical structure of the pages, and determining the position of target data;
2) according to the webpage structure, crawling a url address of a secondary page from a main page, generating a crawling request, removing duplication, storing the crawling request in a main server database, and constructing a request queue to be crawled;
3) and configuring a crawler process and a scheduler at a plurality of sub-server sides and connecting with the main server. And when the current crawler process does not have a crawling task, sending a signal of an idle state, and calling a function to request allocation request. The Scheduler determines a queue to be crawled of the main server according to the name of the crawler, selects a request according to the priority of the queue to be crawled, sends the request to an engine, and sends the request to the crawler for execution;
4) and if the data type crawled in the crawler running process is a new url link, the data type is handed to the scheduler, the scheduler sends the new request to the main server to generate a new request, and the new request is determined to be stored in the queue database or discarded after the duplication elimination judgment. If the crawled target data information is the target data information, preprocessing the data and storing the data into a main server database;
5) and after finishing all crawling tasks of the current request pool, the crawler process sends an idle state signal and calls a continuing function to request the next request until the queue to be crawled in the main server is empty.
Further, the request deduplication in the step 2) is performed in a specific manner: and establishing a non-repeated set, taking a hash value of the url to be crawled, trying to join the set, and if the joining fails, indicating that the url is in the queue, and discarding the url to avoid repetition. And if the set is successfully added, the url is not crawled, and the url is added into a queue to be crawled.
The concrete method for constructing the queue to be crawled in the step 2) is as follows: converting the request into a ditt object, serializing the ditt object into a character string by using a serializer method in the picklectlecompat, and establishing a list object for storing each request to be crawled and recording the stored sequence. This list is used as a value and the crawler name is stored as a key in the database. And selecting one request stored earliest each time to be discharged from the warehouse, and realizing a first-in first-out queue storage form to be crawled.
The specific manner of request distributed scheduling in step 3) is as follows: when the crawler process on the sub-server is idle, a function is called to request the scheduler corresponding to the server to acquire a new crawling request. And after receiving the request, the dispatcher queries a corresponding queue list to be crawled in the main server by taking the crawler name as a key, takes out a batch of earlier-entering requests from the queue list and gives the requests to a crawler engine, and the engine sends the requests to a crawler process of the sub-server for execution. The crawler processes on the servers can be connected to the same crawling queue in the main server database through crawler names, so that sharing and distributed scheduling of the queue to be crawled are achieved.
The specific mode of distinguishing and processing the crawled data in the step 4) is as follows: executing different crawler and page analysis functions according to the depth of the page, generating a request for url links crawled by the current page, storing the request into a crawling queue corresponding to a next-level page crawler program, and waiting for scheduling and execution of a next-level crawler process; and storing the crawled non-url data to the position of the main server corresponding to the database according to the analysis function rule of the corresponding page crawler, thereby realizing the distinguishing processing of different levels of page information during deep crawling.
The invention has the beneficial effects that: the method realizes the distributed crawling process of the data on the multiple servers in a mode of sharing the crawling queue; the duplicate removal operation of the request is carried out by utilizing the characteristic that the hash fingerprint of the url is not repeated with the set, so that the unnecessary repeated crawling process is reduced; for crawling of multiple levels of pages, each level of page is analyzed by adopting an independent crawler logic, and nesting degree and complexity of the logic in each crawler are reduced. In particular, the present invention has the following advantages:
1. the method realizes the distributed crawling process of the data on the multiple servers in a mode of sharing the crawling queue;
2. the duplicate removal operation of the request is carried out by utilizing the characteristic that the hash fingerprint of the url is not repeated with the set, so that the unnecessary repeated crawling process is reduced;
3. for crawling of multiple levels of pages, each level of page is analyzed by adopting an independent crawler logic, and nesting degree and complexity of the logic in each crawler are reduced.
Drawings
FIG. 1 is a flow chart of the distributed parallel acquisition procedure of the present invention.
Fig. 2 is an overall system framework diagram of the present invention.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, which is defined in the appended claims, as interpreted by those skilled in the art.
Referring to fig. 1 and 2, the distributed parallel acquisition method for urban big data according to the present invention includes the following steps:
step 1: and determining the type of the urban data to be acquired according to the requirements of the project, and further determining a source website of the data. Crawling is carried out by taking enterprise information of Jiangsu province as target data, and yellow pages of enterprise information of Jiangsu province are selected as data source websites;
step 2: the structure of the data source website is analyzed, and the yellow page website of enterprise information of Jiangsu province in the embodiment has a four-layer structure. The first layer is the initial page which is the enterprise level classification page of Jiangsu province, and the data needing to be crawled on the page are 20 enterprise level class aliases and corresponding url links; the second layer is a second-level classification page contained in the first-level category, and the data needing to be crawled on the pages are enterprise second-level category names and corresponding url links; the third layer is a third-level classification page contained in the second-level category, and the data needing to be crawled on the pages are the third-level category names of the current enterprises, the names of the enterprises in the third-level category and url links of the detailed information pages of the enterprises; the fourth layer is an enterprise detailed information page, and the data to be crawled is information such as enterprise names, registration dates, registration funds, enterprise profiles, enterprise addresses and the like;
and step 3: installing a redis memory database on a main server for storing and maintaining a queue to be crawled, and installing a MongoDB database for storing non-url data;
and 4, step 4: the method comprises the steps of analyzing and determining the position of information to be crawled according to page source codes, compiling a crawler page analysis function aiming at different pages, and remotely connecting a redis database and a MongoDB database of a main server for follow-up url access and data storage, wherein url information is stored in a queue to be crawled corresponding to the name of a crawler at the next level of the redis database, so that the queues to be crawled at different levels correspond to the crawlers one by one;
and 5: and deploying and starting a crawler process at each sub-server, then inputting a starting url in a redis database of the main server, detecting that new data enters, starting building a crawling queue and distributing the crawling queue to the crawler process for execution.

Claims (6)

1. A distributed parallel acquisition method facing to city big data is characterized in that a shared crawling request queue mode is used for performing distributed crawling on the city big data, and the method comprises the following steps:
1) selecting the type of city data to be acquired, determining one or more main pages to be crawled, analyzing the hierarchical structure of the pages, and determining the position of target data;
2) according to the webpage structure, crawling a url address of a secondary page from a main page, generating a crawling request, removing duplication, storing the crawling request in a main server database, and constructing a request queue to be crawled;
3) a crawler process and a scheduler are configured at a plurality of sub-server sides and are connected with a main server; when the current crawler process does not have a crawling task, sending a signal of an idle state, and calling a function to request allocation request; the Scheduler determines a queue to be crawled of the main server according to the name of the crawler, selects a request according to the priority of the queue to be crawled, sends the request to an engine, and sends the request to a crawler program for execution;
4) if the data type crawled in the crawler running process is a new url link, the data type is handed to a scheduler, the scheduler sends the new request to a main server to generate a new request, and the new request is determined to be stored in a queue database or discarded after repeated judgment; if the crawled target data information is the target data information, preprocessing the data and storing the data into a main server database;
5) and after finishing all crawling tasks of the current request pool, the crawler process sends an idle state signal and calls a continuing function to request the next request until the queue to be crawled in the main server is empty.
2. The distributed parallel acquisition method for big city data according to claim 1, wherein in the step 1), the city data types include city calendar year GDP, population data change and enterprise information.
3. The urban big data-oriented distributed parallel acquisition method according to claim 1, wherein the request deduplication in step 2) is performed in a specific manner as follows: establishing a non-repeated set, taking a hash value of the url to be crawled and trying to join the url into the set, and if the joining fails, indicating that the url is in the queue, and discarding the url to avoid repetition; and if the set is successfully added, the url is not crawled, and the url is added into a queue to be crawled.
4. The urban big data-oriented distributed parallel acquisition method according to claim 1, wherein the specific way of constructing the queue to be crawled in the step 2) is as follows: converting a request into a ditt object, serializing the ditt object into a character string by using a serializer method in a picklectcopat, and establishing a list table object for storing each request to be crawled and recording the stored sequence; storing the list as a value and the crawler name as a key in a database; and selecting one request stored earliest each time to be discharged from the warehouse, and realizing a first-in first-out queue storage form to be crawled.
5. The distributed parallel acquisition method for the urban big data according to claim 1, wherein in step 3), when a crawler process on a sub-server is idle, a function is called to request a scheduler corresponding to the server to acquire a new crawling request; after receiving the request, the dispatcher queries a corresponding queue list to be crawled in the main server by taking the crawler name as a key, takes out a batch of earlier-entered requests from the queue list and gives the requests to a crawler engine, and the engine sends the requests to a crawler process of the sub-server for execution; the crawler processes on the servers can be connected to the same crawling queue in the main server database through crawler names, so that sharing and distributed scheduling of the queue to be crawled are achieved.
6. The distributed parallel acquisition method for big urban data according to claim 1, wherein in step 4), the specific way of distinguishing and processing crawled data is as follows: executing different crawler and page analysis functions according to the depth of the page, generating a request for url links crawled by the current page, storing the request into a crawling queue corresponding to a next-level page crawler program, and waiting for scheduling and execution of a next-level crawler process; and storing the crawled non-url data to the position of the main server corresponding to the database according to the analysis function rule of the corresponding page crawler, thereby realizing the distinguishing processing of different levels of page information during deep crawling.
CN202011221308.0A 2020-11-05 2020-11-05 Distributed parallel acquisition method for urban big data Pending CN112347330A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011221308.0A CN112347330A (en) 2020-11-05 2020-11-05 Distributed parallel acquisition method for urban big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011221308.0A CN112347330A (en) 2020-11-05 2020-11-05 Distributed parallel acquisition method for urban big data

Publications (1)

Publication Number Publication Date
CN112347330A true CN112347330A (en) 2021-02-09

Family

ID=74428812

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011221308.0A Pending CN112347330A (en) 2020-11-05 2020-11-05 Distributed parallel acquisition method for urban big data

Country Status (1)

Country Link
CN (1) CN112347330A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140279056A1 (en) * 2013-03-15 2014-09-18 TriVu Media, Inc. Intelligent platform for real-time bidding
CN110633429A (en) * 2018-05-31 2019-12-31 北京京东尚科信息技术有限公司 Content crawling method and device and distributed crawler system
CN110874429A (en) * 2019-11-14 2020-03-10 北京京航计算通讯研究所 Distributed web crawler performance optimization method oriented to mass data acquisition
CN111209460A (en) * 2019-12-27 2020-05-29 青岛海洋科学与技术国家实验室发展中心 Data acquisition system and method based on script crawler framework
CN111488508A (en) * 2020-04-10 2020-08-04 长春博立电子科技有限公司 Internet information acquisition system and method supporting multi-protocol distributed high concurrency

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140279056A1 (en) * 2013-03-15 2014-09-18 TriVu Media, Inc. Intelligent platform for real-time bidding
CN110633429A (en) * 2018-05-31 2019-12-31 北京京东尚科信息技术有限公司 Content crawling method and device and distributed crawler system
CN110874429A (en) * 2019-11-14 2020-03-10 北京京航计算通讯研究所 Distributed web crawler performance optimization method oriented to mass data acquisition
CN111209460A (en) * 2019-12-27 2020-05-29 青岛海洋科学与技术国家实验室发展中心 Data acquisition system and method based on script crawler framework
CN111488508A (en) * 2020-04-10 2020-08-04 长春博立电子科技有限公司 Internet information acquisition system and method supporting multi-protocol distributed high concurrency

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
樊宇豪: "基于Scrapy的分布式网络爬虫系统设计与实现", 中国优秀硕士学位论文全文数据库 信息科技辑, pages 138 - 1052 *

Similar Documents

Publication Publication Date Title
Zhao et al. Dache: A data aware caching for big-data applications using the MapReduce framework
CN108256115B (en) Spark Sql-oriented HDFS small file real-time combination implementation method
CN108415776B (en) Memory pre-estimation and configuration optimization method in distributed data processing system
CN107590188B (en) Crawler crawling method and management system for automatic vertical subdivision field
US6112196A (en) Method and system for managing connections to a database management system by reusing connections to a database subsystem
US9247022B2 (en) Method and apparatus for optimizing performance and network traffic in distributed workflow processing
CN101477524A (en) System performance optimization method and system based on materialized view
CN104899199A (en) Data processing method and system for data warehouse
US20120143844A1 (en) Multi-level coverage for crawling selection
CN110134738B (en) Distributed storage system resource estimation method and device
CN106383746A (en) Configuration parameter determination method and apparatus of big data processing system
CN113918793A (en) Multi-source scientific and creative resource data acquisition method
CN111680075A (en) Hadoop + Spark traffic prediction system and method based on combination of offline analysis and online prediction
CN109522138A (en) A kind of processing method and system of distributed stream data
US8612597B2 (en) Computing scheduling using resource lend and borrow
CN108228330A (en) The multi-process method for scheduling task and device of a kind of serialization
CN109992469B (en) Method and device for merging logs
CN101021916A (en) Business process analysis method
CN107133321B (en) Method and device for analyzing search characteristics of page
CN112667873A (en) Crawler system and method suitable for general data acquisition of most websites
CN116974994A (en) High-efficiency file collaboration system based on clusters
CN112052284A (en) Main data management method and system under big data
CN112347330A (en) Distributed parallel acquisition method for urban big data
CN116126901A (en) Data processing method, device, electronic equipment and computer readable storage medium
Iskhakova Processing of big data streams in intelligent electronic data analysis systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination