CN112347330A

CN112347330A - Distributed parallel acquisition method for urban big data

Info

Publication number: CN112347330A
Application number: CN202011221308.0A
Authority: CN
Inventors: 徐磊; 王纪军; 张震宇; 赵琳; 陈刚; 王春波
Original assignee: Jiangsu Electric Power Information Technology Co Ltd
Current assignee: Jiangsu Electric Power Information Technology Co Ltd
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2021-02-09

Abstract

The invention discloses a distributed parallel acquisition method facing to urban big data, which comprises the steps of firstly selecting data to be acquired and determining a target website for capturing the data; secondly, capturing webpage url information of the data from the target website, storing the webpage url information in a main server database, and maintaining the webpage url information into a website queue to be captured; then the main server schedules the crawling queue, generates url crawling requests and distributes the url crawling requests to the crawler processes on the plurality of servers for execution; and finally, storing the acquired data into a database on the main server by the crawler program on each server. The data acquisition method greatly improves the data acquisition rate of the crawler program.

Description

Distributed parallel acquisition method for urban big data

Technical Field

The invention belongs to the field of big data processing, and relates to a distributed parallel acquisition method for urban big data.

Background

In the preliminary stage of research and application of big data technology, when investment schemes are designed in the traditional industry, a rapid and convenient data acquisition mode and a processing and analyzing method are lacked, and the investment cost cannot be efficiently analyzed and predicted by utilizing the previous data. Therefore, the decision of investment scheme usually depends on the experience knowledge of decision maker, and lacks the basis of system, which often causes the problem of inappropriate investment allocation decision, resulting in that the fund can not be utilized more fully. In recent years, with the development of internet technology and the maturity of big data acquisition and analysis technology, big data analysis technology has achieved good results in many fields, and the combination of traditional industry and high and new technology has become the trend of the times. By analyzing the change of data such as economy, population and the like of decades in different cities and combining the existing budget investment result, the change rule can be mastered, and further reference and prediction are provided for future investment schemes. Therefore, the acquisition of city big data information in aspects such as GDP, population, enterprise, traffic and the like is realized, data set support can be provided for investment scheme planning of traditional enterprises such as power systems and the like, and the method has important practical significance.

Currently, the way to obtain large amounts of structured data from the internet is mainly web crawler technology. When the data volume to be acquired is huge, the traditional web crawler uses a single machine crawling mode, the operation speed is limited by the fact that a large number of requests are accumulated in a queue to wait for scheduling in the process of analyzing and downloading the webpage content, and therefore the efficiency of the single machine crawler is difficult to improve. Meanwhile, for crawling tasks with deeper web page levels, the use of a single crawler also leads to excessively complex logic of a page resolution function.

Disclosure of Invention

The invention aims to provide a distributed parallel acquisition method for urban big data, which can improve the data acquisition efficiency, simplify the page analysis logic and greatly improve the data acquisition rate of a crawler program.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a distributed parallel acquisition method facing to urban big data utilizes a mode of sharing a crawling request queue to perform distributed crawling on the urban big data, and comprises the following steps:

1) selecting city data types to be acquired, such as city calendar year GDP, population data change, enterprise information and the like, determining one or more main pages to be crawled, analyzing the hierarchical structure of the pages, and determining the position of target data;

2) according to the webpage structure, crawling a url address of a secondary page from a main page, generating a crawling request, removing duplication, storing the crawling request in a main server database, and constructing a request queue to be crawled;

3) and configuring a crawler process and a scheduler at a plurality of sub-server sides and connecting with the main server. And when the current crawler process does not have a crawling task, sending a signal of an idle state, and calling a function to request allocation request. The Scheduler determines a queue to be crawled of the main server according to the name of the crawler, selects a request according to the priority of the queue to be crawled, sends the request to an engine, and sends the request to the crawler for execution;

4) and if the data type crawled in the crawler running process is a new url link, the data type is handed to the scheduler, the scheduler sends the new request to the main server to generate a new request, and the new request is determined to be stored in the queue database or discarded after the duplication elimination judgment. If the crawled target data information is the target data information, preprocessing the data and storing the data into a main server database;

5) and after finishing all crawling tasks of the current request pool, the crawler process sends an idle state signal and calls a continuing function to request the next request until the queue to be crawled in the main server is empty.

Further, the request deduplication in the step 2) is performed in a specific manner: and establishing a non-repeated set, taking a hash value of the url to be crawled, trying to join the set, and if the joining fails, indicating that the url is in the queue, and discarding the url to avoid repetition. And if the set is successfully added, the url is not crawled, and the url is added into a queue to be crawled.

The concrete method for constructing the queue to be crawled in the step 2) is as follows: converting the request into a ditt object, serializing the ditt object into a character string by using a serializer method in the picklectlecompat, and establishing a list object for storing each request to be crawled and recording the stored sequence. This list is used as a value and the crawler name is stored as a key in the database. And selecting one request stored earliest each time to be discharged from the warehouse, and realizing a first-in first-out queue storage form to be crawled.

The specific manner of request distributed scheduling in step 3) is as follows: when the crawler process on the sub-server is idle, a function is called to request the scheduler corresponding to the server to acquire a new crawling request. And after receiving the request, the dispatcher queries a corresponding queue list to be crawled in the main server by taking the crawler name as a key, takes out a batch of earlier-entering requests from the queue list and gives the requests to a crawler engine, and the engine sends the requests to a crawler process of the sub-server for execution. The crawler processes on the servers can be connected to the same crawling queue in the main server database through crawler names, so that sharing and distributed scheduling of the queue to be crawled are achieved.

The specific mode of distinguishing and processing the crawled data in the step 4) is as follows: executing different crawler and page analysis functions according to the depth of the page, generating a request for url links crawled by the current page, storing the request into a crawling queue corresponding to a next-level page crawler program, and waiting for scheduling and execution of a next-level crawler process; and storing the crawled non-url data to the position of the main server corresponding to the database according to the analysis function rule of the corresponding page crawler, thereby realizing the distinguishing processing of different levels of page information during deep crawling.

The invention has the beneficial effects that: the method realizes the distributed crawling process of the data on the multiple servers in a mode of sharing the crawling queue; the duplicate removal operation of the request is carried out by utilizing the characteristic that the hash fingerprint of the url is not repeated with the set, so that the unnecessary repeated crawling process is reduced; for crawling of multiple levels of pages, each level of page is analyzed by adopting an independent crawler logic, and nesting degree and complexity of the logic in each crawler are reduced. In particular, the present invention has the following advantages:

1. the method realizes the distributed crawling process of the data on the multiple servers in a mode of sharing the crawling queue;

2. the duplicate removal operation of the request is carried out by utilizing the characteristic that the hash fingerprint of the url is not repeated with the set, so that the unnecessary repeated crawling process is reduced;

3. for crawling of multiple levels of pages, each level of page is analyzed by adopting an independent crawler logic, and nesting degree and complexity of the logic in each crawler are reduced.

Drawings

FIG. 1 is a flow chart of the distributed parallel acquisition procedure of the present invention.

Fig. 2 is an overall system framework diagram of the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, which is defined in the appended claims, as interpreted by those skilled in the art.

Referring to fig. 1 and 2, the distributed parallel acquisition method for urban big data according to the present invention includes the following steps:

step 1: and determining the type of the urban data to be acquired according to the requirements of the project, and further determining a source website of the data. Crawling is carried out by taking enterprise information of Jiangsu province as target data, and yellow pages of enterprise information of Jiangsu province are selected as data source websites;

step 2: the structure of the data source website is analyzed, and the yellow page website of enterprise information of Jiangsu province in the embodiment has a four-layer structure. The first layer is the initial page which is the enterprise level classification page of Jiangsu province, and the data needing to be crawled on the page are 20 enterprise level class aliases and corresponding url links; the second layer is a second-level classification page contained in the first-level category, and the data needing to be crawled on the pages are enterprise second-level category names and corresponding url links; the third layer is a third-level classification page contained in the second-level category, and the data needing to be crawled on the pages are the third-level category names of the current enterprises, the names of the enterprises in the third-level category and url links of the detailed information pages of the enterprises; the fourth layer is an enterprise detailed information page, and the data to be crawled is information such as enterprise names, registration dates, registration funds, enterprise profiles, enterprise addresses and the like;

and step 3: installing a redis memory database on a main server for storing and maintaining a queue to be crawled, and installing a MongoDB database for storing non-url data;

and 4, step 4: the method comprises the steps of analyzing and determining the position of information to be crawled according to page source codes, compiling a crawler page analysis function aiming at different pages, and remotely connecting a redis database and a MongoDB database of a main server for follow-up url access and data storage, wherein url information is stored in a queue to be crawled corresponding to the name of a crawler at the next level of the redis database, so that the queues to be crawled at different levels correspond to the crawlers one by one;

and 5: and deploying and starting a crawler process at each sub-server, then inputting a starting url in a redis database of the main server, detecting that new data enters, starting building a crawling queue and distributing the crawling queue to the crawler process for execution.

Claims

1. A distributed parallel acquisition method facing to city big data is characterized in that a shared crawling request queue mode is used for performing distributed crawling on the city big data, and the method comprises the following steps:

1) selecting the type of city data to be acquired, determining one or more main pages to be crawled, analyzing the hierarchical structure of the pages, and determining the position of target data;

3) a crawler process and a scheduler are configured at a plurality of sub-server sides and are connected with a main server; when the current crawler process does not have a crawling task, sending a signal of an idle state, and calling a function to request allocation request; the Scheduler determines a queue to be crawled of the main server according to the name of the crawler, selects a request according to the priority of the queue to be crawled, sends the request to an engine, and sends the request to a crawler program for execution;

4) if the data type crawled in the crawler running process is a new url link, the data type is handed to a scheduler, the scheduler sends the new request to a main server to generate a new request, and the new request is determined to be stored in a queue database or discarded after repeated judgment; if the crawled target data information is the target data information, preprocessing the data and storing the data into a main server database;

2. The distributed parallel acquisition method for big city data according to claim 1, wherein in the step 1), the city data types include city calendar year GDP, population data change and enterprise information.

3. The urban big data-oriented distributed parallel acquisition method according to claim 1, wherein the request deduplication in step 2) is performed in a specific manner as follows: establishing a non-repeated set, taking a hash value of the url to be crawled and trying to join the url into the set, and if the joining fails, indicating that the url is in the queue, and discarding the url to avoid repetition; and if the set is successfully added, the url is not crawled, and the url is added into a queue to be crawled.

4. The urban big data-oriented distributed parallel acquisition method according to claim 1, wherein the specific way of constructing the queue to be crawled in the step 2) is as follows: converting a request into a ditt object, serializing the ditt object into a character string by using a serializer method in a picklectcopat, and establishing a list table object for storing each request to be crawled and recording the stored sequence; storing the list as a value and the crawler name as a key in a database; and selecting one request stored earliest each time to be discharged from the warehouse, and realizing a first-in first-out queue storage form to be crawled.

5. The distributed parallel acquisition method for the urban big data according to claim 1, wherein in step 3), when a crawler process on a sub-server is idle, a function is called to request a scheduler corresponding to the server to acquire a new crawling request; after receiving the request, the dispatcher queries a corresponding queue list to be crawled in the main server by taking the crawler name as a key, takes out a batch of earlier-entered requests from the queue list and gives the requests to a crawler engine, and the engine sends the requests to a crawler process of the sub-server for execution; the crawler processes on the servers can be connected to the same crawling queue in the main server database through crawler names, so that sharing and distributed scheduling of the queue to be crawled are achieved.

6. The distributed parallel acquisition method for big urban data according to claim 1, wherein in step 4), the specific way of distinguishing and processing crawled data is as follows: executing different crawler and page analysis functions according to the depth of the page, generating a request for url links crawled by the current page, storing the request into a crawling queue corresponding to a next-level page crawler program, and waiting for scheduling and execution of a next-level crawler process; and storing the crawled non-url data to the position of the main server corresponding to the database according to the analysis function rule of the corresponding page crawler, thereby realizing the distinguishing processing of different levels of page information during deep crawling.