CN111274466A

CN111274466A - Non-structural data acquisition system and method for overseas server

Info

Publication number: CN111274466A
Application number: CN201911310422.8A
Authority: CN
Inventors: 陈泽勇; 张治同; 姚松; 张莉
Original assignee: Chengdu Dippmann Information Technology Co Ltd
Current assignee: Chengdu Dippmann Information Technology Co Ltd
Priority date: 2019-12-18
Filing date: 2019-12-18
Publication date: 2020-06-12

Abstract

The invention discloses a system and a method for acquiring unstructured website data, wherein the method comprises the following steps: creating a project, defining extracted items, writing a spider of a website to be collected, extracting the items and writing Item Pipeline to store the extracted items; the system comprises: the system comprises a Scapy engine module, a scheduler module, a downloader module, a crawler module, an Item pipeline module and a middleware module. The invention utilizes the drive and coordination of the Scapy engine module to the scheduler module, the downloader module, the crawler module, the Item pipeline module and the middleware module through the customized data acquisition template, can support various network protocols, and directionally acquires the data of the non-structural website from the Internet.

Description

Non-structural data acquisition system and method for overseas server

Technical Field

The invention relates to the field of data processing, in particular to a system and a method for acquiring unstructured website data.

Background

With the development of the internet and big data industry, timely and effective data acquisition is very important. However, for mass data with larger and larger scale, a large amount of small unstructured data often exists, and task scheduling needs to occupy a large amount of resources, which affects acquisition efficiency.

In order to solve the above problems, a system capable of stably orienting data acquisition on the internet is required.

Disclosure of Invention

The invention aims to provide a system and a method for acquiring unstructured website data, aiming at the problems.

A method for collecting data of an unstructured website comprises the following steps:

opening a website through an engine, finding a Spider for processing the website and requesting a first URL to be crawled from the Spider; acquiring a first URL to be crawled from a Spider and scheduling the URL by a Request in a scheduler; the engine requests the next URL to be crawled from the dispatcher; the dispatcher returns the URL to be crawled next to the engine, and the engine forwards the URL to the downloader through the downloading middleware; once the page is downloaded, the downloader generates a Response of the page and sends the Response to the engine through the downloading middleware; the engine receives Response from the downloader and sends the Response to the Spider for processing through the Spider middleware; the Spider processes the Response and returns the crawled Item and a new Request to the engine; the engine sends the crawled Item to an Item pipeline and sends the Request to a scheduler; repeating until there are no more requests in the scheduler and the engine shuts down the web site.

Further, the spider class includes actions of data collection and methods of extracting structured data.

Further, the data structure adopted by the scheduler module is a queue.

Further, the item class object is a container for storing the crawled data.

A non-structural website data acquisition system comprises a Scapy engine, a downloader module, a crawler module and an Item pipeline module; the Scapy engine module is responsible for transmitting data signals among different modules; the dispatcher module is used for storing the request sent by the engine; the downloader module is used for downloading the request sent by the engine and returning the result to the engine; the Item pipe module is used for processing data transmitted by the engine. The system further comprises a middleware module, wherein the middleware module is used for customizing download extension and requests.

Further, the script engine module is configured to control the data flow in the framework: data transmission is carried out between the downloading middleware and the crawler middleware, and between the downloading middleware and the crawler module and between the downloading module and the crawler module and the downloader module, and the crawled information is stored in the Item pipeline module.

Further, the middleware module comprises downloader middleware and crawler middleware; the downloader middleware is a hook frame processed by the request/response of the balance acquisition frame and is used for globally modifying a system for acquiring the request and the response; the crawler middleware is a hook framework of a crawler processing mechanism which is inserted into the acquisition framework and is used for processing response sent to spiders and item and request generated by spiders.

Further, the crawler module writes classes for the user to parse responses and extract item or additional follow-up URLs.

Further, the Item pipeline module is used for processing items extracted by the spider.

Further, the whole data acquisition framework is written based on the event-driven network framework Twisted.

The invention collects data from a non-structural data source through a customized data collecting template, supports various network protocols and can directionally collect data from the Internet.

Drawings

FIG. 1 is a system architecture diagram of the present invention.

Fig. 2 is a flow chart of the method of the present invention.

Detailed Description

In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.

In this embodiment, a method for acquiring unstructured website data includes:

Further, the data structure adopted by the scheduler module is a queue.

Further, the item class object is a container for storing the crawled data.

The foregoing shows and describes the general principles and features of the present invention, together with the advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which should fall within the scope of the claimed invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A data acquisition method for an unstructured website is characterized by comprising the following steps:

s1: opening a website through an engine, finding a Spider for processing the website and requesting a first URL to be crawled from the Spider;

s2: the engine acquires a first URL to be crawled from the Spider and schedules the URL in the scheduler by a Request;

s3: the engine requests the next URL to be crawled from the dispatcher;

s4: the dispatcher returns the URL to be crawled next to the engine, and the engine forwards the URL to the downloader through the downloading middleware;

s5: once the page is downloaded, the downloader generates a Response of the page and sends the Response to the engine through the downloading middleware;

s6: the engine receives Response from the downloader and sends the Response to the Spider for processing through the Spider middleware;

s7: the Spider processes the Response and returns the crawled Item and a new Request to the engine;

s8: the engine sends the crawled Item to an Item pipeline and sends the Request to a scheduler;

s9: repeating until there are no more requests in the scheduler and the engine shuts down the web site.

2. The method for collecting data of unstructured website as defined in claim 1, wherein said step S1 comprises the following sub-steps:

s101: initializing a Request by an initial URL and setting a callback function;

s102: when the request is downloaded and returned, generating a response and transmitting the response as a parameter to the callback function;

s103: calling start _ requests () to obtain an initial request in the spider;

s104: the start _ requests () reads the URL in start _ URLs and generates a Request with parse as the callback function.

3. The method for collecting data of unstructured website according to claim 1, wherein the spider class in step S2 comprises data collection action and method for extracting structured data.

4. The method as claimed in claim 1, wherein the data structure adopted by the scheduler module in step S8 is a queue.

5. The method of claim 1, wherein the item-like object in step S8 is a container for storing crawled data.

6. A non-structural website data acquisition system is characterized by comprising a Scapy engine, a downloader module, a crawler module and an Item pipeline module;

the Scapy engine module is responsible for transmitting data signals among different modules;

the dispatcher module is used for storing the request sent by the engine;

the downloader module is used for downloading the request sent by the engine and returning the result to the engine;

the Item pipe module is used for processing data transmitted by the engine.

7. The system of claim 6, further comprising a middleware module for customizing download extensions and requests.

8. The unstructured website data collection system of claim 1, wherein the script engine module is configured to control data flow in a framework: data transmission is carried out between the downloading middleware and the crawler middleware, and between the downloading middleware and the crawler module and between the downloading module and the crawler module and the downloader module, and the crawled information is stored in the Item pipeline module.

9. The system of claim 1, wherein the middleware module comprises downloader middleware and crawler middleware; the downloader middleware is a hook frame processed by the request/response of the balance acquisition frame and is used for globally modifying a system for acquiring the request and the response; the crawler middleware is a hook framework of a crawler processing mechanism which is inserted into the acquisition framework and is used for processing response sent to spiders and item and request generated by spiders.

10. The system of claim 1, wherein the crawler module writes classes for a user to parse responses and extract items or additional follow-up URLs.

11. The system of claim 1, wherein the Item pipeline module is configured to process items extracted by spiders.

12. The system of claim 1, wherein the entire data collection framework is written based on an event-driven web framework Twisted.