CN111274466A - Non-structural data acquisition system and method for overseas server - Google Patents

Non-structural data acquisition system and method for overseas server Download PDF

Info

Publication number
CN111274466A
CN111274466A CN201911310422.8A CN201911310422A CN111274466A CN 111274466 A CN111274466 A CN 111274466A CN 201911310422 A CN201911310422 A CN 201911310422A CN 111274466 A CN111274466 A CN 111274466A
Authority
CN
China
Prior art keywords
module
engine
middleware
data
request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911310422.8A
Other languages
Chinese (zh)
Inventor
陈泽勇
张治同
姚松
张莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Dippmann Information Technology Co Ltd
Original Assignee
Chengdu Dippmann Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Dippmann Information Technology Co Ltd filed Critical Chengdu Dippmann Information Technology Co Ltd
Priority to CN201911310422.8A priority Critical patent/CN111274466A/en
Publication of CN111274466A publication Critical patent/CN111274466A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a system and a method for acquiring unstructured website data, wherein the method comprises the following steps: creating a project, defining extracted items, writing a spider of a website to be collected, extracting the items and writing Item Pipeline to store the extracted items; the system comprises: the system comprises a Scapy engine module, a scheduler module, a downloader module, a crawler module, an Item pipeline module and a middleware module. The invention utilizes the drive and coordination of the Scapy engine module to the scheduler module, the downloader module, the crawler module, the Item pipeline module and the middleware module through the customized data acquisition template, can support various network protocols, and directionally acquires the data of the non-structural website from the Internet.

Description

Non-structural data acquisition system and method for overseas server
Technical Field
The invention relates to the field of data processing, in particular to a system and a method for acquiring unstructured website data.
Background
With the development of the internet and big data industry, timely and effective data acquisition is very important. However, for mass data with larger and larger scale, a large amount of small unstructured data often exists, and task scheduling needs to occupy a large amount of resources, which affects acquisition efficiency.
In order to solve the above problems, a system capable of stably orienting data acquisition on the internet is required.
Disclosure of Invention
The invention aims to provide a system and a method for acquiring unstructured website data, aiming at the problems.
A method for collecting data of an unstructured website comprises the following steps:
opening a website through an engine, finding a Spider for processing the website and requesting a first URL to be crawled from the Spider; acquiring a first URL to be crawled from a Spider and scheduling the URL by a Request in a scheduler; the engine requests the next URL to be crawled from the dispatcher; the dispatcher returns the URL to be crawled next to the engine, and the engine forwards the URL to the downloader through the downloading middleware; once the page is downloaded, the downloader generates a Response of the page and sends the Response to the engine through the downloading middleware; the engine receives Response from the downloader and sends the Response to the Spider for processing through the Spider middleware; the Spider processes the Response and returns the crawled Item and a new Request to the engine; the engine sends the crawled Item to an Item pipeline and sends the Request to a scheduler; repeating until there are no more requests in the scheduler and the engine shuts down the web site.
Further, the spider class includes actions of data collection and methods of extracting structured data.
Further, the data structure adopted by the scheduler module is a queue.
Further, the item class object is a container for storing the crawled data.
A non-structural website data acquisition system comprises a Scapy engine, a downloader module, a crawler module and an Item pipeline module; the Scapy engine module is responsible for transmitting data signals among different modules; the dispatcher module is used for storing the request sent by the engine; the downloader module is used for downloading the request sent by the engine and returning the result to the engine; the Item pipe module is used for processing data transmitted by the engine. The system further comprises a middleware module, wherein the middleware module is used for customizing download extension and requests.
Further, the script engine module is configured to control the data flow in the framework: data transmission is carried out between the downloading middleware and the crawler middleware, and between the downloading middleware and the crawler module and between the downloading module and the crawler module and the downloader module, and the crawled information is stored in the Item pipeline module.
Further, the middleware module comprises downloader middleware and crawler middleware; the downloader middleware is a hook frame processed by the request/response of the balance acquisition frame and is used for globally modifying a system for acquiring the request and the response; the crawler middleware is a hook framework of a crawler processing mechanism which is inserted into the acquisition framework and is used for processing response sent to spiders and item and request generated by spiders.
Further, the crawler module writes classes for the user to parse responses and extract item or additional follow-up URLs.
Further, the Item pipeline module is used for processing items extracted by the spider.
Further, the whole data acquisition framework is written based on the event-driven network framework Twisted.
The invention collects data from a non-structural data source through a customized data collecting template, supports various network protocols and can directionally collect data from the Internet.
Drawings
FIG. 1 is a system architecture diagram of the present invention.
Fig. 2 is a flow chart of the method of the present invention.
Detailed Description
In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.
In this embodiment, a method for acquiring unstructured website data includes:
opening a website through an engine, finding a Spider for processing the website and requesting a first URL to be crawled from the Spider; acquiring a first URL to be crawled from a Spider and scheduling the URL by a Request in a scheduler; the engine requests the next URL to be crawled from the dispatcher; the dispatcher returns the URL to be crawled next to the engine, and the engine forwards the URL to the downloader through the downloading middleware; once the page is downloaded, the downloader generates a Response of the page and sends the Response to the engine through the downloading middleware; the engine receives Response from the downloader and sends the Response to the Spider for processing through the Spider middleware; the Spider processes the Response and returns the crawled Item and a new Request to the engine; the engine sends the crawled Item to an Item pipeline and sends the Request to a scheduler; repeating until there are no more requests in the scheduler and the engine shuts down the web site.
Further, the spider class includes actions of data collection and methods of extracting structured data.
Further, the data structure adopted by the scheduler module is a queue.
Further, the item class object is a container for storing the crawled data.
A non-structural website data acquisition system comprises a Scapy engine, a downloader module, a crawler module and an Item pipeline module; the Scapy engine module is responsible for transmitting data signals among different modules; the dispatcher module is used for storing the request sent by the engine; the downloader module is used for downloading the request sent by the engine and returning the result to the engine; the Item pipe module is used for processing data transmitted by the engine. The system further comprises a middleware module, wherein the middleware module is used for customizing download extension and requests.
Further, the script engine module is configured to control the data flow in the framework: data transmission is carried out between the downloading middleware and the crawler middleware, and between the downloading middleware and the crawler module and between the downloading module and the crawler module and the downloader module, and the crawled information is stored in the Item pipeline module.
Further, the middleware module comprises downloader middleware and crawler middleware; the downloader middleware is a hook frame processed by the request/response of the balance acquisition frame and is used for globally modifying a system for acquiring the request and the response; the crawler middleware is a hook framework of a crawler processing mechanism which is inserted into the acquisition framework and is used for processing response sent to spiders and item and request generated by spiders.
Further, the crawler module writes classes for the user to parse responses and extract item or additional follow-up URLs.
Further, the Item pipeline module is used for processing items extracted by the spider.
Further, the whole data acquisition framework is written based on the event-driven network framework Twisted.
The foregoing shows and describes the general principles and features of the present invention, together with the advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which should fall within the scope of the claimed invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (12)

1. A data acquisition method for an unstructured website is characterized by comprising the following steps:
s1: opening a website through an engine, finding a Spider for processing the website and requesting a first URL to be crawled from the Spider;
s2: the engine acquires a first URL to be crawled from the Spider and schedules the URL in the scheduler by a Request;
s3: the engine requests the next URL to be crawled from the dispatcher;
s4: the dispatcher returns the URL to be crawled next to the engine, and the engine forwards the URL to the downloader through the downloading middleware;
s5: once the page is downloaded, the downloader generates a Response of the page and sends the Response to the engine through the downloading middleware;
s6: the engine receives Response from the downloader and sends the Response to the Spider for processing through the Spider middleware;
s7: the Spider processes the Response and returns the crawled Item and a new Request to the engine;
s8: the engine sends the crawled Item to an Item pipeline and sends the Request to a scheduler;
s9: repeating until there are no more requests in the scheduler and the engine shuts down the web site.
2. The method for collecting data of unstructured website as defined in claim 1, wherein said step S1 comprises the following sub-steps:
s101: initializing a Request by an initial URL and setting a callback function;
s102: when the request is downloaded and returned, generating a response and transmitting the response as a parameter to the callback function;
s103: calling start _ requests () to obtain an initial request in the spider;
s104: the start _ requests () reads the URL in start _ URLs and generates a Request with parse as the callback function.
3. The method for collecting data of unstructured website according to claim 1, wherein the spider class in step S2 comprises data collection action and method for extracting structured data.
4. The method as claimed in claim 1, wherein the data structure adopted by the scheduler module in step S8 is a queue.
5. The method of claim 1, wherein the item-like object in step S8 is a container for storing crawled data.
6. A non-structural website data acquisition system is characterized by comprising a Scapy engine, a downloader module, a crawler module and an Item pipeline module;
the Scapy engine module is responsible for transmitting data signals among different modules;
the dispatcher module is used for storing the request sent by the engine;
the downloader module is used for downloading the request sent by the engine and returning the result to the engine;
the Item pipe module is used for processing data transmitted by the engine.
7. The system of claim 6, further comprising a middleware module for customizing download extensions and requests.
8. The unstructured website data collection system of claim 1, wherein the script engine module is configured to control data flow in a framework: data transmission is carried out between the downloading middleware and the crawler middleware, and between the downloading middleware and the crawler module and between the downloading module and the crawler module and the downloader module, and the crawled information is stored in the Item pipeline module.
9. The system of claim 1, wherein the middleware module comprises downloader middleware and crawler middleware; the downloader middleware is a hook frame processed by the request/response of the balance acquisition frame and is used for globally modifying a system for acquiring the request and the response; the crawler middleware is a hook framework of a crawler processing mechanism which is inserted into the acquisition framework and is used for processing response sent to spiders and item and request generated by spiders.
10. The system of claim 1, wherein the crawler module writes classes for a user to parse responses and extract items or additional follow-up URLs.
11. The system of claim 1, wherein the Item pipeline module is configured to process items extracted by spiders.
12. The system of claim 1, wherein the entire data collection framework is written based on an event-driven web framework Twisted.
CN201911310422.8A 2019-12-18 2019-12-18 Non-structural data acquisition system and method for overseas server Pending CN111274466A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911310422.8A CN111274466A (en) 2019-12-18 2019-12-18 Non-structural data acquisition system and method for overseas server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911310422.8A CN111274466A (en) 2019-12-18 2019-12-18 Non-structural data acquisition system and method for overseas server

Publications (1)

Publication Number Publication Date
CN111274466A true CN111274466A (en) 2020-06-12

Family

ID=71111950

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911310422.8A Pending CN111274466A (en) 2019-12-18 2019-12-18 Non-structural data acquisition system and method for overseas server

Country Status (1)

Country Link
CN (1) CN111274466A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881337A (en) * 2020-08-06 2020-11-03 成都信息工程大学 Data acquisition method and system based on Scapy framework and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103886033A (en) * 2014-03-05 2014-06-25 无锡香象生物科技有限公司 Intelligent vertical searching device and method for safety industry chain
CN107590188A (en) * 2017-08-08 2018-01-16 杭州灵皓科技有限公司 A kind of reptile crawling method and its management system for automating vertical subdivision field
CN107944055A (en) * 2017-12-22 2018-04-20 成都优易数据有限公司 A kind of reptile method of solution Web certificate verifications
CN108011931A (en) * 2017-11-22 2018-05-08 用友金融信息技术股份有限公司 Web data acquisition method and web data acquisition system
CN110147476A (en) * 2019-04-12 2019-08-20 深圳壹账通智能科技有限公司 Data crawling method, terminal device and computer readable storage medium based on Scrapy

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103886033A (en) * 2014-03-05 2014-06-25 无锡香象生物科技有限公司 Intelligent vertical searching device and method for safety industry chain
CN107590188A (en) * 2017-08-08 2018-01-16 杭州灵皓科技有限公司 A kind of reptile crawling method and its management system for automating vertical subdivision field
CN108011931A (en) * 2017-11-22 2018-05-08 用友金融信息技术股份有限公司 Web data acquisition method and web data acquisition system
CN107944055A (en) * 2017-12-22 2018-04-20 成都优易数据有限公司 A kind of reptile method of solution Web certificate verifications
CN110147476A (en) * 2019-04-12 2019-08-20 深圳壹账通智能科技有限公司 Data crawling method, terminal device and computer readable storage medium based on Scrapy

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
姚良: "《Python3爬虫实战:数据清洗、数据分析与可视化》", 中国铁道出版社有限公司, pages: 3 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881337A (en) * 2020-08-06 2020-11-03 成都信息工程大学 Data acquisition method and system based on Scapy framework and storage medium
CN111881337B (en) * 2020-08-06 2021-06-01 成都信息工程大学 Data acquisition method and system based on Scapy framework and storage medium

Similar Documents

Publication Publication Date Title
CN111273898B (en) Automatic construction method, system and storage medium for web front-end code
US7870482B2 (en) Web browser extension for simplified utilization of web services
US20200410031A1 (en) Systems and methods for cloud computing
EP1438674B1 (en) System for integrating java servlets with asynchronous messages
CN1249601C (en) System and method for far distance WEB service cloning and example
US20070174420A1 (en) Caching of web service requests
US20130132422A1 (en) System and method for creating and controlling an application operating on a plurality of computer platform types
US8065617B2 (en) Discovering alternative user experiences for websites
CN110224896B (en) Network performance data acquisition method and device and storage medium
CN102368249B (en) Page downloading control method and system for IE (Internet Explorer) core browser
CN111488508A (en) Internet information acquisition system and method supporting multi-protocol distributed high concurrency
CN105138312A (en) Table generation method and apparatus
CN102368248B (en) Page downloading control method and system of IE kernel browser
US9934029B2 (en) Annotation driven representational state transfer (REST) web services
WO2016005885A2 (en) Asynchronous initialization of document object model (dom) modules
CN105683957A (en) Style sheet speculative preloading
CN103593396A (en) Network resource extracting method and device based on browser
CN109327530B (en) Information processing method, device, electronic equipment and storage medium
CN111274466A (en) Non-structural data acquisition system and method for overseas server
CN111221744B (en) Data acquisition method and device and electronic equipment
CN111338775B (en) Method and equipment for executing timing task
US20110321022A1 (en) Code generation through metadata programming for mobile devices and web platforms to ease access to web services
US10719573B2 (en) Systems and methods for retrieving web data
WO2013059887A1 (en) Data interchange system
CN115202756A (en) Vue-based component loading method and system and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200612