CN111274466A - Non-structural data acquisition system and method for overseas server - Google Patents
Non-structural data acquisition system and method for overseas server Download PDFInfo
- Publication number
- CN111274466A CN111274466A CN201911310422.8A CN201911310422A CN111274466A CN 111274466 A CN111274466 A CN 111274466A CN 201911310422 A CN201911310422 A CN 201911310422A CN 111274466 A CN111274466 A CN 111274466A
- Authority
- CN
- China
- Prior art keywords
- module
- engine
- middleware
- data
- request
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 21
- 241000239290 Araneae Species 0.000 claims abstract description 32
- 230000004044 response Effects 0.000 claims description 29
- 238000012545 processing Methods 0.000 claims description 18
- 238000013480 data collection Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 4
- 230000005540 biological transmission Effects 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a system and a method for acquiring unstructured website data, wherein the method comprises the following steps: creating a project, defining extracted items, writing a spider of a website to be collected, extracting the items and writing Item Pipeline to store the extracted items; the system comprises: the system comprises a Scapy engine module, a scheduler module, a downloader module, a crawler module, an Item pipeline module and a middleware module. The invention utilizes the drive and coordination of the Scapy engine module to the scheduler module, the downloader module, the crawler module, the Item pipeline module and the middleware module through the customized data acquisition template, can support various network protocols, and directionally acquires the data of the non-structural website from the Internet.
Description
Technical Field
The invention relates to the field of data processing, in particular to a system and a method for acquiring unstructured website data.
Background
With the development of the internet and big data industry, timely and effective data acquisition is very important. However, for mass data with larger and larger scale, a large amount of small unstructured data often exists, and task scheduling needs to occupy a large amount of resources, which affects acquisition efficiency.
In order to solve the above problems, a system capable of stably orienting data acquisition on the internet is required.
Disclosure of Invention
The invention aims to provide a system and a method for acquiring unstructured website data, aiming at the problems.
A method for collecting data of an unstructured website comprises the following steps:
opening a website through an engine, finding a Spider for processing the website and requesting a first URL to be crawled from the Spider; acquiring a first URL to be crawled from a Spider and scheduling the URL by a Request in a scheduler; the engine requests the next URL to be crawled from the dispatcher; the dispatcher returns the URL to be crawled next to the engine, and the engine forwards the URL to the downloader through the downloading middleware; once the page is downloaded, the downloader generates a Response of the page and sends the Response to the engine through the downloading middleware; the engine receives Response from the downloader and sends the Response to the Spider for processing through the Spider middleware; the Spider processes the Response and returns the crawled Item and a new Request to the engine; the engine sends the crawled Item to an Item pipeline and sends the Request to a scheduler; repeating until there are no more requests in the scheduler and the engine shuts down the web site.
Further, the spider class includes actions of data collection and methods of extracting structured data.
Further, the data structure adopted by the scheduler module is a queue.
Further, the item class object is a container for storing the crawled data.
A non-structural website data acquisition system comprises a Scapy engine, a downloader module, a crawler module and an Item pipeline module; the Scapy engine module is responsible for transmitting data signals among different modules; the dispatcher module is used for storing the request sent by the engine; the downloader module is used for downloading the request sent by the engine and returning the result to the engine; the Item pipe module is used for processing data transmitted by the engine. The system further comprises a middleware module, wherein the middleware module is used for customizing download extension and requests.
Further, the script engine module is configured to control the data flow in the framework: data transmission is carried out between the downloading middleware and the crawler middleware, and between the downloading middleware and the crawler module and between the downloading module and the crawler module and the downloader module, and the crawled information is stored in the Item pipeline module.
Further, the middleware module comprises downloader middleware and crawler middleware; the downloader middleware is a hook frame processed by the request/response of the balance acquisition frame and is used for globally modifying a system for acquiring the request and the response; the crawler middleware is a hook framework of a crawler processing mechanism which is inserted into the acquisition framework and is used for processing response sent to spiders and item and request generated by spiders.
Further, the crawler module writes classes for the user to parse responses and extract item or additional follow-up URLs.
Further, the Item pipeline module is used for processing items extracted by the spider.
Further, the whole data acquisition framework is written based on the event-driven network framework Twisted.
The invention collects data from a non-structural data source through a customized data collecting template, supports various network protocols and can directionally collect data from the Internet.
Drawings
FIG. 1 is a system architecture diagram of the present invention.
Fig. 2 is a flow chart of the method of the present invention.
Detailed Description
In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.
In this embodiment, a method for acquiring unstructured website data includes:
opening a website through an engine, finding a Spider for processing the website and requesting a first URL to be crawled from the Spider; acquiring a first URL to be crawled from a Spider and scheduling the URL by a Request in a scheduler; the engine requests the next URL to be crawled from the dispatcher; the dispatcher returns the URL to be crawled next to the engine, and the engine forwards the URL to the downloader through the downloading middleware; once the page is downloaded, the downloader generates a Response of the page and sends the Response to the engine through the downloading middleware; the engine receives Response from the downloader and sends the Response to the Spider for processing through the Spider middleware; the Spider processes the Response and returns the crawled Item and a new Request to the engine; the engine sends the crawled Item to an Item pipeline and sends the Request to a scheduler; repeating until there are no more requests in the scheduler and the engine shuts down the web site.
Further, the spider class includes actions of data collection and methods of extracting structured data.
Further, the data structure adopted by the scheduler module is a queue.
Further, the item class object is a container for storing the crawled data.
A non-structural website data acquisition system comprises a Scapy engine, a downloader module, a crawler module and an Item pipeline module; the Scapy engine module is responsible for transmitting data signals among different modules; the dispatcher module is used for storing the request sent by the engine; the downloader module is used for downloading the request sent by the engine and returning the result to the engine; the Item pipe module is used for processing data transmitted by the engine. The system further comprises a middleware module, wherein the middleware module is used for customizing download extension and requests.
Further, the script engine module is configured to control the data flow in the framework: data transmission is carried out between the downloading middleware and the crawler middleware, and between the downloading middleware and the crawler module and between the downloading module and the crawler module and the downloader module, and the crawled information is stored in the Item pipeline module.
Further, the middleware module comprises downloader middleware and crawler middleware; the downloader middleware is a hook frame processed by the request/response of the balance acquisition frame and is used for globally modifying a system for acquiring the request and the response; the crawler middleware is a hook framework of a crawler processing mechanism which is inserted into the acquisition framework and is used for processing response sent to spiders and item and request generated by spiders.
Further, the crawler module writes classes for the user to parse responses and extract item or additional follow-up URLs.
Further, the Item pipeline module is used for processing items extracted by the spider.
Further, the whole data acquisition framework is written based on the event-driven network framework Twisted.
The foregoing shows and describes the general principles and features of the present invention, together with the advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which should fall within the scope of the claimed invention. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (12)
1. A data acquisition method for an unstructured website is characterized by comprising the following steps:
s1: opening a website through an engine, finding a Spider for processing the website and requesting a first URL to be crawled from the Spider;
s2: the engine acquires a first URL to be crawled from the Spider and schedules the URL in the scheduler by a Request;
s3: the engine requests the next URL to be crawled from the dispatcher;
s4: the dispatcher returns the URL to be crawled next to the engine, and the engine forwards the URL to the downloader through the downloading middleware;
s5: once the page is downloaded, the downloader generates a Response of the page and sends the Response to the engine through the downloading middleware;
s6: the engine receives Response from the downloader and sends the Response to the Spider for processing through the Spider middleware;
s7: the Spider processes the Response and returns the crawled Item and a new Request to the engine;
s8: the engine sends the crawled Item to an Item pipeline and sends the Request to a scheduler;
s9: repeating until there are no more requests in the scheduler and the engine shuts down the web site.
2. The method for collecting data of unstructured website as defined in claim 1, wherein said step S1 comprises the following sub-steps:
s101: initializing a Request by an initial URL and setting a callback function;
s102: when the request is downloaded and returned, generating a response and transmitting the response as a parameter to the callback function;
s103: calling start _ requests () to obtain an initial request in the spider;
s104: the start _ requests () reads the URL in start _ URLs and generates a Request with parse as the callback function.
3. The method for collecting data of unstructured website according to claim 1, wherein the spider class in step S2 comprises data collection action and method for extracting structured data.
4. The method as claimed in claim 1, wherein the data structure adopted by the scheduler module in step S8 is a queue.
5. The method of claim 1, wherein the item-like object in step S8 is a container for storing crawled data.
6. A non-structural website data acquisition system is characterized by comprising a Scapy engine, a downloader module, a crawler module and an Item pipeline module;
the Scapy engine module is responsible for transmitting data signals among different modules;
the dispatcher module is used for storing the request sent by the engine;
the downloader module is used for downloading the request sent by the engine and returning the result to the engine;
the Item pipe module is used for processing data transmitted by the engine.
7. The system of claim 6, further comprising a middleware module for customizing download extensions and requests.
8. The unstructured website data collection system of claim 1, wherein the script engine module is configured to control data flow in a framework: data transmission is carried out between the downloading middleware and the crawler middleware, and between the downloading middleware and the crawler module and between the downloading module and the crawler module and the downloader module, and the crawled information is stored in the Item pipeline module.
9. The system of claim 1, wherein the middleware module comprises downloader middleware and crawler middleware; the downloader middleware is a hook frame processed by the request/response of the balance acquisition frame and is used for globally modifying a system for acquiring the request and the response; the crawler middleware is a hook framework of a crawler processing mechanism which is inserted into the acquisition framework and is used for processing response sent to spiders and item and request generated by spiders.
10. The system of claim 1, wherein the crawler module writes classes for a user to parse responses and extract items or additional follow-up URLs.
11. The system of claim 1, wherein the Item pipeline module is configured to process items extracted by spiders.
12. The system of claim 1, wherein the entire data collection framework is written based on an event-driven web framework Twisted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911310422.8A CN111274466A (en) | 2019-12-18 | 2019-12-18 | Non-structural data acquisition system and method for overseas server |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911310422.8A CN111274466A (en) | 2019-12-18 | 2019-12-18 | Non-structural data acquisition system and method for overseas server |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111274466A true CN111274466A (en) | 2020-06-12 |
Family
ID=71111950
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911310422.8A Pending CN111274466A (en) | 2019-12-18 | 2019-12-18 | Non-structural data acquisition system and method for overseas server |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111274466A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111881337A (en) * | 2020-08-06 | 2020-11-03 | 成都信息工程大学 | Data acquisition method and system based on Scapy framework and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103886033A (en) * | 2014-03-05 | 2014-06-25 | 无锡香象生物科技有限公司 | Intelligent vertical searching device and method for safety industry chain |
CN107590188A (en) * | 2017-08-08 | 2018-01-16 | 杭州灵皓科技有限公司 | A kind of reptile crawling method and its management system for automating vertical subdivision field |
CN107944055A (en) * | 2017-12-22 | 2018-04-20 | 成都优易数据有限公司 | A kind of reptile method of solution Web certificate verifications |
CN108011931A (en) * | 2017-11-22 | 2018-05-08 | 用友金融信息技术股份有限公司 | Web data acquisition method and web data acquisition system |
CN110147476A (en) * | 2019-04-12 | 2019-08-20 | 深圳壹账通智能科技有限公司 | Data crawling method, terminal device and computer readable storage medium based on Scrapy |
-
2019
- 2019-12-18 CN CN201911310422.8A patent/CN111274466A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103886033A (en) * | 2014-03-05 | 2014-06-25 | 无锡香象生物科技有限公司 | Intelligent vertical searching device and method for safety industry chain |
CN107590188A (en) * | 2017-08-08 | 2018-01-16 | 杭州灵皓科技有限公司 | A kind of reptile crawling method and its management system for automating vertical subdivision field |
CN108011931A (en) * | 2017-11-22 | 2018-05-08 | 用友金融信息技术股份有限公司 | Web data acquisition method and web data acquisition system |
CN107944055A (en) * | 2017-12-22 | 2018-04-20 | 成都优易数据有限公司 | A kind of reptile method of solution Web certificate verifications |
CN110147476A (en) * | 2019-04-12 | 2019-08-20 | 深圳壹账通智能科技有限公司 | Data crawling method, terminal device and computer readable storage medium based on Scrapy |
Non-Patent Citations (1)
Title |
---|
姚良: "《Python3爬虫实战:数据清洗、数据分析与可视化》", 中国铁道出版社有限公司, pages: 3 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111881337A (en) * | 2020-08-06 | 2020-11-03 | 成都信息工程大学 | Data acquisition method and system based on Scapy framework and storage medium |
CN111881337B (en) * | 2020-08-06 | 2021-06-01 | 成都信息工程大学 | Data acquisition method and system based on Scapy framework and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111273898B (en) | Automatic construction method, system and storage medium for web front-end code | |
US7870482B2 (en) | Web browser extension for simplified utilization of web services | |
US20200410031A1 (en) | Systems and methods for cloud computing | |
EP1438674B1 (en) | System for integrating java servlets with asynchronous messages | |
CN1249601C (en) | System and method for far distance WEB service cloning and example | |
US20070174420A1 (en) | Caching of web service requests | |
US20130132422A1 (en) | System and method for creating and controlling an application operating on a plurality of computer platform types | |
US8065617B2 (en) | Discovering alternative user experiences for websites | |
CN110224896B (en) | Network performance data acquisition method and device and storage medium | |
CN102368249B (en) | Page downloading control method and system for IE (Internet Explorer) core browser | |
CN111488508A (en) | Internet information acquisition system and method supporting multi-protocol distributed high concurrency | |
CN105138312A (en) | Table generation method and apparatus | |
CN102368248B (en) | Page downloading control method and system of IE kernel browser | |
US9934029B2 (en) | Annotation driven representational state transfer (REST) web services | |
WO2016005885A2 (en) | Asynchronous initialization of document object model (dom) modules | |
CN105683957A (en) | Style sheet speculative preloading | |
CN103593396A (en) | Network resource extracting method and device based on browser | |
CN109327530B (en) | Information processing method, device, electronic equipment and storage medium | |
CN111274466A (en) | Non-structural data acquisition system and method for overseas server | |
CN111221744B (en) | Data acquisition method and device and electronic equipment | |
CN111338775B (en) | Method and equipment for executing timing task | |
US20110321022A1 (en) | Code generation through metadata programming for mobile devices and web platforms to ease access to web services | |
US10719573B2 (en) | Systems and methods for retrieving web data | |
WO2013059887A1 (en) | Data interchange system | |
CN115202756A (en) | Vue-based component loading method and system and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200612 |