CN109766488B - Data acquisition method based on Scapy - Google Patents

Data acquisition method based on Scapy Download PDF

Info

Publication number
CN109766488B
CN109766488B CN201910040521.2A CN201910040521A CN109766488B CN 109766488 B CN109766488 B CN 109766488B CN 201910040521 A CN201910040521 A CN 201910040521A CN 109766488 B CN109766488 B CN 109766488B
Authority
CN
China
Prior art keywords
data
engine
scapy
spider
collected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910040521.2A
Other languages
Chinese (zh)
Other versions
CN109766488A (en
Inventor
赵蕾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Institute of Industry Technology
Original Assignee
Nanjing Institute of Industry Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Institute of Industry Technology filed Critical Nanjing Institute of Industry Technology
Priority to CN201910040521.2A priority Critical patent/CN109766488B/en
Publication of CN109766488A publication Critical patent/CN109766488A/en
Application granted granted Critical
Publication of CN109766488B publication Critical patent/CN109766488B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a data acquisition method based on Scapy, which comprises the steps of firstly determining the source of big data to be acquired, secondly qualitatively acquiring the data, secondly quantifying sample data, and finally acquiring the data based on the Scapy technology. The invention realizes the acquisition of mass data, so that the data acquisition order is clear, and the data acquisition is not disordered when the acquired data volume is huge.

Description

Data acquisition method based on Scapy
Technical Field
The invention relates to a data acquisition method based on Scapy, and belongs to the technical field of data acquisition methods.
Background
In recent years, with the continuous development of the Chinese society, the scale of social production is continuously enlarged, a super-large interconnected power grid is formed, the system operation presents compact characteristics, more and more complex challenges are faced in production operation management, and a more reliable, stable and safe production system is urgently needed to be built. Data acquisition is used as an important component part of social production development, and plays an increasingly important supporting role in safe, stable and efficient operation of social production. Big data is a novel strategic resource of China, and draws more and more attention at home and abroad. The concept of the second economy was proposed by Auther in 2011, the second economy (not the virtual economy) beyond the well-known physical economy (the first economy) would be formed by the processors, sensors, actuators and the economic activities associated therewith, while big data is the core connotation and key support of the second economy (second economy). The data acquisition service has to adapt to the characteristics of multiple applications, large data volume, high real-time performance and high safety of the interconnected large power grid, optimize the design and integrate more advanced technical means to support the unified coordination of regulation and control services in a larger range and the panoramic monitoring and analysis of various data. The traditional data acquisition function is mainly oriented to single application, and has the problems of repeated functions, complex maintenance, insufficient information exchange and sharing and the like; meanwhile, with the continuous expansion of the system scale and the drastic increase of the capacity of the data acquisition table, the operation and maintenance are inconvenient and the acquisition processing capacity is reduced. For this reason, a corresponding technical scheme needs to be designed for solution.
Disclosure of Invention
The invention aims to solve the technical problem of providing a data acquisition method based on Scapy, which comprises the steps of firstly determining the source of big data to be acquired, secondly qualitatively acquiring the data, then quantitatively acquiring sample data, and finally acquiring the data based on the Scapy technology, thereby meeting the requirements of practical application.
In order to solve the problems, the technical scheme adopted by the invention is as follows:
a data acquisition method based on Scapy comprises the following steps:
step 1: determining the source of big data to be collected
From the perspective of data sources, social networks, mobile internet and informatization enterprises are mass data manufacturers, and are divided according to the field of data generation, and can be divided into three types, namely network data, physical world data and scientific research test data; the network data refers to various data generated by communication, shopping, learning, website browsing and the like in a network space; according to the user behavior, the user behavior can be subdivided into social behavior data based on the SNS network; shopping behavior data based on an e-commerce platform; searching behavior data, based on a search engine, etc.; physical world data refers to data obtained through perception and representation by smart devices; one is data collection based on large equipment, such as: airplanes, automobiles, large agricultural equipment, and the like; the other is sensor-based decentralized data acquisition, such as temperature, humidity, pressure, sound, image, light, magnetic, voltage, etc. data;
the scientific research test data refers to massive data which are generated in the test process and used for scientific analysis; with the development of scientific research informatization, the traditional scientific research activities gradually change to data-centered scientific research, and particularly in the fields of genomics, proteomics, astrophysics, meteorology, brain science and the like, massive data can be generated;
step 2: qualitative acquisition of data to be acquired
Information without numbers can be called as qualitative data, generally does not depend on tool equipment, and only gives suggestions of data to be collected; a typical form, namely investigation, acquires rough information of slowly acquired data by communicating with an acquired user and the like, and obtains a general view understanding of an investigation object;
and 3, step 3: quantifying the sample data to be collected
The sample data belongs to a quantitative data acquisition stage, one or more data are acquired by means of a data acquisition tool to guide production or decision, and the sample data acquisition stage is mainly characterized by being manually participated and completing data acquisition by means of equipment;
and 4, step 4: data acquisition based on Scapy technology
The script is an application framework which is rewritten by Twisted (a Python-based event-driven network architecture) and extracts structured data, and can be applied to the aspects of data mining, information processing, history archiving and the like;
scapy mainly comprises the following components: the Engine (Engine) is mainly responsible for controlling data flow among different components in the system and triggering events when specific behaviors occur; the Scheduler (Scheduler) receives the engine request and queues the request to return the request to the engine according to the engine requirement; a Downloader (Downloader), which is responsible for downloading the web page and transmitting the web page to the spider through the engine; spiders (spiders), which are classes written by the script user and are responsible for parsing response content from the URL to extract specific items; an Item Pipeline (Item Pipeline), wherein the spider extracts a webpage and then delivers the webpage to the Item Pipeline for further processing, including cleaning, verification, consistency management and the like; downloader middleware (downloader middleware) responsible for coordinating the communication between the engine and the downloader, for processing engine-to-downloader requests and downloader-to-engine responses; spider middleware (spidermidmidleware) for coordinating communications of the engine and the spider, for processing inputs (responses) and outputs (items and requests) of the spider;
the data acquisition method of the network crawler of Scapy comprises the following steps:
1) initializing a Request by an initial URL, setting a callback function, generating a response when the Request is downloaded and returned, and transmitting the response to the callback function as a parameter;
2) analyzing the returned (web page) content within the callback function, returning Item objects or requests or an iteratable container comprising both;
3) within the callback function, you can use Selectors (Selectors) to analyze the contents of the web page and generate items from the analyzed data;
4) storing the data into a database or storing the data into a file by using Feed exports through the item returned by the spider; the main classes of Scapy include Item, Spider, Selector, etc.; an Item is a container holding crawled data, and can be defined by creating a script. The Spider class defines how a web site (or sites) is crawled, including the actions of crawling (e.g., whether or not to follow a link) and how to extract structured data from the content of the web page; seletors "select" a certain part of an HTML file by means of a specific XPath or CSS expression.
As an improvement of the technical scheme, when the source of the big data needing to be collected is determined, the data are divided according to the field of the generated data and the behavior of the user.
As an improvement of the technical scheme, when the data to be acquired is qualitatively acquired, the judgment is mainly carried out by human experience, generally no tool equipment is needed, suggestions of the data to be acquired are given, and the like.
As an improvement of the technical scheme, when quantitative sample data is collected, one or more data are collected by means of a proper information tool to guide production or decision.
As an improvement of the technical scheme, when data are collected by using Scapy, the data are collected step by using a Scapy crawler.
Compared with the prior art, the invention has the following implementation effects:
the method comprises the steps of firstly determining the source of big data to be collected, secondly, qualitatively acquiring the data, then quantifying sample data, and finally, collecting the data based on the script technology, so that the mass data can be collected, and the data collection order is clear.
Drawings
FIG. 1 is a schematic flow chart of a data acquisition method based on Scapy according to the invention;
figure 2 is a diagram of the open source script platform framework of the present invention.
Detailed Description
The present invention will be described with reference to specific examples.
As shown in fig. 1 and 2: the invention is a flow and a framework schematic diagram of the data acquisition method based on Scapy.
The invention aims to solve the technical problem of providing a data acquisition method based on Scapy, which comprises the steps of firstly determining the source of big data to be acquired, secondly qualitatively acquiring the data, then quantitatively acquiring sample data, and finally acquiring the data based on the Scapy technology.
The invention provides a data acquisition method based on Scapy, which is used for acquiring mass data to ensure that the data acquisition order is clear, and comprises the following specific implementation steps: as shown in fig. 1, the specific implementation steps are as follows:
step 1: determining the source of big data to be collected
From the data source, social networks, mobile internets and informatization enterprises are mass data manufacturers, and are divided according to the field of data generation, and can be roughly divided into three types, namely network data, physical world data and scientific research test data; the network data refers to various data generated by communication, shopping, learning, browsing a website, and the like in a network space. According to the user behavior, the user behavior can be subdivided into social behavior data based on an SNS network; shopping behavior data based on an e-commerce platform; search behavior data, search engine based, etc.
The physical world data refers to data obtained through sensing and representation of intelligent equipment, and one is data acquisition based on large-scale equipment, such as airplanes, automobiles, large-scale agricultural equipment and the like; the other is sensor-based decentralized data acquisition, such as temperature, humidity, pressure, sound, image, light, magnetism, voltage, etc. data;
the scientific research test data refers to massive data which are generated in the test process and used for scientific analysis, and with the development of scientific research informatization, the traditional scientific research activities gradually change to scientific research with data as the center, and particularly generate massive data in the fields of genomics, proteomics, celestial physics, meteorology, brain science and the like;
step 2: qualitative acquisition of data to be acquired
Information that does not contain numbers may be referred to as qualitative data. The qualitative data acquisition mainly depends on human experience for judgment, generally does not depend on tool equipment, and only gives suggestions and the like of data to be acquired roughly; a typical form, namely investigation, acquires rough information of slowly acquired data by communicating with an acquired user and the like, and obtains a general view understanding of an investigation object; qualitative data has the characteristics of inaccuracy and descriptiveness, 1) inaccuracy is caused, and an accurate value of the acquired data cannot be given; 2) descriptive, the description of the matter uses descriptive expressions;
and step 3: quantifying the sample data to be collected
The sample data belongs to a quantitative data acquisition stage, one or more kinds of data are acquired by means of a proper information tool to guide production or decision, and the sample data acquisition stage is mainly characterized in that manual participation is performed, and data acquisition is completed by means of equipment;
and 4, step 4: data collection was performed based on the Scapy technique.
Such massive data contains huge potential value, and massive information cannot be effectively organized due to the dispersity of the data and the fragmentation of knowledge, so that decision analysis support is greatly reduced;
the web crawler downloads web page resources from the Internet periodically or continuously according to a certain URL rule, and the main function of the web crawler is resource collection; the working principle is that according to the seed URL, a specified web page is accessed, the web page resource is downloaded to the local, a new URL is extracted from the web page and added into a URL queue; according to a certain rule, the Crawler reads the seed URL from the URL queue again, accesses the URL and downloads resources, and the operation is continuously carried out;
through years of development, the web crawler technology is relatively mature, and more open-source software is available; known examples include Nutch by Apache, Heritrix by SourceFrge, and Larbin, WebLech, Arale, J-Spider, Arachnid, Pyspider.
Script is another website crawling node, is an application framework rewritten by paired (a Python-based event-driven network architecture) and extracting structured data, can be applied to aspects of data mining, information processing, history archiving and the like, and the latest version is 1.05 at present; scapy has the following characteristics: 1) rapid deployment, wherein a user defines a data extraction rule, and the script can automatically execute subsequent work; 2) the method is easy to expand, is easy to design and add new functions, and does not need to modify the kernel of Scapy; 3) the method is compatible and convenient, and the Scapy is based on the lightweight application of Python, and can be compatibly operated on Linux, Windows, Mac and BSD.
The invention tries to use an open-source script platform to build a customizable theme-oriented website data acquisition platform and directionally acquire the specific column information of a specific website, thereby realizing the automatic monitoring of the specific information.
The script is composed of an engine, a scheduler, a downloader, a spider, a project pipeline, downloader middleware, spider middleware, scheduling middleware and the like, and the component relationship, namely data flow, can be found in a technical architecture diagram, wherein green represents the data flow direction; scapy mainly comprises the following components: the Engine (Engine) is mainly responsible for controlling data flow among different components in the system and triggering events when specific behaviors occur; the Scheduler (Scheduler) receives the engine request and queues the request to return the request to the engine according to the engine requirement; a Downloader (Downloader), which is responsible for downloading the web page and transmitting the web page to the spider through the engine; spiders (spiders), which are classes written by the script user and are responsible for parsing response content from the URL to extract specific items; an Item Pipeline (Item Pipeline), wherein the spider extracts a webpage and then delivers the webpage to the Item Pipeline for further processing, including cleaning, verification, consistency management and the like; downloader middleware (downloader middleware) responsible for coordinating the communication between the engine and the downloader, for processing engine-to-downloader requests and downloader-to-engine responses; spider middleware (spidermidmidleware) for coordinating communications of the engine and the spider, for processing inputs (responses) and outputs (items and requests) of the spider.
The crawler of Scapy has the following steps: 1) and initializing the Request by using the initial URL, and setting a callback function. When the request is downloaded and returned, generating a response and transmitting the response as a parameter to the callback function; 2) analyzing the returned (web page) content within the callback function, returning Item objects or requests or an iteratable container that includes both; 3) within the callback function, you can use Selectors (Selectors) to analyze the web content and generate item from the analyzed data; 4) the item returned by the spider stores the data in a database or in a file using Feed extensions.
The main classes of Scapy include Item, Spider, Selector, etc. Item is a container holding crawled data, and may be defined by creating a script. The Spider class defines how a web site (or sites) is crawled, including the act of crawling (e.g., whether or not to follow a link) and how to extract structured data from the content of the web page. Seletors "select" a certain part of an HTML file by means of a specific XPath or CSS expression.
The technology is realized as follows:
the script is installed via the pip install script command. Greater than python2.7 is required for scripy; after the installation is finished, configuring the script as a global variable through ln-s/usr/local/python 27/bin/script commands; a script project is newly built by script start project tutorial. A new spider is created through a script generating spider caas command, for crawling of different websites, a plurality of spiders need to be established and executed concurrently, and accordingly collection of a large amount of network resource data can be achieved.
The foregoing is a detailed description of the invention with reference to specific embodiments, and the practice of the invention is not to be construed as limited thereto. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (5)

1. A data acquisition method based on Scapy is characterized in that: the method comprises the following steps:
step 1: determining the source of big data to be collected
From the data source, social networks, mobile internets and informatization enterprises are mass data manufacturers, and are divided according to the field of data generation, and can be divided into three types, namely network data, physical world data and scientific research test data; the network data refers to various data generated by communication, shopping, learning, website browsing and the like in a network space; according to the user behavior, the user behavior can be subdivided into social behavior data based on an SNS network; shopping behavior data based on an e-commerce platform; searching behavior data, based on a search engine, etc.; physical world data refers to data obtained through perception and representation by smart devices; one is data collection based on large equipment, such as: airplanes, automobiles, large agricultural equipment, and the like; the other is sensor-based decentralized data acquisition, such as temperature, humidity, pressure, sound, image, light, magnetism, voltage, etc. data;
the scientific research test data refers to massive data which are generated in the test process and are used for scientific analysis; with the development of scientific research informatization, the traditional scientific research activities gradually change to data-centered scientific research, and particularly in the fields of genomics, proteomics, astrophysics, meteorology, brain science and the like, massive data can be generated;
and 2, step: qualitative acquisition of data to be collected
Information without numbers can be called as qualitative data, generally does not depend on tool equipment, and only gives suggestions of data to be collected; a typical form, namely investigation, acquires rough information of slowly acquired data by communicating with an acquired user and the like, and obtains a general view understanding of an investigation object;
and 3, step 3: quantifying the sample data to be collected
The sample data belongs to a quantitative data acquisition stage, one or more types of data are acquired by means of a data acquisition tool to guide production or decision, and the sample data acquisition stage is mainly characterized by being manually participated and completing data acquisition work by means of equipment;
and 4, step 4: data acquisition based on Scapy technology
The script is an application framework which is rewritten by Twisted (a Python-based event-driven network architecture) and extracts structured data, and can be applied to the aspects of data mining, information processing, history archiving and the like;
scapy mainly comprises the following components: the Engine (Engine) is mainly responsible for controlling data flow among different components in the system and triggering events when specific behaviors occur; the Scheduler (Scheduler) receives the engine request and queues the request to return the request to the engine according to the engine requirement; a Downloader (Downloader), which is responsible for downloading the web page and transmitting the web page to the spider through the engine; spiders (spiders), which are classes written by the script user and are responsible for parsing response content from the URL to extract specific items; an Item Pipeline (Item Pipeline), wherein the spider extracts a webpage and then delivers the webpage to the Item Pipeline for further processing, including cleaning, verification, consistency management and the like; downloader middleware (downloaders middleware) responsible for coordinating the communication between the engine and the Downloader, for processing engine-to-Downloader requests and Downloader-to-engine responses; spider middleware (Spider middlewares) for coordinating communications of the engine and the Spider, for processing inputs (responses) and outputs (items and requests) of the Spider;
the data acquisition method of the network crawler of Scapy comprises the following steps:
1) initializing a Request by an initial URL, setting a callback function, generating a response when the Request is downloaded and returned, and transmitting the response to the callback function as a parameter;
2) analyzing the returned (web page) content within the callback function, returning Item objects or requests or an iteratable container that includes both;
3) within the callback function, Selectors (Selectors) may be used to analyze the contents of the web page and generate items from the analyzed data;
4) storing the data into a database or a file by using Feed extensions by using the item returned by the spider; the main classes of Scapy include Item, Spider, Selector and the like; item is a container holding crawled data, and may be defined by creating a script. The Spider class defines how a web site (or sites) is crawled, including the act of crawling (e.g., whether links are followed) and how structured data is extracted from the content of the web page; seletors "select" a certain part of an HTML file by means of a specific XPath or CSS expression.
2. The script-based data collection method of claim 1, wherein: and when determining the source of the big data needing to be collected, dividing the data according to the field of the generated data and the behavior of the user.
3. The script-based data collection method of claim 1, wherein: when the data to be collected is acquired qualitatively, the judgment is mainly carried out by human experience, generally, tool equipment is not relied on, and suggestions and the like of the data to be collected are given.
4. The script-based data collection method of claim 1, wherein: when quantitative sample data is collected, one or more data is collected by means of a proper information tool to guide production or decision.
5. The script-based data collection method of claim 1, wherein: and when data acquisition is carried out by using Scapy, the data acquisition is carried out step by using the steps of Scapy crawler.
CN201910040521.2A 2019-01-16 2019-01-16 Data acquisition method based on Scapy Active CN109766488B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910040521.2A CN109766488B (en) 2019-01-16 2019-01-16 Data acquisition method based on Scapy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910040521.2A CN109766488B (en) 2019-01-16 2019-01-16 Data acquisition method based on Scapy

Publications (2)

Publication Number Publication Date
CN109766488A CN109766488A (en) 2019-05-17
CN109766488B true CN109766488B (en) 2022-09-16

Family

ID=66452455

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910040521.2A Active CN109766488B (en) 2019-01-16 2019-01-16 Data acquisition method based on Scapy

Country Status (1)

Country Link
CN (1) CN109766488B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110298386B (en) * 2019-06-10 2023-07-28 成都积微物联集团股份有限公司 Label automatic definition method based on image content
CN110826006B (en) * 2019-11-22 2021-03-19 支付宝(杭州)信息技术有限公司 Abnormal collection behavior identification method and device based on privacy data protection
CN116628248B (en) * 2023-07-21 2023-09-26 合肥焕峰智能科技有限公司 Processing method for intelligent equipment to collect image data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017177872A1 (en) * 2016-04-11 2017-10-19 中兴通讯股份有限公司 Data collection method and apparatus, and storage medium
CN107273409A (en) * 2017-05-03 2017-10-20 广州赫炎大数据科技有限公司 A kind of network data acquisition, storage and processing method and system
CN107506502A (en) * 2017-10-10 2017-12-22 山东浪潮云服务信息科技有限公司 A kind of data collecting system and collecting method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017177872A1 (en) * 2016-04-11 2017-10-19 中兴通讯股份有限公司 Data collection method and apparatus, and storage medium
CN107273409A (en) * 2017-05-03 2017-10-20 广州赫炎大数据科技有限公司 A kind of network data acquisition, storage and processing method and system
CN107506502A (en) * 2017-10-10 2017-12-22 山东浪潮云服务信息科技有限公司 A kind of data collecting system and collecting method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于Scrapy技术的数据采集系统的设计与实现;杨君等;《计算机技术与发展》;20180516(第10期);全文 *

Also Published As

Publication number Publication date
CN109766488A (en) 2019-05-17

Similar Documents

Publication Publication Date Title
CN101662493B (en) Data acquiring method, system and server of user access path
CN109766488B (en) Data acquisition method based on Scapy
CN101370024B (en) Distributed information collection method and system
EP3371665B1 (en) Distributed embedded data and knowledge management system integrated with plc historian
CN102880607A (en) Dynamic network content grabbing method and dynamic network content crawler system
CN109684370A (en) Daily record data processing method, system, equipment and storage medium
US9430579B2 (en) Hybrid web publishing system
CN109284430A (en) Visualization subject web page content based on distributed structure/architecture crawls system and method
CN103092936B (en) A kind of Internet of Things dynamic page real-time information collection method
CN103051496A (en) Monitoring method and device of monitoring point server
CN101441629A (en) Automatic acquiring method of non-structured web page information
CN103744845A (en) Method and system for WEB platform data caching
US20160203224A1 (en) System for analyzing social media data and method of analyzing social media data using the same
Vanhove et al. Tengu: An experimentation platform for big data applications
CN103593396A (en) Network resource extracting method and device based on browser
CN104166545A (en) Webpage resource sniffing method and device
CN110011827A (en) Towards doctor conjuncted multi-user's big data analysis service system and method
CN114443599A (en) Data synchronization method and device, electronic equipment and storage medium
KR20150089693A (en) Apparatus and Method for Extending Data Store System Based on Big Data Platform
CN112860844A (en) Case clue processing system, method and device and computer equipment
KR101235199B1 (en) An interface construction system and method to control low­erformance equipment using web technology
CN113515715A (en) Generating method, processing method and related equipment of buried point event code
CN117591229B (en) Device data viewing and displaying method and system based on gateway embedded Web
CN111159004A (en) Hadoop cluster simulation test method and device and storage medium
US11822566B2 (en) Interactive analytics workflow with integrated caching

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant