CN109766488B

CN109766488B - Data acquisition method based on Scapy

Info

Publication number: CN109766488B
Application number: CN201910040521.2A
Authority: CN
Inventors: 赵蕾
Original assignee: Nanjing Institute of Industry Technology
Current assignee: Nanjing Institute of Industry Technology
Priority date: 2019-01-16
Filing date: 2019-01-16
Publication date: 2022-09-16
Anticipated expiration: 2039-01-16
Also published as: CN109766488A

Abstract

The invention relates to a data acquisition method based on Scapy, which comprises the steps of firstly determining the source of big data to be acquired, secondly qualitatively acquiring the data, secondly quantifying sample data, and finally acquiring the data based on the Scapy technology. The invention realizes the acquisition of mass data, so that the data acquisition order is clear, and the data acquisition is not disordered when the acquired data volume is huge.

Description

Data acquisition method based on Scapy

Technical Field

The invention relates to a data acquisition method based on Scapy, and belongs to the technical field of data acquisition methods.

Background

In recent years, with the continuous development of the Chinese society, the scale of social production is continuously enlarged, a super-large interconnected power grid is formed, the system operation presents compact characteristics, more and more complex challenges are faced in production operation management, and a more reliable, stable and safe production system is urgently needed to be built. Data acquisition is used as an important component part of social production development, and plays an increasingly important supporting role in safe, stable and efficient operation of social production. Big data is a novel strategic resource of China, and draws more and more attention at home and abroad. The concept of the second economy was proposed by Auther in 2011, the second economy (not the virtual economy) beyond the well-known physical economy (the first economy) would be formed by the processors, sensors, actuators and the economic activities associated therewith, while big data is the core connotation and key support of the second economy (second economy). The data acquisition service has to adapt to the characteristics of multiple applications, large data volume, high real-time performance and high safety of the interconnected large power grid, optimize the design and integrate more advanced technical means to support the unified coordination of regulation and control services in a larger range and the panoramic monitoring and analysis of various data. The traditional data acquisition function is mainly oriented to single application, and has the problems of repeated functions, complex maintenance, insufficient information exchange and sharing and the like; meanwhile, with the continuous expansion of the system scale and the drastic increase of the capacity of the data acquisition table, the operation and maintenance are inconvenient and the acquisition processing capacity is reduced. For this reason, a corresponding technical scheme needs to be designed for solution.

Disclosure of Invention

The invention aims to solve the technical problem of providing a data acquisition method based on Scapy, which comprises the steps of firstly determining the source of big data to be acquired, secondly qualitatively acquiring the data, then quantitatively acquiring sample data, and finally acquiring the data based on the Scapy technology, thereby meeting the requirements of practical application.

In order to solve the problems, the technical scheme adopted by the invention is as follows:

a data acquisition method based on Scapy comprises the following steps:

step 1: determining the source of big data to be collected

From the perspective of data sources, social networks, mobile internet and informatization enterprises are mass data manufacturers, and are divided according to the field of data generation, and can be divided into three types, namely network data, physical world data and scientific research test data; the network data refers to various data generated by communication, shopping, learning, website browsing and the like in a network space; according to the user behavior, the user behavior can be subdivided into social behavior data based on the SNS network; shopping behavior data based on an e-commerce platform; searching behavior data, based on a search engine, etc.; physical world data refers to data obtained through perception and representation by smart devices; one is data collection based on large equipment, such as: airplanes, automobiles, large agricultural equipment, and the like; the other is sensor-based decentralized data acquisition, such as temperature, humidity, pressure, sound, image, light, magnetic, voltage, etc. data;

the scientific research test data refers to massive data which are generated in the test process and used for scientific analysis; with the development of scientific research informatization, the traditional scientific research activities gradually change to data-centered scientific research, and particularly in the fields of genomics, proteomics, astrophysics, meteorology, brain science and the like, massive data can be generated;

step 2: qualitative acquisition of data to be acquired

Information without numbers can be called as qualitative data, generally does not depend on tool equipment, and only gives suggestions of data to be collected; a typical form, namely investigation, acquires rough information of slowly acquired data by communicating with an acquired user and the like, and obtains a general view understanding of an investigation object;

and 3, step 3: quantifying the sample data to be collected

The sample data belongs to a quantitative data acquisition stage, one or more data are acquired by means of a data acquisition tool to guide production or decision, and the sample data acquisition stage is mainly characterized by being manually participated and completing data acquisition by means of equipment;

and 4, step 4: data acquisition based on Scapy technology

The script is an application framework which is rewritten by Twisted (a Python-based event-driven network architecture) and extracts structured data, and can be applied to the aspects of data mining, information processing, history archiving and the like;

scapy mainly comprises the following components: the Engine (Engine) is mainly responsible for controlling data flow among different components in the system and triggering events when specific behaviors occur; the Scheduler (Scheduler) receives the engine request and queues the request to return the request to the engine according to the engine requirement; a Downloader (Downloader), which is responsible for downloading the web page and transmitting the web page to the spider through the engine; spiders (spiders), which are classes written by the script user and are responsible for parsing response content from the URL to extract specific items; an Item Pipeline (Item Pipeline), wherein the spider extracts a webpage and then delivers the webpage to the Item Pipeline for further processing, including cleaning, verification, consistency management and the like; downloader middleware (downloader middleware) responsible for coordinating the communication between the engine and the downloader, for processing engine-to-downloader requests and downloader-to-engine responses; spider middleware (spidermidmidleware) for coordinating communications of the engine and the spider, for processing inputs (responses) and outputs (items and requests) of the spider;

the data acquisition method of the network crawler of Scapy comprises the following steps:

1) initializing a Request by an initial URL, setting a callback function, generating a response when the Request is downloaded and returned, and transmitting the response to the callback function as a parameter;

2) analyzing the returned (web page) content within the callback function, returning Item objects or requests or an iteratable container comprising both;

3) within the callback function, you can use Selectors (Selectors) to analyze the contents of the web page and generate items from the analyzed data;

4) storing the data into a database or storing the data into a file by using Feed exports through the item returned by the spider; the main classes of Scapy include Item, Spider, Selector, etc.; an Item is a container holding crawled data, and can be defined by creating a script. The Spider class defines how a web site (or sites) is crawled, including the actions of crawling (e.g., whether or not to follow a link) and how to extract structured data from the content of the web page; seletors "select" a certain part of an HTML file by means of a specific XPath or CSS expression.

As an improvement of the technical scheme, when the source of the big data needing to be collected is determined, the data are divided according to the field of the generated data and the behavior of the user.

As an improvement of the technical scheme, when the data to be acquired is qualitatively acquired, the judgment is mainly carried out by human experience, generally no tool equipment is needed, suggestions of the data to be acquired are given, and the like.

As an improvement of the technical scheme, when quantitative sample data is collected, one or more data are collected by means of a proper information tool to guide production or decision.

As an improvement of the technical scheme, when data are collected by using Scapy, the data are collected step by using a Scapy crawler.

Compared with the prior art, the invention has the following implementation effects:

the method comprises the steps of firstly determining the source of big data to be collected, secondly, qualitatively acquiring the data, then quantifying sample data, and finally, collecting the data based on the script technology, so that the mass data can be collected, and the data collection order is clear.

Drawings

FIG. 1 is a schematic flow chart of a data acquisition method based on Scapy according to the invention;

figure 2 is a diagram of the open source script platform framework of the present invention.

Detailed Description

The present invention will be described with reference to specific examples.

As shown in fig. 1 and 2: the invention is a flow and a framework schematic diagram of the data acquisition method based on Scapy.

The invention aims to solve the technical problem of providing a data acquisition method based on Scapy, which comprises the steps of firstly determining the source of big data to be acquired, secondly qualitatively acquiring the data, then quantitatively acquiring sample data, and finally acquiring the data based on the Scapy technology.

The invention provides a data acquisition method based on Scapy, which is used for acquiring mass data to ensure that the data acquisition order is clear, and comprises the following specific implementation steps: as shown in fig. 1, the specific implementation steps are as follows:

step 1: determining the source of big data to be collected

From the data source, social networks, mobile internets and informatization enterprises are mass data manufacturers, and are divided according to the field of data generation, and can be roughly divided into three types, namely network data, physical world data and scientific research test data; the network data refers to various data generated by communication, shopping, learning, browsing a website, and the like in a network space. According to the user behavior, the user behavior can be subdivided into social behavior data based on an SNS network; shopping behavior data based on an e-commerce platform; search behavior data, search engine based, etc.

The physical world data refers to data obtained through sensing and representation of intelligent equipment, and one is data acquisition based on large-scale equipment, such as airplanes, automobiles, large-scale agricultural equipment and the like; the other is sensor-based decentralized data acquisition, such as temperature, humidity, pressure, sound, image, light, magnetism, voltage, etc. data;

the scientific research test data refers to massive data which are generated in the test process and used for scientific analysis, and with the development of scientific research informatization, the traditional scientific research activities gradually change to scientific research with data as the center, and particularly generate massive data in the fields of genomics, proteomics, celestial physics, meteorology, brain science and the like;

step 2: qualitative acquisition of data to be acquired

Information that does not contain numbers may be referred to as qualitative data. The qualitative data acquisition mainly depends on human experience for judgment, generally does not depend on tool equipment, and only gives suggestions and the like of data to be acquired roughly; a typical form, namely investigation, acquires rough information of slowly acquired data by communicating with an acquired user and the like, and obtains a general view understanding of an investigation object; qualitative data has the characteristics of inaccuracy and descriptiveness, 1) inaccuracy is caused, and an accurate value of the acquired data cannot be given; 2) descriptive, the description of the matter uses descriptive expressions;

and step 3: quantifying the sample data to be collected

The sample data belongs to a quantitative data acquisition stage, one or more kinds of data are acquired by means of a proper information tool to guide production or decision, and the sample data acquisition stage is mainly characterized in that manual participation is performed, and data acquisition is completed by means of equipment;

and 4, step 4: data collection was performed based on the Scapy technique.

Such massive data contains huge potential value, and massive information cannot be effectively organized due to the dispersity of the data and the fragmentation of knowledge, so that decision analysis support is greatly reduced;

the web crawler downloads web page resources from the Internet periodically or continuously according to a certain URL rule, and the main function of the web crawler is resource collection; the working principle is that according to the seed URL, a specified web page is accessed, the web page resource is downloaded to the local, a new URL is extracted from the web page and added into a URL queue; according to a certain rule, the Crawler reads the seed URL from the URL queue again, accesses the URL and downloads resources, and the operation is continuously carried out;

through years of development, the web crawler technology is relatively mature, and more open-source software is available; known examples include Nutch by Apache, Heritrix by SourceFrge, and Larbin, WebLech, Arale, J-Spider, Arachnid, Pyspider.

Script is another website crawling node, is an application framework rewritten by paired (a Python-based event-driven network architecture) and extracting structured data, can be applied to aspects of data mining, information processing, history archiving and the like, and the latest version is 1.05 at present; scapy has the following characteristics: 1) rapid deployment, wherein a user defines a data extraction rule, and the script can automatically execute subsequent work; 2) the method is easy to expand, is easy to design and add new functions, and does not need to modify the kernel of Scapy; 3) the method is compatible and convenient, and the Scapy is based on the lightweight application of Python, and can be compatibly operated on Linux, Windows, Mac and BSD.

The invention tries to use an open-source script platform to build a customizable theme-oriented website data acquisition platform and directionally acquire the specific column information of a specific website, thereby realizing the automatic monitoring of the specific information.

The script is composed of an engine, a scheduler, a downloader, a spider, a project pipeline, downloader middleware, spider middleware, scheduling middleware and the like, and the component relationship, namely data flow, can be found in a technical architecture diagram, wherein green represents the data flow direction; scapy mainly comprises the following components: the Engine (Engine) is mainly responsible for controlling data flow among different components in the system and triggering events when specific behaviors occur; the Scheduler (Scheduler) receives the engine request and queues the request to return the request to the engine according to the engine requirement; a Downloader (Downloader), which is responsible for downloading the web page and transmitting the web page to the spider through the engine; spiders (spiders), which are classes written by the script user and are responsible for parsing response content from the URL to extract specific items; an Item Pipeline (Item Pipeline), wherein the spider extracts a webpage and then delivers the webpage to the Item Pipeline for further processing, including cleaning, verification, consistency management and the like; downloader middleware (downloader middleware) responsible for coordinating the communication between the engine and the downloader, for processing engine-to-downloader requests and downloader-to-engine responses; spider middleware (spidermidmidleware) for coordinating communications of the engine and the spider, for processing inputs (responses) and outputs (items and requests) of the spider.

The crawler of Scapy has the following steps: 1) and initializing the Request by using the initial URL, and setting a callback function. When the request is downloaded and returned, generating a response and transmitting the response as a parameter to the callback function; 2) analyzing the returned (web page) content within the callback function, returning Item objects or requests or an iteratable container that includes both; 3) within the callback function, you can use Selectors (Selectors) to analyze the web content and generate item from the analyzed data; 4) the item returned by the spider stores the data in a database or in a file using Feed extensions.

The main classes of Scapy include Item, Spider, Selector, etc. Item is a container holding crawled data, and may be defined by creating a script. The Spider class defines how a web site (or sites) is crawled, including the act of crawling (e.g., whether or not to follow a link) and how to extract structured data from the content of the web page. Seletors "select" a certain part of an HTML file by means of a specific XPath or CSS expression.

The technology is realized as follows:

the script is installed via the pip install script command. Greater than python2.7 is required for scripy; after the installation is finished, configuring the script as a global variable through ln-s/usr/local/python 27/bin/script commands; a script project is newly built by script start project tutorial. A new spider is created through a script generating spider caas command, for crawling of different websites, a plurality of spiders need to be established and executed concurrently, and accordingly collection of a large amount of network resource data can be achieved.

The foregoing is a detailed description of the invention with reference to specific embodiments, and the practice of the invention is not to be construed as limited thereto. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A data acquisition method based on Scapy is characterized in that: the method comprises the following steps:

step 1: determining the source of big data to be collected

From the data source, social networks, mobile internets and informatization enterprises are mass data manufacturers, and are divided according to the field of data generation, and can be divided into three types, namely network data, physical world data and scientific research test data; the network data refers to various data generated by communication, shopping, learning, website browsing and the like in a network space; according to the user behavior, the user behavior can be subdivided into social behavior data based on an SNS network; shopping behavior data based on an e-commerce platform; searching behavior data, based on a search engine, etc.; physical world data refers to data obtained through perception and representation by smart devices; one is data collection based on large equipment, such as: airplanes, automobiles, large agricultural equipment, and the like; the other is sensor-based decentralized data acquisition, such as temperature, humidity, pressure, sound, image, light, magnetism, voltage, etc. data;

the scientific research test data refers to massive data which are generated in the test process and are used for scientific analysis; with the development of scientific research informatization, the traditional scientific research activities gradually change to data-centered scientific research, and particularly in the fields of genomics, proteomics, astrophysics, meteorology, brain science and the like, massive data can be generated;

and 2, step: qualitative acquisition of data to be collected

and 3, step 3: quantifying the sample data to be collected

The sample data belongs to a quantitative data acquisition stage, one or more types of data are acquired by means of a data acquisition tool to guide production or decision, and the sample data acquisition stage is mainly characterized by being manually participated and completing data acquisition work by means of equipment;

and 4, step 4: data acquisition based on Scapy technology

scapy mainly comprises the following components: the Engine (Engine) is mainly responsible for controlling data flow among different components in the system and triggering events when specific behaviors occur; the Scheduler (Scheduler) receives the engine request and queues the request to return the request to the engine according to the engine requirement; a Downloader (Downloader), which is responsible for downloading the web page and transmitting the web page to the spider through the engine; spiders (spiders), which are classes written by the script user and are responsible for parsing response content from the URL to extract specific items; an Item Pipeline (Item Pipeline), wherein the spider extracts a webpage and then delivers the webpage to the Item Pipeline for further processing, including cleaning, verification, consistency management and the like; downloader middleware (downloaders middleware) responsible for coordinating the communication between the engine and the Downloader, for processing engine-to-Downloader requests and Downloader-to-engine responses; spider middleware (Spider middlewares) for coordinating communications of the engine and the Spider, for processing inputs (responses) and outputs (items and requests) of the Spider;

2) analyzing the returned (web page) content within the callback function, returning Item objects or requests or an iteratable container that includes both;

3) within the callback function, Selectors (Selectors) may be used to analyze the contents of the web page and generate items from the analyzed data;

4) storing the data into a database or a file by using Feed extensions by using the item returned by the spider; the main classes of Scapy include Item, Spider, Selector and the like; item is a container holding crawled data, and may be defined by creating a script. The Spider class defines how a web site (or sites) is crawled, including the act of crawling (e.g., whether links are followed) and how structured data is extracted from the content of the web page; seletors "select" a certain part of an HTML file by means of a specific XPath or CSS expression.

2. The script-based data collection method of claim 1, wherein: and when determining the source of the big data needing to be collected, dividing the data according to the field of the generated data and the behavior of the user.

3. The script-based data collection method of claim 1, wherein: when the data to be collected is acquired qualitatively, the judgment is mainly carried out by human experience, generally, tool equipment is not relied on, and suggestions and the like of the data to be collected are given.

4. The script-based data collection method of claim 1, wherein: when quantitative sample data is collected, one or more data is collected by means of a proper information tool to guide production or decision.

5. The script-based data collection method of claim 1, wherein: and when data acquisition is carried out by using Scapy, the data acquisition is carried out step by using the steps of Scapy crawler.