WO2020207022A1 - 基于Scrapy的数据爬取方法、系统、终端设备及存储介质 - Google Patents

基于Scrapy的数据爬取方法、系统、终端设备及存储介质 Download PDF

Info

Publication number
WO2020207022A1
WO2020207022A1 PCT/CN2019/120540 CN2019120540W WO2020207022A1 WO 2020207022 A1 WO2020207022 A1 WO 2020207022A1 CN 2019120540 W CN2019120540 W CN 2019120540W WO 2020207022 A1 WO2020207022 A1 WO 2020207022A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
crawler
scrapy
data
configuration parameters
Prior art date
Application number
PCT/CN2019/120540
Other languages
English (en)
French (fr)
Inventor
董润华
徐国强
邱寒
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2020207022A1 publication Critical patent/WO2020207022A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • This application relates to the technical field of data crawling, and in particular to a data crawling method, system, terminal device and storage medium based on Scrapy.
  • this application proposes a data crawling method, system, terminal device, and storage medium based on Scrapy to solve the problem of poor data crawling effect during the use process based on the Scrapy framework.
  • this application proposes a data crawling method based on Scrapy.
  • the method includes the steps: defining the configuration parameters of the crawler file based on the Scrapy framework in the Java script object notation JSON file; Name the file, create a crawler file, and name the crawler file according to the name of the JSON file; import the configuration parameters of the JSON file into the crawler file; run the crawler file after importing the configuration parameters, and crawl Get web page data.
  • this application also provides a terminal device, including a memory and a processor, the memory stores a Scrapy-based data crawling system that can run on the processor, and the Scrapy-based When the data crawling system is executed by the processor, the steps of the data crawling method based on Scrapy as described above are implemented.
  • this application also provides a computer-readable storage medium that stores a Scrapy-based data crawling system and computer instructions, and the Scrapy-based data crawling system can It is executed by at least one processor, so that the at least one processor executes the steps of the aforementioned Scrapy-based data crawling method.
  • the Scrapy-based data crawling method, system, terminal device and storage medium proposed in this application can define the configuration parameters of the Scrapy file through a JSON file, and different crawler projects only need to define a different JSON file That is, the JSON file gathers the configuration files required by a crawler file, and there is no need to perform parameter configuration and function modification under each level of the crawler file, which improves the efficiency of code writing, reduces the number of vulnerabilities, and improves the effect of crawling webpage data.
  • Figure 1 is a schematic diagram of an optional hardware architecture of the terminal device of the present application.
  • FIG. 2 is a schematic diagram of program modules of the first embodiment of the data crawling system based on Scrapy in this application;
  • FIG. 3 is a schematic diagram of program modules of the second embodiment of the data crawling system based on Scrapy of the present application
  • FIG. 5 is a schematic flowchart of a second embodiment of the data crawling method based on Scrapy in this application.
  • FIG. 1 is a schematic diagram of an optional hardware architecture of the terminal device 2 of the present application.
  • the terminal device 2 may include, but is not limited to, a memory 11, a processor 12, and a network interface 13 that can communicate with each other through a system bus. It should be pointed out that FIG. 1 only shows the terminal device 2 with the components 11-13, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.
  • the terminal device can be implemented in various forms.
  • the terminal devices described in this application may include mobile phones, tablet computers, notebook computers, palmtop computers, personal digital assistants (Personal Digital Assistant, PDA), portable media players (Portable Media Player, PMP), navigation devices, Mobile terminals such as wearable devices, mobile terminals, pedometers, and fixed terminals such as digital TVs and desktop computers.
  • PDA Personal Digital Assistant
  • PMP portable media players
  • Mobile terminals such as wearable devices
  • mobile terminals mobile terminals, pedometers
  • fixed terminals such as digital TVs and desktop computers.
  • the memory 11 includes at least one type of readable storage medium, the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc.
  • the memory 11 may be an internal storage unit of the terminal device 2, such as a hard disk or memory of the terminal device 2.
  • the memory 11 may also be an external storage device of the terminal device 2, such as a plug-in hard disk, a smart media card (SMC), and a secure digital device equipped on the terminal device 2.
  • the memory 11 may also include both the internal storage unit of the terminal device 2 and its external storage device.
  • the memory 11 is generally used to store an operating system and various application software installed in the terminal device 2, for example, the program code of the data crawling system 200 based on Scrapy.
  • the memory 11 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 12 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments.
  • the processor 12 is generally used to control the overall operation of the terminal device 2.
  • the processor 12 is used to run the program code or process data stored in the memory 11, for example, to run the Scrapy-based data crawling system 200.
  • the network interface 13 may include a wireless network interface or a wired network interface.
  • the network interface 13 is generally used to establish a communication connection between the terminal device 2 and other electronic devices.
  • this application proposes a data crawling system 200 based on Scrapy.
  • FIG. 2 is a program module diagram of the first embodiment of the data crawling system 200 based on Scrapy in this application.
  • the Scrapy-based data crawling system 200 includes a series of computer program instructions stored on the memory 11.
  • the computer program instructions When the computer program instructions are executed by the processor 12, the computer program instructions of the various embodiments of this application can be implemented. Scrapy data crawling operation.
  • the Scrapy-based data crawling system 200 may be divided into one or more modules based on specific operations implemented by various parts of the computer program instructions. For example, in FIG. 2, the Scrapy-based data crawling system 200 can be divided into a definition module 201, a naming module 202, an import module 203, and a crawling module 204. among them:
  • the definition module 201 is used to define the configuration parameters of the crawler file based on the Scrapy framework in the Java script object notation JSON file.
  • Java Script Object Notation is a lightweight data exchange format that uses a text format completely independent of programming language to store and represent data, which is concise and clear.
  • Hierarchical structure features, easy to read and write, but also easy to machine analysis and generation.
  • the naming module 202 is configured to name the JSON file, create a crawler file, and name the crawler file according to the name of the JSON file.
  • the JSON file may be named spidername
  • a crawler file may be created
  • the crawler file may be named spidername. It is added that the created crawler file includes crawler, engine, scheduler, downloader, entity pipeline, default configuration level, download middleware, and crawling middleware.
  • the English name of the crawler can be Spider
  • the English name of the engine can be Enginge
  • the English name of the scheduler can be Scheduler
  • the English name of the downloader can be Download
  • the English name of the physical pipeline can be Item Pipeline
  • the English name of the default configuration level can be Setting
  • the English name of the download middleware can be Download middleware
  • the English name of the crawling middleware can be Spider Middleware.
  • the import module 203 is configured to import the configuration parameters of the JSON file into the crawler file.
  • the configuration parameters in the JSON file can be imported into the crawler file by triggering the startup by running the startup file.
  • the crawling module 204 is used to run a crawler file after importing configuration parameters to crawl webpage data.
  • the crawler file can determine the specific data that needs to be crawled, so as to quickly crawl the specific data that needs to be crawled from the designated website. For example, after importing the configuration parameters, the crawler file determines that it needs to crawl the data about Zhang San, and crawls all the data about Zhang San on the designated first website.
  • the Scrapy-based data crawling system 200 includes a definition module 201, a naming module 202, an import module 203, and a crawling module 204.
  • the crawling module 204 also includes: an acquisition submodule 2041 and a download submodule Module 2042. among them:
  • the definition module 201 is used to define the configuration parameters of the crawler file based on the Scrapy framework in the Java script object notation JSON file.
  • Java Script Object Notation is a lightweight data exchange format that uses a text format completely independent of programming language to store and represent data, which is concise and clear.
  • Hierarchical structure features, easy to read and write, but also easy to machine analysis and generation.
  • the definition module 201 is further configured to define the configuration parameters of each level in the crawler file in the JSON file, where the levels of the crawler file include crawler, engine, scheduler, Downloader, physical pipeline, default configuration level, download middleware, and crawling middleware.
  • the crawler is mainly used to parse the response and extract items that need to be crawled, that is, extract data that needs to be crawled.
  • the download middleware is located in a specific hook between the engine and downloader levels. Using download middleware can achieve the following purposes: process the request before sending it to the downloader; change the received response before passing it to the crawler; send a new Instead of passing the received response to the crawler; passing the response to the crawler without getting the web page.
  • the entity pipeline is responsible for processing entities extracted by the crawler. Typical tasks include cleaning, verification, and persistence. Among them, persistence includes storing the crawled data in the database.
  • the engine is responsible for controlling the data flow between all components of the system and triggering events when certain operations occur. Scheduler The scheduler receives requests from the engine, queues them, and then sends the requests to the engine when the engine needs it.
  • the downloader is responsible for extracting web pages and feeding them to the engine, which then sends them to the crawling level.
  • the definition module 201 is also used to compare the starting website name, the starting website homepage, the request header, and the URI of the database based on distributed file storage in the Java script object notation JSON file of the crawler file based on the Scrapy framework.
  • the matching rules include an xml path language selector, a cascading style sheet selector, and a regular expression; the xml path language selector may be referred to as xpath selector for short, and the cascading style sheet selection
  • the selector can be referred to as CSS selector for short.
  • the request header is the request header field, including cache, client, cookie/logh, entity, miscellaneous, transport and other header field classifications.
  • the header field function please refer to the description of the header field function, which will not be repeated here.
  • the web interface includes a link that can be clicked to continue to follow up. For example, when the crawler gets the News.baidu.com page, click the follow-up link on the News.baidu.com page to continue to open the web page corresponding to the follow-up link .
  • the naming module 202 is configured to name the JSON file, create a crawler file, and name the crawler file according to the name of the JSON file.
  • the JSON file may be named spidername
  • a crawler file may be created
  • the crawler file may be named spidername. It is added that the created crawler file includes crawler, engine, scheduler, downloader, entity pipeline, default configuration level, download middleware, and crawling middleware.
  • the import module 203 is configured to import the configuration parameters of the JSON file into the crawler file.
  • the configuration parameters in the JSON file can be imported into the crawler file by triggering the startup by running the startup file.
  • the import module 203 is further configured to receive a preset start command in the interrupt command line, and according to the received start command, import the configuration parameters of the JSON file into the crawler file.
  • the start command can be Python run.py spidername.
  • the crawler file calls the get_config() function in Scrapyuniversal/utils.py, and starts it through the get_config() function Analysis of the configuration information of the crawler file.
  • Python is to run the file command ending with py
  • run.py is the startup file
  • spidername is the defined JSON file name
  • the file name of the crawler file corresponding to the JSON file name is spidername.
  • Python run.py spidername is different from the traditional Scrapy crawl spidername startup method, which can improve startup efficiency.
  • the import module 203 includes:
  • the import submodule is used to import the configuration parameters in the JSON file into the Scrapy file through the startup file.
  • the import submodule includes:
  • the determining unit is configured to determine a JSON file with the same name as the name of the crawler file, and obtain corresponding configuration parameters from the determined JSON file;
  • the merging unit is used to merge the obtained configuration parameters into the default configuration file of the Scrapy file.
  • the same level directory of the Scrapy configuration file includes the following levels: crawler, engine, scheduler, downloader, project pipeline, default configuration level, download middleware, crawling middleware.
  • the engine is used to process the data stream processing of the entire system and trigger transactions, which is the core of the framework.
  • the scheduler is used to accept the request sent by the engine, press it into the queue, and return when the engine requests it again; it can be imagined as a URL, that is, the URL of the web page or the priority queue of the link, which is determined by it What is the next URL to be crawled, and remove duplicate URLs at the same time.
  • Downloader used to download web content and return the web content to the spider Scrapy.
  • a crawler is used to extract the required information from a specific web page, and the required information is the so-called entity. The user can also extract the link from it and continue to fetch the next page.
  • the project pipeline is responsible for processing the entities extracted by the crawler from the web pages.
  • the main function is to persist the entities, verify the validity of the entities, and remove unnecessary information.
  • the process of downloading middleware processing is mainly used to modify scrapy requests and responses when the scheduler sends requests and when the web page returns response results to spiders.
  • Crawler middleware is similar to download middleware, except that Spider middleware processes the response objects that will be sent to the crawler for processing, and the response objects or Item objects returned from the crawler.
  • Crawler middleware is a hook framework for Scrapy's spider processing mechanism, which can insert custom functions to process response requests sent by the engine to the crawler and request requests and entities sent by the crawler to the engine.
  • the crawling module 204 is used to run a crawler file after importing configuration parameters to crawl webpage data.
  • the crawler file can determine the specific data that needs to be crawled, so as to quickly crawl the specific data that needs to be crawled from the designated website. For example, after importing the configuration parameters, the crawler file determines that it needs to crawl the data about Zhang San, and crawls all the data about Zhang San on the designated first website.
  • the crawling module 204 further includes:
  • the obtaining sub-module 2041 is configured to obtain the starting URL from the crawler file after importing the configuration parameters through the engine of the crawler file, and submit the obtained URL to the dispatcher;
  • the crawling sub-module 2042 is configured to, when data needs to be downloaded from the URL, the scheduler submits the URL to the downloader through the engine, and the downloader downloads the data to be crawled according to the URL.
  • the engine will obtain the starting URL in the crawler file and submit the URL to the dispatcher. If you need to download data from the URL, the dispatcher will submit the URL to the downloader through the engine, and the downloader will download the specified content according to the URL, that is, download the response body.
  • the downloaded data will be handed over to the crawler file through the engine, and the crawler file can parse the downloaded data in a specified format. If the parsed data needs to be stored persistently, the crawler file will transfer the parsed data through the engine to the pipeline for persistent storage.
  • the URL may be a website of a government or an enterprise. In this way, website data can be crawled quickly, and the crawled data can be downloaded and stored for backup.
  • the crawling module 204 is further configured to crawl data through the xml path language selector; in the case that the xml path language selector does not crawl the data, crawl data through the cascading style sheet selector; When the cascading style sheet selector does not match the data, the data is crawled through regular expressions.
  • XPath is XML Path Language, that is, XML path language, which is a language for locating information in structured documents, where structured documents can be XML and HTML documents.
  • Xpath path expressions to select nodes or node sets in XML documents. Nodes are selected by following paths or steps.
  • Cascading style sheet selectors include wildcard selector, element selector, class selector, id selector, EF descendant selector, multi-element selector, vector brother selector, etc.
  • Regular expression regular expression describes a string matching pattern (pattern), which can be used to check whether a string contains a certain substring, replace the matched substring or take out a string that meets a certain condition Substring etc.
  • matching the data to be crawled from the webpage through a variety of matching methods can improve the success rate of obtaining useful data.
  • this application also proposes a data crawling method based on Scrapy.
  • FIG. 4 is a schematic flowchart of the first embodiment of the data crawling method based on Scrapy in this application.
  • the execution order of the steps in the flowchart shown in FIG. 4 can be changed, and some steps can be omitted.
  • the method includes the following steps:
  • step S400 the configuration parameters of the crawler file based on the Scrapy framework are defined in the Java script object notation JSON file.
  • Java Script Object Notation is a lightweight data exchange format that uses a text format completely independent of programming language to store and represent data, which is concise and clear.
  • Hierarchical structure features, easy to read and write, but also easy to machine analysis and generation.
  • Step S402 Name the JSON file, create a crawler file, and name the crawler file according to the name of the JSON file.
  • the JSON file may be named spidername
  • a crawler file may be created
  • the crawler file may be named spidername. It is added that the created crawler file includes crawler, engine, scheduler, downloader, entity pipeline, default configuration level, download middleware, and crawling middleware.
  • the English name of the crawler can be Spider
  • the English name of the engine can be Enginge
  • the English name of the scheduler can be Scheduler
  • the English name of the downloader can be Download
  • the English name of the physical pipeline can be Item Pipeline
  • the English name of the default configuration level can be Setting
  • the English name of the download middleware can be Download middleware
  • the English name of the crawling middleware can be Spider Middleware.
  • Step S404 Import the configuration parameters of the JSON file into the crawler file.
  • the configuration parameters in the JSON file can be imported into the crawler file by triggering the startup by running the startup file.
  • Step S406 Run the crawler file after importing the configuration parameters to crawl web page data.
  • the crawler file can determine the specific data that needs to be crawled, so as to quickly crawl the specific data that needs to be crawled from the designated website. For example, after importing the configuration parameters, the crawler file determines that it needs to crawl the data about Zhang San, and crawls all the data about Zhang San on the designated first website.
  • FIG. 5 is a flowchart of the second embodiment of the data crawling method based on Scrapy in this application.
  • the steps S500-S506 of the Scrapy-based data crawling method are similar to the steps S400-S406 of the first embodiment, except that the method further includes steps S508-510.
  • the method includes the following steps:
  • step S500 the configuration parameters of the crawler file based on the Scrapy framework are defined in the Java script object notation JSON file.
  • Java Script Object Notation is a lightweight data exchange format that uses a text format that is completely independent of programming languages to store and represent data, which is concise and clear.
  • Hierarchical structure features, easy to read and write, but also easy to machine analysis and generation.
  • this step S500 also includes the following steps: defining configuration parameters of each level in the crawler file in the JSON file, where the levels of the crawler file include crawler, engine, scheduler, download Server, physical pipeline, default configuration level, download middleware, and crawling middleware.
  • the crawler is mainly used to parse the response and extract items that need to be crawled, that is, extract data that needs to be crawled.
  • the download middleware is located in a specific hook between the engine and downloader levels. Using download middleware can achieve the following purposes: process the request before sending it to the downloader; change the received response before passing it to the crawler; send a new Instead of passing the received response to the crawler; passing the response to the crawler without getting the web page.
  • the entity pipeline is responsible for processing entities extracted by the crawler. Typical tasks include cleaning, verification, and persistence. Among them, persistence includes storing the crawled data in the database.
  • the engine is responsible for controlling the data flow between all components of the system and triggering events when certain operations occur.
  • the scheduler The scheduler receives requests from the engine, queues them, and then sends requests to the engine when the engine needs it.
  • the downloader is responsible for extracting web pages and feeding them to the engine, which then sends them to the crawling level.
  • this step S500 also includes the following steps: in the Java script object notation JSON file, the name of the starting website, the homepage of the starting website, the request header, and the URI address of the database based on distributed file storage are set in the JSON file of the crawler file based on the Scrapy framework.
  • the database name and the collection name are defined; the preprocessing of the follow-up webpage link is defined; the preprocessing of the homepage, the type of the start page, the allowed domain name, the name of the link function to be followed, the variable name of the data to be crawled, and
  • the matching mode is defined, wherein the matching rules include an xml path language selector, a cascading style sheet selector, and a regular expression; the xml path language selector may be referred to as xpath selector for short, and the cascading style sheet selector Can be referred to as CSS selector for short.
  • the request header is the request header field, including cache, client, cookie/logh, entity, miscellaneous, transport and other header field classifications.
  • the header field function please refer to the description of the header field function, which will not be repeated here.
  • the web interface includes a link that can be clicked to continue to follow up. For example, when the crawler gets the News.baidu.com page, click the follow-up link on the News.baidu.com page to continue to open the web page corresponding to the follow-up link .
  • Step S502 Name the JSON file, create a crawler file, and name the crawler file according to the name of the JSON file.
  • the JSON file may be named spidername
  • a crawler file may be created
  • the crawler file may be named spidername. It is added that the crawler files created include crawlers, engines, schedulers, downloaders, physical pipelines, default configuration levels, download middleware, and crawl middleware.
  • Step S504 Import the configuration parameters of the JSON file into the crawler file.
  • the configuration parameters in the JSON file can be imported into the crawler file by triggering the startup by running the startup file.
  • step S504 further includes the following steps: receiving a preset startup command at the interrupt command line, and importing the configuration parameters of the JSON file into the crawler file according to the received startup command.
  • the start command can be Python run.py spidername.
  • the crawler file calls the get_config() function in Scrapyuniversal/utils.py, and starts it through the get_config() function Analysis of the configuration information of the crawler file.
  • Python is to run the file command ending with py
  • run.py is the startup file
  • spidername is the defined JSON file name
  • the file name of the crawler file corresponding to the JSON file name is spidername.
  • Python run.py spidername is different from the traditional Scrapy crawl spidername startup method, which can improve startup efficiency.
  • step S504 further includes the following steps:
  • the importing the configuration parameters in the JSON file into the Scrapy file through the startup file further includes the following steps:
  • the same level directory of the Scrapy configuration file includes the following levels: crawler, engine, scheduler, downloader, project pipeline, default configuration level, download middleware, crawling middleware.
  • the Scrapy.cfg level of the crawler Scrapy file can define a startup command run.py, and there is no need to define the startup command run.py under each level.
  • the default configuration level is the default configuration file of the Scrapy framework.
  • the engine is used to process the data stream processing of the entire system and trigger transactions, which is the core of the framework.
  • the scheduler is used to accept the request sent by the engine, press it into the queue, and return when the engine requests it again; it can be imagined as a URL, that is, the URL of the web page or the priority queue of the link, which is determined by it What is the next URL to be crawled, and remove duplicate URLs at the same time.
  • Downloader used to download web content and return the web content to the spider Scrapy.
  • a crawler is used to extract the required information from a specific web page, the so-called entity. The user can also extract the link from it and continue to fetch the next page.
  • the project pipeline is responsible for processing the entities extracted by the crawler from the web pages.
  • the main function is to persist the entities, verify the validity of the entities, and remove unnecessary information.
  • the process of downloading middleware processing is mainly used to modify scrapy request requests and response responses when the scheduler sends requests and when the web page returns response results to spiders.
  • the crawler middleware is similar to the download middleware, except that the crawler middleware processes the response objects that will be sent to the crawler for processing, and the response objects or entity objects returned from the crawler.
  • Crawler middleware is a hook framework for Scrapy's spider processing mechanism. You can insert custom functions to process the response sent by the engine to the crawler and the response and entities sent by the crawler to the engine.
  • Step S506 Obtain the starting URL from the crawler file after importing the configuration parameters through the engine of the crawler file, and submit the acquired URL to the dispatcher.
  • Step S508 In the case where data needs to be downloaded from the URL, the scheduler submits the URL to the downloader through the engine, and the downloader downloads the data to be crawled according to the URL.
  • the engine will obtain the starting URL in the crawler file and submit the URL to the dispatcher. If you need to download data from the URL, the dispatcher will submit the URL to the downloader through the engine, and the downloader will download the specified content according to the URL, that is, download the response body.
  • the downloaded data will be handed over to the crawler file through the engine, and the crawler file can parse the downloaded data in a specified format. If the parsed data needs to be stored persistently, the crawler file will transfer the parsed data through the engine to the pipeline for persistent storage.
  • the URL may be a website of a government or an enterprise. In this way, website data can be crawled quickly, and the crawled data can be downloaded and stored for backup.
  • step 406 in the embodiment shown in FIG. 4 may also include the following steps: crawling data through the xml path language selector; in the case that the xml path language selector has not crawled the data, through cascading The style sheet selector crawls the data; in the case that the cascading style sheet selector does not match the data, the data is crawled through a regular expression.
  • XPath is XML Path Language, that is, XML path language, which is a language for locating information in structured documents, where structured documents can be XML and HTML documents.
  • XPath uses path expressions to select nodes or node sets in an XML document. Nodes are selected by following paths or steps. CSS selectors include wildcard selectors, element selectors, class selectors, id selectors, EF descendant selectors, multi-element selectors, vector sibling selectors, etc.
  • Regular expression regular expression describes a string matching pattern (pattern), which can be used to check whether a string contains a certain substring, replace the matched substring or take out a string that meets a certain condition Substring etc.
  • matching the data to be crawled from the webpage through a variety of matching methods can improve the success rate of obtaining useful data.
  • the Scrapy-based data crawling method provided in this embodiment can define the configuration parameters of the Scrapy file through a JSON file. Different crawler projects only need to define a different JSON file.
  • the JSON file aggregates the configuration required by a crawler file. File, there is no need to go to each level of the crawler file for parameter configuration and function modification, which improves the efficiency of code writing, reduces the number of vulnerabilities, and improves the effect of crawling web page data.
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium.
  • the computer-readable storage medium stores computer instructions, and when the computer instructions are executed on the computer, the computer executes the following steps:
  • Name the JSON file create a crawler file, and name the crawler file according to the name of the JSON file;
  • the technical solution of the present invention essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to enable a terminal device (which can be a mobile phone, a computer, a terminal device, an air conditioner, or a network device, etc.) to execute the method described in each embodiment of the present invention.
  • a terminal device which can be a mobile phone, a computer, a terminal device, an air conditioner, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于Scrapy的数据爬取方法、系统、终端设备及存储介质,该方法包括:在爪哇脚本对象简谱JSON文件中对基于Scrapy框架的爬虫文件的配置参数进行定义(S400);对所述JSON文件进行命名,创建爬虫文件,并将所述爬虫文件的名称按照所述JSON文件的名称进行命名(S402);将所述JSON文件的配置参数导入所述爬虫文件(S404);运行导入配置参数后的爬虫文件,爬取网页数据(S406)。上述方法能够通过JSON文件定义Scrapy文件的配置参数,JSON文件集合了一个爬虫文件所需的配置文件,提高代码编写效率,降低漏洞数量,提高爬取网页数据的效果。

Description

基于Scrapy的数据爬取方法、系统、终端设备及存储介质
本申请要求于2019年4月12日提交中国专利局、申请号为201910294765.3、发明名称为“基于Scrapy的数据爬取方法、终端设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
技术领域
本申请涉及数据爬取技术领域,尤其涉及一种基于Scrapy的数据爬取方法、系统、终端设备及存储介质。
背景技术
随着信息社会的快速发展,互联网上的数据越来越多,为获取有用信息,目前常常通过网络爬虫技术爬取有用数据。现有爬虫技术中,基于Scrapy的爬虫框架的使用过程中,对于多个网站的爬取要重复编写代码;在编写代码过程中,除了需要分析代码的逻辑,还需要解析网页的规则,会影响网页规则的正确率;另外,发明人意识到Scrapy爬虫框架功能开关及注意点过于分散,分布在各个层级下的文件中,在使用Scrapy爬虫框架爬取数据的过程,很有可能会出现漏洞。因此基于Scrapy框架的使用过程中,存在爬取数据的效果比较差的问题。
发明内容
有鉴于此,本申请提出一种基于Scrapy的数据爬取方法、系统、终端设备及存储介质,以解决基于Scrapy框架的使用过程中,存在爬取数据的效果比较差的问题。
首先,为实现上述目的,本申请提出一种基于Scrapy的数据爬取方法,该方法包括步骤:在爪哇脚本对象简谱JSON文件中对基于Scrapy框架的爬虫文件的配置参数进行定义;对所述JSON文件进行命名,创建爬虫文件,并将所述爬虫文件的名称按照所述JSON文件的名称进行命名;将所述JSON文件的配置参数导入所述爬虫文件;运行导入配置参数后的爬虫文件,爬取网页数据。
此外,为实现上述目的,本申请还提供一种终端设备,包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的基于Scrapy的数据爬取系统,所述基于Scrapy的数据爬取系统被所述处理器执行时实现如上述的基于Scrapy的数据爬取方法的步骤。
进一步地,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质存储有基于Scrapy的数据爬取系统和计算机指令,所述基于Scrapy的数据爬取系统可被至少一个处理器执行,以使所述至少一个处理器执行如上述的基于Scrapy的数据爬取方法的步骤。
相较于现有技术,本申请所提出的基于Scrapy的数据爬取方法、系统、终端设备及存储介质,能够通过JSON文件定义Scrapy文件的配置参数,不同爬虫项目只需要定义一份不同JSON文件即可,JSON文件集合了一个爬虫文件所需的配置文件,无需到爬虫文件的各个层级下进行参数配置及功能修改,提高代码编写效率,降低漏洞数量,提高爬取网页数据的效果。
附图说明
图1是本申请终端设备一可选的硬件架构的示意图;
图2是本申请基于Scrapy的数据爬取系统第一实施例的程序模块示意图;
图3是本申请基于Scrapy的数据爬取系统第二实施例的程序模块示意图;
图4是本申请基于Scrapy的数据爬取方法第一实施例的流程示意图;
图5是本申请基于Scrapy的数据爬取方法第二实施例的流程示意图。
具体实施方式
参阅图1所示,是本申请终端设备2一可选的硬件架构的示意图。
本实施例中,所述终端设备2可包括,但不仅限于,可通过系统总线相互通信连接存储器11、处理器12、网络接口13。需要指出的是,图1仅示出了具有组件11-13的终端设备2,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
其中,终端设备可以以各种形式来实施。例如,本申请中描述的终端设备可以包括诸如手机、平板电脑、笔记本电脑、掌上电脑、个人数字助理(Personal Digital Assistant,PDA)、便捷式媒体播放器(Portable Media Player,PMP)、导航装置、可穿戴设备、移动终端、计步器等移动终端,以及诸如数字TV、台式计算机等固定终端。
后续描述中将以终端设备为例进行说明,本领域技术人员将理解的是,除了特别用于移动目的的元件之外,根据本申请的实施方式的构造也能够应用于固定类型的终端设备。
所述存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器11可以是所述终端设备2的内部存储单元,例如该终端设备2的硬盘或内存。在另一些实施例中,所述存储器11也可以是所述终端设备2的外部存储设备,例如该终端设备2上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器11还可以既包括所述终端设备2 的内部存储单元也包括其外部存储设备。本实施例中,所述存储器11通常用于存储安装于所述终端设备2的操作系统和各类应用软件,例如基于Scrapy的数据爬取系统200的程序代码等。此外,所述存储器11还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器12在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器12通常用于控制所述终端设备2的总体操作。本实施例中,所述处理器12用于运行所述存储器11中存储的程序代码或者处理数据,例如运行所述的基于Scrapy的数据爬取系统200等。
所述网络接口13可包括无线网络接口或有线网络接口,该网络接口13通常用于在所述终端设备2与其他电子设备之间建立通信连接。
至此,己经详细介绍了本申请相关设备的硬件结构和功能。下面,将基于上述介绍提出本申请的各个实施例。
首先,本申请提出一种基于Scrapy的数据爬取系统200。
参阅图2所示,是本申请基于Scrapy的数据爬取系统200第一实施例的程序模块图。
本实施例中,所述基于Scrapy的数据爬取系统200包括一系列的存储于存储器11上的计算机程序指令,当该计算机程序指令被处理器12执行时,可以实现本申请各实施例的基于Scrapy的数据爬取操作。在一些实施例中,基于该计算机程序指令各部分所实现的特定的操作,基于Scrapy的数据爬取系统200可以被划分为一个或多个模块。例如,在图2中,所述基于Scrapy的数据爬取系统200可以被分割成定义模块201、命名模块202、导入模块203、爬取模块204。其中:
所述定义模块201,用于在爪哇脚本对象简谱JSON文件中对基于Scrapy框架的爬虫文件的配置参数进行定义。
在本实施例中,爪哇脚本对象简谱(Java Script Object Notation,JSON),是一种轻量级的数据交换格式,采用完全独立于编程语言的文本格式来存储和表示数据,具有简洁和清晰的层次结构特点,易于阅读和编写,同时也易于机器解析和生成。
所述命名模块202,用于对所述JSON文件进行命名,创建爬虫文件,并将所述爬虫文件的名称按照所述JSON文件的名称进行命名。
举例来说,可以将所述JSON文件命名为spidername,创建爬虫文件,并将爬虫文件命名为spidername。补充说明的是,创建的爬虫文件包括爬取器、引擎、调度器、下载器、实体管道、默认配置层级、下载中间件、及爬取中间件。
补充说明的是,爬取器的英文名可以为Spider、引擎的英文名可以为Enginge、调度器的英文名可以为Scheduler、下载器的英文名可以为Download、实体管道的英文名可以为Item Pipeline、默认配置层级的英文名可以为Setting、下载中间件的英文名可以为Download middleware、及爬取中间件的英文名可以为Spider Middleware。
所述导入模块203,用于将所述JSON文件的配置参数导入所述爬虫文件。
在本实施例中,可以通过运行启动文件,触发启动将所述JSON文件中的配置参数导入爬虫文件。
所述爬取模块204,用于运行导入配置参数后的爬虫文件,爬取网页数据。
在本实施例中,在导入配置参数后,爬虫文件可以确定需要爬取的特定数据,从而快速从指定网站上爬取到需要爬取的特定数据。例如,在导入配置参数后,爬虫文件确定需要爬取关于张三的数据,并在指定的第一网站爬取所有与张三有关的数据。
参阅图3所示,是本申请基于Scrapy的数据爬取系统200第二实施例的程序模块图。本实施例中,所述的基于Scrapy的数据爬取系统200包括定义模块201、命名模块202、导入模块203、爬取模块204,所述爬取模块204还包括:获取子模块2041及下载子模块2042。其中:
所述定义模块201,用于在爪哇脚本对象简谱JSON文件中对基于Scrapy框架的爬虫文件的配置参数进行定义。
在本实施例中,爪哇脚本对象简谱(Java Script Object Notation,JSON),是一种轻量级的数据交换格式,采用完全独立于编程语言的文本格式来存储和表示数据,具有简洁和清晰的层次结构特点,易于阅读和编写,同时也易于机器解析和生成。
具体的,所述定义模块201,还用于在所述JSON文件中对所述爬虫文件中各个层级的配置参数进行定义,其中,所述爬虫文件的层级包括爬取器、引擎、调度器、下载器、实体管道、默认配置层级、下载中间件、及爬取中间件。
在本实施例中,爬取器主要用于解析响应,并从中提取需要抓取的项目,即提取需要抓取的数据。下载中间件位于引擎和下载器层级之间的特定钩子,使用下载中间件可以达到以下目的:在将请求发送到下载器之前处理请求;在传递给爬取器之前改变接收到的响应;发送新的请求,而不是将接收到的响应传递给爬取器;向爬取器传递响应而不需要获取网页。实体管道负责处理被爬取器提取的实体,典型的任务包括清理,验证和持久性,其中,持久性包括将爬取的数据存储在数据库中。引擎负责控制系统所有组件之间的数据流,并在发生某些操作时触发事件。调度器调度程序接收来自引擎的请求,并将它们排 入队列,并在之后,当引擎需要的时候,将请求发送给引擎。下载器负责提取网页并将它们馈送到引擎,然后引擎将其发送给爬取层级。
具体的,所述定义模块201,还用于在爪哇脚本对象简谱JSON文件中对基于Scrapy框架的爬虫文件的起始网站名、起始网站主页、请求头、基于分布式文件存储的数据库的URI地址、数据库名及集合名进行定义;对跟进网页链接的预处理进行定义;对首页的预处理、起始页的类型、允许域名、跟进的链接函数名、待爬取数据的变量名及匹配方式进行定义,其中,所述匹配规则包括xml路径语言选择器、层叠样式表选择器、及正则表达式;所述xml路径语言选择器可以简称为xpath选择器,所述层叠样式表选择器可以简称为CSS选择器。
其中,请求头即为请求报头域,包括cache、client、cookie/logh、entity、miscellaneous、transport等头域分类,对应的作用可参考头域作用说明,在此不做赘述。
在网页界面中包括可以点击继续跟进的链接,例如,当爬虫获取到News.baidu.com页面时,点击在News.baidu.com页面上的跟进链接,可以继续打开跟进链接对应的网页。
所述命名模块202,用于对所述JSON文件进行命名,创建爬虫文件,并将所述爬虫文件的名称按照所述JSON文件的名称进行命名。
举例来说,可以将所述JSON文件命名为spidername,创建爬虫文件,并将爬虫文件命名为spidername。补充说明的是,创建的爬虫文件包括爬取器、引擎、调度器、下载器、实体管道、默认配置层级、下载中间件、及爬取中间件。
所述导入模块203,用于将所述JSON文件的配置参数导入所述爬虫文件。
在本实施例中,可以通过运行启动文件,触发启动将所述JSON文件中的配置参数导入爬虫文件。
具体的,所述导入模块203,还用于在中断命令行接收预先设置的启动命令,根据接收到的启动命令,将所述JSON文件的配置参数导入所述爬虫文件。
举例来说,所述启动命令可以为Python run.py spidername,在中断命令行接收到Python run.py spidername后,爬虫文件调用Scrapyuniversal/utils.py中的get_config()函数,通过get_config()函数启动对爬虫文件的配置信息的解析。其中,Python是运行以py结尾的文件命令,run.py是启动文件,spidername是定义的JSON文件名,与JSON文件名为spidername对应爬虫文件的文件名为spidername。
这样,直接在中断命令行输入Python run.py spidername启动命令,即可启动爬虫文件。其中,Python run.py spidername区别于传统的Scrapy crawl spidername的启动方式,可以提高启动效率。
具体的,所述导入模块203包括:
定义子模块,用于在爬虫Scrapy配置文件的同级目录下基于爬虫Python语言定义启动文件;
导入子模块,用于通过所述启动文件将所述JSON文件中的配置参数导入所述Scrapy文件。
具体的,所述导入子模块包括:
获取单元,用于从所述启动文件获取所述爬虫文件的名称;
确定单元,用于确定与所述爬虫文件的名称具有相同名称的JSON文件,从所述确定的JSON文件中获取对应的配置参数;
合并单元,用于将所述获取到的配置参数合并到所述Scrapy文件的默认配置文件中。
在本实施例中,Scrapy配置文件的同级目录包括以下几个层级:爬取器、引擎、调度器、下载器、项目管道、默认配置层级、下载中间件、爬取中间件,只需要在爬虫Scrapy文件的Scrapy.cfg层级定义一个启动命令run.py即可,无需在各个层级下分别定义启动命令run.py。补充说明的是,还在默认配置层级中,对机器人robots协议的注意点进行设置,将ROBOTSTXT-OBEY=TURE,改成False。其中,默认配置层级是Scrapy框架的默认配置文件。
进一步补充说明的是,引擎用来处理整个系统的数据流处理,触发事务,为框架核心。调度器,用来接受引擎发过来的请求,压入队列中,并在引擎再次请求的时候返回;可以想像成一个URL,即抓取网页的网址或者说是链接的优先队列,由它来决定下一个要抓取的网址是什么,同时去除重复的网址。下载器,用于下载网页内容,并将网页内容返回给蜘蛛Scrapy。爬取器,用于从特定的网页中提取需要的信息,需要的信息即所谓的实体。用户也可以从中提取出链接,继续抓取下一个页面。
项目管道,负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。下载中间件处理的过程,主要在调度器发送requests请求时、以及网页将response结果返回给spiders时,用于修改Scrapy request和response。爬取器中间件与下载中间件类似,只不过Spider中间件处理的是即将发往爬取器处理的响应对象,以及从爬取器返回的响应对象或Item对象。爬虫中间件是一个Scrapy的spider处理机制的钩子框架,可以插入自定义的功能用来处理引擎发往爬取器的响应请求和爬取器发往引擎的request请求和实体。
这样,不同的网站项目可以同一份爬虫代码,不要重新创建工程project 和爬虫文件,只需要定义不同的JSON文件即可。
所述爬取模块204,用于运行导入配置参数后的爬虫文件,爬取网页数据。
在本实施例中,在导入配置参数后,爬虫文件可以确定需要爬取的特定数据,从而快速从指定网站上爬取到需要爬取的特定数据。例如,在导入配置参数后,爬虫文件确定需要爬取关于张三的数据,并在指定的第一网站爬取所有与张三有关的数据。
具体的,所述爬取模块204还包括:
获取子模块2041,用于通过所述爬虫文件的引擎从所述导入配置参数后的爬虫文件中获取起始URL,并且将获取的URL提交到调度器中;
爬取子模块2042,用于在需要从URL中下载数据的情况下,所述调度器将URL通过引擎提交给下载器,所述下载器根据URL下载待爬取数据。
在本实施例中,引擎会获取爬虫文件中的起始URL,并且将URL提交到调度器中。如果需要从URL中下载数据,则调度器会将URL通过引擎提交给下载器,下载器根据URL去下载指定内容,即下载响应体。下载好的数据会通过引擎移交给爬虫文件,爬虫文件可以将下载的数据进行指定格式的解析。如果解析出的数据需要进行持久化存储,则爬虫文件会将解析好的数据通过引擎移交给管道进行持久化存储。在本实施例中,URL可以为政府、企业的网址。这样,可以快速爬取网站数据,并将爬取的数据进行下载及存储备份。
具体的,所述爬取模块204,还用于通过xml路径语言选择器爬取数据;在所述xml路径语言选择器未爬取到数据的情况下,通过层叠样式表选择器爬取数据;在所述层叠样式表选择器未匹配到数据的情况下,通过正则表达式爬取数据。
在本实施例中,XPath全称为XML Path Language,也就是XML路径语言,是一种在结构化文档中定位信息的语言,其中,结构化文档可以为XML和HTML文档。使用Xpath路径表达式来选取XML文档中的节点或节点集。节点是通过沿着路径(path)或者步(steps)来选取的。层叠样式表选择器包括通配符选择器、元素选择器、class选择器、id选择器、EF后代选择器、多元素选择器、向量兄弟选择器等。正则表达式(regular expression)描述了一种字符串匹配的模式(pattern),可以用来检查一个串是否含有某种子串,将匹配的子串替换或者从某个串中取出符合某个条件的子串等。
这样,通过多种匹配方式从网页匹配待爬取的数据,可以提高获取有用数据的成功率。
此外,本申请还提出一种基于Scrapy的数据爬取方法。
参阅图4所示,是本申请基于Scrapy的数据爬取方法第一实施例的流程示 意图。在本实施例中,根据不同的需求,图4所示的流程图中的步骤的执行顺序可以改变,某些步骤可以省略。
该方法包括以下步骤:
步骤S400,在爪哇脚本对象简谱JSON文件中对基于Scrapy框架的爬虫文件的配置参数进行定义。
在本实施例中,爪哇脚本对象简谱(Java Script Object Notation,JSON),是一种轻量级的数据交换格式,采用完全独立于编程语言的文本格式来存储和表示数据,具有简洁和清晰的层次结构特点,易于阅读和编写,同时也易于机器解析和生成。
步骤S402,对所述JSON文件进行命名,创建爬虫文件,并将所述爬虫文件的名称按照所述JSON文件的名称进行命名。
举例来说,可以将所述JSON文件命名为spidername,创建爬虫文件,并将爬虫文件命名为spidername。补充说明的是,创建的爬虫文件包括爬取器、引擎、调度器、下载器、实体管道、默认配置层级、下载中间件、及爬取中间件。
补充说明的是,爬取器的英文名可以为Spider、引擎的英文名可以为Enginge、调度器的英文名可以为Scheduler、下载器的英文名可以为Download、实体管道的英文名可以为Item Pipeline、默认配置层级的英文名可以为Setting、下载中间件的英文名可以为Download middleware、及爬取中间件的英文名可以为Spider Middleware。
步骤S404,将所述JSON文件的配置参数导入所述爬虫文件。
在本实施例中,可以通过运行启动文件,触发启动将所述JSON文件中的配置参数导入爬虫文件。
步骤S406,运行导入配置参数后的爬虫文件,爬取网页数据。
在本实施例中,在导入配置参数后,爬虫文件可以确定需要爬取的特定数据,从而快速从指定网站上爬取到需要爬取的特定数据。例如,在导入配置参数后,爬虫文件确定需要爬取关于张三的数据,并在指定的第一网站爬取所有与张三有关的数据。
参阅图5所示,是本申请基于Scrapy的数据爬取方法的第二实施例的流程图。本实施例中,所述的基于Scrapy的数据爬取方法的步骤S500-S506与第一实施例的步骤S400-S406相类似,区别在于该方法还包括步骤S508-510。
该方法包括以下步骤:
步骤S500,在爪哇脚本对象简谱JSON文件中对基于Scrapy框架的爬虫文件的配置参数进行定义。
在本实施例中,爪哇脚本对象简谱(Java Script Object Notation,JSON), 是一种轻量级的数据交换格式,采用完全独立于编程语言的文本格式来存储和表示数据,具有简洁和清晰的层次结构特点,易于阅读和编写,同时也易于机器解析和生成。
具体的,该步骤S500还包括以下步骤:在所述JSON文件中对所述爬虫文件中各个层级的配置参数进行定义,其中,所述爬虫文件的层级包括爬取器、引擎、调度器、下载器、实体管道、默认配置层级、下载中间件、及爬取中间件。
在本实施例中,爬取器主要用于解析响应,并从中提取需要抓取的项目,即提取需要抓取的数据。下载中间件位于引擎和下载器层级之间的特定钩子,使用下载中间件可以达到以下目的:在将请求发送到下载器之前处理请求;在传递给爬取器之前改变接收到的响应;发送新的请求,而不是将接收到的响应传递给爬取器;向爬取器传递响应而不需要获取网页。实体管道负责处理被爬取器提取的实体,典型的任务包括清理,验证和持久性,其中,持久性包括将爬取的数据存储在数据库中。引擎负责控制系统所有组件之间的数据流,并在发生某些操作时触发事件。调度器调度程序接收来自引擎的请求,并将它们排入队列,并在之后,当引擎需要的时候,将requests请求发送给引擎。下载器负责提取网页并将它们馈送到引擎,然后引擎将其发送给爬取层级。
具体的,该步骤S500还包括以下步骤:在爪哇脚本对象简谱JSON文件中对基于Scrapy框架的爬虫文件的起始网站名、起始网站主页、请求头、基于分布式文件存储的数据库的URI地址、数据库名及集合名进行定义;对跟进网页链接的预处理进行定义;对首页的预处理、起始页的类型、允许域名、跟进的链接函数名、待爬取数据的变量名及匹配方式进行定义,其中,所述匹配规则包括xml路径语言选择器、层叠样式表选择器、及正则表达式;所述xml路径语言选择器可以简称为xpath选择器,所述层叠样式表选择器可以简称为CSS选择器。
其中,请求头即为请求报头域,包括cache、client、cookie/logh、entity、miscellaneous、transport等头域分类,对应的作用可参考头域作用说明,在此不做赘述。
在网页界面中包括可以点击继续跟进的链接,例如,当爬虫获取到News.baidu.com页面时,点击在News.baidu.com页面上的跟进链接,可以继续打开跟进链接对应的网页。
步骤S502,对所述JSON文件进行命名,创建爬虫文件,并将所述爬虫文件的名称按照所述JSON文件的名称进行命名。
举例来说,可以将所述JSON文件命名为spidername,创建爬虫文件,并将爬虫文件命名为spidername。补充说明的是,创建的爬虫文件包括爬取器、引 擎、调度器、下载器、实体管道、默认配置层级、下载中间件、及爬取中间件。
步骤S504,将所述JSON文件的配置参数导入所述爬虫文件。
在本实施例中,可以通过运行启动文件,触发启动将所述JSON文件中的配置参数导入爬虫文件。
具体的,步骤S504还包括以下步骤:在中断命令行接收预先设置的启动命令,根据接收到的启动命令,将所述JSON文件的配置参数导入所述爬虫文件。
举例来说,所述启动命令可以为Python run.py spidername,在中断命令行接收到Python run.py spidername后,爬虫文件调用Scrapyuniversal/utils.py中的get_config()函数,通过get_config()函数启动对爬虫文件的配置信息的解析。其中,Python是运行以py结尾的文件命令,run.py是启动文件,spidername是定义的JSON文件名,与JSON文件名为spidername对应爬虫文件的文件名为spidername。
这样,直接在中断命令行输入Python run.py spidername启动命令,即可启动爬虫文件。其中,Python run.py spidername区别于传统的Scrapy crawl spidername的启动方式,可以提高启动效率。
具体的,步骤S504还包括以下步骤:
在爬虫Scrapy配置文件的同级目录下基于爬虫Python语言定义启动文件;
通过所述启动文件将所述JSON文件中的配置参数导入所述Scrapy文件。
具体的,所述通过所述启动文件将所述JSON文件中的配置参数导入所述Scrapy文件还包括以下步骤:
从所述启动文件获取所述爬虫文件的名称;
确定与所述爬虫文件的名称具有相同名称的JSON文件,从所述确定的JSON文件中获取对应的配置参数;
将所述获取到的配置参数合并到所述Scrapy文件的默认配置文件中。
在本实施例中,Scrapy配置文件的同级目录包括以下几个层级:爬取器、引擎、调度器、下载器、项目管道、默认配置层级、下载中间件、爬取中间件,只需要在爬虫Scrapy文件的Scrapy.cfg层级定义一个启动命令run.py即可,无需在各个层级下分别定义启动命令run.py。补充说明的是,还在默认配置Setting层级中,对机器人robots协议的注意点进行设置,将ROBOTSTXT-OBEY=TURE,改成False。其中,默认配置层级是Scrapy框架的默认配置文件。
进一步补充说明的是,引擎用来处理整个系统的数据流处理,触发事务,为框架核心。调度器,用来接受引擎发过来的请求,压入队列中,并在引擎再次请求的时候返回;可以想像成一个URL,即抓取网页的网址或者说是链接的 优先队列,由它来决定下一个要抓取的网址是什么,同时去除重复的网址。下载器,用于下载网页内容,并将网页内容返回给蜘蛛Scrapy。爬取器,用于从特定的网页中提取需要的信息,即所谓的实体。用户也可以从中提取出链接,继续抓取下一个页面。
项目管道,负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。下载中间件处理的过程,主要在调度器发送requests请求时、以及网页将response结果返回给spiders时,用于修改Scrapy request请求和response响应。爬取器中间件与下载中间件类似,只不过爬取器中间件处理的是即将发往爬取器处理的响应对象,以及从爬取器返回的响应对象或实体对象。爬虫中间件是一个Scrapy的spider处理机制的钩子框架,你可以插入自定义的功能用来处理引擎发往爬取器的响应和爬取器发往引擎的响应和实体。
这样,不同的网站项目可以同一份爬虫代码,不要重新创建工程和爬虫文件,只需要定义不同的JSON文件即可。
步骤S506,通过所述爬虫文件的引擎从所述导入配置参数后的爬虫文件中获取起始URL,并且将获取的URL提交到调度器中。
步骤S508,在需要从URL中下载数据的情况下,所述调度器将URL通过引擎提交给下载器,所述下载器根据URL下载待爬取数据。
在本实施例中,引擎会获取爬虫文件中的起始URL,并且将URL提交到调度器中。如果需要从URL中下载数据,则调度器会将URL通过引擎提交给下载器,下载器根据URL去下载指定内容,即下载响应体。下载好的数据会通过引擎移交给爬虫文件,爬虫文件可以将下载的数据进行指定格式的解析。如果解析出的数据需要进行持久化存储,则爬虫文件会将解析好的数据通过引擎移交给管道进行持久化存储。在本实施例中,URL可以为政府、企业的网址。这样,可以快速爬取网站数据,并将爬取的数据进行下载及存储备份。
补充说明的是,在图4所示实施例的步骤406还可以包括以下步骤:通过xml路径语言选择器爬取数据;在所述xml路径语言选择器未爬取到数据的情况下,通过层叠样式表选择器爬取数据;在所述层叠样式表选择器未匹配到数据的情况下,通过正则表达式爬取数据。
在本实施例中,XPath全称为XML Path Language,也就是XML路径语言,为一种在结构化文档中定位信息的语言,其中,结构化文档可以为比XML和HTML文档。XPath使用路径表达式来选取XML文档中的节点或节点集。节点是通过沿着路径(path)或者步(steps)来选取的。CSS选择器包括通配符 选择器、元素选择器、class选择器、id选择器、EF后代选择器、多元素选择器、向量兄弟选择器等。正则表达式(regular expression)描述了一种字符串匹配的模式(pattern),可以用来检查一个串是否含有某种子串,将匹配的子串替换或者从某个串中取出符合某个条件的子串等。
这样,通过多种匹配方式从网页匹配待爬取的数据,可以提高获取有用数据的成功率。
本实施例提供的基于Scrapy的数据爬取方法,能够通过JSON文件定义Scrapy文件的配置参数,不同爬虫项目只需要定义一份不同的JSON文件即可,JSON文件集合了一个爬虫文件所需的配置文件,无需到爬虫文件的各个层级下进行参数配置及功能修改,提高代码编写效率,降低漏洞数量,提高爬取网页数据的效果。
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,也可以为易失性计算机可读存储介质。计算机可读存储介质存储有计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:
在爪哇脚本对象简谱JSON文件中对基于Scrapy框架的爬虫文件的配置参数进行定义;
对所述JSON文件进行命名,创建爬虫文件,并将所述爬虫文件的名称按照所述JSON文件的名称进行命名;
将所述JSON文件的配置参数导入所述爬虫文件;
运行导入配置参数后的爬虫文件,爬取网页数据。
上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,终端设备,空调器,或者网络设备等)执行本发明各个实施例所述的方法。
以上仅为本发明的优选实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。

Claims (20)

  1. 一种基于Scrapy的数据爬取方法,所述方法包括步骤:
    在爪哇脚本对象简谱JSON文件中对基于Scrapy框架的爬虫文件的配置参数进行定义;
    对所述JSON文件进行命名,创建爬虫文件,并将所述爬虫文件的名称按照所述JSON文件的名称进行命名;
    将所述JSON文件的配置参数导入所述爬虫文件;
    运行导入配置参数后的爬虫文件,爬取网页数据。
  2. 如权利要求1所述的基于Scrapy的数据爬取方法,所述在爪哇脚本对象简谱JSON文件中对基于Scrapy框架的爬虫文件的配置参数进行定义的步骤包括:
    在所述JSON文件中对所述爬虫文件中各个层级的配置参数进行定义,其中,所述爬虫文件的层级包括爬取器、引擎、调度器、下载器、实体管道、默认配置Setting层级、下载中间件、及爬取中间件。
  3. 如权利要求2所述的基于Scrapy的数据爬取方法,所述运行导入配置参数后的爬虫文件,爬取网页数据的步骤包括:
    通过所述爬虫文件的引擎从所述导入配置参数后的爬虫文件中获取起始URL,并且将获取的URL提交到调度器中;
    在需要从URL中下载数据的情况下,所述调度器将URL通过引擎提交给下载器,所述下载器根据URL下载待爬取数据。
  4. 如权利要求1所述的基于Scrapy的数据爬取方法,所述在爪哇脚本对象简谱JSON文件中对基于Scrapy框架的爬虫文件的配置参数进行定义的步骤包括:
    在爪哇脚本对象简谱JSON文件中对基于Scrapy框架的爬虫文件的起始网站名、起始网站主页、请求头、基于分布式文件存储的数据库的URI地址、数据库名及集合名进行定义;对跟进网页链接的预处理进行定义;对首页的预处理、起始页的类型、允许域名、跟进的链接函数名、待爬取数据的变量名及匹配方式进行定义,其中,所述匹配方式包括xml路径语言选择器、层叠样式表选择器、及正则表达式。
  5. 如权利要求1或2所述的基于Scrapy的数据爬取方法,所述运行导入配 置参数后的爬虫文件,爬取网页数据的步骤包括:
    通过xml路径语言选择器爬取数据;
    在所述xml路径语言选择器未爬取到数据的情况下,通过层叠样式表选择器爬取数据;
    在所述层叠样式表选择器未匹配到数据的情况下,通过正则表达式爬取数据。
  6. 如权利要求1或2所述的基于Scrapy的数据爬取方法,所述将所述JSON文件的配置参数导入所述爬虫文件的步骤包括:
    在中断命令行接收预先设置的启动命令,根据接收到的启动命令,将所述JSON文件的配置参数导入所述爬虫文件。
  7. 如权利要求1或2所述的基于Scrapy的数据爬取方法,所述将所述JSON文件的配置参数导入所述爬虫文件的步骤包括:
    在爬虫Scrapy配置文件的同级目录下基于爬虫Python语言定义启动文件;
    通过所述启动文件将所述JSON文件中的配置参数导入所述Scrapy文件。
  8. 如权利要求7所述的基于Scrapy的数据爬取方法,所述通过所述启动文件将JSON文件中的配置参数导入所述Scrapy文件的步骤包括:
    从所述启动文件获取所述爬虫文件的名称;
    确定与所述爬虫文件的名称具有相同名称的JSON文件,从所述确定的JSON文件中获取对应的配置参数;
    将所述获取到的配置参数合并到所述Scrapy文件的默认配置文件中。
  9. 一种基于Scrapy的数据爬取系统,包括:
    定义模块,用于在爪哇脚本对象简谱JSON文件中对基于Scrapy框架的爬虫文件的配置参数进行定义;
    命名模块,用于对所述JSON文件进行命名,创建爬虫文件,并将所述爬虫文件的名称按照所述JSON文件的名称进行命名;
    导入模块,用于将所述JSON文件的配置参数导入所述爬虫文件;
    爬取模块,用于运行导入配置参数后的爬虫文件,爬取网页数据。
  10. 如权利要求9所述的基于Scrapy的数据爬取系统,所述定义模块具体用于:
    在所述JSON文件中对所述爬虫文件中各个层级的配置参数进行定义,其中,所述爬虫文件的层级包括爬取器、引擎、调度器、下载器、实体管道、默认配置Setting层级、下载中间件、及爬取中间件。
  11. 如权利要求10所述的基于Scrapy的数据爬取系统,所述爬取模块还包括:
    获取子模块,用于通过所述爬虫文件的引擎从所述导入配置参数后的爬虫文件中获取起始URL,并且将获取的URL提交到调度器中;
    爬取子模块,用于在需要从URL中下载数据的情况下,所述调度器将URL通过引擎提交给下载器,所述下载器根据URL下载待爬取数据。
  12. 如权利要求9所述的基于Scrapy的数据爬取系统,所述定义模块具体用于:
    在爪哇脚本对象简谱JSON文件中对基于Scrapy框架的爬虫文件的起始网站名、起始网站主页、请求头、基于分布式文件存储的数据库的URI地址、数据库名及集合名进行定义;对跟进网页链接的预处理进行定义;对首页的预处理、起始页的类型、允许域名、跟进的链接函数名、待爬取数据的变量名及匹配方式进行定义,其中,所述匹配方式包括xml路径语言选择器、层叠样式表选择器、及正则表达式。
  13. 如权利要求9或10所述的基于Scrapy的数据爬取系统,所述爬取模块具体用于:
    通过xml路径语言选择器爬取数据;
    在所述xml路径语言选择器未爬取到数据的情况下,通过层叠样式表选择器爬取数据;
    在所述层叠样式表选择器未匹配到数据的情况下,通过正则表达式爬取数据。
  14. 如权利要求9或10所述的基于Scrapy的数据爬取系统,所述导入模块具体用于:
    在中断命令行接收预先设置的启动命令,根据接收到的启动命令,将所述JSON文件的配置参数导入所述爬虫文件。
  15. 如权利要求9或10所述的基于Scrapy的数据爬取系统,所述导入模块 还包括:
    定义子模块,用于在爬虫Scrapy配置文件的同级目录下基于爬虫Python语言定义启动文件;
    导入子模块,用于通过所述启动文件将所述JSON文件中的配置参数导入所述Scrapy文件。
  16. 如权利要求15所述的基于Scrapy的数据爬取系统,所述导入子模块包括:
    获取单元,用于从所述启动文件获取所述爬虫文件的名称;
    确定单元,用于确定与所述爬虫文件的名称具有相同名称的JSON文件,从所述确定的JSON文件中获取对应的配置参数;
    合并单元,用于将所述获取到的配置参数合并到所述Scrapy文件的默认配置文件中。
  17. 一种终端设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如下步骤:
    在爪哇脚本对象简谱JSON文件中对基于Scrapy框架的爬虫文件的配置参数进行定义;
    对所述JSON文件进行命名,创建爬虫文件,并将所述爬虫文件的名称按照所述JSON文件的名称进行命名;
    将所述JSON文件的配置参数导入所述爬虫文件;
    运行导入配置参数后的爬虫文件,爬取网页数据。
  18. 如权利要求17所述的终端设备,所述处理器执行所述计算机程序实现所述在爪哇脚本对象简谱JSON文件中对基于Scrapy框架的爬虫文件的配置参数进行定义时,包括以下步骤:
    在所述JSON文件中对所述爬虫文件中各个层级的配置参数进行定义,其中,所述爬虫文件的层级包括爬取器、引擎、调度器、下载器、实体管道、默认配置Setting层级、下载中间件、及爬取中间件。
  19. 如权利要求18所述的终端设备,所述处理器执行所述计算机程序实现所述运行导入配置参数后的爬虫文件,爬取网页数据时,包括以下步骤:
    通过所述爬虫文件的引擎从所述导入配置参数后的爬虫文件中获取起始 URL,并且将获取的URL提交到调度器中;
    在需要从URL中下载数据的情况下,所述调度器将URL通过引擎提交给下载器,所述下载器根据URL下载待爬取数据。
  20. 一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:
    在爪哇脚本对象简谱JSON文件中对基于Scrapy框架的爬虫文件的配置参数进行定义;
    对所述JSON文件进行命名,创建爬虫文件,并将所述爬虫文件的名称按照所述JSON文件的名称进行命名;
    将所述JSON文件的配置参数导入所述爬虫文件;
    运行导入配置参数后的爬虫文件,爬取网页数据。
PCT/CN2019/120540 2019-04-12 2019-11-25 基于Scrapy的数据爬取方法、系统、终端设备及存储介质 WO2020207022A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910294765.3 2019-04-12
CN201910294765.3A CN110147476A (zh) 2019-04-12 2019-04-12 基于Scrapy的数据爬取方法、终端设备及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2020207022A1 true WO2020207022A1 (zh) 2020-10-15

Family

ID=67588746

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/120540 WO2020207022A1 (zh) 2019-04-12 2019-11-25 基于Scrapy的数据爬取方法、系统、终端设备及存储介质

Country Status (2)

Country Link
CN (1) CN110147476A (zh)
WO (1) WO2020207022A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110147476A (zh) * 2019-04-12 2019-08-20 深圳壹账通智能科技有限公司 基于Scrapy的数据爬取方法、终端设备及计算机可读存储介质
CN111274466A (zh) * 2019-12-18 2020-06-12 成都迪普曼林信息技术有限公司 一种海外服务器非结构数据采集系统及方法
CN111414523A (zh) * 2020-03-11 2020-07-14 中国建设银行股份有限公司 一种数据获取方法和装置
CN111554272A (zh) * 2020-04-27 2020-08-18 天津大学 一种面向中文语音识别的语言模型建模方法
CN111859075A (zh) * 2020-07-30 2020-10-30 吉林大学 一种基于异步处理框架的具有自动测试功能的数据爬取方法
CN113515681A (zh) * 2021-04-30 2021-10-19 广东科学技术职业学院 基于scrapy框架的房地产数据爬虫方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160094497A1 (en) * 2014-09-26 2016-03-31 Oracle International Corporation Building message relationships for offline operation of an enterprise application
CN107944055A (zh) * 2017-12-22 2018-04-20 成都优易数据有限公司 一种解决Web证书认证的爬虫方法
CN108829792A (zh) * 2018-06-01 2018-11-16 成都康乔电子有限责任公司 基于scrapy的分布式暗网资源挖掘系统及方法
CN110147476A (zh) * 2019-04-12 2019-08-20 深圳壹账通智能科技有限公司 基于Scrapy的数据爬取方法、终端设备及计算机可读存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978408A (zh) * 2015-08-05 2015-10-14 许昌学院 基于Berkeley DB数据库的主题爬虫系统
CN107943838B (zh) * 2017-10-30 2021-09-07 北京大数元科技发展有限公司 一种自动获取xpath生成爬虫脚本的方法及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160094497A1 (en) * 2014-09-26 2016-03-31 Oracle International Corporation Building message relationships for offline operation of an enterprise application
CN107944055A (zh) * 2017-12-22 2018-04-20 成都优易数据有限公司 一种解决Web证书认证的爬虫方法
CN108829792A (zh) * 2018-06-01 2018-11-16 成都康乔电子有限责任公司 基于scrapy的分布式暗网资源挖掘系统及方法
CN110147476A (zh) * 2019-04-12 2019-08-20 深圳壹账通智能科技有限公司 基于Scrapy的数据爬取方法、终端设备及计算机可读存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CUI, QINGCAI: "The General Crawler of Scrapy for the Use of Scrapy Framework", GOLD DIGGING, 21 May 2018 (2018-05-21), DOI: 20200216102159X *

Also Published As

Publication number Publication date
CN110147476A (zh) 2019-08-20

Similar Documents

Publication Publication Date Title
WO2020207022A1 (zh) 基于Scrapy的数据爬取方法、系统、终端设备及存储介质
US10311073B2 (en) System and method for asynchronous retrieval of information from a server to a client based on incremental user input
US10567407B2 (en) Method and system for detecting malicious web addresses
US8452925B2 (en) System, method and computer program product for automatically updating content in a cache
EP2724251B1 (en) Methods for making ajax web applications bookmarkable and crawlable and devices thereof
US9141697B2 (en) Method, system and computer-readable storage medium for detecting trap of web-based perpetual calendar and building retrieval database using the same
US20090106296A1 (en) Method and system for automated form aggregation
US7630970B2 (en) Wait timer for partially formed query
US9154522B2 (en) Network security identification method, security detection server, and client and system therefor
US9454535B2 (en) Topical mapping
US20080133460A1 (en) Searching descendant pages of a root page for keywords
WO2020211367A1 (zh) 数据爬取方法、装置、计算机设备和存储介质
US20130159480A1 (en) Smart Browsing Providers
US11775518B2 (en) Asynchronous predictive caching of content listed in search results
US11489907B1 (en) Systems, methods, and computer-readable storage media for extracting data from web applications
CN109246069B (zh) 网页登录方法、装置和可读存储介质
US11436291B1 (en) Source rank metric of measuring sources of influence
CN106919600A (zh) 一种失效网址访问方法及终端
US20210342413A1 (en) Identifying code dependencies in web applications
CN112527531A (zh) 一种缓存处理方法及系统
CN109344344A (zh) 网页客户端的标识方法、服务器及计算机可读存储介质
US11687612B2 (en) Deep learning approach to mitigate the cold-start problem in textual items recommendations
US20050240662A1 (en) Identifying, cataloging and retrieving web pages that use client-side scripting and/or web forms by a search engine robot
US11048866B1 (en) Ad hoc contact data capture
US11704173B1 (en) Streaming machine learning platform

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19924550

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 02/02/2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19924550

Country of ref document: EP

Kind code of ref document: A1