CN112612943A

CN112612943A - Asynchronous processing framework-based data crawling method with automatic testing function

Info

Publication number: CN112612943A
Application number: CN202110059894.1A
Authority: CN
Inventors: 康辉; 孙鑫; 赵旭; 李佳辉; 卢凌锋
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2020-07-30
Filing date: 2021-01-18
Publication date: 2021-04-06
Also published as: CN111859075A

Abstract

The invention belongs to the technical field of web crawlers, and relates to a data crawling method with an automatic testing function based on an asynchronous processing framework. Based on a relatively mature web crawler frame in the field of web crawlers, the invention adds various web crawling tasks with anti-crawler strategies at the beginning of designing a web site, particularly tasks for dynamically generating web site data through scripts, and introduces an automatic test technology when an initial request link is sent out by a project file of a spider and reaches a downloading middleware through an engine and a queue, thereby achieving the purpose of acquiring a dynamic web page source code. The webpage response result obtained by the method is rendered through the script, and the purpose of completing a series of chain operations by self-defining and controlling the browser through an automatic testing technology is achieved, so that the framework analysis process of a developer on a target website is saved, the project development difficulty is reduced, more time can be put on webpage analysis, the quality of crawled data is improved, and the project development period is shortened.

Description

Asynchronous processing framework-based data crawling method with automatic testing function

Technical Field

The invention belongs to the technical field of web crawlers, relates to a data crawling method based on an asynchronous processing framework, and particularly relates to a data crawling method with an automatic testing function based on the asynchronous processing framework.

Background

With the advent of the big data age, the position of web crawlers in the internet will become more and more important. The data in the internet is massive, and how to automatically and efficiently acquire information which is interesting to people in the internet is an important problem for people to use, and the crawler technology is used for solving the problems. When technologies such as big data analysis, data mining, natural language processing in the field of artificial intelligence and the like are continuously developed, the premise that the technologies are rapidly developed is that data is needed and high-quality data is provided. The web crawler not only solves the problem of data extraction, but also extracts structured data from the webpage which can be circulated without rules, and the breakthrough progress of the technology plays a significant role.

Web crawlers can be divided into personal crawlers and enterprise crawlers, but whether individual or enterprise, web crawlers are an integral part of many project processes. With the continuous penetration of web crawler applications, a large set of open source crawler frameworks has emerged, such as: pyspider framework, Scapy framework, which is currently more and more mature framework.

With the development of a web crawler technology, in order to improve the efficiency of a crawler program, the crawler program in actual development is often a distributed crawler which needs to be deployed on a server, because a plurality of computers with different physical addresses are used for simultaneously operating the program, the access link deduplication of a request queue in the program operation process becomes a problem which needs to be considered preferentially, and compared with a traditional crawler tool, an existing crawler frame provides a scheduler structure and can solve the problem; on the other hand, in order to protect the privacy of site resources, each big data source website is added with an anti-crawler strategy at the beginning of website design, most of the data of the website is dynamically generated through javascript scripts, and a server can identify whether the scripts access the resources to judge whether the scripts are real users or not, so that new requirements are provided for the existing crawler framework.

Disclosure of Invention

The invention aims to solve the problems, and provides a data crawling method with an automatic testing function based on an asynchronous processing frame.

The purpose of the invention is realized by the following technical scheme:

the method comprises the following steps:

A. determining information required for requesting a target website

The method comprises a target website request link, a user agent, a request mode and a request parameter;

B. determining web page loading characteristics

Checking a webpage source code, and determining whether the source code is consistent with the content presented by the current webpage;

C. determining code segment regions to crawl data

Positioning the position of the data needing to be crawled of the webpage and information of each field;

D. deploying unstructured database information

Determining an unstructured database address, a port and a database name for storing crawl data;

E. configuring a Selenium automatic test tool

The Selenium is an automatic testing tool for testing the application program of the website, and the Selenium test can be directly run in a browser and imitates the operation of a user on the browser;

installing a Selenium toolkit and a browser driver of a corresponding version;

F. building crawler frame based on Scapy technology

The Scapy framework is a quick and high-level webpage crawling framework suitable for Python, has low coupling degree and strong expandability among modules, and is used for crawling websites and extracting structured data from webpages.

Further, in step a, the determining information required by the request target website includes the following steps:

a1, opening a browser developer mode of a target website, clicking a 'Network' tab, and refreshing a current page;

a2, clicking an item consistent with the path of the navigation bar of the page browser;

a3, recording information of website request links, user agents, request modes and request parameters on the right side of the developer mode window.

Further, in step B, said determining the loading characteristics of the web page includes the following steps:

b1, opening the source code of the target webpage;

b2, comparing the data to be crawled in the target webpage with the content of the corresponding label in the source code, and judging whether the data and the content are the same; if the two are the same, the front-end webpage belongs to a static state; and if the difference is different, the dynamic webpage is rendered by the javascript script and some encryption algorithms.

Further, in step C, the determining the area where the crawling data is located includes the following steps:

c1, opening a browser developer mode of the target website, clicking an Elements tab, and displaying the webpage source code after script rendering;

c2, finding the data needed to be crawled in turn through the function of automatic code positioning of the browser, declaring field names respectively, and recording positioning elements corresponding to the code field areas.

Further, in step D, the deploying unstructured database information comprises the following steps:

d1, the unstructured database may be deployed in a local computer or a server, and the unstructured database can be connected to the database only by determining a database address and a designated port number set during deployment;

d2, connecting the deployed unstructured databases, creating the databases used for storing the crawled data and recording the names of the databases;

further, in step F, the Scapy framework is divided into the following parts:

the Engine is mainly responsible for transmitting data and signals among different modules of the whole system;

an Item, which defines the data structure of the crawled information;

the Scheduler receives the request sent by the engine and adds the request into the queue;

the Downloader downloads the webpage content sent by the engine and returns the webpage content;

the spider Spiders define crawling logic and analysis rules and generate extraction results and new requests;

an Item pipe Item Pipeline, which is responsible for processing results extracted from the web page by the spider, performing data cleansing and storage, and the like;

middleware Middlewares, which comprises download middleware and spider middleware, and a structure positioned between an engine and a downloader and a spider, realizes customized request and download extension and filters returned response results.

Further, in the step F, the crawler frame based on the Scapy technology is built:

f1, setting recorded request information in an initial function of the spider project file, serving as a parameter access of the request function along with a website request link, and iterating the initial link by using iteration operation;

f2, the request sent by the spider project file reaches the download middleware through an engine and a queue, so that an automatic test technology is introduced to achieve the purposes of acquiring dynamic webpage source codes and self-defining to control a browser to perform a series of chain operations;

f3, defining a project class, wherein the class defines the collection name of the stored data, and defines a Field class variable for each Field of the recorded required data;

f4, a callback function corresponding to the response result is appointed by a callback parameter in the request function;

f5, configuring parameters in the setting file of the project, including the use priority of the downloading middleware and the pipeline middleware, the database address and the database name, the timeout time, the number of accessed web pages and the number of concurrent requests.

Further, the operation in the callback function includes:

1) analyzing a source webpage, acquiring links of a list page to perform deeper page access, namely requesting a new link through iterative operation every time a link is acquired, and enabling a callback parameter to point to other callback functions;

2) and analyzing the detail page, screening useful information contained in the response result according to each recorded field of the required data and the corresponding positioning element, declaring a defined item class object to store a corresponding field value, and then transmitting the field value to an item pipeline through iterative operation to store the field value in a database.

Further, in step B2, when it is determined that the target web page is a dynamic web page rendered by javascript and some encryption algorithms, before the target web page and the requested link of the related web page are downloaded, an automatic test technique needs to be introduced to return the result of the web page rendered by the javascript or encryption algorithms through the download middleware.

Compared with the prior method, the invention has the following improvements: by introducing an automatic testing technology into the downloading middleware, the crawler project can acquire the webpage processed by the javascript script and the encryption algorithm, the structural analysis process of a developer on the complex webpage is avoided, the project development difficulty is reduced, and the development period is shortened.

Drawings

FIG. 1 is a diagram of an asynchronous processing framework of the present invention incorporating automatic test techniques.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. The following examples are presented only for further understanding and implementation of the technical solution of the present invention, and do not constitute a further limitation to the claims of the present invention, therefore, all other examples obtained by one of ordinary skill in the art without creative efforts based on the examples of the present invention shall fall within the protection scope of the present invention.

The invention provides a data crawling method with an automatic testing function based on an asynchronous processing frame, which successfully solves the problem that a front-end dynamic webpage cannot obtain a response result simply through a crawler frame by combining a Scapy crawler frame and a Selenium automatic testing technology, and further can analyze a target webpage to obtain structured data. And further, the framework analysis process of a developer on the target website is saved, the project development difficulty is reduced, more time can be put into webpage analysis, the quality of crawled data is improved, and the project development period is shortened.

The following examples are provided to illustrate specific embodiments of the present invention, and it should be understood that the examples are only illustrative and not restrictive. The target website of the embodiment is a Chinese referee document network, and a target is crawled: crawling all data of a home page of a specified court, namely all document detail information of the court, is 600 pieces of data (the website only allows a user to view 600 pieces of data).

A data crawling method with an automatic test function based on an asynchronous processing framework comprises the following steps:

step A, the information required by determining the target website request comprises the following steps:

a2, clicking on the item (the first item in general) in accordance with the path of the navigation bar of the page browser

Step B, the step of determining the webpage loading characteristics comprises the following steps:

b1, opening the source code of the target webpage;

b2, comparing the data to be crawled in the target webpage with the content of the corresponding label in the source code, and finding that the information which can be seen by the user in the webpage does not appear in the webpage source code, so that the ' Chinese referee's document web ' is judged to be a dynamic webpage rendered by javascript and some encryption algorithms, so that the real webpage result cannot be directly returned by a downloader, and the webpage source code containing the required data can be obtained only by dynamically rendering with the help of an automatic test technology.

Step C, the step of determining the area of the crawled data comprises the following steps:

c1, opening a browser developer mode of the target website, and clicking an Elements tab to display the webpage source codes after the script is rendered;

c2, automatically positioning the code through the browser, finding each data to be crawled in turn, declaring the field name, and recording the xpath path or css selector of the corresponding code field area.

Crawling of the "chinese referee's paperweb" involves two pages in total, namely, a paperlist page and a paperdetail page.

Firstly, the link of each document in the document list page needs to be acquired, and the document detail page is continuously accessed iteratively after the link is acquired, wherein the record of the link is as follows: linking: // a [ @ class ═ caseName "]/@ href;

and (3) linking and accessing the document detail page by each document in the document list, then acquiring each part of information in the document detail page, declaring field names for each part of information, and recording an xpath path, wherein each part of information is recorded as follows:

title: a/div [ @ class ═ PDF _ title' ], a,

Release time: PDF _ cut > div, nth-child (1) > table, nth-child (1) > tr, nth-child (1) > td, nth-child (1),

Browsing amount: PDF _ cut > div, nth-child (1) > table, nth-child (1) > tr, nth-child (1) > td, nth-child (2),

Court: a/div [ @ class ═ PDF _ pox "]/div [1 ].

Because the writing of the document is not standard, other fields do not have a fixed xpath or css selector format and need to be obtained through circulation and condition judgment, and the fields comprise the document type, a party, an approval number, an approval reason, an approval result, an approval length, an approval person, approval time, a judge assistant and a bookmarker.

Step D, the information of the deployment unstructured database comprises the following steps:

d1, deploying the MongoDB database in a local computer, determining that the address of the database is localhost and the designated port number is 27017;

d2, concatenating the deployed MongoDB databases, creating a database test "that is used to store the crawled data.

Step E, configuring a Selenium automatic test tool

using a python's toolkit installation instruction pip to install the Selenium, namely pip install Selenium; and viewing the version number of the Chrome browser, downloading the browser driver.exe of the corresponding version, and storing the browser driver.exe into a script directory of python.

Step F, building a crawler frame based on the Scapy technology:

f1, setting the recorded request information in the initial function of the spider project file, connecting with the website request as the parameter access of the request function, and using yield operation to access the initial link iteration loop.

The initial link contains a Request parameter coded by Unicode, and initiates a Request by a Request method Request built in a script framework, and the specific operations are as follows:

yield Request(url＝self.start_urls,k＝self.parse_origin,meta＝{'tag':0},dont_filter＝True)。

the Selenium kit used was first introduced, operating as follows:

from selenium import webdriver；

the object of the Chrome browser to perform the test is then declared, the operation is as follows:

self.browser＝webdriver.Chrome()；

access is performed again through self.

For a document list page, 600 pieces of data are displayed, and a single page can only display 5 pieces of data, which requires 120 times of page access. But the number of the single page document display pieces of the page is changed to 600 through the Selenium, and then the option is selected, so that the document list page can be accessed once to obtain the links of 600 documents, thereby greatly improving the execution efficiency of the project, and the operation is as follows:

self.browser.execute_script("""document.querySelector("select.pageSizeSelect option:nth-child(3)").text＝"600"；""")

time.sleep(4)

driver＝self.browser.find_element_by_xpath('//div[@class＝"WS_my_pages"]/se lect[@class＝"pageSizeSelect"]')

sel＝Select(driver)

sel.select_by_visible_text('600')。

depending on the division of the document detail page information, the following fields may be declared in the project class:

title＝Field()

release＝Field()

views＝Field()

court＝Field()

type＝Field()

prelude＝Field()

parties＝Field()

justification＝Field()

end＝Field()

chief＝Field()

judge＝Field()

time＝Field()

assistant＝Field()

clerk＝Field()

in addition, a variable collection of a record set name is declared as 'wenshu'.

F4, specifying a callback function corresponding to the response result by the callback parameter in the request function, wherein the operation in the callback function comprises:

1) and analyzing the source webpage. Obtaining list-type links to access a deeper page, namely requesting a new link through yield operation iteration once one link is obtained, and enabling a callback parameter to point to other callback functions;

the initial Request function Request has a parameter callback, which is used to return the response result of the current access link as a parameter to the specified parsing function, namely, the pase _ origin.

Acquiring a set form of all document links through an xpath path, and operating as follows:

urls＝response.xpath('//a[@class＝"caseName"]/@href').extract()。

and then, by circulating the link set and combining yield iteration operation, the detail pages of all documents can be sequentially accessed, and the operation is as follows:

for url in urls:

yield Request(url＝target_url,callback＝self.parse_detail,tag':1},dont_filter＝False)。

the callback function becomes parse _ detail, i.e. the response result jumps to the detail page resolution function of the new document.

2) The details page is parsed. And screening useful information contained in the response result according to the recorded required data fields and the corresponding xpath path or css selector. Declaring the corresponding field value of the defined item deposit, then transmitting the value to the item pipeline through yield iteration, and storing the data

Declaring that an object of the item class, item (), has been created, and then screening corresponding field data through an xpath path or a css selector, as follows:

box＝response.xpath('//div[@class＝"PDF_box"]')item['title']＝box.xpath('./div[@class＝"PDF_title"]/text()').extract_first().replace('\n',”)

item['court']＝box.xpath('./div[@class＝"PDF_pox"]/div[1]/text()').extract_first()。

because the writing of the document is not standard, other fields adopt for loop and condition judgment to intercept and match the response result, and finally the project object of each document is iterated through yield item operation and transmitted into a project pipeline to execute the subsequent operation of storing in a database.

Firstly, importing a MongoDB toolkit-import pymongo, connecting a database under a specified path and specifying a selected database name, and operating as follows:

self.client＝pymongo.MongoClient(self.mongo_uri)

self.db＝self.client[self.mongo_db]。

then, an insert operation is performed on the item object, and in order to avoid the problem of unmatched object types, it needs to first determine whether the item object belongs to a WenshuItem class object, and the specific operations are as follows:

if isinstance(item,WenshuItem):

self.db[item.collection].insert(dict(item))。

f5, configuring parameters in the project setting file, including the use priority of the download middleware and the pipeline middleware, the database address and the database name, the timeout time, the number of accessed web pages and the number of concurrent requests.

Parameters required in the project are configured in the setting file, so that the project is convenient to manage, and the robustness of the project is enhanced. Some of these parameters include:

number of pages visited: MAX _ PAGE ═ 2

Number of concurrent accesses: CONCURRENT _ REQUESTS ═ 3

Selenium test timeout time: SELENIUM _ TIMEOUT is 60

Access path of MongoDB: MONGO _ URI ═ localhost'

The MongoDB stores the database name: MONGO _ DB ═ test'

Project pipe usage priority: ITEM _ PIPELINES ═ leaf

'wenshu.pipelines.WsMongoPipeline':300,

'wenshu.pipelines.MongoPipeline':302}

Downloading middleware priority: DOWNLODER _ MIDDLEWARES ═ last distance

'wenshu.middlewares.WenshuSeleniumMiddleware':543,

'scrapy_splash.SplashCookiesMiddleware':723,

'scrapy_splash.SplashMiddleware':725}

If the set parameter value is to be acquired, the name of the corresponding parameter is used, for example, the number of page accesses is to be acquired, and the following operations are performed: pageSize ═ self.

While embodiments of the invention have been disclosed above, it is not intended to be limited to the uses set forth in the specification and examples. It can be applied to all kinds of fields suitable for the present invention. Additional modifications will readily occur to those skilled in the art. It is therefore intended that the invention not be limited to the exact details and illustrations described and illustrated herein, but fall within the scope of the appended claims and equivalents thereof.

Claims

1. A data crawling method with an automatic test function based on an asynchronous processing framework comprises the following steps:

A. determining information required for requesting a target website

B. determining web page loading characteristics

C. determining code segment regions to crawl data

D. deploying unstructured database information

E. configuring a Selenium automatic test tool

installing a Selenium toolkit and a browser driver of a corresponding version;

F. building crawler frame based on Scapy technology

2. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1, wherein:

step A, the step of determining the information required by the request target website comprises the following steps:

3. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1, wherein:

b1, opening the source code of the target webpage;

4. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 3, wherein:

step B2, when it is determined that the target web page is a dynamic web page rendered by javascript and some encryption algorithms, before downloading the requested links of the target web page and related web pages, an automatic test technique needs to be introduced to return the results of the web pages rendered by the javascript or encryption algorithms through downloading middleware.

5. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1, wherein:

step C, the code segment area of the crawling data is determined to comprise the following steps:

6. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1, wherein:

d2, connecting the deployed unstructured database, creating the database used to store the crawled data and recording its name.

7. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1, wherein:

step F, dividing the Scapy framework into the following parts:

an Item, which defines the data structure of the crawled information;

8. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1 or 7, wherein:

step F, building a crawler frame based on the Scapy technology:

f1, setting recorded request information in an initial function of the spider project file, using the request information as a parameter access of a request function along with a website request link, and circularly accessing the initial link by using iterative operation;

9. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 8, wherein:

the operation in the callback function in step F4 includes: