CN111859075A

CN111859075A - Asynchronous processing framework-based data crawling method with automatic testing function

Info

Publication number: CN111859075A
Application number: CN202010747535.0A
Authority: CN
Inventors: 康辉; 孙鑫; 赵旭; 李佳辉; 卢凌锋
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2020-10-30
Also published as: CN112612943A

Abstract

The invention belongs to the technical field of web crawlers, and relates to a data crawling method with an automatic testing function based on an asynchronous processing framework. Based on a relatively mature web crawler frame in the field of web crawlers, the invention adds various web crawling tasks with anti-crawler strategies at the beginning of designing a web site, particularly tasks for dynamically generating web site data through scripts, and introduces an automatic test technology when an initial request link is sent out by a project file of a spider and reaches a downloading middleware through an engine and a queue, thereby achieving the purpose of acquiring a dynamic web page source code. The webpage response result obtained by the method is rendered through the script, and the purpose of completing a series of chain operations by self-defining and controlling the browser through an automatic testing technology is achieved, so that the framework analysis process of a developer on a target website is saved, the project development difficulty is reduced, more time can be put on webpage analysis, the quality of crawled data is improved, and the project development period is shortened.

Description

Asynchronous processing framework-based data crawling method with automatic testing function

Technical Field

The invention belongs to the technical field of web crawlers, relates to a data crawling method based on an asynchronous processing framework, and particularly relates to a data crawling method with an automatic testing function based on the asynchronous processing framework.

Background

With the advent of the big data age, the position of web crawlers in the internet will become more and more important. The data in the internet is massive, and how to automatically and efficiently acquire information which is interesting to people in the internet is an important problem for people to use, and the crawler technology is used for solving the problems. When technologies such as big data analysis, data mining, natural language processing in the field of artificial intelligence and the like are continuously developed, the premise that the technologies are rapidly developed is that data is needed and high-quality data is provided. The web crawler not only solves the problem of data extraction, but also extracts structured data from the webpage which can be circulated without rules, and the breakthrough progress of the technology plays a significant role.

Web crawlers can be divided into personal crawlers and enterprise crawlers, but whether individual or enterprise, web crawlers are an integral part of many project processes. With the continuous penetration of web crawler applications, a large set of open source crawler frameworks has emerged, such as: pyspider framework, Scapy framework, which is currently more and more mature framework.

With the development of a web crawler technology, in order to improve the efficiency of a crawler program, the crawler program in actual development is often a distributed crawler which needs to be deployed on a server, because a plurality of computers with different physical addresses are used for simultaneously operating the program, the access link deduplication of a request queue in the program operation process becomes a problem which needs to be considered preferentially, and compared with a traditional crawler tool, an existing crawler frame provides a scheduler structure and can solve the problem; on the other hand, in order to protect the privacy of site resources, each big data source website is added with an anti-crawler strategy at the beginning of website design, most of the data of the website is dynamically generated through javascript scripts, and a server can identify whether the scripts access the resources to judge whether the scripts are real users or not, so that new requirements are provided for the existing crawler framework.

Disclosure of Invention

The invention aims to solve the problems, and provides a self-powered data crawling method based on an asynchronous processing frame.

The purpose of the invention is realized by the following technical scheme:

the method comprises the following steps:

A. determining information required for requesting a target website

The method comprises a target website request link, a user agent, a request mode and a request parameter;

B. determining web page loading characteristics

Checking a webpage source code, and determining whether the source code is consistent with the content presented by the current webpage;

C. determining code segment regions to crawl data

Positioning the position of the data needing to be crawled of the webpage and information of each field;

D. deploying unstructured database information

Determining an unstructured database address, a port and a database name for storing crawl data;

E. configuring a Selenium automatic test tool

The Selenium is an automatic testing tool for testing the application program of the website, and the Selenium test can be directly run in a browser and imitates the operation of a user on the browser;

installing a Selenium toolkit and a browser driver of a corresponding version;

F. building crawler frame based on Scapy technology

The Scapy framework is a quick and high-level webpage crawling framework suitable for Python, has low coupling degree and strong expandability among modules, and is used for crawling websites and extracting structured data from webpages.

Further, in step a, the determining information required by the request target website includes the following steps:

a1, opening a browser developer mode of a target website, clicking a 'Network' tab, and refreshing a current page;

a2, clicking an item consistent with the path of the navigation bar of the page browser;

a3, recording information of website request links, user agents, request modes and request parameters on the right side of the developer mode window.

Further, in step B, said determining the loading characteristics of the web page includes the following steps:

b1, opening the source code of the target webpage;

b2, comparing the data to be crawled in the target webpage with the content of the corresponding label in the source code, and judging whether the data and the content are the same; if the two are the same, the front-end webpage belongs to a static state; and if the difference is different, the dynamic webpage is rendered by the javascript script and some encryption algorithms.

Further, in step C, the determining the area where the crawling data is located includes the following steps:

c1, opening a browser developer mode of the target website, clicking an Elements tab, and displaying the webpage source code after script rendering;

c2, finding the data needed to be crawled in turn through the function of automatic code positioning of the browser, declaring field names respectively, and recording positioning elements corresponding to the code field areas.

Further, in step D, the deploying unstructured database information comprises the following steps:

d1, the unstructured database may be deployed in a local computer or a server, and the unstructured database can be connected to the database only by determining a database address and a designated port number set during deployment;

d2, connecting the deployed unstructured databases, creating the databases used for storing the crawled data and recording the names of the databases;

further, in step F, the Scapy framework is divided into the following parts:

the Engine is mainly responsible for transmitting data and signals among different modules of the whole system;

an Item, which defines the data structure of the crawled information;

the Scheduler receives the request sent by the engine and adds the request into the queue;

the Downloader downloads the webpage content sent by the engine and returns the webpage content;

the spider Spiders define crawling logic and analysis rules and generate extraction results and new requests;

an Item pipe Item Pipeline, which is responsible for processing results extracted from the web page by the spider, performing data cleansing and storage, and the like;

middleware Middlewares, which comprises download middleware and spider middleware, and a structure positioned between an engine and a downloader and a spider, realizes customized request and download extension and filters returned response results.

Further, in the step F, the crawler frame based on the Scapy technology is built:

f1, setting recorded request information in an initial function of the spider project file, serving as a parameter access of the request function along with a website request link, and iterating the initial link by using iteration operation;

f2, the request sent by the spider project file reaches the download middleware through an engine and a queue, so that an automatic test technology is introduced to achieve the purposes of acquiring dynamic webpage source codes and self-defining to control a browser to perform a series of chain operations;

f3, defining a project class, wherein the class defines the collection name of the stored data, and defines a Field class variable for each Field of the recorded required data;

f4, a callback function corresponding to the response result is appointed by a callback parameter in the request function;

f5, configuring parameters in the setting file of the project, including the use priority of the downloading middleware and the pipeline middleware, the database address and the database name, the timeout time, the number of accessed web pages and the number of concurrent requests.

Further, the operation in the callback function includes:

1) analyzing a source webpage, acquiring links of a list page to perform deeper page access, namely requesting a new link through iterative operation every time a link is acquired, and enabling a callback parameter to point to other callback functions;

2) and analyzing the detail page, screening useful information contained in the response result according to each recorded field of the required data and the corresponding positioning element, declaring a defined item class object to store a corresponding field value, and then transmitting the field value to an item pipeline through iterative operation to store the field value in a database.

Further, in step B2, when it is determined that the target web page is a dynamic web page rendered by javascript and some encryption algorithms, before the target web page and the requested link of the related web page are downloaded, an automatic test technique needs to be introduced to return the result of the web page rendered by the javascript or encryption algorithms through the download middleware.

Compared with the prior method, the invention has the following improvements: by introducing self-operation in the downloading middleware, the crawler project can obtain the architecture analysis process of the complex webpage processed by the javascript script and the encryption algorithm, the project development difficulty is reduced, and the development period is shortened.

Drawings

FIG. 1 is a diagram of an asynchronous processing framework of the present invention incorporating automatic test techniques.

Detailed description of the preferred embodiments

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. The following examples are presented merely to further understand and practice the present invention and are not to be construed as further limiting the claims of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a data crawling method with an automatic testing function based on an asynchronous processing frame, which solves the problem that a successful end dynamic webpage cannot obtain a response result simply through a crawler frame by combining a Scapy crawler frame and a Selenium automatic testing technology, and can further analyze the webpage to obtain structured data. And furthermore, the process of a developer on a target website is saved, the project development difficulty is reduced, more time can be put on webpage analysis, the quality of crawled data is improved, and the project development period is shortened.

The following examples are provided to illustrate specific embodiments of the present invention, and should not be construed as limiting the invention. The target website of the embodiment is a Chinese referee document network: crawling all data of a home page of a specified court, namely all document detail information of the court, is common (the website only allows a user to view 600 pieces of data).

A data crawling method with automatic test function based on asynchronous processing frame includes

Step A, the information required by determining the target website request comprises the following steps:

a1, opening a browser developer mode of a target website, and clicking a 'Network' tab and a page;

a2, clicking on the item (generally the first one) that is consistent with the path of the navigation bar of the page browser

A3, recording information of website request link, user agent, and requester parameter on the right side of the developer mode window.

Step B, the step of determining the webpage loading characteristics comprises the following steps:

b1, opening the source code of the target webpage;

b2, comparing the data to be crawled in the target webpage with the content of the corresponding label in the source code, wherein the information can be seen by the user in the webpage is not shown in the webpage source code, so that the 'Chinese script network' is judged to be a dynamic webpage rendered by a javascript script and some encryption algorithms, the true webpage result is directly returned by a downloader, and the webpage source code containing the required data can be obtained by performing actions with the help of an automatic test technology.

Step C, the step of determining the area of the crawled data comprises the following steps:

c1, opening a browser developer mode of the target website, clicking an Elements tab, and rendering the webpage source code through the script;

c2, automatically positioning the code through the browser, finding out the data and field names needed to be crawled, and recording the xpath path or css selector of the corresponding code field area.

Crawling of the "chinese referee's paperweb" involves two pages in total, namely, the paperlist page and the text.

Firstly, the link of each document in the document list page needs to be acquired, and the iteration visit page is continued after the link is acquired, wherein the record of the link is as follows: linking: // a [ @ class ═ caseName "]/@ href;

the document detail page is accessed by each document link in the document list, each part in the document detail page is acquired, the field name is declared for each part information, the xpath path is recorded, and each part information is recorded as

Title: a/div [ @ class ═ PDF _ title' ], a,

Release time: PDF _ cut > div, nth-child (1) > table, nth-child (1) > tr, nth-child (1) > td, nth-child (1),

Browsing amount: PDF _ cut > div, nth-child (1) > table, nth-child (1) > tr, nth-child (1) > td, nth-child (2),

Court: a/div [ @ class ═ PDF _ pox "]/div [1 ].

Because the writing of the document is not standard, other fields do not have fixed xpath or css formulas and need to be obtained through circulation and condition judgment, and the fields comprise document types, parties, trial reasons, trial results, trial lengths, trial judges, trial time, judge assistants and notes

Step D, the information of the deployment unstructured database comprises the following steps:

d1, deploying the MongoDB database in a local computer, determining that the address of the database is host', and designating a port number to be 27017;

d2, concatenating the deployed MongoDB databases, creating data t "that is used to store crawled data.

Step E, configuring a Selenium automatic test tool

The Selenium is an automatic testing tool for testing the application program of the website, and the test of the m can be directly run in the browser to simulate the operation of a user on the browser;

using python's toolkit installation instruction pip to install Selenium, pip install m; and viewing the version number of the Chrome browser, downloading the browser driver version.exe of the corresponding version, and storing the browser driver version.exe in a script directory of python.

Step F, building a crawler frame based on the Scapy technology:

f1, setting recorded request information in the initial function of spider project file, accessing as the parameter of request function along with website, and using yield operation to iterate loop on initial link

The initial link contains a Request parameter coded by Unicode, and a Request is initiated by solving a method Request through a script framework, and the method specifically comprises the following operations:

yield Request(url＝self.start_urls,k＝self.parse_origin,meta＝{'tag':0},dont_filter＝True)。

f2, the request sent by the spider project file reaches the download middleware through an engine and a queue, and the technology is tested, so that the purposes of acquiring dynamic webpage source codes and customizing the control browser to carry out a series of operations are achieved;

the Selenium kit used was first introduced, operating as follows:

from selenium import webdriver；

the object of the Chrome browser to perform the test is then declared, the operation is as follows:

self.browser＝webdriver.Chrome()；

access is performed again through self.

For a document list page, 600 pieces of data are total, and a single page can display only 5 pieces of data, which is 120 times the page. However, by changing the number of the displayed pieces of the single-page document of the page into the option of selecting through the Selenium, the document list page can be accessed once to obtain the links of 600 documents, the execution efficiency of the project is improved, and the operation is as follows:

self.browser.execute_script("""document.querySelector("select.pageSizeSelectoption:nth-child(3)").text＝"600"；""")

time.sleep(4)

driver＝self.browser.find_element_by_xpath('//div[@class＝"WS_my_pages"]/select[@class＝"pageSizeSelect"]')

sel＝Select(driver)

sel.select_by_visible_text('600')。

depending on the division of the document detail page information, the following fields may be declared in the project class:

title＝Field()

release＝Field()

views＝Field()

court＝Field()

type＝Field()

prelude＝Field()

parties＝Field()

justification＝Field()

end＝Field()

chief＝Field()

judge＝Field()

time＝Field()

assistant＝Field()

clerk＝Field()

in addition, a variable collection of a record set name is declared as 'wenshu'.

F4, specifying a callback function corresponding to the response result by the callback parameter in the request function, wherein the callback function comprises:

1) and analyzing the source webpage. Obtaining list type links to access deeper page, i.e. each link requests new link through yield operation iteration, and the callback parameter points to other loops

The initial Request function Request has a parameter callback, which is used to return the current response result as a parameter to the specified parsing function, namely, the parse _ origin.

Acquiring a set form of all document links through an xpath path, and operating as follows:

urls＝e.xpath('//a[@class＝"caseName"]/@href').extract()。

and then circulating the link set, and combining yield iteration operation, sequentially accessing the detail pages, wherein the operation is as follows:

for url in urls:

yield Request(url＝target_url,callback＝self.parse_detail,tag':1},dont_filter＝False)。

the callback function becomes parse _ detail, i.e. the response result jumps to the new document parsing function.

2) The details page is parsed. And screening useful information contained in the response result according to each recorded field of the required data and a corresponding xpath path selector. Declaring the corresponding field value of the defined item deposit, then transmitting the value to the item pipeline through yield iteration, and storing the data

Declaring that an object of the item class, item (), is created, and then screening the corresponding segment data through an xpath path or a css selector, as follows:

box＝response.xpath('//div[@class＝"PDF_box"]')

item['title']＝th('./div[@class＝"PDF_title"]/text()').extract_first().replac”)

item['court']＝th('./div[@class＝"PDF_pox"]/div[1]/text()').extract_first()。

and because the writing of the document is not standard, other fields adopt for loop and condition judgment to respond to the result and match, and finally the project object of each document is iterated through yield item operation and is transmitted to execute subsequent operation of storing in a database.

Firstly, importing a MongoDB toolkit-import pymongo, connecting the database names selected by the number under the specified path, and operating as follows:

self.client＝pymongo.MongoClient(self.mongo_uri)

self.db＝self.client[self.mongo_db]。

and then, performing an insert operation on the item object, wherein whether the item object belongs to a WenshuItem class object is required to avoid the problem that the object types are not matched, and the specific operations are as follows:

if isinstance(item,WenshuItem):

self.db[item.collection].insert(dict(item))。

f5, configuring parameters in the setting file of the project, including the priority of the downloading middleware and the pipeline middleware, the database address and the database name, the timeout time, the number of the accessed web pages and the concurrency amount.

Parameters needed in the project are configured in the setting file, so that the project is convenient to manage, and the project rod property is enhanced. Some of these parameters include:

number of pages visited: MAX _ PAGE ═ 2

Number of concurrent accesses: CONCURRENT _ REQUESTS ═ 3

Selenium test timeout time: SELENIUM _ TIMEOUT is 60

Access path of MongoDB: MONGO _ URI ═ localhost'

The MongoDB stores the database name: MONGO _ DB ═ test'

Project pipe usage priority: ITEM _ PIPELINES ═ leaf

enshu.pipelines.WsMongoPipeline':300,

enshu.pipelines.MongoPipeline':302}

Downloading middleware priority: DOWNLODER _ MIDDLEWARES ═ last distance

wenshu.middlewares.WenshuSeleniumMiddleware':543,

scrapy_splash.SplashCookiesMiddleware':723,

scrapy_splash.SplashMiddleware':725}

If the set parameter value is desired to be obtained, the name of the corresponding parameter is used, for example, the number of questions is desired to be obtained, and the operation is as follows: pageSize ═ self.

While embodiments of the invention have been disclosed above, it is not intended to be limited to the uses set forth in the specification and examples. It can be applied to all kinds of fields suitable for the present invention. Additional modifications will readily occur to those skilled in the art. It is therefore intended that the invention not be limited to the exact details and illustrations described and illustrated herein, but fall within the scope of the appended claims and equivalents thereof.

Claims

1. A data crawling method with an automatic test function based on an asynchronous processing framework comprises the following steps:

A. determining information required for requesting a target website

B. determining web page loading characteristics

C. determining code segment regions to crawl data

D. deploying unstructured database information

E. configuring a Selenium automatic test tool

installing a Selenium toolkit and a browser driver of a corresponding version;

F. building crawler frame based on Scapy technology

2. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1, wherein:

step A, the step of determining the information required by the request target website comprises the following steps:

3. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1, wherein:

b1, opening the source code of the target webpage;

4. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1, wherein:

5. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1, wherein:

d2, connecting the deployed unstructured database, creating the database used to store the crawled data and recording its name.

6. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1, wherein:

step F, dividing the Scapy framework into the following parts:

an Item, which defines the data structure of the crawled information;

7. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1 or 6, wherein:

step F, building a crawler frame based on the Scapy technology:

8. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 7, wherein:

the operation in the callback function comprises the following steps:

9. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 3, wherein:

step B2, when it is determined that the target web page is a dynamic web page rendered by javascript and some encryption algorithms, before downloading the requested links of the target web page and related web pages, an automatic test technique needs to be introduced to return the results of the web pages rendered by the javascript or encryption algorithms through downloading middleware.