CN112612943A - Asynchronous processing framework-based data crawling method with automatic testing function - Google Patents

Asynchronous processing framework-based data crawling method with automatic testing function Download PDF

Info

Publication number
CN112612943A
CN112612943A CN202110059894.1A CN202110059894A CN112612943A CN 112612943 A CN112612943 A CN 112612943A CN 202110059894 A CN202110059894 A CN 202110059894A CN 112612943 A CN112612943 A CN 112612943A
Authority
CN
China
Prior art keywords
data
request
webpage
automatic test
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110059894.1A
Other languages
Chinese (zh)
Inventor
康辉
孙鑫
赵旭
李佳辉
卢凌锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jilin University
Original Assignee
Jilin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jilin University filed Critical Jilin University
Publication of CN112612943A publication Critical patent/CN112612943A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention belongs to the technical field of web crawlers, and relates to a data crawling method with an automatic testing function based on an asynchronous processing framework. Based on a relatively mature web crawler frame in the field of web crawlers, the invention adds various web crawling tasks with anti-crawler strategies at the beginning of designing a web site, particularly tasks for dynamically generating web site data through scripts, and introduces an automatic test technology when an initial request link is sent out by a project file of a spider and reaches a downloading middleware through an engine and a queue, thereby achieving the purpose of acquiring a dynamic web page source code. The webpage response result obtained by the method is rendered through the script, and the purpose of completing a series of chain operations by self-defining and controlling the browser through an automatic testing technology is achieved, so that the framework analysis process of a developer on a target website is saved, the project development difficulty is reduced, more time can be put on webpage analysis, the quality of crawled data is improved, and the project development period is shortened.

Description

Asynchronous processing framework-based data crawling method with automatic testing function
Technical Field
The invention belongs to the technical field of web crawlers, relates to a data crawling method based on an asynchronous processing framework, and particularly relates to a data crawling method with an automatic testing function based on the asynchronous processing framework.
Background
With the advent of the big data age, the position of web crawlers in the internet will become more and more important. The data in the internet is massive, and how to automatically and efficiently acquire information which is interesting to people in the internet is an important problem for people to use, and the crawler technology is used for solving the problems. When technologies such as big data analysis, data mining, natural language processing in the field of artificial intelligence and the like are continuously developed, the premise that the technologies are rapidly developed is that data is needed and high-quality data is provided. The web crawler not only solves the problem of data extraction, but also extracts structured data from the webpage which can be circulated without rules, and the breakthrough progress of the technology plays a significant role.
Web crawlers can be divided into personal crawlers and enterprise crawlers, but whether individual or enterprise, web crawlers are an integral part of many project processes. With the continuous penetration of web crawler applications, a large set of open source crawler frameworks has emerged, such as: pyspider framework, Scapy framework, which is currently more and more mature framework.
With the development of a web crawler technology, in order to improve the efficiency of a crawler program, the crawler program in actual development is often a distributed crawler which needs to be deployed on a server, because a plurality of computers with different physical addresses are used for simultaneously operating the program, the access link deduplication of a request queue in the program operation process becomes a problem which needs to be considered preferentially, and compared with a traditional crawler tool, an existing crawler frame provides a scheduler structure and can solve the problem; on the other hand, in order to protect the privacy of site resources, each big data source website is added with an anti-crawler strategy at the beginning of website design, most of the data of the website is dynamically generated through javascript scripts, and a server can identify whether the scripts access the resources to judge whether the scripts are real users or not, so that new requirements are provided for the existing crawler framework.
Disclosure of Invention
The invention aims to solve the problems, and provides a data crawling method with an automatic testing function based on an asynchronous processing frame.
The purpose of the invention is realized by the following technical scheme:
the method comprises the following steps:
A. determining information required for requesting a target website
The method comprises a target website request link, a user agent, a request mode and a request parameter;
B. determining web page loading characteristics
Checking a webpage source code, and determining whether the source code is consistent with the content presented by the current webpage;
C. determining code segment regions to crawl data
Positioning the position of the data needing to be crawled of the webpage and information of each field;
D. deploying unstructured database information
Determining an unstructured database address, a port and a database name for storing crawl data;
E. configuring a Selenium automatic test tool
The Selenium is an automatic testing tool for testing the application program of the website, and the Selenium test can be directly run in a browser and imitates the operation of a user on the browser;
installing a Selenium toolkit and a browser driver of a corresponding version;
F. building crawler frame based on Scapy technology
The Scapy framework is a quick and high-level webpage crawling framework suitable for Python, has low coupling degree and strong expandability among modules, and is used for crawling websites and extracting structured data from webpages.
Further, in step a, the determining information required by the request target website includes the following steps:
a1, opening a browser developer mode of a target website, clicking a 'Network' tab, and refreshing a current page;
a2, clicking an item consistent with the path of the navigation bar of the page browser;
a3, recording information of website request links, user agents, request modes and request parameters on the right side of the developer mode window.
Further, in step B, said determining the loading characteristics of the web page includes the following steps:
b1, opening the source code of the target webpage;
b2, comparing the data to be crawled in the target webpage with the content of the corresponding label in the source code, and judging whether the data and the content are the same; if the two are the same, the front-end webpage belongs to a static state; and if the difference is different, the dynamic webpage is rendered by the javascript script and some encryption algorithms.
Further, in step C, the determining the area where the crawling data is located includes the following steps:
c1, opening a browser developer mode of the target website, clicking an Elements tab, and displaying the webpage source code after script rendering;
c2, finding the data needed to be crawled in turn through the function of automatic code positioning of the browser, declaring field names respectively, and recording positioning elements corresponding to the code field areas.
Further, in step D, the deploying unstructured database information comprises the following steps:
d1, the unstructured database may be deployed in a local computer or a server, and the unstructured database can be connected to the database only by determining a database address and a designated port number set during deployment;
d2, connecting the deployed unstructured databases, creating the databases used for storing the crawled data and recording the names of the databases;
further, in step F, the Scapy framework is divided into the following parts:
the Engine is mainly responsible for transmitting data and signals among different modules of the whole system;
an Item, which defines the data structure of the crawled information;
the Scheduler receives the request sent by the engine and adds the request into the queue;
the Downloader downloads the webpage content sent by the engine and returns the webpage content;
the spider Spiders define crawling logic and analysis rules and generate extraction results and new requests;
an Item pipe Item Pipeline, which is responsible for processing results extracted from the web page by the spider, performing data cleansing and storage, and the like;
middleware Middlewares, which comprises download middleware and spider middleware, and a structure positioned between an engine and a downloader and a spider, realizes customized request and download extension and filters returned response results.
Further, in the step F, the crawler frame based on the Scapy technology is built:
f1, setting recorded request information in an initial function of the spider project file, serving as a parameter access of the request function along with a website request link, and iterating the initial link by using iteration operation;
f2, the request sent by the spider project file reaches the download middleware through an engine and a queue, so that an automatic test technology is introduced to achieve the purposes of acquiring dynamic webpage source codes and self-defining to control a browser to perform a series of chain operations;
f3, defining a project class, wherein the class defines the collection name of the stored data, and defines a Field class variable for each Field of the recorded required data;
f4, a callback function corresponding to the response result is appointed by a callback parameter in the request function;
f5, configuring parameters in the setting file of the project, including the use priority of the downloading middleware and the pipeline middleware, the database address and the database name, the timeout time, the number of accessed web pages and the number of concurrent requests.
Further, the operation in the callback function includes:
1) analyzing a source webpage, acquiring links of a list page to perform deeper page access, namely requesting a new link through iterative operation every time a link is acquired, and enabling a callback parameter to point to other callback functions;
2) and analyzing the detail page, screening useful information contained in the response result according to each recorded field of the required data and the corresponding positioning element, declaring a defined item class object to store a corresponding field value, and then transmitting the field value to an item pipeline through iterative operation to store the field value in a database.
Further, in step B2, when it is determined that the target web page is a dynamic web page rendered by javascript and some encryption algorithms, before the target web page and the requested link of the related web page are downloaded, an automatic test technique needs to be introduced to return the result of the web page rendered by the javascript or encryption algorithms through the download middleware.
Compared with the prior method, the invention has the following improvements: by introducing an automatic testing technology into the downloading middleware, the crawler project can acquire the webpage processed by the javascript script and the encryption algorithm, the structural analysis process of a developer on the complex webpage is avoided, the project development difficulty is reduced, and the development period is shortened.
Drawings
FIG. 1 is a diagram of an asynchronous processing framework of the present invention incorporating automatic test techniques.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. The following examples are presented only for further understanding and implementation of the technical solution of the present invention, and do not constitute a further limitation to the claims of the present invention, therefore, all other examples obtained by one of ordinary skill in the art without creative efforts based on the examples of the present invention shall fall within the protection scope of the present invention.
The invention provides a data crawling method with an automatic testing function based on an asynchronous processing frame, which successfully solves the problem that a front-end dynamic webpage cannot obtain a response result simply through a crawler frame by combining a Scapy crawler frame and a Selenium automatic testing technology, and further can analyze a target webpage to obtain structured data. And further, the framework analysis process of a developer on the target website is saved, the project development difficulty is reduced, more time can be put into webpage analysis, the quality of crawled data is improved, and the project development period is shortened.
The following examples are provided to illustrate specific embodiments of the present invention, and it should be understood that the examples are only illustrative and not restrictive. The target website of the embodiment is a Chinese referee document network, and a target is crawled: crawling all data of a home page of a specified court, namely all document detail information of the court, is 600 pieces of data (the website only allows a user to view 600 pieces of data).
A data crawling method with an automatic test function based on an asynchronous processing framework comprises the following steps:
step A, the information required by determining the target website request comprises the following steps:
a1, opening a browser developer mode of a target website, clicking a 'Network' tab, and refreshing a current page;
a2, clicking on the item (the first item in general) in accordance with the path of the navigation bar of the page browser
A3, recording information of website request links, user agents, request modes and request parameters on the right side of the developer mode window.
Step B, the step of determining the webpage loading characteristics comprises the following steps:
b1, opening the source code of the target webpage;
b2, comparing the data to be crawled in the target webpage with the content of the corresponding label in the source code, and finding that the information which can be seen by the user in the webpage does not appear in the webpage source code, so that the ' Chinese referee's document web ' is judged to be a dynamic webpage rendered by javascript and some encryption algorithms, so that the real webpage result cannot be directly returned by a downloader, and the webpage source code containing the required data can be obtained only by dynamically rendering with the help of an automatic test technology.
Step C, the step of determining the area of the crawled data comprises the following steps:
c1, opening a browser developer mode of the target website, and clicking an Elements tab to display the webpage source codes after the script is rendered;
c2, automatically positioning the code through the browser, finding each data to be crawled in turn, declaring the field name, and recording the xpath path or css selector of the corresponding code field area.
Crawling of the "chinese referee's paperweb" involves two pages in total, namely, a paperlist page and a paperdetail page.
Firstly, the link of each document in the document list page needs to be acquired, and the document detail page is continuously accessed iteratively after the link is acquired, wherein the record of the link is as follows: linking: // a [ @ class ═ caseName "]/@ href;
and (3) linking and accessing the document detail page by each document in the document list, then acquiring each part of information in the document detail page, declaring field names for each part of information, and recording an xpath path, wherein each part of information is recorded as follows:
title: a/div [ @ class ═ PDF _ title' ], a,
Release time: PDF _ cut > div, nth-child (1) > table, nth-child (1) > tr, nth-child (1) > td, nth-child (1),
Browsing amount: PDF _ cut > div, nth-child (1) > table, nth-child (1) > tr, nth-child (1) > td, nth-child (2),
Court: a/div [ @ class ═ PDF _ pox "]/div [1 ].
Because the writing of the document is not standard, other fields do not have a fixed xpath or css selector format and need to be obtained through circulation and condition judgment, and the fields comprise the document type, a party, an approval number, an approval reason, an approval result, an approval length, an approval person, approval time, a judge assistant and a bookmarker.
Step D, the information of the deployment unstructured database comprises the following steps:
d1, deploying the MongoDB database in a local computer, determining that the address of the database is localhost and the designated port number is 27017;
d2, concatenating the deployed MongoDB databases, creating a database test "that is used to store the crawled data.
Step E, configuring a Selenium automatic test tool
The Selenium is an automatic testing tool for testing the application program of the website, and the Selenium test can be directly run in a browser and imitates the operation of a user on the browser;
using a python's toolkit installation instruction pip to install the Selenium, namely pip install Selenium; and viewing the version number of the Chrome browser, downloading the browser driver.exe of the corresponding version, and storing the browser driver.exe into a script directory of python.
Step F, building a crawler frame based on the Scapy technology:
f1, setting the recorded request information in the initial function of the spider project file, connecting with the website request as the parameter access of the request function, and using yield operation to access the initial link iteration loop.
The initial link contains a Request parameter coded by Unicode, and initiates a Request by a Request method Request built in a script framework, and the specific operations are as follows:
yield Request(url=self.start_urls,k=self.parse_origin,meta={'tag':0},dont_filter=True)。
f2, the request sent by the spider project file reaches the download middleware through an engine and a queue, so that an automatic test technology is introduced to achieve the purposes of acquiring dynamic webpage source codes and self-defining to control a browser to perform a series of chain operations;
the Selenium kit used was first introduced, operating as follows:
from selenium import webdriver;
the object of the Chrome browser to perform the test is then declared, the operation is as follows:
self.browser=webdriver.Chrome();
access is performed again through self.
For a document list page, 600 pieces of data are displayed, and a single page can only display 5 pieces of data, which requires 120 times of page access. But the number of the single page document display pieces of the page is changed to 600 through the Selenium, and then the option is selected, so that the document list page can be accessed once to obtain the links of 600 documents, thereby greatly improving the execution efficiency of the project, and the operation is as follows:
self.browser.execute_script("""document.querySelector("select.pageSizeSelect option:nth-child(3)").text="600";""")
time.sleep(4)
driver=self.browser.find_element_by_xpath('//div[@class="WS_my_pages"]/se lect[@class="pageSizeSelect"]')
sel=Select(driver)
sel.select_by_visible_text('600')。
f3, defining a project class, wherein the class defines the collection name of the stored data, and defines a Field class variable for each Field of the recorded required data;
depending on the division of the document detail page information, the following fields may be declared in the project class:
title=Field()
release=Field()
views=Field()
court=Field()
type=Field()
prelude=Field()
parties=Field()
justification=Field()
end=Field()
chief=Field()
judge=Field()
time=Field()
assistant=Field()
clerk=Field()
in addition, a variable collection of a record set name is declared as 'wenshu'.
F4, specifying a callback function corresponding to the response result by the callback parameter in the request function, wherein the operation in the callback function comprises:
1) and analyzing the source webpage. Obtaining list-type links to access a deeper page, namely requesting a new link through yield operation iteration once one link is obtained, and enabling a callback parameter to point to other callback functions;
the initial Request function Request has a parameter callback, which is used to return the response result of the current access link as a parameter to the specified parsing function, namely, the pase _ origin.
Acquiring a set form of all document links through an xpath path, and operating as follows:
urls=response.xpath('//a[@class="caseName"]/@href').extract()。
and then, by circulating the link set and combining yield iteration operation, the detail pages of all documents can be sequentially accessed, and the operation is as follows:
for url in urls:
yield Request(url=target_url,callback=self.parse_detail,tag':1},dont_filter=False)。
the callback function becomes parse _ detail, i.e. the response result jumps to the detail page resolution function of the new document.
2) The details page is parsed. And screening useful information contained in the response result according to the recorded required data fields and the corresponding xpath path or css selector. Declaring the corresponding field value of the defined item deposit, then transmitting the value to the item pipeline through yield iteration, and storing the data
Declaring that an object of the item class, item (), has been created, and then screening corresponding field data through an xpath path or a css selector, as follows:
box=response.xpath('//div[@class="PDF_box"]')item['title']=box.xpath('./div[@class="PDF_title"]/text()').extract_first().replace('\n',”)
item['court']=box.xpath('./div[@class="PDF_pox"]/div[1]/text()').extract_first()。
because the writing of the document is not standard, other fields adopt for loop and condition judgment to intercept and match the response result, and finally the project object of each document is iterated through yield item operation and transmitted into a project pipeline to execute the subsequent operation of storing in a database.
Firstly, importing a MongoDB toolkit-import pymongo, connecting a database under a specified path and specifying a selected database name, and operating as follows:
self.client=pymongo.MongoClient(self.mongo_uri)
self.db=self.client[self.mongo_db]。
then, an insert operation is performed on the item object, and in order to avoid the problem of unmatched object types, it needs to first determine whether the item object belongs to a WenshuItem class object, and the specific operations are as follows:
if isinstance(item,WenshuItem):
self.db[item.collection].insert(dict(item))。
f5, configuring parameters in the project setting file, including the use priority of the download middleware and the pipeline middleware, the database address and the database name, the timeout time, the number of accessed web pages and the number of concurrent requests.
Parameters required in the project are configured in the setting file, so that the project is convenient to manage, and the robustness of the project is enhanced. Some of these parameters include:
number of pages visited: MAX _ PAGE ═ 2
Number of concurrent accesses: CONCURRENT _ REQUESTS ═ 3
Selenium test timeout time: SELENIUM _ TIMEOUT is 60
Access path of MongoDB: MONGO _ URI ═ localhost'
The MongoDB stores the database name: MONGO _ DB ═ test'
Project pipe usage priority: ITEM _ PIPELINES ═ leaf
'wenshu.pipelines.WsMongoPipeline':300,
'wenshu.pipelines.MongoPipeline':302}
Downloading middleware priority: DOWNLODER _ MIDDLEWARES ═ last distance
'wenshu.middlewares.WenshuSeleniumMiddleware':543,
'scrapy_splash.SplashCookiesMiddleware':723,
'scrapy_splash.SplashMiddleware':725}
If the set parameter value is to be acquired, the name of the corresponding parameter is used, for example, the number of page accesses is to be acquired, and the following operations are performed: pageSize ═ self.
While embodiments of the invention have been disclosed above, it is not intended to be limited to the uses set forth in the specification and examples. It can be applied to all kinds of fields suitable for the present invention. Additional modifications will readily occur to those skilled in the art. It is therefore intended that the invention not be limited to the exact details and illustrations described and illustrated herein, but fall within the scope of the appended claims and equivalents thereof.

Claims (9)

1. A data crawling method with an automatic test function based on an asynchronous processing framework comprises the following steps:
A. determining information required for requesting a target website
The method comprises a target website request link, a user agent, a request mode and a request parameter;
B. determining web page loading characteristics
Checking a webpage source code, and determining whether the source code is consistent with the content presented by the current webpage;
C. determining code segment regions to crawl data
Positioning the position of the data needing to be crawled of the webpage and information of each field;
D. deploying unstructured database information
Determining an unstructured database address, a port and a database name for storing crawl data;
E. configuring a Selenium automatic test tool
The Selenium is an automatic testing tool for testing the application program of the website, and the Selenium test can be directly run in a browser and imitates the operation of a user on the browser;
installing a Selenium toolkit and a browser driver of a corresponding version;
F. building crawler frame based on Scapy technology
The Scapy framework is a quick and high-level webpage crawling framework suitable for Python, has low coupling degree and strong expandability among modules, and is used for crawling websites and extracting structured data from webpages.
2. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1, wherein:
step A, the step of determining the information required by the request target website comprises the following steps:
a1, opening a browser developer mode of a target website, clicking a 'Network' tab, and refreshing a current page;
a2, clicking an item consistent with the path of the navigation bar of the page browser;
a3, recording information of website request links, user agents, request modes and request parameters on the right side of the developer mode window.
3. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1, wherein:
step B, the step of determining the webpage loading characteristics comprises the following steps:
b1, opening the source code of the target webpage;
b2, comparing the data to be crawled in the target webpage with the content of the corresponding label in the source code, and judging whether the data and the content are the same; if the two are the same, the front-end webpage belongs to a static state; and if the difference is different, the dynamic webpage is rendered by the javascript script and some encryption algorithms.
4. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 3, wherein:
step B2, when it is determined that the target web page is a dynamic web page rendered by javascript and some encryption algorithms, before downloading the requested links of the target web page and related web pages, an automatic test technique needs to be introduced to return the results of the web pages rendered by the javascript or encryption algorithms through downloading middleware.
5. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1, wherein:
step C, the code segment area of the crawling data is determined to comprise the following steps:
c1, opening a browser developer mode of the target website, clicking an Elements tab, and displaying the webpage source code after script rendering;
c2, finding the data needed to be crawled in turn through the function of automatic code positioning of the browser, declaring field names respectively, and recording positioning elements corresponding to the code field areas.
6. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1, wherein:
step D, the information of the deployment unstructured database comprises the following steps:
d1, the unstructured database may be deployed in a local computer or a server, and the unstructured database can be connected to the database only by determining a database address and a designated port number set during deployment;
d2, connecting the deployed unstructured database, creating the database used to store the crawled data and recording its name.
7. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1, wherein:
step F, dividing the Scapy framework into the following parts:
the Engine is mainly responsible for transmitting data and signals among different modules of the whole system;
an Item, which defines the data structure of the crawled information;
the Scheduler receives the request sent by the engine and adds the request into the queue;
the Downloader downloads the webpage content sent by the engine and returns the webpage content;
the spider Spiders define crawling logic and analysis rules and generate extraction results and new requests;
an Item pipe Item Pipeline, which is responsible for processing results extracted from the web page by the spider, performing data cleansing and storage, and the like;
middleware Middlewares, which comprises download middleware and spider middleware, and a structure positioned between an engine and a downloader and a spider, realizes customized request and download extension and filters returned response results.
8. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1 or 7, wherein:
step F, building a crawler frame based on the Scapy technology:
f1, setting recorded request information in an initial function of the spider project file, using the request information as a parameter access of a request function along with a website request link, and circularly accessing the initial link by using iterative operation;
f2, the request sent by the spider project file reaches the download middleware through an engine and a queue, so that an automatic test technology is introduced to achieve the purposes of acquiring dynamic webpage source codes and self-defining to control a browser to perform a series of chain operations;
f3, defining a project class, wherein the class defines the collection name of the stored data, and defines a Field class variable for each Field of the recorded required data;
f4, a callback function corresponding to the response result is appointed by a callback parameter in the request function;
f5, configuring parameters in the setting file of the project, including the use priority of the downloading middleware and the pipeline middleware, the database address and the database name, the timeout time, the number of accessed web pages and the number of concurrent requests.
9. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 8, wherein:
the operation in the callback function in step F4 includes:
1) analyzing a source webpage, acquiring links of a list page to perform deeper page access, namely requesting a new link through iterative operation every time a link is acquired, and enabling a callback parameter to point to other callback functions;
2) and analyzing the detail page, screening useful information contained in the response result according to each recorded field of the required data and the corresponding positioning element, declaring a defined item class object to store a corresponding field value, and then transmitting the field value to an item pipeline through iterative operation to store the field value in a database.
CN202110059894.1A 2020-07-30 2021-01-18 Asynchronous processing framework-based data crawling method with automatic testing function Pending CN112612943A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010747535.0A CN111859075A (en) 2020-07-30 2020-07-30 Asynchronous processing framework-based data crawling method with automatic testing function
CN2020107475350 2020-07-30

Publications (1)

Publication Number Publication Date
CN112612943A true CN112612943A (en) 2021-04-06

Family

ID=72945639

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202010747535.0A Pending CN111859075A (en) 2020-07-30 2020-07-30 Asynchronous processing framework-based data crawling method with automatic testing function
CN202110059894.1A Pending CN112612943A (en) 2020-07-30 2021-01-18 Asynchronous processing framework-based data crawling method with automatic testing function

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202010747535.0A Pending CN111859075A (en) 2020-07-30 2020-07-30 Asynchronous processing framework-based data crawling method with automatic testing function

Country Status (1)

Country Link
CN (2) CN111859075A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116719986A (en) * 2023-08-10 2023-09-08 深圳传趣网络技术有限公司 Python-based data grabbing method, device, equipment and storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515681A (en) * 2021-04-30 2021-10-19 广东科学技术职业学院 Real estate data crawler method and device based on script framework
CN114969474A (en) * 2022-03-31 2022-08-30 安徽希施玛数据科技有限公司 Webpage data acquisition method, webpage data acquisition device and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033115A (en) * 2017-06-12 2018-12-18 广东技术师范学院 A kind of dynamic web page crawler system
CN110147476A (en) * 2019-04-12 2019-08-20 深圳壹账通智能科技有限公司 Data crawling method, terminal device and computer readable storage medium based on Scrapy

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033115A (en) * 2017-06-12 2018-12-18 广东技术师范学院 A kind of dynamic web page crawler system
CN110147476A (en) * 2019-04-12 2019-08-20 深圳壹账通智能科技有限公司 Data crawling method, terminal device and computer readable storage medium based on Scrapy

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ANPEDESTRIAN: "scrapy+selenium之中国裁判文书网文书爬取", 《CSDN 博客》 *
游攀利 等: "基于Scrapy的水利数据爬虫设计与实现", 《水利水电快报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116719986A (en) * 2023-08-10 2023-09-08 深圳传趣网络技术有限公司 Python-based data grabbing method, device, equipment and storage medium
CN116719986B (en) * 2023-08-10 2023-12-26 深圳传趣网络技术有限公司 Python-based data grabbing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN111859075A (en) 2020-10-30

Similar Documents

Publication Publication Date Title
US20210294727A1 (en) Monitoring web application behavior from a browser using a document object model
Khalil et al. RCrawler: An R package for parallel web crawling and scraping
CN112612943A (en) Asynchronous processing framework-based data crawling method with automatic testing function
CN112637361B (en) Page proxy method, device, electronic equipment and storage medium
US11263062B2 (en) API mashup exploration and recommendation
US8812551B2 (en) Client-side manipulation of tables
Jarmul et al. Python web scraping
US9122484B2 (en) Method and apparatus for mashing up web applications
CN110147476A (en) Data crawling method, terminal device and computer readable storage medium based on Scrapy
Hajba Website Scraping with Python
US11785039B2 (en) Scanning web applications for security vulnerabilities
US10114617B2 (en) Rapid visualization rendering package for statistical programming language
Berlin et al. To re-experience the web: A framework for the transformation and replay of archived web pages
Behfarshad et al. Hidden-web induced by client-side scripting: An empirical study
CN109471966B (en) Method and system for automatically acquiring target data source
CN112182338A (en) Monitoring method and device for hosting platform
Zochniak et al. Performance comparison of observer design pattern implementations in javascript
Kaczmarek et al. Harvesting deep web data through produser involvement
CN104778070B (en) Hidden variable abstracting method and equipment and information extracting method and equipment
Méndez Lobato SEO Analysis and its effects on Web Positioning
Koder Increasing Full Stack Development Productivity via Technology Selection
Salama “Down With Regression!”–Generating Test Suites for the Web
Wu et al. A web data extraction description language and its implementation
Ast et al. The SWAC Approach for Sharing a Web Application’s Codebase Between Server and Client
Ast et al. Efficient development of progressively enhanced web applications by sharing presentation and business logic between server and client

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210406