CN111859075A - Asynchronous processing framework-based data crawling method with automatic testing function - Google Patents

Asynchronous processing framework-based data crawling method with automatic testing function Download PDF

Info

Publication number
CN111859075A
CN111859075A CN202010747535.0A CN202010747535A CN111859075A CN 111859075 A CN111859075 A CN 111859075A CN 202010747535 A CN202010747535 A CN 202010747535A CN 111859075 A CN111859075 A CN 111859075A
Authority
CN
China
Prior art keywords
data
webpage
request
automatic test
browser
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010747535.0A
Other languages
Chinese (zh)
Inventor
康辉
孙鑫
赵旭
李佳辉
卢凌锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jilin University
Original Assignee
Jilin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jilin University filed Critical Jilin University
Priority to CN202010747535.0A priority Critical patent/CN111859075A/en
Publication of CN111859075A publication Critical patent/CN111859075A/en
Priority to CN202110059894.1A priority patent/CN112612943A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention belongs to the technical field of web crawlers, and relates to a data crawling method with an automatic testing function based on an asynchronous processing framework. Based on a relatively mature web crawler frame in the field of web crawlers, the invention adds various web crawling tasks with anti-crawler strategies at the beginning of designing a web site, particularly tasks for dynamically generating web site data through scripts, and introduces an automatic test technology when an initial request link is sent out by a project file of a spider and reaches a downloading middleware through an engine and a queue, thereby achieving the purpose of acquiring a dynamic web page source code. The webpage response result obtained by the method is rendered through the script, and the purpose of completing a series of chain operations by self-defining and controlling the browser through an automatic testing technology is achieved, so that the framework analysis process of a developer on a target website is saved, the project development difficulty is reduced, more time can be put on webpage analysis, the quality of crawled data is improved, and the project development period is shortened.

Description

Asynchronous processing framework-based data crawling method with automatic testing function
Technical Field
The invention belongs to the technical field of web crawlers, relates to a data crawling method based on an asynchronous processing framework, and particularly relates to a data crawling method with an automatic testing function based on the asynchronous processing framework.
Background
With the advent of the big data age, the position of web crawlers in the internet will become more and more important. The data in the internet is massive, and how to automatically and efficiently acquire information which is interesting to people in the internet is an important problem for people to use, and the crawler technology is used for solving the problems. When technologies such as big data analysis, data mining, natural language processing in the field of artificial intelligence and the like are continuously developed, the premise that the technologies are rapidly developed is that data is needed and high-quality data is provided. The web crawler not only solves the problem of data extraction, but also extracts structured data from the webpage which can be circulated without rules, and the breakthrough progress of the technology plays a significant role.
Web crawlers can be divided into personal crawlers and enterprise crawlers, but whether individual or enterprise, web crawlers are an integral part of many project processes. With the continuous penetration of web crawler applications, a large set of open source crawler frameworks has emerged, such as: pyspider framework, Scapy framework, which is currently more and more mature framework.
With the development of a web crawler technology, in order to improve the efficiency of a crawler program, the crawler program in actual development is often a distributed crawler which needs to be deployed on a server, because a plurality of computers with different physical addresses are used for simultaneously operating the program, the access link deduplication of a request queue in the program operation process becomes a problem which needs to be considered preferentially, and compared with a traditional crawler tool, an existing crawler frame provides a scheduler structure and can solve the problem; on the other hand, in order to protect the privacy of site resources, each big data source website is added with an anti-crawler strategy at the beginning of website design, most of the data of the website is dynamically generated through javascript scripts, and a server can identify whether the scripts access the resources to judge whether the scripts are real users or not, so that new requirements are provided for the existing crawler framework.
Disclosure of Invention
The invention aims to solve the problems, and provides a self-powered data crawling method based on an asynchronous processing frame.
The purpose of the invention is realized by the following technical scheme:
the method comprises the following steps:
A. determining information required for requesting a target website
The method comprises a target website request link, a user agent, a request mode and a request parameter;
B. determining web page loading characteristics
Checking a webpage source code, and determining whether the source code is consistent with the content presented by the current webpage;
C. determining code segment regions to crawl data
Positioning the position of the data needing to be crawled of the webpage and information of each field;
D. deploying unstructured database information
Determining an unstructured database address, a port and a database name for storing crawl data;
E. configuring a Selenium automatic test tool
The Selenium is an automatic testing tool for testing the application program of the website, and the Selenium test can be directly run in a browser and imitates the operation of a user on the browser;
installing a Selenium toolkit and a browser driver of a corresponding version;
F. building crawler frame based on Scapy technology
The Scapy framework is a quick and high-level webpage crawling framework suitable for Python, has low coupling degree and strong expandability among modules, and is used for crawling websites and extracting structured data from webpages.
Further, in step a, the determining information required by the request target website includes the following steps:
a1, opening a browser developer mode of a target website, clicking a 'Network' tab, and refreshing a current page;
a2, clicking an item consistent with the path of the navigation bar of the page browser;
a3, recording information of website request links, user agents, request modes and request parameters on the right side of the developer mode window.
Further, in step B, said determining the loading characteristics of the web page includes the following steps:
b1, opening the source code of the target webpage;
b2, comparing the data to be crawled in the target webpage with the content of the corresponding label in the source code, and judging whether the data and the content are the same; if the two are the same, the front-end webpage belongs to a static state; and if the difference is different, the dynamic webpage is rendered by the javascript script and some encryption algorithms.
Further, in step C, the determining the area where the crawling data is located includes the following steps:
c1, opening a browser developer mode of the target website, clicking an Elements tab, and displaying the webpage source code after script rendering;
c2, finding the data needed to be crawled in turn through the function of automatic code positioning of the browser, declaring field names respectively, and recording positioning elements corresponding to the code field areas.
Further, in step D, the deploying unstructured database information comprises the following steps:
d1, the unstructured database may be deployed in a local computer or a server, and the unstructured database can be connected to the database only by determining a database address and a designated port number set during deployment;
d2, connecting the deployed unstructured databases, creating the databases used for storing the crawled data and recording the names of the databases;
further, in step F, the Scapy framework is divided into the following parts:
the Engine is mainly responsible for transmitting data and signals among different modules of the whole system;
an Item, which defines the data structure of the crawled information;
the Scheduler receives the request sent by the engine and adds the request into the queue;
the Downloader downloads the webpage content sent by the engine and returns the webpage content;
the spider Spiders define crawling logic and analysis rules and generate extraction results and new requests;
an Item pipe Item Pipeline, which is responsible for processing results extracted from the web page by the spider, performing data cleansing and storage, and the like;
middleware Middlewares, which comprises download middleware and spider middleware, and a structure positioned between an engine and a downloader and a spider, realizes customized request and download extension and filters returned response results.
Further, in the step F, the crawler frame based on the Scapy technology is built:
f1, setting recorded request information in an initial function of the spider project file, serving as a parameter access of the request function along with a website request link, and iterating the initial link by using iteration operation;
f2, the request sent by the spider project file reaches the download middleware through an engine and a queue, so that an automatic test technology is introduced to achieve the purposes of acquiring dynamic webpage source codes and self-defining to control a browser to perform a series of chain operations;
f3, defining a project class, wherein the class defines the collection name of the stored data, and defines a Field class variable for each Field of the recorded required data;
f4, a callback function corresponding to the response result is appointed by a callback parameter in the request function;
f5, configuring parameters in the setting file of the project, including the use priority of the downloading middleware and the pipeline middleware, the database address and the database name, the timeout time, the number of accessed web pages and the number of concurrent requests.
Further, the operation in the callback function includes:
1) analyzing a source webpage, acquiring links of a list page to perform deeper page access, namely requesting a new link through iterative operation every time a link is acquired, and enabling a callback parameter to point to other callback functions;
2) and analyzing the detail page, screening useful information contained in the response result according to each recorded field of the required data and the corresponding positioning element, declaring a defined item class object to store a corresponding field value, and then transmitting the field value to an item pipeline through iterative operation to store the field value in a database.
Further, in step B2, when it is determined that the target web page is a dynamic web page rendered by javascript and some encryption algorithms, before the target web page and the requested link of the related web page are downloaded, an automatic test technique needs to be introduced to return the result of the web page rendered by the javascript or encryption algorithms through the download middleware.
Compared with the prior method, the invention has the following improvements: by introducing self-operation in the downloading middleware, the crawler project can obtain the architecture analysis process of the complex webpage processed by the javascript script and the encryption algorithm, the project development difficulty is reduced, and the development period is shortened.
Drawings
FIG. 1 is a diagram of an asynchronous processing framework of the present invention incorporating automatic test techniques.
Detailed description of the preferred embodiments
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. The following examples are presented merely to further understand and practice the present invention and are not to be construed as further limiting the claims of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a data crawling method with an automatic testing function based on an asynchronous processing frame, which solves the problem that a successful end dynamic webpage cannot obtain a response result simply through a crawler frame by combining a Scapy crawler frame and a Selenium automatic testing technology, and can further analyze the webpage to obtain structured data. And furthermore, the process of a developer on a target website is saved, the project development difficulty is reduced, more time can be put on webpage analysis, the quality of crawled data is improved, and the project development period is shortened.
The following examples are provided to illustrate specific embodiments of the present invention, and should not be construed as limiting the invention. The target website of the embodiment is a Chinese referee document network: crawling all data of a home page of a specified court, namely all document detail information of the court, is common (the website only allows a user to view 600 pieces of data).
A data crawling method with automatic test function based on asynchronous processing frame includes
Step A, the information required by determining the target website request comprises the following steps:
a1, opening a browser developer mode of a target website, and clicking a 'Network' tab and a page;
a2, clicking on the item (generally the first one) that is consistent with the path of the navigation bar of the page browser
A3, recording information of website request link, user agent, and requester parameter on the right side of the developer mode window.
Step B, the step of determining the webpage loading characteristics comprises the following steps:
b1, opening the source code of the target webpage;
b2, comparing the data to be crawled in the target webpage with the content of the corresponding label in the source code, wherein the information can be seen by the user in the webpage is not shown in the webpage source code, so that the 'Chinese script network' is judged to be a dynamic webpage rendered by a javascript script and some encryption algorithms, the true webpage result is directly returned by a downloader, and the webpage source code containing the required data can be obtained by performing actions with the help of an automatic test technology.
Step C, the step of determining the area of the crawled data comprises the following steps:
c1, opening a browser developer mode of the target website, clicking an Elements tab, and rendering the webpage source code through the script;
c2, automatically positioning the code through the browser, finding out the data and field names needed to be crawled, and recording the xpath path or css selector of the corresponding code field area.
Crawling of the "chinese referee's paperweb" involves two pages in total, namely, the paperlist page and the text.
Firstly, the link of each document in the document list page needs to be acquired, and the iteration visit page is continued after the link is acquired, wherein the record of the link is as follows: linking: // a [ @ class ═ caseName "]/@ href;
the document detail page is accessed by each document link in the document list, each part in the document detail page is acquired, the field name is declared for each part information, the xpath path is recorded, and each part information is recorded as
Title: a/div [ @ class ═ PDF _ title' ], a,
Release time: PDF _ cut > div, nth-child (1) > table, nth-child (1) > tr, nth-child (1) > td, nth-child (1),
Browsing amount: PDF _ cut > div, nth-child (1) > table, nth-child (1) > tr, nth-child (1) > td, nth-child (2),
Court: a/div [ @ class ═ PDF _ pox "]/div [1 ].
Because the writing of the document is not standard, other fields do not have fixed xpath or css formulas and need to be obtained through circulation and condition judgment, and the fields comprise document types, parties, trial reasons, trial results, trial lengths, trial judges, trial time, judge assistants and notes
Step D, the information of the deployment unstructured database comprises the following steps:
d1, deploying the MongoDB database in a local computer, determining that the address of the database is host', and designating a port number to be 27017;
d2, concatenating the deployed MongoDB databases, creating data t "that is used to store crawled data.
Step E, configuring a Selenium automatic test tool
The Selenium is an automatic testing tool for testing the application program of the website, and the test of the m can be directly run in the browser to simulate the operation of a user on the browser;
using python's toolkit installation instruction pip to install Selenium, pip install m; and viewing the version number of the Chrome browser, downloading the browser driver version.exe of the corresponding version, and storing the browser driver version.exe in a script directory of python.
Step F, building a crawler frame based on the Scapy technology:
f1, setting recorded request information in the initial function of spider project file, accessing as the parameter of request function along with website, and using yield operation to iterate loop on initial link
The initial link contains a Request parameter coded by Unicode, and a Request is initiated by solving a method Request through a script framework, and the method specifically comprises the following operations:
yield Request(url=self.start_urls,k=self.parse_origin,meta={'tag':0},dont_filter=True)。
f2, the request sent by the spider project file reaches the download middleware through an engine and a queue, and the technology is tested, so that the purposes of acquiring dynamic webpage source codes and customizing the control browser to carry out a series of operations are achieved;
the Selenium kit used was first introduced, operating as follows:
from selenium import webdriver;
the object of the Chrome browser to perform the test is then declared, the operation is as follows:
self.browser=webdriver.Chrome();
access is performed again through self.
For a document list page, 600 pieces of data are total, and a single page can display only 5 pieces of data, which is 120 times the page. However, by changing the number of the displayed pieces of the single-page document of the page into the option of selecting through the Selenium, the document list page can be accessed once to obtain the links of 600 documents, the execution efficiency of the project is improved, and the operation is as follows:
self.browser.execute_script("""document.querySelector("select.pageSizeSelectoption:nth-child(3)").text="600";""")
time.sleep(4)
driver=self.browser.find_element_by_xpath('//div[@class="WS_my_pages"]/select[@class="pageSizeSelect"]')
sel=Select(driver)
sel.select_by_visible_text('600')。
f3, defining a project class, wherein the class defines the collection name of the stored data, and defines a Field class variable for each Field of the recorded required data;
depending on the division of the document detail page information, the following fields may be declared in the project class:
title=Field()
release=Field()
views=Field()
court=Field()
type=Field()
prelude=Field()
parties=Field()
justification=Field()
end=Field()
chief=Field()
judge=Field()
time=Field()
assistant=Field()
clerk=Field()
in addition, a variable collection of a record set name is declared as 'wenshu'.
F4, specifying a callback function corresponding to the response result by the callback parameter in the request function, wherein the callback function comprises:
1) and analyzing the source webpage. Obtaining list type links to access deeper page, i.e. each link requests new link through yield operation iteration, and the callback parameter points to other loops
The initial Request function Request has a parameter callback, which is used to return the current response result as a parameter to the specified parsing function, namely, the parse _ origin.
Acquiring a set form of all document links through an xpath path, and operating as follows:
urls=e.xpath('//a[@class="caseName"]/@href').extract()。
and then circulating the link set, and combining yield iteration operation, sequentially accessing the detail pages, wherein the operation is as follows:
for url in urls:
yield Request(url=target_url,callback=self.parse_detail,tag':1},dont_filter=False)。
the callback function becomes parse _ detail, i.e. the response result jumps to the new document parsing function.
2) The details page is parsed. And screening useful information contained in the response result according to each recorded field of the required data and a corresponding xpath path selector. Declaring the corresponding field value of the defined item deposit, then transmitting the value to the item pipeline through yield iteration, and storing the data
Declaring that an object of the item class, item (), is created, and then screening the corresponding segment data through an xpath path or a css selector, as follows:
box=response.xpath('//div[@class="PDF_box"]')
item['title']=th('./div[@class="PDF_title"]/text()').extract_first().replac”)
item['court']=th('./div[@class="PDF_pox"]/div[1]/text()').extract_first()。
and because the writing of the document is not standard, other fields adopt for loop and condition judgment to respond to the result and match, and finally the project object of each document is iterated through yield item operation and is transmitted to execute subsequent operation of storing in a database.
Firstly, importing a MongoDB toolkit-import pymongo, connecting the database names selected by the number under the specified path, and operating as follows:
self.client=pymongo.MongoClient(self.mongo_uri)
self.db=self.client[self.mongo_db]。
and then, performing an insert operation on the item object, wherein whether the item object belongs to a WenshuItem class object is required to avoid the problem that the object types are not matched, and the specific operations are as follows:
if isinstance(item,WenshuItem):
self.db[item.collection].insert(dict(item))。
f5, configuring parameters in the setting file of the project, including the priority of the downloading middleware and the pipeline middleware, the database address and the database name, the timeout time, the number of the accessed web pages and the concurrency amount.
Parameters needed in the project are configured in the setting file, so that the project is convenient to manage, and the project rod property is enhanced. Some of these parameters include:
number of pages visited: MAX _ PAGE ═ 2
Number of concurrent accesses: CONCURRENT _ REQUESTS ═ 3
Selenium test timeout time: SELENIUM _ TIMEOUT is 60
Access path of MongoDB: MONGO _ URI ═ localhost'
The MongoDB stores the database name: MONGO _ DB ═ test'
Project pipe usage priority: ITEM _ PIPELINES ═ leaf
enshu.pipelines.WsMongoPipeline':300,
enshu.pipelines.MongoPipeline':302}
Downloading middleware priority: DOWNLODER _ MIDDLEWARES ═ last distance
wenshu.middlewares.WenshuSeleniumMiddleware':543,
scrapy_splash.SplashCookiesMiddleware':723,
scrapy_splash.SplashMiddleware':725}
If the set parameter value is desired to be obtained, the name of the corresponding parameter is used, for example, the number of questions is desired to be obtained, and the operation is as follows: pageSize ═ self.
While embodiments of the invention have been disclosed above, it is not intended to be limited to the uses set forth in the specification and examples. It can be applied to all kinds of fields suitable for the present invention. Additional modifications will readily occur to those skilled in the art. It is therefore intended that the invention not be limited to the exact details and illustrations described and illustrated herein, but fall within the scope of the appended claims and equivalents thereof.

Claims (9)

1. A data crawling method with an automatic test function based on an asynchronous processing framework comprises the following steps:
A. determining information required for requesting a target website
The method comprises a target website request link, a user agent, a request mode and a request parameter;
B. determining web page loading characteristics
Checking a webpage source code, and determining whether the source code is consistent with the content presented by the current webpage;
C. determining code segment regions to crawl data
Positioning the position of the data needing to be crawled of the webpage and information of each field;
D. deploying unstructured database information
Determining an unstructured database address, a port and a database name for storing crawl data;
E. configuring a Selenium automatic test tool
The Selenium is an automatic testing tool for testing the application program of the website, and the Selenium test can be directly run in a browser and imitates the operation of a user on the browser;
installing a Selenium toolkit and a browser driver of a corresponding version;
F. building crawler frame based on Scapy technology
The Scapy framework is a quick and high-level webpage crawling framework suitable for Python, has low coupling degree and strong expandability among modules, and is used for crawling websites and extracting structured data from webpages.
2. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1, wherein:
step A, the step of determining the information required by the request target website comprises the following steps:
a1, opening a browser developer mode of a target website, clicking a 'Network' tab, and refreshing a current page;
a2, clicking an item consistent with the path of the navigation bar of the page browser;
a3, recording information of website request links, user agents, request modes and request parameters on the right side of the developer mode window.
3. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1, wherein:
step B, the step of determining the webpage loading characteristics comprises the following steps:
b1, opening the source code of the target webpage;
b2, comparing the data to be crawled in the target webpage with the content of the corresponding label in the source code, and judging whether the data and the content are the same; if the two are the same, the front-end webpage belongs to a static state; and if the difference is different, the dynamic webpage is rendered by the javascript script and some encryption algorithms.
4. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1, wherein:
step C, the step of determining the area of the crawled data comprises the following steps:
c1, opening a browser developer mode of the target website, clicking an Elements tab, and displaying the webpage source code after script rendering;
c2, finding the data needed to be crawled in turn through the function of automatic code positioning of the browser, declaring field names respectively, and recording positioning elements corresponding to the code field areas.
5. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1, wherein:
step D, the information of the deployment unstructured database comprises the following steps:
d1, the unstructured database may be deployed in a local computer or a server, and the unstructured database can be connected to the database only by determining a database address and a designated port number set during deployment;
d2, connecting the deployed unstructured database, creating the database used to store the crawled data and recording its name.
6. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1, wherein:
step F, dividing the Scapy framework into the following parts:
the Engine is mainly responsible for transmitting data and signals among different modules of the whole system;
an Item, which defines the data structure of the crawled information;
the Scheduler receives the request sent by the engine and adds the request into the queue;
the Downloader downloads the webpage content sent by the engine and returns the webpage content;
the spider Spiders define crawling logic and analysis rules and generate extraction results and new requests;
an Item pipe Item Pipeline, which is responsible for processing results extracted from the web page by the spider, performing data cleansing and storage, and the like;
middleware Middlewares, which comprises download middleware and spider middleware, and a structure positioned between an engine and a downloader and a spider, realizes customized request and download extension and filters returned response results.
7. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 1 or 6, wherein:
step F, building a crawler frame based on the Scapy technology:
f1, setting recorded request information in an initial function of the spider project file, serving as a parameter access of the request function along with a website request link, and iterating the initial link by using iteration operation;
f2, the request sent by the spider project file reaches the download middleware through an engine and a queue, so that an automatic test technology is introduced to achieve the purposes of acquiring dynamic webpage source codes and self-defining to control a browser to perform a series of chain operations;
f3, defining a project class, wherein the class defines the collection name of the stored data, and defines a Field class variable for each Field of the recorded required data;
f4, a callback function corresponding to the response result is appointed by a callback parameter in the request function;
f5, configuring parameters in the setting file of the project, including the use priority of the downloading middleware and the pipeline middleware, the database address and the database name, the timeout time, the number of accessed web pages and the number of concurrent requests.
8. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 7, wherein:
the operation in the callback function comprises the following steps:
1) analyzing a source webpage, acquiring links of a list page to perform deeper page access, namely requesting a new link through iterative operation every time a link is acquired, and enabling a callback parameter to point to other callback functions;
2) and analyzing the detail page, screening useful information contained in the response result according to each recorded field of the required data and the corresponding positioning element, declaring a defined item class object to store a corresponding field value, and then transmitting the field value to an item pipeline through iterative operation to store the field value in a database.
9. The data crawling method with automatic test function based on asynchronous processing framework as claimed in claim 3, wherein:
step B2, when it is determined that the target web page is a dynamic web page rendered by javascript and some encryption algorithms, before downloading the requested links of the target web page and related web pages, an automatic test technique needs to be introduced to return the results of the web pages rendered by the javascript or encryption algorithms through downloading middleware.
CN202010747535.0A 2020-07-30 2020-07-30 Asynchronous processing framework-based data crawling method with automatic testing function Pending CN111859075A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010747535.0A CN111859075A (en) 2020-07-30 2020-07-30 Asynchronous processing framework-based data crawling method with automatic testing function
CN202110059894.1A CN112612943A (en) 2020-07-30 2021-01-18 Asynchronous processing framework-based data crawling method with automatic testing function

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010747535.0A CN111859075A (en) 2020-07-30 2020-07-30 Asynchronous processing framework-based data crawling method with automatic testing function

Publications (1)

Publication Number Publication Date
CN111859075A true CN111859075A (en) 2020-10-30

Family

ID=72945639

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202010747535.0A Pending CN111859075A (en) 2020-07-30 2020-07-30 Asynchronous processing framework-based data crawling method with automatic testing function
CN202110059894.1A Pending CN112612943A (en) 2020-07-30 2021-01-18 Asynchronous processing framework-based data crawling method with automatic testing function

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202110059894.1A Pending CN112612943A (en) 2020-07-30 2021-01-18 Asynchronous processing framework-based data crawling method with automatic testing function

Country Status (1)

Country Link
CN (2) CN111859075A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515681A (en) * 2021-04-30 2021-10-19 广东科学技术职业学院 Real estate data crawler method and device based on script framework
CN114969474A (en) * 2022-03-31 2022-08-30 安徽希施玛数据科技有限公司 Webpage data acquisition method, webpage data acquisition device and storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114297462A (en) * 2021-12-13 2022-04-08 中国电子科技集团公司第二十八研究所 Intelligent website asynchronous sequence data acquisition method based on dynamic self-adaption
CN116719986B (en) * 2023-08-10 2023-12-26 深圳传趣网络技术有限公司 Python-based data grabbing method, device, equipment and storage medium
CN116975408A (en) * 2023-08-11 2023-10-31 国网吉林省电力有限公司经济技术研究院 Automatic grabbing method for rural industrial database website based on manual simulation

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033115B (en) * 2017-06-12 2021-02-19 广东技术师范学院 Dynamic webpage crawler system
CN110147476A (en) * 2019-04-12 2019-08-20 深圳壹账通智能科技有限公司 Data crawling method, terminal device and computer readable storage medium based on Scrapy

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515681A (en) * 2021-04-30 2021-10-19 广东科学技术职业学院 Real estate data crawler method and device based on script framework
CN114969474A (en) * 2022-03-31 2022-08-30 安徽希施玛数据科技有限公司 Webpage data acquisition method, webpage data acquisition device and storage medium

Also Published As

Publication number Publication date
CN112612943A (en) 2021-04-06

Similar Documents

Publication Publication Date Title
US20210294727A1 (en) Monitoring web application behavior from a browser using a document object model
CN111859075A (en) Asynchronous processing framework-based data crawling method with automatic testing function
Lawson Web scraping with Python
Mesbah et al. Crawling Ajax by inferring user interface state changes
US7584194B2 (en) Method and apparatus for an application crawler
US6538673B1 (en) Method for extracting digests, reformatting, and automatic monitoring of structured online documents based on visual programming of document tree navigation and transformation
US8196039B2 (en) Relevant term extraction and classification for Wiki content
US20100269096A1 (en) Creation, generation, distribution and application of self-contained modifications to source code
US20030088643A1 (en) Method and computer system for isolating and interrelating components of an application
Schulte et al. Active documents with org-mode
JP2008521147A (en) Application crawler method and apparatus
US8812551B2 (en) Client-side manipulation of tables
Thota et al. Web scraping of covid-19 news stories to create datasets for sentiment and emotion analysis
US10114617B2 (en) Rapid visualization rendering package for statistical programming language
CN106095674B (en) A kind of website automation test method and device
Zamith Innovation in content analysis: Freezing the flow of liquid news
CN110334302A (en) Complicated Web application front end motion time analyses method
Matthijssen et al. FireDetective: understanding ajax client/server interactions
RU2613026C1 (en) Method of preparing documents in markup languages while implementing user interface for working with information system data
Behfarshad et al. Hidden-web induced by client-side scripting: An empirical study
Zochniak et al. Performance comparison of observer design pattern implementations in javascript
Salama “Down With Regression!”–Generating Test Suites for the Web
Wu et al. A web data extraction description language and its implementation
Litvinavicius Practice Tasks for Server-Side Blazor
Olofsson Evaluation of webscraping tools for creating an embedded webwrapper

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20201030

WD01 Invention patent application deemed withdrawn after publication