CN110516139B - Crawler system and method - Google Patents

Crawler system and method Download PDF

Info

Publication number
CN110516139B
CN110516139B CN201910835029.4A CN201910835029A CN110516139B CN 110516139 B CN110516139 B CN 110516139B CN 201910835029 A CN201910835029 A CN 201910835029A CN 110516139 B CN110516139 B CN 110516139B
Authority
CN
China
Prior art keywords
crawling
server
crawler
parameters
cluster server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910835029.4A
Other languages
Chinese (zh)
Other versions
CN110516139A (en
Inventor
宋海伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Ctrip Business Co Ltd
Original Assignee
Shanghai Ctrip Business Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Ctrip Business Co Ltd filed Critical Shanghai Ctrip Business Co Ltd
Priority to CN201910835029.4A priority Critical patent/CN110516139B/en
Publication of CN110516139A publication Critical patent/CN110516139A/en
Application granted granted Critical
Publication of CN110516139B publication Critical patent/CN110516139B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M11/00Telephonic communication systems specially adapted for combination with other electrical systems
    • H04M11/06Simultaneous speech and data transmission, e.g. telegraphic transmission over the same conductors
    • H04M11/062Simultaneous speech and data transmission, e.g. telegraphic transmission over the same conductors using different frequency bands for speech and other data
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)
  • Computer And Data Communications (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a crawler system and a method, wherein the crawler system comprises a client and a server; the server side comprises a load balancing server and a cluster server; the client is used for generating crawling parameters according to access requirements of a preset crawling target and a target website, and sending the crawling parameters to the load balancing server; the load balancing server is used for generating a crawler task according to the crawling parameters, distributing crawler tasks to the cluster servers; the cluster server is used for crawling target data of the target website according to the crawler task. According to the invention, the client user only needs to set the crawling parameters at the client according to the access requirements of the preset crawling target and the target website, and the generated crawler tasks are uniformly processed by the cluster server, so that the system is easy to maintain, the research and development time is reduced, the repeated research and development workload is avoided, and the research and development cost is reduced.

Description

Crawler system and method
Technical Field
The invention relates to the field of information retrieval, in particular to a crawler system and a crawler method.
Background
With the rapid development of networks, the internet has become a carrier of vast amounts of information, and how to efficiently extract and utilize such information has become a great challenge.
The web crawler is a program or script for automatically capturing internet information according to a certain rule, and is widely used in internet search engines or other similar websites, and can automatically collect the contents of all the pages of the websites which can be accessed by the web crawler so as to acquire or update the contents and the retrieval modes of the websites.
At present, enterprises are increasingly applied to information on the internet, but due to cost consideration, generally, when specific business demands exist, the crawling function is independently developed according to the specific business demands, because the complexity and the requirements of each target website are different, a set of general crawler technical scheme is not available, a great deal of time and labor are wasted, the independently developed crawler functions are repeated and complicated, the upgrading and maintenance are not facilitated, the commercial crawler system on the market is high in price, and the business specificity of the enterprises is not strong.
Disclosure of Invention
The invention aims to overcome the defects that the crawler functions independently developed in the prior art are repeated and complicated, the upgrading and the maintenance are not facilitated, and the existing crawler system is high in cost and not strong in profession.
The invention solves the technical problems by the following technical scheme:
the crawler system comprises a client and a server; the server comprises a load balancing server and a cluster server;
the client is used for generating crawling parameters according to access requirements of a preset crawling target and a target website and sending the crawling parameters to the load balancing server;
the load balancing server is used for generating a crawler task according to the crawling parameters and distributing the crawler task to the cluster server;
and the cluster server is used for crawling target data of the target website according to the crawler task.
Preferably, the cluster server comprises a plurality of working servers, and the load balancing server is further used for generating prompt information when the number of the crawler tasks exceeds a preset number, wherein the prompt information comprises information for prompting to increase the number of the working servers.
Preferably, the cluster server is further configured to simulate an input operation of a real user by using an automated testing tool when receiving a verification code input request of the target website;
and/or the cluster server is further used for connecting the target website through ADSL dialing;
and/or the cluster server is further used for modifying the crawling parameters by utilizing a bale grabbing tool and generating the crawler task according to the modified crawling parameters.
Preferably, the client is further configured to set a preset time, and send the crawling parameter to the load balancing server according to the preset time;
and/or the client is further used for setting a crawling mode, the crawling mode comprises webpage element crawling and/or interface crawling, and the crawling parameters comprise the crawling mode.
Preferably, the server further comprises a monitoring server, the cluster server is used for generating process data according to the crawler task and sending the process data to the monitoring server, and the monitoring server is used for displaying the process data;
and/or the monitoring server is further used for sending a control command to the cluster server, and the cluster server is used for starting or stopping the crawler task according to the control command.
A crawler method, the crawler method comprising:
the client generates crawling parameters according to access requirements of a preset crawling target and a target website, and sends the crawling parameters to the load balancing server;
the load balancing server generates a crawler task according to the crawling parameters and distributes the crawler task to a cluster server;
and the cluster server crawls target data of the target website according to the crawler task.
Preferably, the cluster server includes a plurality of working servers, and the step of generating the crawler task by the load balancing server according to the crawling parameter further includes:
and the load balancing server generates prompt information when the number of the crawler tasks exceeds a preset number, and the prompt information comprises information for prompting to increase the number of the working servers.
Preferably, the step of crawling, by the cluster server, the target data of the target website according to the crawler task includes:
when receiving a verification code input request of the target website, the cluster server simulates input operation of a real user by using an automatic testing tool;
and/or before the step of crawling the target data of the target website according to the crawler task, the cluster server comprises:
the cluster server is connected with the target website through ADSL (asymmetric digital subscriber line) dialing;
and/or the step of generating the crawler task by the load balancing server according to the crawling parameters comprises the following steps:
and the cluster server modifies the crawling parameters by utilizing a packet grabbing tool and generates the crawler task according to the modified crawling parameters.
Preferably, the step of sending the crawling parameter to the load balancing server includes: the client sets preset time and sends the crawling parameters to the load balancing server according to the preset time;
and/or the client side also sets a crawling mode, wherein the crawling mode comprises webpage element crawling and/or interface crawling, and the crawling parameters comprise the crawling mode.
Preferably, the step of crawling, by the cluster server, the target data of the target website according to the crawler task further includes:
the cluster server generates process data according to the crawler task and sends the process data to a monitoring server;
the monitoring server displays the process data; and/or the monitoring server sends a control command to the cluster server; and the cluster server starts or stops the crawler task according to the control command.
The invention has the positive progress effects that:
according to the invention, the client sets the crawling parameters according to the access requirements of the preset crawling target and the target website, the target data of the target website can be crawled in a targeted manner, the crawler speed is improved, the load balancing server generates the crawler task according to the crawling parameters, and distributes the crawler task to the cluster server, the cluster server is used for crawling the target data of the target website according to the crawler task so as to complete the crawler task, a unified crawler system is realized, the user only needs to set the crawling parameters at the client according to the access requirements of the preset crawling target and the target website, and the generated crawler task is uniformly processed by the cluster server, so that the system is easy to maintain, the research and development duration is reduced, the repeated research and development workload is avoided, and the research and development cost is reduced.
Drawings
FIG. 1 is a block diagram of a crawler system according to embodiment 1 of the present invention.
FIG. 2 is a flow chart of the crawler method of embodiment 2 of the present invention.
FIG. 3 is a flowchart of step 205 of the crawler method of embodiment 2 of the present invention.
Detailed Description
The invention is further illustrated by means of the following examples, which are not intended to limit the scope of the invention.
Example 1
The embodiment provides a crawler system, as shown in fig. 1, which comprises a client 1 and a server 2; the server side 2 includes a load balancing server 21, a cluster server 22 and a monitoring server 23.
The server side 2 is set as a Service cluster based on Web Service.
The client 1 is configured to generate crawling parameters according to access requirements of a preset crawling target and a target website 3, and send the crawling parameters to the load balancing server 21.
The load balancing server 21 is configured to generate a crawler task according to the crawling parameter, and distribute the crawler task to the cluster server 22.
The cluster server 22 is used for crawling target data of the target website 3 according to the crawler task.
The cluster server 22 includes a plurality of work servers 221, and the load balancing server 21 is further configured to generate prompt information when the number of crawler tasks exceeds a preset number, where the prompt information includes information for prompting to increase the number of work servers 221. The method can be used for flexibly reminding the increase of the work servers when the crawler tasks are increased gradually so as to relieve the crawling pressure of the cluster servers.
The cluster server 22 is further configured to simulate an input operation of a real user by using the automated test tool when receiving the verification code input request of the target website 3.
In this embodiment, the cluster server 22 may simulate the input operation of the real user through Selenium (an automated testing tool), firefox (a web browser), gecko driver (for driving Firefox), etc., for example, to implement slider verification, etc. Simulating a manual input operation may be accomplished by invoking the browser kernel with the Selenium to directly open an interface-free browser and select a page element when encountering an operation where some scenarios must be manually intervened.
The cluster server 22 is also used to connect to the target web site 3 via ADSL dial-up.
If the target web site 3 takes monitoring of the IP (internet protocol address) address of the accessed cluster server 22, a large number of accesses to the same IP address by the target web site 3 may be blocked. Therefore, the crawler task needs to be deployed on the working server 221 supporting ADSL dialing, and the function of switching the IP address within a certain time is realized through the timing task of timing dialing.
The cluster server 22 is further configured to modify the crawling parameter with the crawling tool, and generate a crawler task according to the modified crawling parameter.
Request parameters are modified by a mitxproxy (intermediate proxy tool), and the feature of the non-manual request is removed or modified. Through the Mitmproxy as the server of the intermediate proxy, a forward proxy mode is adopted, so that the request sent by the client 1 firstly passes the Mitmproxy inspection, the Mitmproxy modifies the content of the response sent back by the target website, the characteristic of the non-manual request is removed or modified, new crawling parameters are generated, and the new crawling parameters are sent to the target website.
The client 1 is further configured to set a preset time, and send crawling parameters to the load balancing server 21 according to the preset time.
The client 1 is further configured to set a crawling manner, where the crawling manner includes crawling of web page elements and crawling of interfaces, and the crawling manner is set by crawling parameters.
The cluster server 22 uses a combination of two processing logic modes, namely, crawling the web page elements of the target website and crawling specific interfaces, where the crawling of the web page elements can directly obtain the element codes of the web page through an address request tool, such as HttpClient (client programming tool kit) supporting the HTTP (hypertext transfer protocol) protocol, and then parse the element codes through a jsoup (which is an HTML (hypertext markup language) parser for Java (an object oriented programming language)). The direct interface crawling request analyzes a JavaScript (an transliterated script language) file obtained from a target website, then stores the JavaScript file into a local site to directly or translate the JavaScript file into a corresponding service end code, and generates crawling parameters before sending the interface request.
The cluster server 22 realizes efficient crawling of data of the target website through a compatible crawler mode. The cluster server can process the simple content return of the target website or the complex request containing the content such as the header information, the cookie (stored on the computer of the user and used for maintaining the information in the computer of the user) and the like, so that the user does not need to care about specific crawling logic, and the workload of repeated development is simplified.
The cluster server 22 is configured to generate process data according to the crawler task and send the process data to the monitoring server 23, and the monitoring server 23 is configured to display the process data.
The monitoring server 23 is further configured to send a control command to the cluster server 22, and the cluster server 22 is configured to start or stop the crawler task according to the control command.
According to the embodiment, the client sets crawling parameters according to the access requirements of the preset crawling target and the target website, target data of the target website can be crawled in a targeted manner, the crawler speed is improved, the load balancing server generates a crawler task according to the crawling parameters, the crawler task is distributed to the cluster server, the cluster server is used for crawling the target data of the target website according to the crawler task so as to complete the crawler task, a unified crawler system is realized, a user only needs to set the crawling parameters at the client according to the access requirements of the preset crawling target and the target website, and the generated crawler task is uniformly processed by the cluster server, so that the system is easy to maintain, the research and development time is reduced, the repeated research and development workload is avoided, and the research and development cost is reduced.
Example 2
The embodiment provides a crawler method, as shown in fig. 2, including:
step 201, the client generates crawling parameters according to access requirements of a preset crawling target and a target website, and sends the crawling parameters to the load balancing server.
The server is a Service cluster based on Web Service.
The manner of sending the crawling parameters to the load balancing server can be instant sending, or can set preset time through the client, and then send the crawling parameters to the load balancing server according to the preset time.
The crawling mode can be set through the client, the crawling mode comprises a webpage element crawling mode and/or an interface crawling mode, and the crawling parameters comprise the set crawling mode.
The cluster server adopts the combination of two processing logic modes of crawling the webpage elements of the target website and crawling specific interfaces, the crawling of the webpage elements can directly acquire element codes of the webpage through an address request tool such as HTTP protocol supporting HttpClient, and then the element codes are analyzed through a jsoup tool. The direct interface crawling request analyzes the JavaScript file obtained from the target website, then stores the JavaScript file into a local site to directly or translate the JavaScript file into a corresponding service end code, and generates crawling parameters before sending the interface request.
The cluster server realizes efficient crawling of the data of the target website in a compatible crawler mode. The cluster server can process the simple content return of the target website or the complex request containing the content such as the header information, the cookie and the like, so that a user does not need to care about specific crawling logic, and the workload of repeated development is simplified.
Step 202, the load balancing server generates a crawler task according to the crawling parameters and distributes the crawler task to the cluster server.
Wherein the cluster server comprises a plurality of working servers.
And 203, generating prompt information by the load balancing server when the number of the crawler tasks exceeds the preset number, wherein the prompt information comprises information for prompting to increase the number of the working servers.
Step 204, the cluster server is further configured to connect to the target website through ADSL dial-up.
If the target website adopts the IP address monitoring of the accessed cluster server, a large number of accesses to the same IP address by the target website can be blocked. Therefore, the crawler task is deployed on a working server supporting ADSL dialing, and the function of switching IP addresses in a certain time is realized through the timing task of timing dialing.
Step 205, the cluster server crawls target data of the target website according to the crawler task.
As shown in fig. 3, a specific crawling process includes:
step 2050, judging whether the cluster server receives the verification code input request of the target website, if yes, executing step 2051.
Step 2051, simulating an input operation of a real user by using an automated test tool.
In this embodiment, the cluster server may simulate the actual user operation through Selenium, firefox, gecko driver, etc., for example, to implement slider verification, etc. Simulating a manual operation may be accomplished by opening an interface-free browser with a Selenium through a browser kernel and selecting page elements when encountering an operation where some scenes must be manually intervened.
Step 2052, the cluster server modifies the crawling parameters by using the crawling tool, and generates a crawler task according to the modified crawling parameters.
The request parameters are modified by Mitmproxy, and the features of the non-manual request are removed or modified. Through the Mitmproxy as the intermediate proxy server, a forward proxy mode is adopted, so that a request sent by a client passes Mitmproxy inspection, the Mitmproxy modifies the content of a response sent back by a target website, the characteristic of a non-manual request is removed or modified, new crawling parameters are generated, and the new crawling parameters are sent to the target website.
Step 2053, the cluster server generates process data according to the crawler task, and sends the process data to the monitoring server.
Step 2054, the monitoring server displays the process data.
Step 2055, the monitoring server sends a control command to the cluster server.
Step 2056, the cluster server starts or stops the crawler task according to the control command.
While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the principles and spirit of the invention, but such changes and modifications fall within the scope of the invention.

Claims (10)

1. The crawler system is characterized by comprising a client and a server; the server comprises a load balancing server and a cluster server;
the client is used for generating crawling parameters according to access requirements of a preset crawling target and a target website and sending the crawling parameters to the load balancing server;
the load balancing server is used for generating a crawler task according to the crawling parameters and distributing the crawler task to the cluster server;
the cluster server is used for crawling target data of the target website according to the crawler task; the cluster server is further used for modifying the crawling parameters by utilizing a bag grabbing tool, removing or modifying the characteristic of the non-manual request to obtain the modified crawling parameters, and generating the crawler task according to the modified crawling parameters.
2. The crawler system of claim 1, wherein the cluster server includes a plurality of work servers, the load balancing server further configured to generate hint information when the number of crawler tasks exceeds a preset number, the hint information including information that hints to increase the number of work servers.
3. The crawler system of claim 1, wherein the cluster server is further configured to simulate input operations of a real user with an automated testing tool upon receiving a verification code input request of the target website;
and/or the cluster server is further used for connecting the target website through ADSL dialing.
4. The crawler system of claim 1, wherein the client is further configured to set a preset time, and send the crawling parameters to the load balancing server according to the preset time;
and/or the client is further used for setting a crawling mode, the crawling mode comprises webpage element crawling and/or interface crawling, and the crawling parameters comprise the crawling mode.
5. The crawler system of claim 1, wherein the server further comprises a monitoring server, the cluster server is configured to generate process data according to the crawler task and send the process data to the monitoring server, and the monitoring server is configured to display the process data;
and/or the monitoring server is further used for sending a control command to the cluster server, and the cluster server is used for starting or stopping the crawler task according to the control command.
6. A crawler method, the crawler method comprising:
the client generates crawling parameters according to access requirements of a preset crawling target and a target website, and sends the crawling parameters to the load balancing server;
the load balancing server generates a crawler task according to the crawling parameters and distributes the crawler task to a cluster server;
the cluster server crawls target data of the target website according to the crawler task; the cluster server is further used for modifying the crawling parameters by utilizing a bag grabbing tool, removing or modifying the characteristic of the non-manual request to obtain the modified crawling parameters, and generating the crawler task according to the modified crawling parameters.
7. The crawler method of claim 6, wherein the cluster server comprises a plurality of work servers, the step of the load balancing server generating crawler tasks according to the crawling parameters further comprising:
and the load balancing server generates prompt information when the number of the crawler tasks exceeds a preset number, and the prompt information comprises information for prompting to increase the number of the working servers.
8. The crawler method of claim 6, wherein crawling the target data of the target website by the cluster server according to the crawler task comprises:
when receiving a verification code input request of the target website, the cluster server simulates input operation of a real user by using an automatic testing tool;
and/or before the step of crawling the target data of the target website according to the crawler task, the cluster server comprises:
the cluster server is connected with the target website through ADSL dialing.
9. The crawler method of claim 6, wherein the step of sending the crawling parameters to a load balancing server comprises: the client sets preset time and sends the crawling parameters to the load balancing server according to the preset time;
and/or the client side also sets a crawling mode, wherein the crawling mode comprises webpage element crawling and/or interface crawling, and the crawling parameters comprise the crawling mode.
10. The crawler method as in claim 6 wherein the cluster server is based on
The step of crawling the target data of the target website by the crawler task further comprises the following steps:
the cluster server generates process data according to the crawler task and sends the process data to a monitoring server;
the monitoring server displays the process data; and/or the monitoring server sends a control command to the cluster server; and the cluster server starts or stops the crawler task according to the control command.
CN201910835029.4A 2019-09-05 2019-09-05 Crawler system and method Active CN110516139B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910835029.4A CN110516139B (en) 2019-09-05 2019-09-05 Crawler system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910835029.4A CN110516139B (en) 2019-09-05 2019-09-05 Crawler system and method

Publications (2)

Publication Number Publication Date
CN110516139A CN110516139A (en) 2019-11-29
CN110516139B true CN110516139B (en) 2023-07-07

Family

ID=68631009

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910835029.4A Active CN110516139B (en) 2019-09-05 2019-09-05 Crawler system and method

Country Status (1)

Country Link
CN (1) CN110516139B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111522654A (en) * 2020-03-18 2020-08-11 大箴(杭州)科技有限公司 Scheduling processing method, device and equipment for distributed crawler
CN111428179B (en) * 2020-03-19 2023-09-19 新方正控股发展有限责任公司 Picture monitoring method and device and electronic equipment
CN112486741B (en) * 2020-12-11 2021-07-20 深圳前瞻资讯股份有限公司 Multi-process and multi-thread distributed crawler method, system and device
CN112765438B (en) * 2021-01-25 2024-03-26 北京星汉博纳医药科技有限公司 Automatic crawler management method based on micro-service
CN113190737B (en) * 2021-05-06 2024-04-16 上海慧洲信息技术有限公司 Website information acquisition system based on cloud platform
CN113965371B (en) * 2021-10-19 2023-08-29 北京天融信网络安全技术有限公司 Task processing method, device, terminal and storage medium in website monitoring process

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105447088A (en) * 2015-11-06 2016-03-30 杭州掘数科技有限公司 Volunteer computing based multi-tenant professional cloud crawler
CN105760550A (en) * 2016-03-23 2016-07-13 江苏物联网研究发展中心 Big data storage center-oriented internet data acquisition system and acquisition method
CN106897357A (en) * 2017-01-04 2017-06-27 北京京拍档科技股份有限公司 A kind of method for crawling the network information for band checking distributed intelligence
CN107071009A (en) * 2017-03-28 2017-08-18 江苏飞搏软件股份有限公司 A kind of distributed big data crawler system of load balancing
CN107562541A (en) * 2017-09-05 2018-01-09 广东科杰通信息科技有限公司 A kind of distributed reptile method of load balancing, crawler system
CN108595510A (en) * 2018-03-22 2018-09-28 成都数聚城堡科技有限公司 A kind of reptile based on browser end, distributed reptile system and method
CN109815385A (en) * 2019-01-31 2019-05-28 无锡火球普惠信息科技有限公司 Crawler and crawling method based on APP client

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105447088A (en) * 2015-11-06 2016-03-30 杭州掘数科技有限公司 Volunteer computing based multi-tenant professional cloud crawler
CN105760550A (en) * 2016-03-23 2016-07-13 江苏物联网研究发展中心 Big data storage center-oriented internet data acquisition system and acquisition method
CN106897357A (en) * 2017-01-04 2017-06-27 北京京拍档科技股份有限公司 A kind of method for crawling the network information for band checking distributed intelligence
CN107071009A (en) * 2017-03-28 2017-08-18 江苏飞搏软件股份有限公司 A kind of distributed big data crawler system of load balancing
CN107562541A (en) * 2017-09-05 2018-01-09 广东科杰通信息科技有限公司 A kind of distributed reptile method of load balancing, crawler system
CN108595510A (en) * 2018-03-22 2018-09-28 成都数聚城堡科技有限公司 A kind of reptile based on browser end, distributed reptile system and method
CN109815385A (en) * 2019-01-31 2019-05-28 无锡火球普惠信息科技有限公司 Crawler and crawling method based on APP client

Also Published As

Publication number Publication date
CN110516139A (en) 2019-11-29

Similar Documents

Publication Publication Date Title
CN110516139B (en) Crawler system and method
US10567407B2 (en) Method and system for detecting malicious web addresses
CN104158836B (en) A kind of method by data render mobile application interface
CN109033115B (en) Dynamic webpage crawler system
US7814410B2 (en) Initial server-side content rendering for client-script web pages
US8365188B2 (en) Content management
CN105095280A (en) Caching method and apparatus for browser
CN107229633A (en) Static page generation method, Web access method and device
US7478142B1 (en) Self-contained applications that are applied to be received by and processed within a browser environment and that have a first package that includes a manifest file and an archive of files including a markup language file and second package
CN111177519A (en) Webpage content acquisition method and device, storage medium and equipment
CN106598991A (en) Web crawler system capable of realizing website interaction and automatic form extraction by conversational mode
CN102932469A (en) Method for achieving client browser and client browser
CN106649581B (en) Webpage repairing method and client
CN104123143A (en) User control loading system and method
CN102929489A (en) Implementation method of client browser and client browser
CN105556918A (en) Resource downloading method, electronic device, and apparatus
CN103905434A (en) Method and device for processing network data
CA2938293A1 (en) Control program for accessing browser data and for controlling appliance
CN102169486A (en) File downloading method and device
CN105095070B (en) QQ group's data capture method and system based on browser testing component
CA3060005A1 (en) Systems and methods for retrieving web data
CN103677951A (en) Method and system for controlling executing process of JavaScript
CN107391132B (en) Method, device and equipment for target App to execute preset action
US20170031884A1 (en) Automated dependency management based on page components
CN103617223B (en) webpage collection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant