CN109471979B - Method, system, equipment and medium for capturing dynamic page - Google Patents

Method, system, equipment and medium for capturing dynamic page Download PDF

Info

Publication number
CN109471979B
CN109471979B CN201811562767.8A CN201811562767A CN109471979B CN 109471979 B CN109471979 B CN 109471979B CN 201811562767 A CN201811562767 A CN 201811562767A CN 109471979 B CN109471979 B CN 109471979B
Authority
CN
China
Prior art keywords
chrome
page
request
load
browser
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811562767.8A
Other languages
Chinese (zh)
Other versions
CN109471979A (en
Inventor
沈鹏
顾鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qianxin Technology Group Co Ltd
Original Assignee
Qianxin Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qianxin Technology Group Co Ltd filed Critical Qianxin Technology Group Co Ltd
Priority to CN201811562767.8A priority Critical patent/CN109471979B/en
Publication of CN109471979A publication Critical patent/CN109471979A/en
Application granted granted Critical
Publication of CN109471979B publication Critical patent/CN109471979B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2206/00Indexing scheme related to dedicated interfaces for computers
    • G06F2206/10Indexing scheme related to storage interfaces for computers, indexing schema related to group G06F3/06
    • G06F2206/1012Load balancing

Abstract

The invention provides a method for capturing a dynamic page, which is used for a chromium browser and comprises the following steps: receiving a request of a crawler for capturing a page; and distributing the load of the page crawling request to the management programs of the servers, so that the management programs distribute the load of the page crawling request to the plurality of chrome processes of the chrome browser, and crawling the dynamic page in the webpage through the plurality of chrome processes. On the other hand, the invention also provides a system, electronic equipment and a computer readable medium for capturing the dynamic page. Unnecessary information is intercepted through a customized development chrome browser, and redundancy of page rendering in the capturing process is reduced; by setting the management program, the management program can distribute the grabbing request load to a plurality of the chrome processes, and the plurality of the chrome processes grab simultaneously, so that the grabbing efficiency is improved; through load balancing equipment balance, a plurality of chrome browser clusters can be built, and the universality and the expansibility are good.

Description

Method, system, equipment and medium for capturing dynamic page
Technical Field
The invention relates to the technical field of crawlers, in particular to a method, a system, equipment and a medium for capturing a dynamic page.
Background
The current distributed crawler frame has the function of capturing dynamic pages, and in a high-concurrency scene, the capturing speed is low, the failure rate is high, and the utilization of server resources is unreasonable; the current distributed crawler frame highly couples the dynamic page grabbing function with the crawler frame, and the dynamic page grabbing function cannot be general; when the current distributed crawler frame utilizes a browser to render pages, a plurality of redundant processes exist, and the page rendering capturing efficiency is reduced.
Disclosure of Invention
Technical problem to be solved
The invention provides a method, a system, equipment and a medium for capturing a dynamic page, which are used for improving the efficiency, the stability and the universality of a crawler in capturing the dynamic page.
(II) technical scheme
The invention provides a method for capturing a dynamic page, which is used for a chromium browser and comprises the following steps: receiving a request of a crawler for capturing a page; and distributing the load of the page crawling request to the management programs of the servers, so that the management programs distribute the load of the page crawling request to the plurality of chrome processes of the chrome browser, and crawling the dynamic page in the webpage through the plurality of chrome processes.
Optionally, the load of the crawl page request is distributed to the plurality of chrome processes, specifically, the hypervisor distributes the load of the crawl page request to the plurality of chrome processes according to a minimum connection number scheduling algorithm.
Optionally, the load of the request for fetching the page is distributed to the management programs of the plurality of servers through the load balancing device.
Optionally, the load balancing device is one of LVS (Linux virtual server), Nginx, and haprox.
Optionally, the hypervisor is connected to the premium browser and the load balancing device, and is configured to allocate the crawl page request allocated by the load balancing device to the premium browser.
Optionally, the chromium browser performs information interaction with the management program through a websocket communication mode.
Optionally, the premium browser also includes an interception program to intercept popup windows, ignore SSL certificate errors, and information for non-crawled pages.
In another aspect, the present invention further provides an electronic device, including: a processor; a memory storing a computer executable program which, when executed by the processor, causes the processor to perform the above method of fetching dynamic pages.
In another aspect, the present invention further provides a system for crawling a dynamic page, including: the receiving module is used for receiving a crawler page grabbing request; and the distribution module is used for distributing the load of the page grabbing request to the management programs of the servers so that the management programs distribute the load of the page grabbing request to the plurality of the chrome processes of the chrome browser and grab the dynamic page in the webpage through the plurality of the chrome processes.
In yet another aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the method for crawling dynamic pages as described above.
(III) advantageous effects
Unnecessary information is intercepted through a customized development chrome browser, and redundancy of page rendering in the capturing process is reduced; by setting the management program, the management program can distribute the grabbing request load to a plurality of the chrome processes, and the plurality of the chrome processes grab simultaneously, so that the grabbing efficiency is improved; load balancing is carried out through the LVS, a pair of chrome browser clusters can be built, and universality and expansibility are good.
Drawings
FIG. 1 is a diagram that schematically illustrates method steps for crawling dynamic pages, in an embodiment of the present disclosure;
FIG. 2 schematically illustrates a flow diagram for crawling dynamic pages in an embodiment of the present disclosure;
FIG. 3 schematically shows a block diagram of an electronic device in an embodiment of the disclosure;
FIG. 4 schematically illustrates a system diagram for crawling dynamic pages in an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to specific embodiments and the accompanying drawings.
One aspect of the present invention provides a method for capturing a dynamic page, referring to fig. 1, for a chrome browser, the method includes: receiving a request of a crawler for capturing a page; and distributing the load of the page grabbing request to the management programs of the servers, so that the management programs distribute the load of the page grabbing request to a plurality of chrome processes of a chrome browser, and grabbing the dynamic page in the webpage through the plurality of chrome processes.
Specifically, S1, a crawler page crawling request is received;
the load balancing device receives a request of a crawler for capturing a page, the load balancing device in the embodiment of the invention is preferably one of LVS, Nginx and HAproxy, the embodiment of the invention takes LVS as an example, the LVS receives the request of the crawler for capturing the page and distributes a load corresponding to the request for capturing the page to each of a plurality of servers according to a preset algorithm, and a plurality of request services for capturing the page can be received at the same time.
And S2, distributing the load of the page grabbing request to the management programs of the servers, so that the management programs distribute the load of the page grabbing request to the plurality of the chrome processes of the chrome browser, and grabbing the dynamic page in the webpage through the plurality of the chrome processes.
The LVS distributes the request for capturing the page to the server cluster, the number of the servers in the server cluster can be set according to actual needs, the LVS is provided with a server connection interface, and the interface can be expanded, so that the increase and the reduction of the servers are very convenient.
And each server in the server cluster is provided with a management program, and the management program is connected with the chromatic browser and the load balancing equipment and is used for distributing the page grabbing requests distributed by the load balancing equipment to the chromatic browser. The management program can adopt a minimum connection number scheduling algorithm to distribute the load of the page grabbing request to a plurality of the chrome processes of the chrome browser, and the management program provides services of the independent management processes, so that a cluster of the chrome processes is conveniently established to meet the grabbing speed requirement. The minimum connection scheduling algorithm is a dynamic scheduling algorithm, the load condition of the chrome browser is estimated according to the number of connections of the chrome browser which are currently active, meanwhile, the number of connections which are established in each chrome browser is recorded, and when a request is scheduled to a certain chrome browser, the number of connections is added by 1; when a connection is aborted or times out, its number of connections is decremented by one.
The management program can be developed through the golang language, and a crawler capturing request is better received through a high-performance transmission protocol of the Japanese worker.
The browser can be developed by adopting a chrome DevTools protocol, so that the browser can perform information interaction with a management program in a websocket communication mode to receive a calling instruction and call a chrome process. Websocket is a protocol for full-duplex communication on a single TCP connection, and makes data exchange between a browser and a server simpler, and allows a server to actively push data to the browser. In the WebSocket API, the browser and the server only need to complete one handshake, and persistent connection can be directly established between the browser and the server, and bidirectional data transmission is carried out. Therefore, in the embodiment of the invention, the WebSocket enables the communication connection to be established between the cache browser and the server where the dynamic page is located, the data exchange becomes simpler, the captured dynamic page can be actively transmitted to the cache browser, and then the cache browser transmits the captured page to the user.
In addition, the chrome browser is also provided with an interception program, the interception program can intercept pop-up windows, ignore SSL certificate errors, information of non-captured pages and the like, and effectively reduce redundancy of the captured pages when the pages are rendered through the browser, so that the page capturing speed is higher.
In summary, as shown in fig. 2, a crawler technology is first used to grab a dynamic page according to a user requirement, so that a page grabbing request is sent out, a load balancing device such as a LVS allocates the page grabbing request to a server cluster, each server in the server cluster is provided with a customized chrome browser, and the system further includes a management program, the management program receives the page grabbing request allocated by the load balancing device such as the LVS, and then further allocates the page grabbing request to multiple processes of the chrome browser according to a minimum connection number scheduling algorithm, where the multiple processes grab a page on a dynamic page server (WebSite), and sends the grabbed content to the user through a WebSocket communication protocol.
In another aspect, an embodiment of the present invention provides an electronic device, which is a block diagram of the electronic device in an embodiment of the present invention with reference to fig. 3, where the electronic device 300 includes: a processor 301 and a memory 302, which electronic device 300 may perform a method according to an embodiment of the invention.
In particular, processor 301 may include, for example, a general purpose microprocessor, an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), and/or the like. The processor 301 may also include on-board memory for caching purposes. The processor 301 may be a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present invention.
The memory 302, for example, can be any medium that can contain, store, communicate, propagate, or transport the instructions. For example, a readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. Specific examples of the readable storage medium include: magnetic storage devices, such as magnetic tape or Hard Disk Drives (HDDs); optical storage devices, such as compact disks (CD-ROMs); a memory, such as a Random Access Memory (RAM) or a flash memory; and/or wired/wireless communication links.
The memory 302 may include a computer program 3021, which computer program 3021 may include code/computer-executable instructions that, when executed by the processor 301, cause the processor 301 to perform, for example, the method flows of the embodiments of the invention above and any variations thereof.
The computer program 3021 may be configured with, for example, computer program code comprising computer program modules. For example, in an example embodiment, code in computer program 3021 may include one or more program modules, including 3021A, modules 3021B, and … …, for example. It should be noted that the division and number of the modules are not fixed, and those skilled in the art may use suitable program modules or program module combinations according to actual situations, so that when the program modules are executed by the processor 301, the processor 301 may execute, for example, the above method flows and any modifications thereof in connection with the embodiments of the present invention.
In another aspect, an embodiment of the present invention provides a system for crawling a dynamic page, and referring to fig. 4, the system 400 includes: a receiving module 401 and an assigning module 402.
Specifically, the receiving module 401 is configured to receive a request for crawling a page by a crawler; the allocating module 402 is configured to allocate the load of the request for fetching the page to the management programs of the servers, so that the management programs allocate the load of the request for fetching the page to the plurality of premium processes of the premium browser, and fetch the dynamic page in the web page through the plurality of premium processes.
Any number of modules, sub-modules, units, sub-units, or at least part of the functionality of any number thereof according to embodiments of the present application may be implemented in one module. Any one or more of the modules, sub-modules, units and sub-units according to the embodiments of the present application may be implemented by being split into a plurality of modules.
The present application also provides a computer readable medium, which may be embodied in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer readable medium carries one or more programs which, when executed, implement the method according to an embodiment of the present application.
According to embodiments of the present application, a computer readable medium may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, optical fiber cable, radio frequency signals, etc., or any suitable combination of the foregoing.
It will be appreciated by a person skilled in the art that various combinations and/or combinations of features described in the various embodiments and/or claims of the present application are possible, even if such combinations or combinations are not explicitly described in the present application. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present application may be made without departing from the spirit and teachings of the present application. All such combinations and/or associations are intended to fall within the scope of this application.
While the present application has been shown and described with reference to certain exemplary embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the application as defined by the appended claims and their equivalents. Accordingly, the scope of the present application should not be limited to the above-described embodiments, but should be defined not only by the appended claims, but also by equivalents thereof.

Claims (7)

1. A method for grabbing a dynamic page is used for a chrome browser, and is characterized by comprising the following steps:
receiving a request of a crawler for capturing a page;
distributing the load of the page crawling request to a management program of a plurality of servers through load balancing equipment, wherein the management program is connected with the chrome browser and the load balancing equipment, so that the management program distributes the load of the page crawling request to a plurality of chrome processes of the chrome browser, and dynamic pages in web pages are crawled through the plurality of chrome processes;
the method comprises the following steps that a management program is managed by a website, wherein the chrome browser is a customized chrome browser, developed by adopting a chrome DevTools protocol, and performs information interaction with the management program in a websocket communication mode to receive a calling instruction and call a chrome process;
the management program is developed through the golang language and receives a crawler capturing request through a high-performance transmission protocol.
2. The method for crawling dynamic pages of claim 1, wherein said distributing the load of the crawl page request to a plurality of chrome processes is specifically said hypervisor distributing the load of the crawl page request to the plurality of chrome processes according to a minimum number of connections scheduling algorithm.
3. The method for crawling a dynamic page of claim 1, wherein said load balancing device is one of LVS, Nginx, and haprox.
4. A method for crawling dynamic pages as claimed in claim 1 wherein said chrome browser further comprises an interception program to intercept pop-up windows, ignore SSL certificate errors and information on non-crawled pages.
5. An electronic device, characterized in that the device comprises:
a processor;
memory storing a computer executable program which, when executed by the processor, causes the processor to perform a method of crawling dynamic pages as claimed in any one of claims 1 to 4.
6. A system for crawling dynamic pages, comprising:
the receiving module is used for receiving a crawler page grabbing request;
the allocation module is used for allocating the load of the page crawling request to management programs of a plurality of servers through load balancing equipment, and the management programs are connected with the chrome browser and the load balancing equipment so that the management programs allocate the load of the page crawling request to a plurality of chrome processes of the chrome browser and crawl dynamic pages in web pages through the plurality of chrome processes;
the method comprises the following steps that a management program is managed by a website, wherein the chrome browser is a customized chrome browser, developed by adopting a chrome DevTools protocol, and performs information interaction with the management program in a websocket communication mode to receive a calling instruction and call a chrome process;
the management program is developed through the golang language and receives a crawler capturing request through a high-performance transmission protocol.
7. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method of crawling dynamic pages as claimed in any one of claims 1 to 4.
CN201811562767.8A 2018-12-20 2018-12-20 Method, system, equipment and medium for capturing dynamic page Active CN109471979B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811562767.8A CN109471979B (en) 2018-12-20 2018-12-20 Method, system, equipment and medium for capturing dynamic page

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811562767.8A CN109471979B (en) 2018-12-20 2018-12-20 Method, system, equipment and medium for capturing dynamic page

Publications (2)

Publication Number Publication Date
CN109471979A CN109471979A (en) 2019-03-15
CN109471979B true CN109471979B (en) 2021-09-10

Family

ID=65675401

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811562767.8A Active CN109471979B (en) 2018-12-20 2018-12-20 Method, system, equipment and medium for capturing dynamic page

Country Status (1)

Country Link
CN (1) CN109471979B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110119273A (en) * 2019-05-10 2019-08-13 北京墨云科技有限公司 A kind of browse request optimization method, device, terminal and storage medium
CN110569414A (en) * 2019-08-21 2019-12-13 时趣互动(北京)科技有限公司 puppeteeer-based website data collection method
CN110727426A (en) * 2019-10-12 2020-01-24 南京我爱我家信息科技有限公司 Customized version browsing system for real estate brokerage industry

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7774782B1 (en) * 2003-12-18 2010-08-10 Google Inc. Limiting requests by web crawlers to a web host
CN102902576B (en) * 2012-09-26 2014-12-24 北京奇虎科技有限公司 Method, server and system for rendering webpages
CN103902386B (en) * 2014-04-11 2017-05-10 复旦大学 Multi-thread network crawler processing method based on connection proxy optimal management
CN107193960B (en) * 2017-05-24 2020-11-10 南京大学 Distributed crawler system and periodic incremental grabbing method
CN107729564A (en) * 2017-11-13 2018-02-23 北京众荟信息技术股份有限公司 A kind of distributed focused web crawler web page crawl method and system
CN108595510A (en) * 2018-03-22 2018-09-28 成都数聚城堡科技有限公司 A kind of reptile based on browser end, distributed reptile system and method
CN108520024A (en) * 2018-03-22 2018-09-11 河海大学 Binary cycle crawler system and its operation method based on Spark Streaming

Also Published As

Publication number Publication date
CN109471979A (en) 2019-03-15

Similar Documents

Publication Publication Date Title
CN109471979B (en) Method, system, equipment and medium for capturing dynamic page
US10459764B2 (en) Stateless instance backed mobile devices
EP3531290B1 (en) Data backup method, apparatus, electronic device, storage medium, and system
CN109729106B (en) Method, system and computer program product for processing computing tasks
US10409649B1 (en) Predictive load balancer resource management
US8880634B2 (en) Cache sharing among branch proxy servers via a master proxy server at a data center
CN110401720B (en) Information processing method, device, system, application server and medium
US8918474B2 (en) Determining priorities for cached objects to order the transfer of modifications of cached objects based on measured network bandwidth
US9390036B2 (en) Processing data packets from a receive queue in a remote direct memory access device
US10038640B2 (en) Managing state for updates to load balancers of an auto scaling group
CN107247629A (en) Cloud computing system and cloud computing method and device for controlling server
CN111221638B (en) Concurrent task scheduling processing method, device, equipment and medium
CN107835181B (en) Authority management method, device and medium of server cluster and electronic equipment
CN114077480B (en) Method, device, equipment and medium for sharing memory between host and virtual machine
US11750711B1 (en) Systems and methods for adaptively rate limiting client service requests at a blockchain service provider platform
CN109992406A (en) The method and client that picture requesting method, response picture are requested
CN110287146A (en) Using the method, equipment and computer storage medium of downloading
CN106919442A (en) Many GPU dispatching devices and distributed computing system and many GPU dispatching methods
US11576181B2 (en) Logical channel management in a communication system
US10523741B2 (en) System and method for avoiding proxy connection latency
CN107229424B (en) Data writing method for distributed storage system and distributed storage system
CN111813541B (en) Task scheduling method, device, medium and equipment
CN106888240A (en) A kind of page data dissemination method and system
CN114301980A (en) Method, device and system for scheduling container cluster and computer readable medium
CN115250276A (en) Distributed system and data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100088 Building 3 332, 102, 28 Xinjiekouwai Street, Xicheng District, Beijing

Applicant after: Qianxin Technology Group Co., Ltd.

Address before: Beijing Chaoyang District Jiuxianqiao Road 10, building 15, floor 17, layer 1701-26, 3

Applicant before: BEIJING QI'ANXIN SCIENCE & TECHNOLOGY CO., LTD.

GR01 Patent grant
GR01 Patent grant