CN112637361A - Page proxy method, device, electronic equipment and storage medium - Google Patents

Page proxy method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112637361A
CN112637361A CN202011611316.6A CN202011611316A CN112637361A CN 112637361 A CN112637361 A CN 112637361A CN 202011611316 A CN202011611316 A CN 202011611316A CN 112637361 A CN112637361 A CN 112637361A
Authority
CN
China
Prior art keywords
page
response page
response
script
event
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011611316.6A
Other languages
Chinese (zh)
Other versions
CN112637361B (en
Inventor
刘德森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Original Assignee
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Topsec Technology Co Ltd, Beijing Topsec Network Security Technology Co Ltd, Beijing Topsec Software Co Ltd filed Critical Beijing Topsec Technology Co Ltd
Priority to CN202011611316.6A priority Critical patent/CN112637361B/en
Priority claimed from CN202011611316.6A external-priority patent/CN112637361B/en
Publication of CN112637361A publication Critical patent/CN112637361A/en
Application granted granted Critical
Publication of CN112637361B publication Critical patent/CN112637361B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/562Brokering proxy services
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9574Browsing optimisation, e.g. caching or content distillation of access to content, e.g. by caching
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/38Creation or generation of source code for implementing user interfaces
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/12Avoiding congestion; Recovering from congestion
    • H04L47/125Avoiding congestion; Recovering from congestion by balancing the load, e.g. traffic engineering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/09Mapping addresses
    • H04L61/10Mapping addresses of different types
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]

Abstract

The application provides a page proxy method, a page proxy device, electronic equipment and a storage medium, which are used for solving the problem of low crawling rate of resource links in a page. The method comprises the following steps: receiving a page request sent by a crawler terminal, and acquiring a response page corresponding to an access link in the page request; judging whether the response page comprises script codes or not; if so, loading and rendering the response page by using the browser, modifying the script codes in the rendered response page, injecting the modified script codes into the response page, and sending the injected response page to the crawler terminal, so that the crawler terminal triggers the script codes and acquires the resource links in the response page.

Description

Page proxy method, device, electronic equipment and storage medium
Technical Field
The present application relates to the technical field of computer data processing and network security, and in particular, to a page proxy method, apparatus, electronic device, and storage medium.
Background
HTML unifies the formatting of documents on the web by tags, linking the distributed Internet (Internet) resources into a logical whole, and HTML marks the various parts of the web page to be displayed by Markup symbols. The web page file is a text file, and by adding a marker in the text file, the browser can be informed of how to display the contents (such as how to process words, how to arrange pictures, how to display pictures, etc.).
JavaScript is an transliterated scripting language, also abbreviated as JS, a dynamic-type, weak-type, prototype-based user terminal scripting language, and a scripting language widely used for user terminals, and is used on HTML web pages at first, and can be used to add dynamic functions to the HTML web pages.
Currently, a method for capturing resource links in a web page generally includes obtaining a web page corresponding to an access link, and traversing resource links in attribute values of all tag elements in the web page in a crawler manner. However, most of the existing websites are of a framework with a front-end and a back-end separated, that is, the front end renders an HTML webpage by using a template, the HTML webpage includes a JavaScript script code, and when a user operation is received, the JavaScript script code is triggered to asynchronously obtain the back-end data, and the webpage for obtaining the back-end data by using the asynchronous mode is usually referred to as a dynamic page for short. In the practical process, it is found that when script codes in a dynamic page are executed, a crawler cannot acquire resource links of all asynchronous requests of trigger events in the page, and therefore the crawling rate of the resource links in the page by the existing crawler is not high.
Disclosure of Invention
An object of the embodiments of the present application is to provide a page proxy method, an apparatus, an electronic device, and a storage medium, which are used to solve the problem that the crawling rate of resource links in a page is not high.
The embodiment of the application provides a page proxy method, which is applied to a proxy server and comprises the following steps: receiving a page request sent by a crawler terminal, and acquiring a response page corresponding to an access link in the page request; judging whether the response page comprises script codes or not; if so, loading and rendering the response page by using the browser, modifying the script codes in the rendered response page, injecting the modified script codes into the response page, and sending the injected response page to the crawler terminal, so that the crawler terminal triggers the script codes and acquires the resource links in the response page. In the implementation process, the script codes in the response page returned by the crawler terminal are modified by the proxy server and then returned to the response page modified by the crawler terminal, and when the script codes in the modified response page are rendered and executed by the browser of the crawler terminal, the browser automatically executes the modified script codes in the response page, so that the script codes are automatically triggered, all resource links which trigger event asynchronous requests in the response page are obtained, and the crawling rate of the resource links in the page is effectively improved.
Optionally, in this embodiment of the present application, after determining whether the response page includes the script code, the method further includes: and if the response page does not comprise the script code, sending the response page to the crawler terminal. In the implementation process, the static response page is directly sent to the crawler terminal by not including the script code in the response page, so that the crawler terminal directly obtains the resource link in the static response page.
Optionally, in an embodiment of the present application, the script code includes: adding an event listener function; modifying the script codes in the rendered response page, and injecting the modified script codes into the response page, wherein the steps of: the method comprises the steps that a predefined event listener function is used for carrying out coverage rewriting on an adding event listener function to obtain a script code after rewriting, and the predefined event listener function is used for adding a trigger event to a tag element in a response page when the response page is rendered so that a crawler terminal can obtain a resource link generated when the trigger event of the tag element is triggered; and injecting the rewritten script code into the response page. In the implementation process, the adding event listener function is overwritten in a covering mode through the predefined event listener function, so that the trigger event can be added to the tag element in the response page when the response page is rendered, the crawler terminal acquires the resource link generated when the trigger event of the tag element is triggered, and the crawling rate of the resource link in the page is effectively improved.
Optionally, in this embodiment of the application, before injecting the rewritten script code into the response page, the method further includes: acquiring all tag elements in a response page; and eliminating the label elements which cannot trigger the event from all the label elements in the response page. In the implementation process, the tag elements which cannot trigger the event in all the tag elements in the response page are removed, so that all the tag elements in the response page are managed in a unified manner, and the possibility that the acquired resource links are missed is effectively reduced.
Optionally, in this embodiment of the present application, rejecting a tag element that cannot trigger an event from all tag elements in a response page includes: and if the tag element does not have the binding event or is invisible, rejecting the tag element. In the implementation process, all the label elements in the response page are removed uniformly by removing the label elements which are not bound with events or are invisible, so that the crawling rate of resource links in the page is effectively improved.
Optionally, in this embodiment of the present application, obtaining a response page corresponding to an access link in a page request includes: sending a page request to a website server corresponding to the access link so that the website server returns a response page corresponding to the page request; and receiving a response page sent by the website server. In the implementation process, the proxy server can send the page request in a proxy mode and receive the response page corresponding to the page request by sending the page request to the website server corresponding to the access link and receiving the response page sent by the website server, so that the flexibility of controlling the response page is improved.
Optionally, in this embodiment of the present application, sending a page request to a website server corresponding to an access link includes: resolving a plurality of internet protocol addresses according to the domain name in the access link; the page requests are sent to multiple internet protocol addresses in a load balanced manner. In the implementation process, the page requests are sent to the plurality of internet protocol addresses in a load balancing mode, so that the pressure of the website server is reduced, and the condition that the website server is forbidden due to frequent access is avoided.
An embodiment of the present application further provides a page proxy apparatus, including: the response page acquisition module is used for receiving a page request sent by the crawler terminal and acquiring a response page corresponding to the access link in the page request; the script code judging module is used for judging whether the response page comprises script codes or not; and the page injection sending module is used for loading and rendering the response page by using a browser if the response page comprises the script codes, modifying the script codes in the rendered response page, injecting the modified script codes into the response page, and sending the injected response page to the crawler terminal so that the crawler terminal triggers the script codes and acquires the resource links in the response page.
Optionally, in this embodiment of the present application, the page proxy apparatus further includes: and the response page sending module is used for sending the response page to the crawler terminal if the response page does not comprise the script code.
Optionally, in this embodiment of the present application, the response page obtaining module includes: the page request sending module is used for sending a page request to the website server corresponding to the access link so that the website server returns a response page corresponding to the page request; and the response page receiving module is used for receiving a response page sent by the website server.
Optionally, in this embodiment of the present application, the page request sending module includes: the access link analysis module is used for analyzing a plurality of internet protocol addresses according to the domain name in the access link; and the request load balancing module is used for sending the page requests to the plurality of internet protocol addresses in a load balancing mode.
Optionally, in an embodiment of the present application, the script code includes: adding an event listener function; a page injection sending module, comprising: the script covering and rewriting module is used for covering and rewriting the added event listener function by using a predefined event listener function to obtain a rewritten script code, and the predefined event listener function is used for adding a trigger event to a tag element in a response page when the response page is rendered so as to enable the crawler terminal to obtain a resource link generated when the trigger event of the tag element is triggered; and the response page injection module is used for injecting the rewritten script codes into the response page.
Optionally, in this embodiment of the present application, the page injection sending module further includes: the tag element acquisition module is used for acquiring all tag elements in the response page; and the label element removing module is used for removing the label elements which cannot trigger the event in all the label elements in the response page.
Optionally, in this embodiment of the present application, the tag element removing module is specifically configured to: and if the tag element does not have the binding event or is invisible, rejecting the tag element.
An embodiment of the present application further provides an electronic device, including: a processor and a memory, the memory storing processor-executable machine-readable instructions, the machine-readable instructions when executed by the processor performing the method as described above.
Embodiments of the present application also provide a storage medium having a computer program stored thereon, where the computer program is executed by a processor to perform the method as described above.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a schematic flowchart illustrating a page proxy method provided in an embodiment of the present application;
fig. 2 is a schematic flow chart illustrating a process performed by a crawler terminal on a received response page according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a page proxy apparatus provided in an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
The technical solution in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
Before introducing the page proxy method provided in the embodiment of the present application, some concepts related in the embodiment of the present application are introduced:
proxy Server (Proxy Server), which is a Server for Proxy network users to obtain network information; the proxy server is a transfer station of network information, an intermediate proxy mechanism between a source host and a destination host, for example, an intermediate proxy mechanism between a host in a personal network and a server of an Internet (Internet) service provider, and is responsible for forwarding legal network information, controlling and registering the forwarding.
A reverse proxy refers to a reverse proxy service provided by a proxy server in a computer network, that is, the proxy server can obtain resources from one or more sets of backend servers (e.g., Web servers) related to the client according to a request of the client, and then return the resources to the client, and the client only knows an Internet Protocol (IP) address of the reverse proxy and does not know existence of a server cluster behind the proxy server.
The forward proxy is a proxy service provided by a proxy server in a computer network, and similar to the reverse proxy service, the forward proxy forwards a request message by a one-to-one proxy, that is, a service end does not know an actual IP address of a client end which actually initiates a request; in the reverse proxy process, the client does not know the actual IP address of the actual real service provider.
Nginx is HTTP service software designed for performance, the service software can provide high-performance HTTP service and reverse proxy service, and compared with Apache HTTP Server and Lighttpd, the Nginx HTTP service software has the advantages of less occupied memory, high stability and the like; meanwhile, nginnx is also a web server of an asynchronous framework, and can also be used as a reverse proxy, a load balancer and an HTTP cache.
Headless browsers refer to browsers without graphical user interfaces; headless browsers provide automatic control of web pages in an environment similar to popular web browsers, but do so through a command line interface or using web communications.
The WebDriver tool is a piece of open source software, the WebDriver can control different browsers (such as Firefox, Chrome, Safari, IE) in a mode of defining a driving engine, and the WebDriver can open a URL to interact with a page which is rendered.
jQuery is a set of cross-browser JavaScript library, which simplifies the operation between Hyper Text Markup Language (HTML) and JavaScript.
Document Object Model (DOM), which is an internal data model of a tree structure that describes the parsing results of an XML document; an XML document includes root nodes, internal nodes, leaf nodes, remark nodes, etc.
It should be noted that the page proxy method provided in the embodiment of the present application may be executed by an electronic device, where the electronic device refers to a device terminal or a server having a function of executing a computer program, and the device terminal includes: smart phones, Personal Computers (PCs), tablet computers, and the like. A server refers to a device that provides computing services over a network, such as: x86 server and non-x 86 server, non-x 86 server includes: mainframe, minicomputer, and UNIX server.
Before introducing the page proxy method provided by the embodiment of the present application, an application scenario applicable to the page proxy method is introduced, where the application scenario includes, but is not limited to: the page proxy method is used for enhancing the functions or performances and the like of the crawler software or crawler hardware, resource links crawled by the crawler software or crawler terminal equipment are more complete by using the page proxy method, and resource links and the like are omitted in the crawling process of the crawler software or the crawler terminal equipment.
Please refer to fig. 1, which is a schematic flow chart diagram of a page proxy method provided in the embodiment of the present application; the page proxy method can be applied to a proxy server, namely the method can be executed by the proxy server, and the page proxy method has the main idea that script codes in a response page returned by a crawler terminal are modified by the proxy server and then returned to the response page modified by the crawler terminal, and when the script codes in the modified response page are rendered and executed by a browser of the crawler terminal, the browser automatically executes the modified script codes in the response page, so that the script codes are automatically triggered and resource links of all trigger event asynchronous requests in the response page are acquired, and the problem that the crawler cannot acquire the resource links of all trigger event asynchronous requests in the page is effectively solved; the page proxy method may include:
step S110: the proxy server receives a page request sent by the crawler terminal and acquires a response page corresponding to the access link in the page request.
The embodiment of the proxy server in step S110 receiving the page request sent by the crawler terminal is, for example: the method comprises the steps that a proxy server receives a page request sent by a crawler terminal through a hypertext Transfer Protocol (HTTP) or a hypertext Transfer security Protocol (HTTPS); the page request includes an access link.
The above embodiment of obtaining the response page corresponding to the access link in the page request in step S110 may include:
step S111: and the proxy server sends a page request to the website server corresponding to the access link so that the website server returns a response page corresponding to the page request.
Step S112: and the proxy server receives a response page sent by the website server.
Since the embodiment of step S111 to step S112 is relatively closely related, the two steps will be described together; there are many embodiments of the above steps S111 to S112, including but not limited to the following:
in the first implementation mode, a page request is sent to a website server corresponding to an access link in a reverse proxy mode, and a response page sent by the website server is received; the embodiment specifically includes, for example: the proxy server resolves a plurality of internet protocol addresses according to the domain name in the access link, uses reverse proxy software to send page requests to the plurality of internet protocol addresses in a load balancing mode, then uses the reverse proxy software to receive response pages sent by the website server, and sends the response pages to the crawler terminal; among them, reverse proxy software that can be used includes: nginx, Tengine, Apache HTTP Server, HAproxy, and Hiawatha HTTP Server, and the like.
In the second implementation mode, a forward proxy mode is used for sending a page request to a website server corresponding to an access link and receiving a response page sent by the website server; the embodiment specifically includes, for example: using forward proxy software to receive a response page sent by a website server and sending the response page to a crawler terminal; among them, forward proxy software that can be used includes: CERN HTTPdServer, Cherokee HTTPServer, NginxHTTPServer, Apache HTTP Server, LighttpdHTTP Server, and the like.
After step S110, step S120 is performed: the proxy server determines whether the response page includes script code.
The script code refers to code written by an transliterated script language running in a browser, and the script code may include: adding an event listener function; the transliterated scripting language herein includes, but is not limited to, a JavaScript language.
The embodiment of step S120 described above is, for example: the proxy server analyzes the source code of the response page by using a source code analysis program, then uses a regular expression to search whether a preset label exists in the source code of the response page, if so, the response page is determined to comprise the script code, otherwise, the response page is determined not to comprise the script code: the preset tag is a script, if the script tag has an src attribute, the source code of the script code may be obtained by accessing the src attribute value, and if the script does not have the src attribute, the source code in the script tag may be obtained by using a regular expression.
After step S120, step S130 is performed: and if the response page comprises the script code, the proxy server loads and renders the response page by using the browser, modifies the script code in the rendered response page, injects the modified script code into the response page, and sends the injected response page to the crawler terminal, so that the crawler terminal triggers the script code and acquires the resource link in the response page.
There are many embodiments of the proxy server in step S130 loading and rendering the response page using the browser, including but not limited to the following:
in a first embodiment, the proxy server uses a program to control the browser to load and render the response page, and the embodiment specifically includes: and controlling the browser to load and render the response page by using a program in the Selenium, a jQuery program or a WebDriver tool.
In a second embodiment, the proxy server uses a program or tool to control a headless browser to load and render the response page, and headless browsers that may be used include, but are not limited to: a PhantomJS browser, a Chrome browser in headless mode (header-Chrome), and a Firefox browser in headless mode, etc.; the return data includes, but is not limited to: a style file and a picture file for executing JavaScript scripts, CSS, and the like may be loaded.
There are many embodiments for modifying the script code in the rendered response page and injecting the modified script code into the response page in step S130, including but not limited to the following:
in the first implementation mode, the script codes are modified in a mode of performing covering and rewriting on the added event listener function, and the modified script codes are injected into a response page; the embodiment specifically includes, for example: using a predefined event listener function to perform coverage rewriting on an added event listener (addEventListener) function to obtain a modified script code, and finally injecting the modified script code into a response page; injecting the rewritten script codes into a response page; the addEventListener is a listening event and processes a function of the listening event, specifically, the proxy server can overwrite the window event listener in a manner of overwriting the addEventListener function, where the window refers to a browser built-in object; the proxy server can also overwrite the Document event listener in an overwriting manner, wherein Document refers to a Document object in an HTML format response webpage; the proxy server can also overwrite the document object model Node (namely DOM-Node) event listener in a mode of overwriting the addEventListener function. The predefined event listener function is used for adding a trigger event to the tag element in the response page when the response page is rendered, so that the crawler terminal acquires the resource link generated when the trigger event of the tag element is triggered.
In the second implementation mode, the added event listener function is overwritten in a covering manner, then the label elements in the response page are screened and removed to obtain modified script codes, and the modified script codes are injected into the response page; the embodiment specifically includes, for example: firstly, overwriting an added event listener function, and then acquiring all tag elements in a response page by using a JavaScript code; secondly, filtering and removing all label elements which cannot trigger events in all the label elements in the response page, filtering and removing label elements which are not allowed in the configuration file, and filtering invisible label elements; then, the events of the ancestor level tag elements are sequentially transmitted to the descendant tag elements, the descendant tag elements can inherit the events transmitted by the ancestor level tag elements, and all the events bound by the descendant tag elements are combined, namely repeated events or repeated effective events and ineffective events are removed; and finally, injecting the rewritten script codes into a response page.
The specific implementation manner of removing the tag elements that cannot trigger the event from all the tag elements in the response page is, for example: if the tag element does not have the binding event or is Invisible (Invisible), rejecting the tag element; if the tag element does not have a binding event, but the parent tag element (or ancestor tag element) of the tag element has a binding event, the tag element inherits the binding event of the parent tag element (or ancestor tag element); if the tag element does not have a binding event, and the parent tag element (or the ancestor tag element) of the tag element does not have a binding event, the tag element is considered to have no binding event, and the tag element needs to be removed; if the tag element is bound with an event, but the tag element is bound with an invalid event which cannot be triggered, the tag element is considered to have no binding event, and the tag element needs to be removed, wherein the invalid event can also be an event bound with the tag element without a name; if the tag element is bound with a plurality of events, wherein the plurality of events comprise valid events and invalid times, the tag element cannot be removed.
Alternatively, after step S120, step S140 is performed: and if the response page does not comprise the script code, the proxy server sends the response page to the crawler terminal.
The embodiment of step S140 described above is, for example: if the response page does not include the script code, the response page can be a static page, where the static page refers to a page that can acquire all information in the page without interaction with the website server corresponding to the page again, and then the proxy server can directly send the static page to the crawler terminal, that is, the proxy server directly sends the response page to the crawler terminal.
In the implementation process, a page request sent by a crawler terminal is received first, and a response page corresponding to an access link in the page request is obtained; then, under the condition that the response page comprises the script codes, loading and rendering the response page by using a browser, modifying the script codes in the rendered response page, injecting the modified script codes into the response page, and sending the injected response page to the crawler terminal, so that the crawler terminal triggers the script codes and acquires resource links in the response page; that is to say, the script codes in the response page returned by the crawler terminal are modified by the proxy server and then returned to the response page modified by the crawler terminal, and when the script codes in the modified response page are rendered and executed by the browser of the crawler terminal, the browser automatically executes the modified script codes in the response page, so that the script codes are automatically triggered, all resource links of the asynchronous request of the trigger event in the response page are acquired, and the problem that the crawler cannot acquire the resource links of the asynchronous request of all trigger events in the page is effectively solved.
Please refer to fig. 2, which is a schematic flow diagram illustrating a process performed by a crawler terminal on a received response page according to an embodiment of the present application; the implementation manner of processing the received response page by the crawler terminal may include:
step S210: and the crawler terminal receives the response page sent by the proxy server and analyzes the modified script code in the response page.
The embodiment of step S210 described above is, for example: the crawler terminal receives a response page sent by the proxy server through an HTTP (hyper text transport protocol) protocol or an HTTPS (hyper text transport protocol) protocol, and loads and renders the response page by using a headless browser; after the response page is loaded and rendered, a script tag is found in the response page by using a JavaScript program or a jQuery program, a source code in the script tag is obtained by using a regular expression, and the source code in the script tag is determined to be a modified script code; although the script code is modified by the proxy server, the crawler terminal does not know whether the script code is modified or not.
After step S210, step S220 is performed: and the crawler terminal executes the modified script code to acquire all events bound by all the tag elements in the response page.
There are many embodiments of the above step S220, including but not limited to the following:
in the first implementation mode, the crawler terminal executes the modified script code by using a Selenium tool and a WebDriver tool, so as to acquire all events bound by all tag elements in a response page; this embodiment is, for example: and acquiring all label elements in the response page by using a regular expression, an XPath and a Beautiful Soup program suite in a Python program, and then acquiring all events bound by all the label elements by using a JavaScript program or a Jquery program.
In the second implementation mode, the modified script codes are a JavaScript program and a jQuery program, and the crawler terminal executes the modified JavaScript program and the jQuery program so as to obtain all events bound by all tag elements in the response page; this embodiment is, for example: after the loading and rendering of the response page are completed, all DOM nodes (the DOM node is another name of the tag element in the DOM operation process) in the response page can be selected by using a selector in the jQuery, and then whether the DOM node is bound with an event or not is judged; if the DOM node is bound with the event, extracting the event of the DOM node by using a JavaScript program; among these, events herein include, but are not limited to: and a hyperlink clicking event, a form clicking event, a mouse clicking event, a keyboard clicking event and the like in the webpage to be processed.
After step S220, step S230 is performed: the crawler terminal controls the thread of the headless browser to simulate and trigger all events, intercepts a page request generated in the triggering process of the events, and acquires resource links generated in the page request.
The embodiment of the step S230 is, for example: the crawler terminal starts a plurality of threads of a headless browser by using a Selenium tool, simulates and triggers all events bound by all label elements by using the plurality of threads of the headless browser, then intercepts a page request generated by the event in the triggering process by using a Python program, and acquires resource links generated in the page request by using programs such as a JavaScript script, jQuery and Python, or acquires the resource links generated in the page request by using a regular expression, XPath and Beautiful Soup program suite in the Python program, or acquires the resource links generated in the page request by using tools such as node. Among them, headless browsers that may be used include, but are not limited to: a PhantomJS browser, a Chrome browser in headless mode, and a Firefox browser in headless mode, etc.
Please refer to fig. 3, which illustrates a schematic structural diagram of a page proxy apparatus according to an embodiment of the present application; the embodiment of the present application provides a page proxy apparatus 300, including:
the response page obtaining module 310 is configured to receive a page request sent by the crawler terminal, and obtain a response page corresponding to an access link in the page request.
And a script code judging module 320 for judging whether the response page includes script codes.
The page injection sending module 330 is configured to, if the response page includes the script code, load and render the response page using the browser, modify the script code in the rendered response page, inject the modified script code into the response page, and send the injected response page to the crawler terminal, so that the crawler terminal triggers the script code and obtains a resource link in the response page.
Optionally, in this embodiment of the present application, the page proxy apparatus further includes:
and the response page sending module is used for sending the response page to the crawler terminal if the response page does not comprise the script code.
Optionally, in this embodiment of the present application, the response page obtaining module includes:
and the page request sending module is used for sending a page request to the website server corresponding to the access link so that the website server returns a response page corresponding to the page request.
And the response page receiving module is used for receiving a response page sent by the website server.
Optionally, in this embodiment of the present application, the page request sending module includes:
and the access link analyzing module is used for analyzing a plurality of internet protocol addresses according to the domain name in the access link.
And the request load balancing module is used for sending the page requests to the plurality of internet protocol addresses in a load balancing mode.
Optionally, in an embodiment of the present application, the script code includes: adding an event listener function; a page injection sending module, comprising:
and the script covering and rewriting module is used for covering and rewriting the added event listener function by using a predefined event listener function to obtain a rewritten script code, and the predefined event listener function is used for adding a trigger event to the tag element in the response page when the response page is rendered so as to enable the crawler terminal to obtain the resource link generated when the trigger event of the tag element is triggered.
And the response page injection module is used for injecting the rewritten script codes into the response page.
Optionally, in this embodiment of the present application, the page injection sending module further includes:
and the tag element acquisition module is used for acquiring all tag elements in the response page.
And the label element removing module is used for removing the label elements which cannot trigger the event in all the label elements in the response page.
Optionally, in this embodiment of the present application, the tag element removing module is specifically configured to: and if the tag element does not have the binding event or is invisible, rejecting the tag element.
It should be understood that the apparatus corresponds to the above-mentioned page proxy method embodiment, and can perform the steps related to the above-mentioned method embodiment, and the specific functions of the apparatus can be referred to the above description, and the detailed description is appropriately omitted here to avoid redundancy. The device includes at least one software function that can be stored in memory in the form of software or firmware (firmware) or solidified in the Operating System (OS) of the device.
Please refer to fig. 4 for a schematic structural diagram of an electronic device according to an embodiment of the present application. An electronic device 400 provided in an embodiment of the present application includes: a processor 410 and a memory 420, the memory 420 storing machine-readable instructions executable by the processor 410, the machine-readable instructions when executed by the processor 410 performing the method as above.
The embodiment of the present application also provides a storage medium 430, where the storage medium 430 stores a computer program, and the computer program is executed by the processor 410 to perform the method as above.
The storage medium 430 may be implemented by any type of volatile or nonvolatile storage device or combination thereof, such as a Static Random Access Memory (SRAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), an Erasable Programmable Read-Only Memory (EPROM), a Programmable Read-Only Memory (PROM), a Read-Only Memory (ROM), a magnetic Memory, a flash Memory, a magnetic disk, or an optical disk.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules of the embodiments in the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The above description is only an alternative embodiment of the embodiments of the present application, but the scope of the embodiments of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the embodiments of the present application, and all the changes or substitutions should be covered by the scope of the embodiments of the present application.

Claims (10)

1. A page proxy method is applied to a proxy server and comprises the following steps:
receiving a page request sent by a crawler terminal, and acquiring a response page corresponding to an access link in the page request;
judging whether the response page comprises script codes or not;
if so, loading and rendering the response page by using a browser, modifying script codes in the rendered response page, injecting the modified script codes into the response page, and sending the injected response page to the crawler terminal, so that the crawler terminal triggers the script codes and acquires resource links in the response page.
2. The method of claim 1, after said determining whether the response page includes script code, further comprising:
and if the response page does not comprise the script code, sending the response page to the crawler terminal.
3. The method of claim 1, wherein the script code comprises: adding an event listener function; the modifying the script codes in the rendered response page and injecting the modified script codes into the response page includes:
using a predefined event listener function to perform coverage rewriting on the added event listener function to obtain a rewritten script code, wherein the predefined event listener function is used for adding a trigger event to a tag element in the response page when the response page is rendered, so that the crawler terminal obtains a resource link generated when the trigger event of the tag element is triggered;
and injecting the rewritten script codes into the response page.
4. The method of claim 3, wherein prior to said injecting said rewritten script code into said response page, further comprising:
acquiring all tag elements in the response page;
and eliminating the label elements which cannot trigger the event from all the label elements in the response page.
5. The method of claim 4, wherein the culling of all the tag elements in the response page that cannot trigger an event comprises:
and if the tag element has no binding event or is invisible, rejecting the tag element.
6. The method according to claim 1, wherein the obtaining of the response page corresponding to the access link in the page request comprises:
sending a page request to a website server corresponding to the access link so that the website server returns a response page corresponding to the page request;
and receiving the response page sent by the website server.
7. The method of claim 6, wherein sending a page request to a website server corresponding to the access link comprises:
resolving a plurality of internet protocol addresses according to the domain name in the access link;
sending the page request to the plurality of Internet protocol addresses in a load balanced manner.
8. A page proxy apparatus, comprising:
the response page acquisition module is used for receiving a page request sent by the crawler terminal and acquiring a response page corresponding to an access link in the page request;
the script code judging module is used for judging whether the response page comprises script codes or not;
and the page injection sending module is used for loading and rendering the response page by using a browser if the response page comprises script codes, modifying the script codes in the rendered response page, injecting the modified script codes into the response page, and sending the injected response page to the crawler terminal so that the crawler terminal triggers the script codes and acquires resource links in the response page.
9. An electronic device, comprising: a processor and a memory, the memory storing machine-readable instructions executable by the processor, the machine-readable instructions, when executed by the processor, performing the method of any of claims 1 to 7.
10. A storage medium, having stored thereon a computer program which, when executed by a processor, performs the method of any one of claims 1 to 7.
CN202011611316.6A 2020-12-29 Page proxy method, device, electronic equipment and storage medium Active CN112637361B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011611316.6A CN112637361B (en) 2020-12-29 Page proxy method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011611316.6A CN112637361B (en) 2020-12-29 Page proxy method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112637361A true CN112637361A (en) 2021-04-09
CN112637361B CN112637361B (en) 2022-09-16

Family

ID=

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076501A (en) * 2021-04-21 2021-07-06 广州虎牙科技有限公司 Page processing method, storage medium and equipment
CN113986322A (en) * 2021-12-29 2022-01-28 天津联想协同科技有限公司 Method, device and storage medium for dynamically modifying page codes

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408204A (en) * 2014-12-18 2015-03-11 北京国双科技有限公司 Method and device for obtaining webpage page link address
CN105095260A (en) * 2014-05-08 2015-11-25 广州爱九游信息技术有限公司 Webpage processing method and device aiming at search engine optimization
US20180013848A1 (en) * 2016-07-08 2018-01-11 Facebook, Inc. Methods and Systems for Rewriting Scripts to Direct Requests
CN109656670A (en) * 2018-12-27 2019-04-19 广州华多网络科技有限公司 A kind of page rendering method and device
CN109670100A (en) * 2018-12-21 2019-04-23 第四范式(北京)技术有限公司 A kind of page data grasping means and device
CN110069683A (en) * 2017-09-18 2019-07-30 北京国双科技有限公司 A kind of method and device crawling data based on browser
US20190303500A1 (en) * 2018-03-27 2019-10-03 Capital One Services, Llc Systems and methods for single page application server side renderer
CN111680247A (en) * 2020-04-28 2020-09-18 平安国际智慧城市科技股份有限公司 Local calling method, device, equipment and storage medium of webpage character string

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095260A (en) * 2014-05-08 2015-11-25 广州爱九游信息技术有限公司 Webpage processing method and device aiming at search engine optimization
CN104408204A (en) * 2014-12-18 2015-03-11 北京国双科技有限公司 Method and device for obtaining webpage page link address
US20180013848A1 (en) * 2016-07-08 2018-01-11 Facebook, Inc. Methods and Systems for Rewriting Scripts to Direct Requests
CN110069683A (en) * 2017-09-18 2019-07-30 北京国双科技有限公司 A kind of method and device crawling data based on browser
US20190303500A1 (en) * 2018-03-27 2019-10-03 Capital One Services, Llc Systems and methods for single page application server side renderer
CN109670100A (en) * 2018-12-21 2019-04-23 第四范式(北京)技术有限公司 A kind of page data grasping means and device
CN109656670A (en) * 2018-12-27 2019-04-19 广州华多网络科技有限公司 A kind of page rendering method and device
CN111680247A (en) * 2020-04-28 2020-09-18 平安国际智慧城市科技股份有限公司 Local calling method, device, equipment and storage medium of webpage character string

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076501A (en) * 2021-04-21 2021-07-06 广州虎牙科技有限公司 Page processing method, storage medium and equipment
CN113986322A (en) * 2021-12-29 2022-01-28 天津联想协同科技有限公司 Method, device and storage medium for dynamically modifying page codes
CN113986322B (en) * 2021-12-29 2022-03-11 天津联想协同科技有限公司 Method, device and storage medium for dynamically modifying page codes

Similar Documents

Publication Publication Date Title
US10567407B2 (en) Method and system for detecting malicious web addresses
US20130212465A1 (en) Postponed rendering of select web page elements
EP2842072B1 (en) Retrieving content from website through sandbox
CN105095280B (en) Browser caching method and device
US8793809B2 (en) Unified tracking data management
Lawson Web scraping with Python
US7921353B1 (en) Method and system for providing client-server injection framework using asynchronous JavaScript and XML
US8935798B1 (en) Automatically enabling private browsing of a web page, and applications thereof
CN106126693B (en) Method and device for sending related data of webpage
US10015226B2 (en) Methods for making AJAX web applications bookmarkable and crawlable and devices thereof
US20120210243A1 (en) Web co-navigation
US20140280691A1 (en) Updating dynamic content in cached resources
US10474729B2 (en) Delayed encoding of resource identifiers
US20140304588A1 (en) Creating page snapshots
CN101964025A (en) XSS (Cross Site Scripting) detection method and device
US10291738B1 (en) Speculative prefetch of resources across page loads
CN107147645B (en) Method and device for acquiring network security data
CN113076501A (en) Page processing method, storage medium and equipment
CN111177519A (en) Webpage content acquisition method and device, storage medium and equipment
US20170131856A1 (en) System and Method for a Hybrid User Interface for the Display of Analytical Data Related to Real-time Search Engine Optimization Issue Detection and Correction
CN113742551A (en) Dynamic data capture method based on script and puppeteer
CN112637361B (en) Page proxy method, device, electronic equipment and storage medium
CN112637361A (en) Page proxy method, device, electronic equipment and storage medium
CN111552854A (en) Webpage data capturing method and device, storage medium and equipment
Panum et al. Kraaler: A user-perspective web crawler

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant