CN113656674B

CN113656674B - Automatic processing method and device for click type hyperlink in website crawler

Info

Publication number: CN113656674B
Application number: CN202111018080.XA
Authority: CN
Inventors: 董仲舒; 张阳光; 何文欢; 程杰; 毕静静; 姚金龙
Original assignee: Valley Network Polytron Technologies Inc
Current assignee: Valley Network Polytron Technologies Inc
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2023-06-27
Anticipated expiration: 2041-08-30
Also published as: CN113656674A

Abstract

The invention discloses an automatic processing method and device for click type hyperlinks in a web crawler, wherein the method comprises the following steps: capturing page links through a web crawler; generating webpage content; then generating a hyperlink queue; judging whether the element needing clicking is contained or not by using an xpath expression; if yes, returning to the expression, continuing to execute, and if not, jumping to the last step; utilizing selinum to call a virtual browser, reloading a current page, if the loading is successful, continuing to execute, otherwise, jumping to the last step; finding out the element needing clicking according to the expression, and performing simulated clicking; if the browser response is successful, continuing to execute, otherwise, jumping to the last step; obtaining the content of the response, performing UTF-8 coding, and then jumping to the second step; taking out the next hyperlink, and continuing grabbing by adopting a breadth or depth traversal algorithm; and then jumps to the first step. The invention greatly improves the integrity and accuracy of the web crawler content.

Description

Automatic processing method and device for click type hyperlink in website crawler

Technical Field

The invention belongs to the technical field of website crawlers, and particularly relates to an automatic processing method and device for click type hyperlinks in a website crawler, which are suitable for links which can be further accessed by manually clicking in the process of the website crawler.

Background

Along with the development of the modern webpage front-end technology, particularly the front-end language javascript; many excellent front-end frameworks such as jquery, vue, react, angular are presented, and with the advent of these frameworks, there are several excellent UI component libraries such as bootstrap, element-UI, etc., these frameworks are very excellent in compatibility, applicability, convenience and internationalization, and greatly improve the efficiency of web site development, so that more and more web sites begin to develop using these frameworks.

While these frameworks offer great convenience to web site development, they present great difficulties and challenges to web site crawlers and the field of content retrieval, one of the most prominent being the need for a single click to continue access for some hyperlinks; whereas a traditional web crawler only grabs hyperlinks similar to < a href= "xxx" > xxx </a > that exist on a web page, but breaks through for hyperlinks such as < a onclick= "xxx" > xxx </a >; the more such hyperlinks are present in existing web sites, in particular in the hyperlinks "previous" and "next"; these hyperlinks are the basic style and method of web site content pagination, resulting in incomplete and inaccurate content crawling by web site crawlers.

Disclosure of Invention

The invention provides an automatic processing method and device for clicking type hyperlinks in a website crawler, aiming at the problem that the clicking type hyperlinks (such as 'previous page' and 'next page' hyperlinks) cannot be grabbed when the traditional website crawler is in use, so that the grabbed content is incomplete and inaccurate.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

in one aspect, the present invention proposes an automated processing method for clicking type hyperlinks in a web crawler, including:

step 1: capturing page links through a web crawler;

step 2: generating webpage content according to the page links;

step 3: generating a hyperlink queue after de-duplicating all hyperlinks in the web page;

step 4: judging whether the webpage content contains elements needing clicking or not by using an xpath expression on the basis of the webpage content; if yes, returning to the xpath expression, executing the step 5, and if not, jumping to the step 8;

step 5: utilizing selinum to call a virtual browser, reloading the current page, executing the step 6 if the loading is successful, otherwise jumping to the step 8;

step 6: selecting according to the xpath expression returned in the step 4, carrying out simulated clicking on the element after finding the element needing clicking, and waiting for the response of the browser; if the browser response is successful, executing the step 7, otherwise jumping to the step 8;

step 7: obtaining response content from the browser, performing UTF-8 coding on the response content, and then jumping to the step 2;

step 8: taking out the next hyperlink in the hyperlink queue, and continuing to grasp by adopting a breadth or depth traversal algorithm; and then jumps to step 1.

Further, in the step 4, the xpath expression includes:

1) /(td [ containers (text (), 'lower page') and starts-with (@ onclick, 'window. Location') ];

2) /(a [ containers (), 'next page') and starts-with (@ href, 'javascript:') ];

3) /(a [ containers (text (), 'next page') and@onclick and@href= '#' ];

4) /(a [ containers (text (), 'next page') and@onclick ];

5)//a[@onclick]。

another aspect of the present invention proposes an automated processing apparatus for clicking a type hyperlink in a web crawler, comprising:

the page link grabbing module is used for grabbing page links through a web crawler;

the webpage content generation module is used for generating webpage content according to the page links;

the de-duplication module is used for de-duplicating all hyperlinks in the web page and then generating a hyperlink queue;

the first judging module is used for judging whether the element needing clicking is contained by utilizing an xpath expression on the basis of the webpage content; if yes, returning the xpath expression, executing a second judging module, and if not, executing a circulating module;

the second judging module is used for calling the virtual browser by using selinum, reloading the current page, executing the third judging module if the loading is successful, and executing the circulating module if the loading is not successful;

the third judging module is used for selecting according to the xpath expression returned by the first judging module, simulating clicking on the element after finding the element needing clicking, and waiting for the response of the browser; executing the coding module if the browser responds successfully, otherwise executing the circulation module;

the coding module is used for acquiring the response content from the browser, carrying out UTF-8 coding on the response content, and then executing the webpage content generation module;

the circulation module is used for taking out the next hyperlink in the hyperlink queue and adopting a breadth or depth traversal algorithm to continue grabbing; and then executing a page link grabbing module.

Further, the xpath expression includes:

2) /(a [ containers (), 'next page') and starts-with (@ href, 'javascript:') ];

3) /(a [ containers (text (), 'next page') and@onclick and@href= '#' ];

4) /(a [ containers (text (), 'next page') and@onclick ];

5)//a[@onclick]。

compared with the prior art, the invention has the beneficial effects that:

when the characteristic that the manual clicking is required in the webpage content is judged, the invention directly utilizes the simulation browser to load the current page once, then finds the link that the current clicking is required to simulate the clicking event, then the browser carries out the simulation loading on the link, and finally returns the corresponding content. By the method and the device, the traditional webpage content can be captured when the web crawler is in use, click type hyperlinks (such as 'previous page' and 'next page' hyperlinks) which cannot be captured by the traditional web crawler can be captured, and the integrity and the accuracy of the web crawler content are greatly improved.

Drawings

FIG. 1 is a basic flow diagram of an automated processing method for click type hyperlinks in web site crawlers in accordance with an embodiment of the present invention;

FIG. 2 is a comparison of web site crawler data;

FIG. 3 is a schematic diagram of an automated processing unit related to click type hyperlinks in web crawlers according to an embodiment of the present invention.

Detailed Description

The invention is further illustrated by the following description of specific embodiments in conjunction with the accompanying drawings:

as shown in fig. 1, an automated processing method for clicking type hyperlinks in a web site crawler includes:

step 1: capturing page links through a web crawler;

step 2: generating webpage content according to the page links;

Further, in the step 4, the xpath expression includes:

2) /(a [ containers (), 'next page') and starts-with (@ href, 'javascript:') ];

3) /(a [ containers (text (), 'next page') and@onclick and@href= '#' ];

4) /(a [ containers (text (), 'next page') and@onclick ];

5)//a[@onclick]。

to verify the effect of the invention, the following experiments were performed:

in the test, five websites are used for testing, in order to eliminate network jitter and server performance influence, ten tests are carried out on each website, the first five tests are carried out by using a standard crawler method, and the last five tests are carried out by adopting the method; the test results are shown in the following table. For privacy reasons, etc., we replace the website name with a website id.

Table 1: comparison test result table

The comparative test chart is shown in fig. 2.

As can be seen from Table 1 and FIG. 2, the method of the present invention has the advantages that the amount of the data captured by the web crawlers on certain websites is obviously improved, and the integrity of the data of the crawlers is greatly improved.

On the basis of the above embodiment, as shown in fig. 3, another aspect of the present invention proposes an automated processing apparatus for clicking type hyperlinks on a web site crawler, including:

Further, the xpath expression includes:

2) /(a [ containers (), 'next page') and starts-with (@ href, 'javascript:') ];

3) /(a [ containers (text (), 'next page') and@onclick and@href= '#' ];

4) /(a [ containers (text (), 'next page') and@onclick ];

5)//a[@onclick]。

in summary, the invention adopts the technology of simulating the browser, when the characteristic that needs to be clicked manually exists in the webpage content, the invention directly utilizes the simulating browser to load the current page once, then finds the link that needs to be clicked currently to simulate the clicking event, then the browser carries out simulating loading on the link, and finally returns corresponding content. By the method and the device, the traditional webpage content can be captured when the web crawler is in use, click type hyperlinks (such as 'previous page' and 'next page' hyperlinks) which cannot be captured by the traditional web crawler can be captured, and the integrity and the accuracy of the web crawler content are greatly improved.

The foregoing is merely illustrative of the preferred embodiments of this invention, and it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of this invention, and it is intended to cover such modifications and changes as fall within the true scope of the invention.

Claims

1. An automated processing method for click-type hyperlinks in web site crawlers, comprising:

step 1: capturing page links through a web crawler;

step 2: generating webpage content according to the page links;

2. The automated processing method for click-type hyperlinks on web crawlers according to claim 1, wherein in said step 4, said xpath expression comprises:

2) /(a [ containers (), 'next page') and starts-with (@ href, 'javascript:') ];

3) /(a [ containers (text (), 'next page') and@onclick and@href= '#' ];

4) /(a [ containers (text (), 'next page') and@onclick ];

5)//a[@onclick]。

3. an automated processing apparatus for click-type hyperlinks in web site crawlers, comprising:

4. The automated processing apparatus for click-type hyperlinks in web crawlers according to claim 3, wherein said xpath expression comprises:

2) /(a [ containers (), 'next page') and starts-with (@ href, 'javascript:') ];

3) /(a [ containers (text (), 'next page') and@onclick and@href= '#' ];

4) /(a [ containers (text (), 'next page') and@onclick ];

5)//a[@onclick]。