CN113656674B - Automatic processing method and device for click type hyperlink in website crawler - Google Patents

Automatic processing method and device for click type hyperlink in website crawler Download PDF

Info

Publication number
CN113656674B
CN113656674B CN202111018080.XA CN202111018080A CN113656674B CN 113656674 B CN113656674 B CN 113656674B CN 202111018080 A CN202111018080 A CN 202111018080A CN 113656674 B CN113656674 B CN 113656674B
Authority
CN
China
Prior art keywords
page
module
executing
clicking
browser
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111018080.XA
Other languages
Chinese (zh)
Other versions
CN113656674A (en
Inventor
董仲舒
张阳光
何文欢
程杰
毕静静
姚金龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Valley Network Polytron Technologies Inc
Original Assignee
Valley Network Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Valley Network Polytron Technologies Inc filed Critical Valley Network Polytron Technologies Inc
Priority to CN202111018080.XA priority Critical patent/CN113656674B/en
Publication of CN113656674A publication Critical patent/CN113656674A/en
Application granted granted Critical
Publication of CN113656674B publication Critical patent/CN113656674B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9558Details of hyperlinks; Management of linked annotations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses an automatic processing method and device for click type hyperlinks in a web crawler, wherein the method comprises the following steps: capturing page links through a web crawler; generating webpage content; then generating a hyperlink queue; judging whether the element needing clicking is contained or not by using an xpath expression; if yes, returning to the expression, continuing to execute, and if not, jumping to the last step; utilizing selinum to call a virtual browser, reloading a current page, if the loading is successful, continuing to execute, otherwise, jumping to the last step; finding out the element needing clicking according to the expression, and performing simulated clicking; if the browser response is successful, continuing to execute, otherwise, jumping to the last step; obtaining the content of the response, performing UTF-8 coding, and then jumping to the second step; taking out the next hyperlink, and continuing grabbing by adopting a breadth or depth traversal algorithm; and then jumps to the first step. The invention greatly improves the integrity and accuracy of the web crawler content.

Description

Automatic processing method and device for click type hyperlink in website crawler
Technical Field
The invention belongs to the technical field of website crawlers, and particularly relates to an automatic processing method and device for click type hyperlinks in a website crawler, which are suitable for links which can be further accessed by manually clicking in the process of the website crawler.
Background
Along with the development of the modern webpage front-end technology, particularly the front-end language javascript; many excellent front-end frameworks such as jquery, vue, react, angular are presented, and with the advent of these frameworks, there are several excellent UI component libraries such as bootstrap, element-UI, etc., these frameworks are very excellent in compatibility, applicability, convenience and internationalization, and greatly improve the efficiency of web site development, so that more and more web sites begin to develop using these frameworks.
While these frameworks offer great convenience to web site development, they present great difficulties and challenges to web site crawlers and the field of content retrieval, one of the most prominent being the need for a single click to continue access for some hyperlinks; whereas a traditional web crawler only grabs hyperlinks similar to < a href= "xxx" > xxx </a > that exist on a web page, but breaks through for hyperlinks such as < a onclick= "xxx" > xxx </a >; the more such hyperlinks are present in existing web sites, in particular in the hyperlinks "previous" and "next"; these hyperlinks are the basic style and method of web site content pagination, resulting in incomplete and inaccurate content crawling by web site crawlers.
Disclosure of Invention
The invention provides an automatic processing method and device for clicking type hyperlinks in a website crawler, aiming at the problem that the clicking type hyperlinks (such as 'previous page' and 'next page' hyperlinks) cannot be grabbed when the traditional website crawler is in use, so that the grabbed content is incomplete and inaccurate.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
in one aspect, the present invention proposes an automated processing method for clicking type hyperlinks in a web crawler, including:
step 1: capturing page links through a web crawler;
step 2: generating webpage content according to the page links;
step 3: generating a hyperlink queue after de-duplicating all hyperlinks in the web page;
step 4: judging whether the webpage content contains elements needing clicking or not by using an xpath expression on the basis of the webpage content; if yes, returning to the xpath expression, executing the step 5, and if not, jumping to the step 8;
step 5: utilizing selinum to call a virtual browser, reloading the current page, executing the step 6 if the loading is successful, otherwise jumping to the step 8;
step 6: selecting according to the xpath expression returned in the step 4, carrying out simulated clicking on the element after finding the element needing clicking, and waiting for the response of the browser; if the browser response is successful, executing the step 7, otherwise jumping to the step 8;
step 7: obtaining response content from the browser, performing UTF-8 coding on the response content, and then jumping to the step 2;
step 8: taking out the next hyperlink in the hyperlink queue, and continuing to grasp by adopting a breadth or depth traversal algorithm; and then jumps to step 1.
Further, in the step 4, the xpath expression includes:
1) /(td [ containers (text (), 'lower page') and starts-with (@ onclick, 'window. Location') ];
2) /(a [ containers (), 'next page') and starts-with (@ href, 'javascript:') ];
3) /(a [ containers (text (), 'next page') and@onclick and@href= '#' ];
4) /(a [ containers (text (), 'next page') and@onclick ];
5)//a[@onclick]。
another aspect of the present invention proposes an automated processing apparatus for clicking a type hyperlink in a web crawler, comprising:
the page link grabbing module is used for grabbing page links through a web crawler;
the webpage content generation module is used for generating webpage content according to the page links;
the de-duplication module is used for de-duplicating all hyperlinks in the web page and then generating a hyperlink queue;
the first judging module is used for judging whether the element needing clicking is contained by utilizing an xpath expression on the basis of the webpage content; if yes, returning the xpath expression, executing a second judging module, and if not, executing a circulating module;
the second judging module is used for calling the virtual browser by using selinum, reloading the current page, executing the third judging module if the loading is successful, and executing the circulating module if the loading is not successful;
the third judging module is used for selecting according to the xpath expression returned by the first judging module, simulating clicking on the element after finding the element needing clicking, and waiting for the response of the browser; executing the coding module if the browser responds successfully, otherwise executing the circulation module;
the coding module is used for acquiring the response content from the browser, carrying out UTF-8 coding on the response content, and then executing the webpage content generation module;
the circulation module is used for taking out the next hyperlink in the hyperlink queue and adopting a breadth or depth traversal algorithm to continue grabbing; and then executing a page link grabbing module.
Further, the xpath expression includes:
1) /(td [ containers (text (), 'lower page') and starts-with (@ onclick, 'window. Location') ];
2) /(a [ containers (), 'next page') and starts-with (@ href, 'javascript:') ];
3) /(a [ containers (text (), 'next page') and@onclick and@href= '#' ];
4) /(a [ containers (text (), 'next page') and@onclick ];
5)//a[@onclick]。
compared with the prior art, the invention has the beneficial effects that:
when the characteristic that the manual clicking is required in the webpage content is judged, the invention directly utilizes the simulation browser to load the current page once, then finds the link that the current clicking is required to simulate the clicking event, then the browser carries out the simulation loading on the link, and finally returns the corresponding content. By the method and the device, the traditional webpage content can be captured when the web crawler is in use, click type hyperlinks (such as 'previous page' and 'next page' hyperlinks) which cannot be captured by the traditional web crawler can be captured, and the integrity and the accuracy of the web crawler content are greatly improved.
Drawings
FIG. 1 is a basic flow diagram of an automated processing method for click type hyperlinks in web site crawlers in accordance with an embodiment of the present invention;
FIG. 2 is a comparison of web site crawler data;
FIG. 3 is a schematic diagram of an automated processing unit related to click type hyperlinks in web crawlers according to an embodiment of the present invention.
Detailed Description
The invention is further illustrated by the following description of specific embodiments in conjunction with the accompanying drawings:
as shown in fig. 1, an automated processing method for clicking type hyperlinks in a web site crawler includes:
step 1: capturing page links through a web crawler;
step 2: generating webpage content according to the page links;
step 3: generating a hyperlink queue after de-duplicating all hyperlinks in the web page;
step 4: judging whether the webpage content contains elements needing clicking or not by using an xpath expression on the basis of the webpage content; if yes, returning to the xpath expression, executing the step 5, and if not, jumping to the step 8;
step 5: utilizing selinum to call a virtual browser, reloading the current page, executing the step 6 if the loading is successful, otherwise jumping to the step 8;
step 6: selecting according to the xpath expression returned in the step 4, carrying out simulated clicking on the element after finding the element needing clicking, and waiting for the response of the browser; if the browser response is successful, executing the step 7, otherwise jumping to the step 8;
step 7: obtaining response content from the browser, performing UTF-8 coding on the response content, and then jumping to the step 2;
step 8: taking out the next hyperlink in the hyperlink queue, and continuing to grasp by adopting a breadth or depth traversal algorithm; and then jumps to step 1.
Further, in the step 4, the xpath expression includes:
1) /(td [ containers (text (), 'lower page') and starts-with (@ onclick, 'window. Location') ];
2) /(a [ containers (), 'next page') and starts-with (@ href, 'javascript:') ];
3) /(a [ containers (text (), 'next page') and@onclick and@href= '#' ];
4) /(a [ containers (text (), 'next page') and@onclick ];
5)//a[@onclick]。
to verify the effect of the invention, the following experiments were performed:
in the test, five websites are used for testing, in order to eliminate network jitter and server performance influence, ten tests are carried out on each website, the first five tests are carried out by using a standard crawler method, and the last five tests are carried out by adopting the method; the test results are shown in the following table. For privacy reasons, etc., we replace the website name with a website id.
Table 1: comparison test result table
Figure BDA0003237349940000041
Figure BDA0003237349940000051
The comparative test chart is shown in fig. 2.
As can be seen from Table 1 and FIG. 2, the method of the present invention has the advantages that the amount of the data captured by the web crawlers on certain websites is obviously improved, and the integrity of the data of the crawlers is greatly improved.
On the basis of the above embodiment, as shown in fig. 3, another aspect of the present invention proposes an automated processing apparatus for clicking type hyperlinks on a web site crawler, including:
the page link grabbing module is used for grabbing page links through a web crawler;
the webpage content generation module is used for generating webpage content according to the page links;
the de-duplication module is used for de-duplicating all hyperlinks in the web page and then generating a hyperlink queue;
the first judging module is used for judging whether the element needing clicking is contained by utilizing an xpath expression on the basis of the webpage content; if yes, returning the xpath expression, executing a second judging module, and if not, executing a circulating module;
the second judging module is used for calling the virtual browser by using selinum, reloading the current page, executing the third judging module if the loading is successful, and executing the circulating module if the loading is not successful;
the third judging module is used for selecting according to the xpath expression returned by the first judging module, simulating clicking on the element after finding the element needing clicking, and waiting for the response of the browser; executing the coding module if the browser responds successfully, otherwise executing the circulation module;
the coding module is used for acquiring the response content from the browser, carrying out UTF-8 coding on the response content, and then executing the webpage content generation module;
the circulation module is used for taking out the next hyperlink in the hyperlink queue and adopting a breadth or depth traversal algorithm to continue grabbing; and then executing a page link grabbing module.
Further, the xpath expression includes:
1) /(td [ containers (text (), 'lower page') and starts-with (@ onclick, 'window. Location') ];
2) /(a [ containers (), 'next page') and starts-with (@ href, 'javascript:') ];
3) /(a [ containers (text (), 'next page') and@onclick and@href= '#' ];
4) /(a [ containers (text (), 'next page') and@onclick ];
5)//a[@onclick]。
in summary, the invention adopts the technology of simulating the browser, when the characteristic that needs to be clicked manually exists in the webpage content, the invention directly utilizes the simulating browser to load the current page once, then finds the link that needs to be clicked currently to simulate the clicking event, then the browser carries out simulating loading on the link, and finally returns corresponding content. By the method and the device, the traditional webpage content can be captured when the web crawler is in use, click type hyperlinks (such as 'previous page' and 'next page' hyperlinks) which cannot be captured by the traditional web crawler can be captured, and the integrity and the accuracy of the web crawler content are greatly improved.
The foregoing is merely illustrative of the preferred embodiments of this invention, and it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of this invention, and it is intended to cover such modifications and changes as fall within the true scope of the invention.

Claims (4)

1. An automated processing method for click-type hyperlinks in web site crawlers, comprising:
step 1: capturing page links through a web crawler;
step 2: generating webpage content according to the page links;
step 3: generating a hyperlink queue after de-duplicating all hyperlinks in the web page;
step 4: judging whether the webpage content contains elements needing clicking or not by using an xpath expression on the basis of the webpage content; if yes, returning to the xpath expression, executing the step 5, and if not, jumping to the step 8;
step 5: utilizing selinum to call a virtual browser, reloading the current page, executing the step 6 if the loading is successful, otherwise jumping to the step 8;
step 6: selecting according to the xpath expression returned in the step 4, carrying out simulated clicking on the element after finding the element needing clicking, and waiting for the response of the browser; if the browser response is successful, executing the step 7, otherwise jumping to the step 8;
step 7: obtaining response content from the browser, performing UTF-8 coding on the response content, and then jumping to the step 2;
step 8: taking out the next hyperlink in the hyperlink queue, and continuing to grasp by adopting a breadth or depth traversal algorithm; and then jumps to step 1.
2. The automated processing method for click-type hyperlinks on web crawlers according to claim 1, wherein in said step 4, said xpath expression comprises:
1) /(td [ containers (text (), 'lower page') and starts-with (@ onclick, 'window. Location') ];
2) /(a [ containers (), 'next page') and starts-with (@ href, 'javascript:') ];
3) /(a [ containers (text (), 'next page') and@onclick and@href= '#' ];
4) /(a [ containers (text (), 'next page') and@onclick ];
5)//a[@onclick]。
3. an automated processing apparatus for click-type hyperlinks in web site crawlers, comprising:
the page link grabbing module is used for grabbing page links through a web crawler;
the webpage content generation module is used for generating webpage content according to the page links;
the de-duplication module is used for de-duplicating all hyperlinks in the web page and then generating a hyperlink queue;
the first judging module is used for judging whether the element needing clicking is contained by utilizing an xpath expression on the basis of the webpage content; if yes, returning the xpath expression, executing a second judging module, and if not, executing a circulating module;
the second judging module is used for calling the virtual browser by using selinum, reloading the current page, executing the third judging module if the loading is successful, and executing the circulating module if the loading is not successful;
the third judging module is used for selecting according to the xpath expression returned by the first judging module, simulating clicking on the element after finding the element needing clicking, and waiting for the response of the browser; executing the coding module if the browser responds successfully, otherwise executing the circulation module;
the coding module is used for acquiring the response content from the browser, carrying out UTF-8 coding on the response content, and then executing the webpage content generation module;
the circulation module is used for taking out the next hyperlink in the hyperlink queue and adopting a breadth or depth traversal algorithm to continue grabbing; and then executing a page link grabbing module.
4. The automated processing apparatus for click-type hyperlinks in web crawlers according to claim 3, wherein said xpath expression comprises:
1) /(td [ containers (text (), 'lower page') and starts-with (@ onclick, 'window. Location') ];
2) /(a [ containers (), 'next page') and starts-with (@ href, 'javascript:') ];
3) /(a [ containers (text (), 'next page') and@onclick and@href= '#' ];
4) /(a [ containers (text (), 'next page') and@onclick ];
5)//a[@onclick]。
CN202111018080.XA 2021-08-30 2021-08-30 Automatic processing method and device for click type hyperlink in website crawler Active CN113656674B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111018080.XA CN113656674B (en) 2021-08-30 2021-08-30 Automatic processing method and device for click type hyperlink in website crawler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111018080.XA CN113656674B (en) 2021-08-30 2021-08-30 Automatic processing method and device for click type hyperlink in website crawler

Publications (2)

Publication Number Publication Date
CN113656674A CN113656674A (en) 2021-11-16
CN113656674B true CN113656674B (en) 2023-06-27

Family

ID=78493394

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111018080.XA Active CN113656674B (en) 2021-08-30 2021-08-30 Automatic processing method and device for click type hyperlink in website crawler

Country Status (1)

Country Link
CN (1) CN113656674B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105243159A (en) * 2015-10-28 2016-01-13 福建亿榕信息技术有限公司 Visual script editor-based distributed web crawler system
AU2016101343A4 (en) * 2015-07-30 2016-09-01 M Hassan & S Hassan & E Kravchenko & A Shchurov Method and systems for operating dynamic dashboard style website menus
CN108062468A (en) * 2017-12-25 2018-05-22 南京烽火软件科技有限公司 A kind of web crawlers method based on picture validation code identification
CN112632358A (en) * 2020-12-29 2021-04-09 北京天融信网络安全技术有限公司 Resource link obtaining method and device, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2016101343A4 (en) * 2015-07-30 2016-09-01 M Hassan & S Hassan & E Kravchenko & A Shchurov Method and systems for operating dynamic dashboard style website menus
CN105243159A (en) * 2015-10-28 2016-01-13 福建亿榕信息技术有限公司 Visual script editor-based distributed web crawler system
CN108062468A (en) * 2017-12-25 2018-05-22 南京烽火软件科技有限公司 A kind of web crawlers method based on picture validation code identification
CN112632358A (en) * 2020-12-29 2021-04-09 北京天融信网络安全技术有限公司 Resource link obtaining method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于Selenium框架的定向网络数据获取的设计与实现;何苗;张蕴;;工业控制计算机(06);全文 *

Also Published As

Publication number Publication date
CN113656674A (en) 2021-11-16

Similar Documents

Publication Publication Date Title
CN108304498B (en) Webpage data acquisition method and device, computer equipment and storage medium
JP5695027B2 (en) Method and system for acquiring AJAX web page content
CN103095681B (en) A kind of method and device detecting leak
CN109033195A (en) The acquisition methods of webpage information obtain equipment and computer-readable medium
KR20080053293A (en) Initial server-side content rendering for client-script web pages
CN106599270B (en) Network data capturing method and crawler
CN112417240A (en) Website link detection method and device and computer equipment
CN111367595B (en) Data processing method, program running method, device and processing equipment
CN112637361B (en) Page proxy method, device, electronic equipment and storage medium
CN103177115A (en) Method and device of extracting page link of webpage
US20160034378A1 (en) Method and system for testing page link addresses
CN108306918B (en) Automatic website access information acquisition method based on program dynamic analysis
CN112612943A (en) Asynchronous processing framework-based data crawling method with automatic testing function
CN111324894A (en) XSS vulnerability detection method and system based on web application security
CN114491560A (en) Vulnerability detection method and device, storage medium and electronic equipment
CN113656674B (en) Automatic processing method and device for click type hyperlink in website crawler
Liu et al. A XSS vulnerability detection approach based on simulating browser behavior
CN108200191B (en) Utilize the client dynamic URL associated script character string detection system of perturbation method
CN103390050B (en) The method of Web Pre-Fetching, device and terminal unit
Losada et al. Efficient execution of web navigation sequences
CN110232019A (en) Page test method and Related product
Yao et al. An approach for crawling dynamic webpages based on script language analysis
CN114443929A (en) Data capturing method, device and medium
Panum et al. Kraaler: A user-perspective web crawler
CN106202319B (en) Abnormal URL (Uniform resource locator) verification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant