CN113656674A - Automatic processing method and device for click type hyperlink in website crawler - Google Patents

Automatic processing method and device for click type hyperlink in website crawler Download PDF

Info

Publication number
CN113656674A
CN113656674A CN202111018080.XA CN202111018080A CN113656674A CN 113656674 A CN113656674 A CN 113656674A CN 202111018080 A CN202111018080 A CN 202111018080A CN 113656674 A CN113656674 A CN 113656674A
Authority
CN
China
Prior art keywords
page
module
executing
browser
hyperlink
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111018080.XA
Other languages
Chinese (zh)
Other versions
CN113656674B (en
Inventor
董仲舒
张阳光
何文欢
程杰
毕静静
姚金龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Valley Network Polytron Technologies Inc
Original Assignee
Valley Network Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Valley Network Polytron Technologies Inc filed Critical Valley Network Polytron Technologies Inc
Priority to CN202111018080.XA priority Critical patent/CN113656674B/en
Publication of CN113656674A publication Critical patent/CN113656674A/en
Application granted granted Critical
Publication of CN113656674B publication Critical patent/CN113656674B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9558Details of hyperlinks; Management of linked annotations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses an automatic processing method and a device for clicking type hyperlinks in website crawlers, wherein the method comprises the following steps: capturing page links through a web crawler; generating webpage content; then generating a hyperlink queue; judging whether an element needing to be clicked is contained or not by using an xpath expression; if the expression is returned, continuing to execute, and if the expression is not returned, jumping to the last step; calling the virtual browser by utilizing selinum, reloading the current page, if the loading is successful, continuing to execute, otherwise, jumping to the last step; finding the elements needing clicking according to the expression and carrying out simulated clicking; if the browser response is successful, continuing to execute, otherwise, jumping to the last step; acquiring the response content, encoding by UTF-8, and jumping to the second step; taking out the next hyperlink, and adopting an breadth or depth traversal algorithm to continuously capture the next hyperlink; and then jumps to the first step. The invention greatly improves the integrity and accuracy of the contents of the web crawler.

Description

Automatic processing method and device for click type hyperlink in website crawler
Technical Field
The invention belongs to the technical field of website crawlers, and particularly relates to an automatic processing method and device for click type hyperlinks in a website crawler, which are suitable for links which can be further accessed only by manually clicking in the process of the website crawler.
Background
With the development of the modern webpage front-end technology, particularly front-end language javascript; many excellent front-end frameworks such as jquery, vue, act, and angular appear, and with the emergence of these frameworks, some excellent UI component libraries such as bootstrap, element-UI, etc. appear, and these frameworks are excellent in compatibility, applicability, convenience, and internationalization, which greatly improves the efficiency of website development, so that more and more websites begin to be developed by using these frameworks.
Although these frameworks bring great convenience to website development, they bring great difficulties and challenges to the fields of website crawlers and content retrieval, one of the most prominent problems is that some hyperlinks require one click to continue accessing; the traditional web crawler only grabs hyperlinks similar to < a href ═ xxx > xxx existing in the web page, but catches an elbow for the hyperlinks of < a onclick ═ xxx "> xxx >; the more such hyperlinks exist in existing websites, particularly in hyperlinks "previous page" and "next page"; the hyperlinks are basic styles and methods for paging the content of the website, so that the content captured by the crawler of the website is incomplete and inaccurate.
Disclosure of Invention
The invention provides an automatic processing method and device for clicking type hyperlinks in a website crawler, aiming at the problem that the clicking type hyperlinks (such as previous page hyperlinks and next page hyperlinks) cannot be grabbed when the traditional website crawler captures the clicked type hyperlinks, so that the grabbing content is incomplete and inaccurate.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides an automatic processing method of a click type hyperlink in a website crawler, which comprises the following steps:
step 1: capturing page links through a web crawler;
step 2: generating webpage content according to the page link;
and step 3: generating a hyperlink queue after all hyperlinks in the webpage are deduplicated;
and 4, step 4: on the basis of the webpage content, judging whether an element needing to be clicked is contained by using an xpath expression; if yes, returning the xpath expression, executing the step 5, and if not, jumping to the step 8;
and 5: calling the virtual browser by utilizing selinum, reloading the current page, executing the step 6 if the loading is successful, otherwise jumping to the step 8;
step 6: selecting according to the xpath expression returned in the step 4, performing simulated clicking on the element after the element needing clicking is found, and then waiting for the response of the browser; if the browser response is successful, executing step 7, otherwise jumping to step 8;
and 7: acquiring the response content from the browser, carrying out UTF-8 encoding on the response content, and then jumping to the step 2;
and 8: taking out the next hyperlink in the hyperlink queue, and adopting an breadth or depth traversal algorithm to continuously capture; then jump to step 1.
Further, in step 4, the xpath expression includes:
1) // td [ details (, 'lower page') and starts-with (@ on click, 'window. location') ];
2) // a [ details (, 'Next page') and starts-with (@ href, 'javascript:') ];
3) // a [ details (), 'next page') and @ once and @ href ═ # ];
4) /[ contacts (, 'Next page') and @ onclick ];
5)//a[@onclick]。
in another aspect, the present invention provides an apparatus for automatically processing a click type hyperlink in a web crawler, comprising:
the page link capturing module is used for capturing page links through a web crawler;
the webpage content generating module is used for generating webpage content according to the page link;
the duplication eliminating module is used for generating a hyperlink queue after all hyperlinks in the webpage are duplicated;
the first judgment module is used for judging whether the webpage contains an element which needs to be clicked or not by using an xpath expression on the basis of the webpage content; if yes, returning the xpath expression and executing a second judgment module, and if not, executing a circulation module;
the second judgment module is used for calling the virtual browser by utilizing selinum, reloading the current page, if the loading is successful, executing the third judgment module, and otherwise, executing the circulation module;
the third judgment module is used for selecting according to the xpath expression returned by the first judgment module, carrying out simulated clicking on the element after the element needing clicking is found, and then waiting for the response of the browser; if the browser response is successful, executing the coding module, otherwise executing the circulating module;
the encoding module is used for acquiring the response content from the browser, carrying out UTF-8 encoding on the response content and then executing the webpage content generating module;
the circulation module is used for taking out the next hyperlink in the hyperlink queue and continuously grabbing the next hyperlink by adopting an breadth or depth traversal algorithm; and then executing a page link grabbing module.
Further, the xpath expression includes:
1) // td [ details (, 'lower page') and starts-with (@ on click, 'window. location') ];
2) // a [ details (, 'Next page') and starts-with (@ href, 'javascript:') ];
3) // a [ details (), 'next page') and @ once and @ href ═ # ];
4) /[ contacts (, 'Next page') and @ onclick ];
5)//a[@onclick]。
compared with the prior art, the invention has the following beneficial effects:
the method adopts the simulated browser technology, when the characteristic that manual clicking is needed in the webpage content is judged, the current page is loaded once by directly utilizing the simulated browser, then the link needing clicking currently is found to carry out the simulated clicking event, then the browser carries out simulated loading on the link, and finally the corresponding content is returned. According to the invention, when the web crawler is crawled, the traditional web page content can be grabbed, and click type hyperlinks (such as 'previous page' hyperlink and 'next page' hyperlink) which can not be grabbed by the traditional web crawler can also be grabbed, so that the completeness and the accuracy of the web crawler content are greatly improved.
Drawings
FIG. 1 is a basic flowchart of a method for automated processing of click-type hyperlinks in web crawlers, according to an embodiment of the present invention;
FIG. 2 is a graph comparing website crawler data;
FIG. 3 is a block diagram of an exemplary automated processing device for clicking on a hyperlink in a web crawler.
Detailed Description
The invention is further illustrated by the following examples in conjunction with the accompanying drawings:
as shown in FIG. 1, an automated processing method for clicking type hyperlinks in a web crawler includes:
step 1: capturing page links through a web crawler;
step 2: generating webpage content according to the page link;
and step 3: generating a hyperlink queue after all hyperlinks in the webpage are deduplicated;
and 4, step 4: on the basis of the webpage content, judging whether an element needing to be clicked is contained by using an xpath expression; if yes, returning the xpath expression, executing the step 5, and if not, jumping to the step 8;
and 5: calling the virtual browser by utilizing selinum, reloading the current page, executing the step 6 if the loading is successful, otherwise jumping to the step 8;
step 6: selecting according to the xpath expression returned in the step 4, performing simulated clicking on the element after the element needing clicking is found, and then waiting for the response of the browser; if the browser response is successful, executing step 7, otherwise jumping to step 8;
and 7: acquiring the response content from the browser, carrying out UTF-8 encoding on the response content, and then jumping to the step 2;
and 8: taking out the next hyperlink in the hyperlink queue, and adopting an breadth or depth traversal algorithm to continuously capture; then jump to step 1.
Further, in step 4, the xpath expression includes:
1) // td [ details (, 'lower page') and starts-with (@ on click, 'window. location') ];
2) // a [ details (, 'Next page') and starts-with (@ href, 'javascript:') ];
3) // a [ details (), 'next page') and @ once and @ href ═ # ];
4) /[ contacts (, 'Next page') and @ onclick ];
5)//a[@onclick]。
to verify the effect of the present invention, the following experiment was performed:
in the test, five websites are used for testing, in order to eliminate network jitter and server performance influence, each website is tested for ten times, the former five times are tested by a standard crawler method, and the later five times are tested by the method; the test results are shown in the following table. For privacy reasons, etc., we replace the website name with a website id.
Table 1: comparative test result table
Figure BDA0003237349940000041
Figure BDA0003237349940000051
The comparative test chart is shown in fig. 2.
As can be seen from the table 1 and the figure 2, the data amount captured by the method for web crawlers on certain websites is obviously improved, and the integrity of crawler data is greatly improved.
On the basis of the above embodiment, as shown in fig. 3, another aspect of the present invention provides an automatic processing apparatus for clicking type hyperlinks in web crawlers, comprising:
the page link capturing module is used for capturing page links through a web crawler;
the webpage content generating module is used for generating webpage content according to the page link;
the duplication eliminating module is used for generating a hyperlink queue after all hyperlinks in the webpage are duplicated;
the first judgment module is used for judging whether the webpage contains an element which needs to be clicked or not by using an xpath expression on the basis of the webpage content; if yes, returning the xpath expression and executing a second judgment module, and if not, executing a circulation module;
the second judgment module is used for calling the virtual browser by utilizing selinum, reloading the current page, if the loading is successful, executing the third judgment module, and otherwise, executing the circulation module;
the third judgment module is used for selecting according to the xpath expression returned by the first judgment module, carrying out simulated clicking on the element after the element needing clicking is found, and then waiting for the response of the browser; if the browser response is successful, executing the coding module, otherwise executing the circulating module;
the encoding module is used for acquiring the response content from the browser, carrying out UTF-8 encoding on the response content and then executing the webpage content generating module;
the circulation module is used for taking out the next hyperlink in the hyperlink queue and continuously grabbing the next hyperlink by adopting an breadth or depth traversal algorithm; and then executing a page link grabbing module.
Further, the xpath expression includes:
1) // td [ details (, 'lower page') and starts-with (@ on click, 'window. location') ];
2) // a [ details (, 'Next page') and starts-with (@ href, 'javascript:') ];
3) // a [ details (), 'next page') and @ once and @ href ═ # ];
4) /[ contacts (, 'Next page') and @ onclick ];
5)//a[@onclick]。
in summary, the invention adopts the simulation browser technology, when the feature that manual clicking is needed in the webpage content is judged, the simulation browser is directly used for loading the current page once, then the link needing clicking currently is found for simulating the clicking event, then the browser carries out simulation loading on the link, and finally the corresponding content is returned. According to the invention, when the web crawler is crawled, the traditional web page content can be grabbed, and click type hyperlinks (such as 'previous page' hyperlink and 'next page' hyperlink) which can not be grabbed by the traditional web crawler can also be grabbed, so that the completeness and the accuracy of the web crawler content are greatly improved.
The above shows only the preferred embodiments of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims (4)

1. An automated processing method for clicking type hyperlinks in web crawlers, comprising:
step 1: capturing page links through a web crawler;
step 2: generating webpage content according to the page link;
and step 3: generating a hyperlink queue after all hyperlinks in the webpage are deduplicated;
and 4, step 4: on the basis of the webpage content, judging whether an element needing to be clicked is contained by using an xpath expression; if yes, returning the xpath expression, executing the step 5, and if not, jumping to the step 8;
and 5: calling the virtual browser by utilizing selinum, reloading the current page, executing the step 6 if the loading is successful, otherwise jumping to the step 8;
step 6: selecting according to the xpath expression returned in the step 4, performing simulated clicking on the element after the element needing clicking is found, and then waiting for the response of the browser; if the browser response is successful, executing step 7, otherwise jumping to step 8;
and 7: acquiring the response content from the browser, carrying out UTF-8 encoding on the response content, and then jumping to the step 2;
and 8: taking out the next hyperlink in the hyperlink queue, and adopting an breadth or depth traversal algorithm to continuously capture; then jump to step 1.
2. The method of claim 1, wherein in step 4, the xpath expression comprises:
1) // td [ details (, 'lower page') and starts-with (@ on click, 'window. location') ];
2) // a [ details (, 'Next page') and starts-with (@ href, 'javascript:') ];
3) // a [ details (), 'next page') and @ once and @ href ═ # ];
4) /[ contacts (, 'Next page') and @ onclick ];
5)//a[@onclick]。
3. an automated processing apparatus for clicking type hyperlinks in web crawlers, comprising:
the page link capturing module is used for capturing page links through a web crawler;
the webpage content generating module is used for generating webpage content according to the page link;
the duplication eliminating module is used for generating a hyperlink queue after all hyperlinks in the webpage are duplicated;
the first judgment module is used for judging whether the webpage contains an element which needs to be clicked or not by using an xpath expression on the basis of the webpage content; if yes, returning the xpath expression and executing a second judgment module, and if not, executing a circulation module;
the second judgment module is used for calling the virtual browser by utilizing selinum, reloading the current page, if the loading is successful, executing the third judgment module, and otherwise, executing the circulation module;
the third judgment module is used for selecting according to the xpath expression returned by the first judgment module, carrying out simulated clicking on the element after the element needing clicking is found, and then waiting for the response of the browser; if the browser response is successful, executing the coding module, otherwise executing the circulating module;
the encoding module is used for acquiring the response content from the browser, carrying out UTF-8 encoding on the response content and then executing the webpage content generating module;
the circulation module is used for taking out the next hyperlink in the hyperlink queue and continuously grabbing the next hyperlink by adopting an breadth or depth traversal algorithm; and then executing a page link grabbing module.
4. The automated processing apparatus for clicking type hyperlinks according to claim 3, wherein said xpath expression comprises:
1) // td [ details (, 'lower page') and starts-with (@ on click, 'window. location') ];
2) // a [ details (, 'Next page') and starts-with (@ href, 'javascript:') ];
3) // a [ details (), 'next page') and @ once and @ href ═ # ];
4) /[ contacts (, 'Next page') and @ onclick ];
5)//a[@onclick]。
CN202111018080.XA 2021-08-30 2021-08-30 Automatic processing method and device for click type hyperlink in website crawler Active CN113656674B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111018080.XA CN113656674B (en) 2021-08-30 2021-08-30 Automatic processing method and device for click type hyperlink in website crawler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111018080.XA CN113656674B (en) 2021-08-30 2021-08-30 Automatic processing method and device for click type hyperlink in website crawler

Publications (2)

Publication Number Publication Date
CN113656674A true CN113656674A (en) 2021-11-16
CN113656674B CN113656674B (en) 2023-06-27

Family

ID=78493394

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111018080.XA Active CN113656674B (en) 2021-08-30 2021-08-30 Automatic processing method and device for click type hyperlink in website crawler

Country Status (1)

Country Link
CN (1) CN113656674B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105243159A (en) * 2015-10-28 2016-01-13 福建亿榕信息技术有限公司 Visual script editor-based distributed web crawler system
AU2016101343A4 (en) * 2015-07-30 2016-09-01 M Hassan & S Hassan & E Kravchenko & A Shchurov Method and systems for operating dynamic dashboard style website menus
CN108062468A (en) * 2017-12-25 2018-05-22 南京烽火软件科技有限公司 A kind of web crawlers method based on picture validation code identification
CN112632358A (en) * 2020-12-29 2021-04-09 北京天融信网络安全技术有限公司 Resource link obtaining method and device, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2016101343A4 (en) * 2015-07-30 2016-09-01 M Hassan & S Hassan & E Kravchenko & A Shchurov Method and systems for operating dynamic dashboard style website menus
CN105243159A (en) * 2015-10-28 2016-01-13 福建亿榕信息技术有限公司 Visual script editor-based distributed web crawler system
CN108062468A (en) * 2017-12-25 2018-05-22 南京烽火软件科技有限公司 A kind of web crawlers method based on picture validation code identification
CN112632358A (en) * 2020-12-29 2021-04-09 北京天融信网络安全技术有限公司 Resource link obtaining method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
何苗;张蕴;: "基于Selenium框架的定向网络数据获取的设计与实现", 工业控制计算机, no. 06 *

Also Published As

Publication number Publication date
CN113656674B (en) 2023-06-27

Similar Documents

Publication Publication Date Title
CN103095681B (en) A kind of method and device detecting leak
CN109033115B (en) Dynamic webpage crawler system
KR101908162B1 (en) Live browser tooling in an integrated development environment
US10699017B2 (en) Determining coverage of dynamic security scans using runtime and static code analyses
CN109033195A (en) The acquisition methods of webpage information obtain equipment and computer-readable medium
CN106599270B (en) Network data capturing method and crawler
CN112637361B (en) Page proxy method, device, electronic equipment and storage medium
CN111367595B (en) Data processing method, program running method, device and processing equipment
CN112417240A (en) Website link detection method and device and computer equipment
CN112632358B (en) Resource link obtaining method and device, electronic equipment and storage medium
CN111538883A (en) Data crawling method, system and equipment
CN103631806A (en) Network information fetching method and device
CN108306918B (en) Automatic website access information acquisition method based on program dynamic analysis
US20160034378A1 (en) Method and system for testing page link addresses
CN112612943A (en) Asynchronous processing framework-based data crawling method with automatic testing function
CN111324894A (en) XSS vulnerability detection method and system based on web application security
US9003378B2 (en) Client-side application script error processing
CN104281629A (en) Method and device for extracting picture from webpage and client equipment
CN106371987A (en) Test method and device
CN114491560A (en) Vulnerability detection method and device, storage medium and electronic equipment
US10198408B1 (en) System and method for converting and importing web site content
CN110719344B (en) Domain name acquisition method and device, electronic equipment and storage medium
CN113656674B (en) Automatic processing method and device for click type hyperlink in website crawler
Li et al. Automatically crawling dynamic web applications via proxy-based javascript injection and runtime analysis
CN110232019A (en) Page test method and Related product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant