CN113656674A - Automatic processing method and device for click type hyperlink in website crawler - Google Patents
Automatic processing method and device for click type hyperlink in website crawler Download PDFInfo
- Publication number
- CN113656674A CN113656674A CN202111018080.XA CN202111018080A CN113656674A CN 113656674 A CN113656674 A CN 113656674A CN 202111018080 A CN202111018080 A CN 202111018080A CN 113656674 A CN113656674 A CN 113656674A
- Authority
- CN
- China
- Prior art keywords
- page
- module
- executing
- browser
- hyperlink
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9558—Details of hyperlinks; Management of linked annotations
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses an automatic processing method and a device for clicking type hyperlinks in website crawlers, wherein the method comprises the following steps: capturing page links through a web crawler; generating webpage content; then generating a hyperlink queue; judging whether an element needing to be clicked is contained or not by using an xpath expression; if the expression is returned, continuing to execute, and if the expression is not returned, jumping to the last step; calling the virtual browser by utilizing selinum, reloading the current page, if the loading is successful, continuing to execute, otherwise, jumping to the last step; finding the elements needing clicking according to the expression and carrying out simulated clicking; if the browser response is successful, continuing to execute, otherwise, jumping to the last step; acquiring the response content, encoding by UTF-8, and jumping to the second step; taking out the next hyperlink, and adopting an breadth or depth traversal algorithm to continuously capture the next hyperlink; and then jumps to the first step. The invention greatly improves the integrity and accuracy of the contents of the web crawler.
Description
Technical Field
The invention belongs to the technical field of website crawlers, and particularly relates to an automatic processing method and device for click type hyperlinks in a website crawler, which are suitable for links which can be further accessed only by manually clicking in the process of the website crawler.
Background
With the development of the modern webpage front-end technology, particularly front-end language javascript; many excellent front-end frameworks such as jquery, vue, act, and angular appear, and with the emergence of these frameworks, some excellent UI component libraries such as bootstrap, element-UI, etc. appear, and these frameworks are excellent in compatibility, applicability, convenience, and internationalization, which greatly improves the efficiency of website development, so that more and more websites begin to be developed by using these frameworks.
Although these frameworks bring great convenience to website development, they bring great difficulties and challenges to the fields of website crawlers and content retrieval, one of the most prominent problems is that some hyperlinks require one click to continue accessing; the traditional web crawler only grabs hyperlinks similar to < a href ═ xxx > xxx existing in the web page, but catches an elbow for the hyperlinks of < a onclick ═ xxx "> xxx >; the more such hyperlinks exist in existing websites, particularly in hyperlinks "previous page" and "next page"; the hyperlinks are basic styles and methods for paging the content of the website, so that the content captured by the crawler of the website is incomplete and inaccurate.
Disclosure of Invention
The invention provides an automatic processing method and device for clicking type hyperlinks in a website crawler, aiming at the problem that the clicking type hyperlinks (such as previous page hyperlinks and next page hyperlinks) cannot be grabbed when the traditional website crawler captures the clicked type hyperlinks, so that the grabbing content is incomplete and inaccurate.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides an automatic processing method of a click type hyperlink in a website crawler, which comprises the following steps:
step 1: capturing page links through a web crawler;
step 2: generating webpage content according to the page link;
and step 3: generating a hyperlink queue after all hyperlinks in the webpage are deduplicated;
and 4, step 4: on the basis of the webpage content, judging whether an element needing to be clicked is contained by using an xpath expression; if yes, returning the xpath expression, executing the step 5, and if not, jumping to the step 8;
and 5: calling the virtual browser by utilizing selinum, reloading the current page, executing the step 6 if the loading is successful, otherwise jumping to the step 8;
step 6: selecting according to the xpath expression returned in the step 4, performing simulated clicking on the element after the element needing clicking is found, and then waiting for the response of the browser; if the browser response is successful, executing step 7, otherwise jumping to step 8;
and 7: acquiring the response content from the browser, carrying out UTF-8 encoding on the response content, and then jumping to the step 2;
and 8: taking out the next hyperlink in the hyperlink queue, and adopting an breadth or depth traversal algorithm to continuously capture; then jump to step 1.
Further, in step 4, the xpath expression includes:
1) // td [ details (, 'lower page') and starts-with (@ on click, 'window. location') ];
2) // a [ details (, 'Next page') and starts-with (@ href, 'javascript:') ];
3) // a [ details (), 'next page') and @ once and @ href ═ # ];
4) /[ contacts (, 'Next page') and @ onclick ];
5)//a[@onclick]。
in another aspect, the present invention provides an apparatus for automatically processing a click type hyperlink in a web crawler, comprising:
the page link capturing module is used for capturing page links through a web crawler;
the webpage content generating module is used for generating webpage content according to the page link;
the duplication eliminating module is used for generating a hyperlink queue after all hyperlinks in the webpage are duplicated;
the first judgment module is used for judging whether the webpage contains an element which needs to be clicked or not by using an xpath expression on the basis of the webpage content; if yes, returning the xpath expression and executing a second judgment module, and if not, executing a circulation module;
the second judgment module is used for calling the virtual browser by utilizing selinum, reloading the current page, if the loading is successful, executing the third judgment module, and otherwise, executing the circulation module;
the third judgment module is used for selecting according to the xpath expression returned by the first judgment module, carrying out simulated clicking on the element after the element needing clicking is found, and then waiting for the response of the browser; if the browser response is successful, executing the coding module, otherwise executing the circulating module;
the encoding module is used for acquiring the response content from the browser, carrying out UTF-8 encoding on the response content and then executing the webpage content generating module;
the circulation module is used for taking out the next hyperlink in the hyperlink queue and continuously grabbing the next hyperlink by adopting an breadth or depth traversal algorithm; and then executing a page link grabbing module.
Further, the xpath expression includes:
1) // td [ details (, 'lower page') and starts-with (@ on click, 'window. location') ];
2) // a [ details (, 'Next page') and starts-with (@ href, 'javascript:') ];
3) // a [ details (), 'next page') and @ once and @ href ═ # ];
4) /[ contacts (, 'Next page') and @ onclick ];
5)//a[@onclick]。
compared with the prior art, the invention has the following beneficial effects:
the method adopts the simulated browser technology, when the characteristic that manual clicking is needed in the webpage content is judged, the current page is loaded once by directly utilizing the simulated browser, then the link needing clicking currently is found to carry out the simulated clicking event, then the browser carries out simulated loading on the link, and finally the corresponding content is returned. According to the invention, when the web crawler is crawled, the traditional web page content can be grabbed, and click type hyperlinks (such as 'previous page' hyperlink and 'next page' hyperlink) which can not be grabbed by the traditional web crawler can also be grabbed, so that the completeness and the accuracy of the web crawler content are greatly improved.
Drawings
FIG. 1 is a basic flowchart of a method for automated processing of click-type hyperlinks in web crawlers, according to an embodiment of the present invention;
FIG. 2 is a graph comparing website crawler data;
FIG. 3 is a block diagram of an exemplary automated processing device for clicking on a hyperlink in a web crawler.
Detailed Description
The invention is further illustrated by the following examples in conjunction with the accompanying drawings:
as shown in FIG. 1, an automated processing method for clicking type hyperlinks in a web crawler includes:
step 1: capturing page links through a web crawler;
step 2: generating webpage content according to the page link;
and step 3: generating a hyperlink queue after all hyperlinks in the webpage are deduplicated;
and 4, step 4: on the basis of the webpage content, judging whether an element needing to be clicked is contained by using an xpath expression; if yes, returning the xpath expression, executing the step 5, and if not, jumping to the step 8;
and 5: calling the virtual browser by utilizing selinum, reloading the current page, executing the step 6 if the loading is successful, otherwise jumping to the step 8;
step 6: selecting according to the xpath expression returned in the step 4, performing simulated clicking on the element after the element needing clicking is found, and then waiting for the response of the browser; if the browser response is successful, executing step 7, otherwise jumping to step 8;
and 7: acquiring the response content from the browser, carrying out UTF-8 encoding on the response content, and then jumping to the step 2;
and 8: taking out the next hyperlink in the hyperlink queue, and adopting an breadth or depth traversal algorithm to continuously capture; then jump to step 1.
Further, in step 4, the xpath expression includes:
1) // td [ details (, 'lower page') and starts-with (@ on click, 'window. location') ];
2) // a [ details (, 'Next page') and starts-with (@ href, 'javascript:') ];
3) // a [ details (), 'next page') and @ once and @ href ═ # ];
4) /[ contacts (, 'Next page') and @ onclick ];
5)//a[@onclick]。
to verify the effect of the present invention, the following experiment was performed:
in the test, five websites are used for testing, in order to eliminate network jitter and server performance influence, each website is tested for ten times, the former five times are tested by a standard crawler method, and the later five times are tested by the method; the test results are shown in the following table. For privacy reasons, etc., we replace the website name with a website id.
Table 1: comparative test result table
The comparative test chart is shown in fig. 2.
As can be seen from the table 1 and the figure 2, the data amount captured by the method for web crawlers on certain websites is obviously improved, and the integrity of crawler data is greatly improved.
On the basis of the above embodiment, as shown in fig. 3, another aspect of the present invention provides an automatic processing apparatus for clicking type hyperlinks in web crawlers, comprising:
the page link capturing module is used for capturing page links through a web crawler;
the webpage content generating module is used for generating webpage content according to the page link;
the duplication eliminating module is used for generating a hyperlink queue after all hyperlinks in the webpage are duplicated;
the first judgment module is used for judging whether the webpage contains an element which needs to be clicked or not by using an xpath expression on the basis of the webpage content; if yes, returning the xpath expression and executing a second judgment module, and if not, executing a circulation module;
the second judgment module is used for calling the virtual browser by utilizing selinum, reloading the current page, if the loading is successful, executing the third judgment module, and otherwise, executing the circulation module;
the third judgment module is used for selecting according to the xpath expression returned by the first judgment module, carrying out simulated clicking on the element after the element needing clicking is found, and then waiting for the response of the browser; if the browser response is successful, executing the coding module, otherwise executing the circulating module;
the encoding module is used for acquiring the response content from the browser, carrying out UTF-8 encoding on the response content and then executing the webpage content generating module;
the circulation module is used for taking out the next hyperlink in the hyperlink queue and continuously grabbing the next hyperlink by adopting an breadth or depth traversal algorithm; and then executing a page link grabbing module.
Further, the xpath expression includes:
1) // td [ details (, 'lower page') and starts-with (@ on click, 'window. location') ];
2) // a [ details (, 'Next page') and starts-with (@ href, 'javascript:') ];
3) // a [ details (), 'next page') and @ once and @ href ═ # ];
4) /[ contacts (, 'Next page') and @ onclick ];
5)//a[@onclick]。
in summary, the invention adopts the simulation browser technology, when the feature that manual clicking is needed in the webpage content is judged, the simulation browser is directly used for loading the current page once, then the link needing clicking currently is found for simulating the clicking event, then the browser carries out simulation loading on the link, and finally the corresponding content is returned. According to the invention, when the web crawler is crawled, the traditional web page content can be grabbed, and click type hyperlinks (such as 'previous page' hyperlink and 'next page' hyperlink) which can not be grabbed by the traditional web crawler can also be grabbed, so that the completeness and the accuracy of the web crawler content are greatly improved.
The above shows only the preferred embodiments of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.
Claims (4)
1. An automated processing method for clicking type hyperlinks in web crawlers, comprising:
step 1: capturing page links through a web crawler;
step 2: generating webpage content according to the page link;
and step 3: generating a hyperlink queue after all hyperlinks in the webpage are deduplicated;
and 4, step 4: on the basis of the webpage content, judging whether an element needing to be clicked is contained by using an xpath expression; if yes, returning the xpath expression, executing the step 5, and if not, jumping to the step 8;
and 5: calling the virtual browser by utilizing selinum, reloading the current page, executing the step 6 if the loading is successful, otherwise jumping to the step 8;
step 6: selecting according to the xpath expression returned in the step 4, performing simulated clicking on the element after the element needing clicking is found, and then waiting for the response of the browser; if the browser response is successful, executing step 7, otherwise jumping to step 8;
and 7: acquiring the response content from the browser, carrying out UTF-8 encoding on the response content, and then jumping to the step 2;
and 8: taking out the next hyperlink in the hyperlink queue, and adopting an breadth or depth traversal algorithm to continuously capture; then jump to step 1.
2. The method of claim 1, wherein in step 4, the xpath expression comprises:
1) // td [ details (, 'lower page') and starts-with (@ on click, 'window. location') ];
2) // a [ details (, 'Next page') and starts-with (@ href, 'javascript:') ];
3) // a [ details (), 'next page') and @ once and @ href ═ # ];
4) /[ contacts (, 'Next page') and @ onclick ];
5)//a[@onclick]。
3. an automated processing apparatus for clicking type hyperlinks in web crawlers, comprising:
the page link capturing module is used for capturing page links through a web crawler;
the webpage content generating module is used for generating webpage content according to the page link;
the duplication eliminating module is used for generating a hyperlink queue after all hyperlinks in the webpage are duplicated;
the first judgment module is used for judging whether the webpage contains an element which needs to be clicked or not by using an xpath expression on the basis of the webpage content; if yes, returning the xpath expression and executing a second judgment module, and if not, executing a circulation module;
the second judgment module is used for calling the virtual browser by utilizing selinum, reloading the current page, if the loading is successful, executing the third judgment module, and otherwise, executing the circulation module;
the third judgment module is used for selecting according to the xpath expression returned by the first judgment module, carrying out simulated clicking on the element after the element needing clicking is found, and then waiting for the response of the browser; if the browser response is successful, executing the coding module, otherwise executing the circulating module;
the encoding module is used for acquiring the response content from the browser, carrying out UTF-8 encoding on the response content and then executing the webpage content generating module;
the circulation module is used for taking out the next hyperlink in the hyperlink queue and continuously grabbing the next hyperlink by adopting an breadth or depth traversal algorithm; and then executing a page link grabbing module.
4. The automated processing apparatus for clicking type hyperlinks according to claim 3, wherein said xpath expression comprises:
1) // td [ details (, 'lower page') and starts-with (@ on click, 'window. location') ];
2) // a [ details (, 'Next page') and starts-with (@ href, 'javascript:') ];
3) // a [ details (), 'next page') and @ once and @ href ═ # ];
4) /[ contacts (, 'Next page') and @ onclick ];
5)//a[@onclick]。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111018080.XA CN113656674B (en) | 2021-08-30 | 2021-08-30 | Automatic processing method and device for click type hyperlink in website crawler |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111018080.XA CN113656674B (en) | 2021-08-30 | 2021-08-30 | Automatic processing method and device for click type hyperlink in website crawler |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113656674A true CN113656674A (en) | 2021-11-16 |
CN113656674B CN113656674B (en) | 2023-06-27 |
Family
ID=78493394
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111018080.XA Active CN113656674B (en) | 2021-08-30 | 2021-08-30 | Automatic processing method and device for click type hyperlink in website crawler |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113656674B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105243159A (en) * | 2015-10-28 | 2016-01-13 | 福建亿榕信息技术有限公司 | Visual script editor-based distributed web crawler system |
AU2016101343A4 (en) * | 2015-07-30 | 2016-09-01 | M Hassan & S Hassan & E Kravchenko & A Shchurov | Method and systems for operating dynamic dashboard style website menus |
CN108062468A (en) * | 2017-12-25 | 2018-05-22 | 南京烽火软件科技有限公司 | A kind of web crawlers method based on picture validation code identification |
CN112632358A (en) * | 2020-12-29 | 2021-04-09 | 北京天融信网络安全技术有限公司 | Resource link obtaining method and device, electronic equipment and storage medium |
-
2021
- 2021-08-30 CN CN202111018080.XA patent/CN113656674B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU2016101343A4 (en) * | 2015-07-30 | 2016-09-01 | M Hassan & S Hassan & E Kravchenko & A Shchurov | Method and systems for operating dynamic dashboard style website menus |
CN105243159A (en) * | 2015-10-28 | 2016-01-13 | 福建亿榕信息技术有限公司 | Visual script editor-based distributed web crawler system |
CN108062468A (en) * | 2017-12-25 | 2018-05-22 | 南京烽火软件科技有限公司 | A kind of web crawlers method based on picture validation code identification |
CN112632358A (en) * | 2020-12-29 | 2021-04-09 | 北京天融信网络安全技术有限公司 | Resource link obtaining method and device, electronic equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
何苗;张蕴;: "基于Selenium框架的定向网络数据获取的设计与实现", 工业控制计算机, no. 06 * |
Also Published As
Publication number | Publication date |
---|---|
CN113656674B (en) | 2023-06-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103095681B (en) | A kind of method and device detecting leak | |
CN109033115B (en) | Dynamic webpage crawler system | |
KR101908162B1 (en) | Live browser tooling in an integrated development environment | |
US10699017B2 (en) | Determining coverage of dynamic security scans using runtime and static code analyses | |
CN109033195A (en) | The acquisition methods of webpage information obtain equipment and computer-readable medium | |
CN106599270B (en) | Network data capturing method and crawler | |
CN112637361B (en) | Page proxy method, device, electronic equipment and storage medium | |
CN111367595B (en) | Data processing method, program running method, device and processing equipment | |
CN112417240A (en) | Website link detection method and device and computer equipment | |
CN112632358B (en) | Resource link obtaining method and device, electronic equipment and storage medium | |
CN111538883A (en) | Data crawling method, system and equipment | |
CN103631806A (en) | Network information fetching method and device | |
CN108306918B (en) | Automatic website access information acquisition method based on program dynamic analysis | |
US20160034378A1 (en) | Method and system for testing page link addresses | |
CN112612943A (en) | Asynchronous processing framework-based data crawling method with automatic testing function | |
CN111324894A (en) | XSS vulnerability detection method and system based on web application security | |
US9003378B2 (en) | Client-side application script error processing | |
CN104281629A (en) | Method and device for extracting picture from webpage and client equipment | |
CN106371987A (en) | Test method and device | |
CN114491560A (en) | Vulnerability detection method and device, storage medium and electronic equipment | |
US10198408B1 (en) | System and method for converting and importing web site content | |
CN110719344B (en) | Domain name acquisition method and device, electronic equipment and storage medium | |
CN113656674B (en) | Automatic processing method and device for click type hyperlink in website crawler | |
Li et al. | Automatically crawling dynamic web applications via proxy-based javascript injection and runtime analysis | |
CN110232019A (en) | Page test method and Related product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |