CN113868501A - Deep webpage crawling method and device and vulnerability scanning system - Google Patents

Deep webpage crawling method and device and vulnerability scanning system Download PDF

Info

Publication number
CN113868501A
CN113868501A CN202111143417.XA CN202111143417A CN113868501A CN 113868501 A CN113868501 A CN 113868501A CN 202111143417 A CN202111143417 A CN 202111143417A CN 113868501 A CN113868501 A CN 113868501A
Authority
CN
China
Prior art keywords
deep
webpage
web page
client
receiving
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111143417.XA
Other languages
Chinese (zh)
Inventor
王彪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Original Assignee
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Topsec Technology Co Ltd, Beijing Topsec Network Security Technology Co Ltd, Beijing Topsec Software Co Ltd filed Critical Beijing Topsec Technology Co Ltd
Priority to CN202111143417.XA priority Critical patent/CN113868501A/en
Publication of CN113868501A publication Critical patent/CN113868501A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a deep webpage crawling method and device and a vulnerability scanning system, and relates to the technical field of data crawling. The method comprises the steps of receiving a deep webpage website access request sent by a client; sending the website access request to a corresponding deep webpage website server; and receiving and storing the webpage data returned by the deep webpage website server, sending the webpage data to the client, and enabling the client to pass through the proxy server when accessing the deep webpage, thereby solving the problems that the deep webpage is difficult to crawl and is not comprehensive in crawling.

Description

Deep webpage crawling method and device and vulnerability scanning system
Technical Field
The application relates to the technical field of data crawling, in particular to a deep webpage crawling method and device and a vulnerability scanning system.
Background
Most web pages belong to dynamic web pages, most useful data in the web pages are obtained from a server dynamically through Ajax/Fetch and the like by a user trigger event and then filled into a web page DOM tree by JavaScript, and the useful data in a simple HTML static page is few. The method mainly adopted at present for crawling the dynamic pages is to directly request the URL of Ajax/Fetch, but the original webpage code logic needs to be understood, so that the method is very troublesome and professional, and the crawl codes are extremely huge, and even more, the user trigger events of some JavaScripts are difficult or even impossible to imitate by using a crawler program.
Disclosure of Invention
An object of the embodiments of the present application is to provide a deep webpage crawling method, device and vulnerability scanning system, where a client end passes through a proxy server when accessing a deep webpage, so as to solve the problems of difficulty in crawling and incompleteness of the deep webpage.
The embodiment of the application provides a deep webpage crawling method, which is applied to a proxy server and comprises the following steps:
receiving a deep webpage website access request sent by a client;
sending the website access request to a corresponding deep webpage website server;
and receiving and storing the webpage data returned by the deep webpage website server, and sending the webpage data to the client.
In the implementation process, the client accesses the deep webpage through the proxy server, so that the deep webpage accessed by the user can be stored by the proxy server, and the deep webpage accessed by the user can be crawled, thereby solving the problems of difficulty in crawling and incomprehensible crawling of the deep webpage.
Further, before the step of receiving a deep web site access request sent by a client, the method further includes:
and establishing a communication connection with the client.
In the implementation process, the proxy server and the client are firstly in communication connection, so that the proxy server receives a request message sent by the client, sends an access request of a website to a corresponding deep webpage website server and receives webpage data returned by the deep webpage website server for storage, and returns a response message to the client, so that the proxy server can store all the deep webpage data received by the user client.
Further, the receiving and storing the webpage data returned by the deep webpage website server includes:
and saving HTML, CSS and JavaScript files of the deep webpage.
In the implementation process, the structure (HTML), the expression (CSS) and the behavior (JavaScript) in the webpage data are stored, all data in the webpage can be stored, and the deep webpage can be audited according to the webpage data.
Further, before the step of receiving and storing the webpage data returned by the deep webpage website server, the method further includes:
calculating a hash value of the webpage data;
comparing the hash value of the webpage data with the hash value of the stored webpage;
and if the hash values are different, storing the webpage data and the corresponding hash values.
In the implementation process, the hash values are compared to remove the duplicate, if the hash values are the same, the two webpages are the same, and the two webpages can not be stored repeatedly, and if the hash values are different, the deep webpages can be stored.
The embodiment of the present application further provides a deep webpage crawling device, and the device includes:
the receiving module is used for receiving a deep webpage website access request sent by a client;
the sending module is used for sending the website access request to a corresponding deep webpage website server;
and the storage module is used for receiving and storing the webpage data returned by the deep webpage website server and sending the webpage data to the client.
In the implementation process, the client accesses the deep webpage through the proxy server, so that the deep webpage accessed by the user can be stored by the proxy server, and the deep webpage accessed by the user can be crawled, thereby solving the problems of difficulty in crawling and incomprehensible crawling of the deep webpage.
Further, the apparatus further comprises:
and the connection establishing module is used for establishing communication connection with the client.
In the implementation process, the proxy server is firstly in communication connection with the client and the deep webpage website server respectively, so that the proxy server receives a request message sent by the client, sends an access request of a website to the corresponding deep webpage website server and receives and stores webpage data returned by the deep webpage website server, and returns a response message to the client, so that the proxy server can store all the deep webpage data received by the user client.
Further, the apparatus further comprises a deduplication module to:
calculating a hash value of the webpage data;
comparing the hash value of the webpage data with the hash value of the stored webpage;
and if the hash values are different, storing the webpage data and the corresponding hash values.
In the implementation process, the hash values are compared to remove the duplicate, if the hash values are the same, the two webpages are the same, and the two webpages can not be stored repeatedly, and if the hash values are different, the deep webpages can be stored.
An embodiment of the present application further provides a vulnerability scanning system, including any one of the above deep web page crawling apparatuses, further including:
and the auditing module is used for adding the stored deep webpage into an auditing queue so as to audit the deep webpage.
In the implementation process, the vulnerability can be determined by auditing the page.
An embodiment of the present application further provides an electronic device, where the electronic device includes a memory and a processor, where the memory is used to store a computer program, and the processor runs the computer program to enable the computer device to execute any one of the deep web crawling methods described above.
An embodiment of the present invention further provides a readable storage medium, where computer program instructions are stored, and when the computer program instructions are read and executed by a processor, the deep-level web crawling method described in any of the above is executed.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a flowchart of a deep web page crawling method according to an embodiment of the present disclosure;
fig. 2 is a schematic diagram of connection establishment provided in an embodiment of the present application;
fig. 3 is a flowchart of a deep web page deduplication process provided in an embodiment of the present application;
FIG. 4 is a block diagram illustrating a deep web page crawling apparatus according to an embodiment of the present disclosure;
FIG. 5 is a block diagram illustrating an alternative deep web page crawling apparatus according to an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of a vulnerability scanning system provided in an embodiment of the present application.
Icon:
100-a receiving module; 110-a connection establishment module; 200-a sending module; 300-a storage module; 310-deduplication module.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
Example 1
Referring to fig. 1, fig. 1 is a flowchart of a deep web page crawling method according to an embodiment of the present disclosure. The method can be applied to a proxy server, and the user automatically adds and stores the pages accessed by the user in the process of accessing the Web site, thereby realizing the purpose of crawling deep pages. The method specifically comprises the following steps:
step S100: receiving a deep webpage website access request sent by a client;
step S200: sending the website access request to a corresponding deep webpage website server;
step S300: and receiving and storing the webpage data returned by the deep webpage website server, and sending the webpage data to the client.
After a user sets a Proxy Server (Proxy Server) for accessing the internet in a client such as a browser, a request for accessing all websites by using the browser is not directly sent to a target Server (deep web site Server), but is sent to the Proxy Server first, after the Proxy Server receives the request of the user, the Proxy Server sends the request to the target Server and receives data returned by the target Server, the data is stored in the Proxy Server, and then the data required by the user is sent to the user by the Proxy Server. In the process, the proxy server stores the deep web pages accessed by the user, and realizes the crawling of the deep web pages, namely, the crawling of the pages which need some interaction by the user and can go to the next page and are dynamically acquired from the server by the user triggering events is realized; the crawling of the Web site can be controlled by a user, and more pages can be crawled and scanned.
For a user's browser, such as a Web client, the proxy server plays the role of a server of a destination Web site, receives a request message, and returns a response message. For the Web server, the proxy server plays the role of a client, and transmits a Web request message and receives a Web response message. The proxy server receives the request message sent by the client, sends the access request of the website to the corresponding deep webpage website server, receives and stores the webpage data returned by the deep webpage website server, and returns the response message to the client, so that the proxy server can store all the deep webpage data received by the user client. Therefore, when the user accesses the deep page of the website to be crawled through the browser, all data interaction processes of the browser and the deep page and user trigger events are completed through the proxy server, and all the deep pages accessed by the user are stored by the proxy server.
Before receiving a deep web site access request sent by a client, a connection needs to be established, as shown in fig. 2, which is a schematic diagram of establishing a connection. The browser establishes HTTP connection with the proxy module firstly, and the proxy server establishes HTTP connection with the deep page website server, so that the request sent by the client can be sent to the deep page website server through the proxy server. After the client end is connected with the proxy server, the proxy server establishes communication connection with the deep webpage website server when sending the website access request to the corresponding deep webpage website server.
For the storage of webpage information, HTML, CSS and JavaScript files are stored, and for a webpage, the webpage mainly comprises three parts: structure (HTML), presentation (CSS), and behavior (JavaScript). Structure (HTML) is used to describe the structure of a page; presentation (CSS) is used to control the style of elements in a page; behavior (JavaScript) is used to respond to user actions. The HTML file contains hyperlinks (URLs) with which jumps can be made from one page to another. Therefore, the three documents are stored, that is, all the web page information of the deep web page is stored.
As shown in fig. 3, which is a flowchart of a deep web page deduplication process, before the step of receiving and storing the web page data returned by the deep web site server, the method further includes:
step S311: calculating a hash value of the webpage data;
step S312: comparing the hash value of the webpage data with the hash value of the stored webpage;
step S313: and if the hash values are different, storing the webpage data and the corresponding hash values.
The process is a duplicate removal process, and the situation that the proxy server repeatedly stores the same deep webpage for multiple times is avoided, so that more memory space is occupied. The proxy server stores the deep web pages manually visited by the user after the duplication removal, judges whether the two web pages are the same through a hash value in the duplication removal process, namely, performs hash calculation on the web pages, compares the calculated hash value with the hash value stored before, does not store the web pages again if the same hash value exists, and stores the calculated hash value and stores the corresponding deep web pages otherwise.
The user can pass through the proxy server when visiting the deep webpage through the browser, therefore, the accessible user issues the crawling scanning task, the user automatically adds and stores the page visited by the user in the process of visiting the Web site, the deep webpage can be crawled, all deep webpages to be crawled can be more conveniently, safely, conveniently and thoroughly crawled, and the problem that the deep webpages are difficult to crawl and are incomprehensible is solved.
Example 2
An embodiment of the present application provides a deep webpage crawling apparatus, as shown in fig. 4, for the structural block diagram of the deep webpage crawling apparatus, the apparatus includes:
the receiving module 100 is configured to receive a deep web site access request sent by a client;
a sending module 200, configured to send the website access request to a corresponding deep web website server;
the storage module 300 is configured to receive and store the webpage data returned by the deep webpage website server, and send the webpage data to the client.
As shown in fig. 5, a block diagram of another deep web page crawling apparatus is further provided, where the apparatus further includes:
a connection establishing module 110, configured to establish communication connections with the client and the proxy server, respectively.
The apparatus further comprises a deduplication module 310, the deduplication module 310 to:
calculating a hash value of the webpage data;
comparing the hash value of the webpage data with the hash value of the stored webpage;
and if the hash values are different, storing the webpage data and the corresponding hash values.
Example 3
An embodiment of the present application provides a vulnerability scanning system, as shown in fig. 6, which is a schematic structural diagram of the vulnerability scanning system. The device comprises an agent module (agent server) and an auditing module, wherein the auditing module is used for adding the stored deep webpage into an auditing queue so as to audit the deep webpage.
The proxy server of the browser configured by the user for accessing the internet is a vulnerability scanning system, and at the moment, the user can pass through the proxy module of the vulnerability scanning system when accessing the website through the browser. Therefore, after the user issues the crawling scanning task, when the user accesses the deep page of the website to be crawled through the browser, all the deep pages accessed by the user can be stored by the proxy module due to the fact that the access process passes through the proxy module. And after the agent module of the vulnerability scanning system is subjected to duplicate removal, the deep web pages manually visited by the user are stored, the stored deep web pages are added into an audit queue, and the vulnerability is reported after the page audit is completed.
To sum up, the crawling process of the deep webpage can be controlled by the user, the deep webpage is more thoroughly crawled, the user can access and crawl all the deep webpages needing audit scanning, and therefore the problem that dynamic pages of the Web websites cannot be crawled by using frameworks such as AngularJS, ReactJS and VueJS is solved, the crawling of the Web websites is more complete, and the missing reports of leaks are effectively reduced.
An embodiment of the present application further provides an electronic device, where the electronic device includes a memory and a processor, the memory is used to store a computer program, and the processor runs the computer program to enable the computer device to execute the deep-level web crawling method according to embodiment 1.
An embodiment of the present application further provides a readable storage medium, where computer program instructions are stored, and when the computer program instructions are read and executed by a processor, the deep-level web crawling method according to embodiment 1 is executed.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (10)

1. A deep web page crawling method is applied to a proxy server, and comprises the following steps:
receiving a deep webpage website access request sent by a client;
sending the website access request to a corresponding deep webpage website server;
and receiving and storing the webpage data returned by the deep webpage website server, and sending the webpage data to the client.
2. The deep web page crawling method according to claim 1, wherein before the step of receiving the deep web page website access request sent by the client, the method further comprises:
and establishing a communication connection with the client.
3. The deep web page crawling method according to claim 1, wherein the receiving and storing the web page data returned by the deep web page website server comprises:
and saving HTML, CSS and JavaScript files of the deep webpage.
4. The deep web page crawling method according to claim 1, wherein before the step of receiving and saving the web page data returned by the deep web page website server, the method further comprises:
calculating a hash value of the webpage data;
comparing the hash value of the webpage data with the hash value of the stored webpage;
and if the hash values are different, storing the webpage data and the corresponding hash values.
5. A deep web page crawling apparatus, the apparatus comprising:
the receiving module is used for receiving a deep webpage website access request sent by a client;
the sending module is used for sending the website access request to a corresponding deep webpage website server;
and the storage module is used for receiving and storing the webpage data returned by the deep webpage website server and sending the webpage data to the client.
6. The deep web page crawling apparatus of claim 5, further comprising:
and the connection establishing module is used for establishing communication connection with the client.
7. The deep web page crawling apparatus of claim 5, further comprising a deduplication module for:
calculating a hash value of the webpage data;
comparing the hash value of the webpage data with the hash value of the stored webpage;
and if the hash values are different, storing the webpage data and the corresponding hash values.
8. A vulnerability scanning system, comprising the deep web page crawling apparatus of any of claims 5 to 7, further comprising:
and the auditing module is used for adding the stored deep webpage into an auditing queue so as to audit the deep webpage.
9. An electronic device, comprising a memory for storing a computer program and a processor for executing the computer program to cause the computer device to perform the deep web page crawling method according to any one of claims 1 to 4.
10. A readable storage medium having stored thereon computer program instructions which, when read and executed by a processor, perform the deep web crawling method according to any one of claims 1 to 4.
CN202111143417.XA 2021-09-28 2021-09-28 Deep webpage crawling method and device and vulnerability scanning system Pending CN113868501A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111143417.XA CN113868501A (en) 2021-09-28 2021-09-28 Deep webpage crawling method and device and vulnerability scanning system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111143417.XA CN113868501A (en) 2021-09-28 2021-09-28 Deep webpage crawling method and device and vulnerability scanning system

Publications (1)

Publication Number Publication Date
CN113868501A true CN113868501A (en) 2021-12-31

Family

ID=78991747

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111143417.XA Pending CN113868501A (en) 2021-09-28 2021-09-28 Deep webpage crawling method and device and vulnerability scanning system

Country Status (1)

Country Link
CN (1) CN113868501A (en)

Similar Documents

Publication Publication Date Title
US20210349964A1 (en) Predictive resource identification and phased delivery of structured documents
US10642904B2 (en) Infrastructure enabling intelligent execution and crawling of a web application
US7210094B2 (en) Method and system for dynamic web page breadcrumbing using javascript
US8533297B2 (en) Setting cookies in conjunction with phased delivery of structured documents
US8768928B2 (en) Document object model (DOM) based page uniqueness detection
EP2724251B1 (en) Methods for making ajax web applications bookmarkable and crawlable and devices thereof
JP5826266B2 (en) Method and apparatus for handling nested fragment caching of web pages
US20070094156A1 (en) User defined components for content syndication
CN101523393A (en) Locally storing web-based database data
US10324896B2 (en) Method and apparatus for acquiring resource
CN111552854A (en) Webpage data capturing method and device, storage medium and equipment
CN103577427A (en) Browser kernel based web page crawling method and device and browser containing device
CN102915363A (en) Website storing method and system
CN109670100B (en) Page data capturing method and device
Chowdhary et al. Study of web page ranking algorithms: a review
CN101231655A (en) Method and system for processing search engine results
CN113868501A (en) Deep webpage crawling method and device and vulnerability scanning system
CN111368231B (en) Method and device for testing heterogeneous redundancy architecture website
US8826119B2 (en) Management of a web site that includes dynamic protected data
CN102937982A (en) Method and system for creating collection contents
CN107451160B (en) Page pre-reading method and device
US20050216474A1 (en) Retrieving dynamically-generated and database-driven web pages using a search engine robot
KR100461600B1 (en) A method and apparatus for providing a temporary link set up by a user
CN113868579A (en) Method and device for determining similar websites
CN113709148A (en) Website monitoring method, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination