CN114528457A - Web fingerprint detection method and related equipment - Google Patents

Web fingerprint detection method and related equipment Download PDF

Info

Publication number
CN114528457A
CN114528457A CN202111681406.7A CN202111681406A CN114528457A CN 114528457 A CN114528457 A CN 114528457A CN 202111681406 A CN202111681406 A CN 202111681406A CN 114528457 A CN114528457 A CN 114528457A
Authority
CN
China
Prior art keywords
web
information
target site
fingerprint
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111681406.7A
Other languages
Chinese (zh)
Inventor
金正平
刘冰
张承宇
秦素娟
时忆杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
National Computer Network and Information Security Management Center
Original Assignee
Beijing University of Posts and Telecommunications
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications, National Computer Network and Information Security Management Center filed Critical Beijing University of Posts and Telecommunications
Priority to CN202111681406.7A priority Critical patent/CN114528457A/en
Publication of CN114528457A publication Critical patent/CN114528457A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The application provides a Web fingerprint detection method and related equipment. The method comprises the following steps: crawling source codes of a plurality of webpages from a target website by using a web crawler, and acquiring key information of a static file path based on the source codes; the method comprises the steps that a web crawler sends a predefined HTTP request to a host server of a target site to obtain the header information of a response message of the host server; identifying a Content Management System (CMS) type by matching the key information to a Web fingerprint library; predicting the type of the Web server by utilizing a machine learning model based on the head information; the method comprises the steps of scanning an open port of a host server and a service corresponding to the open port by using a network connection end scanning tool, and detecting a host port fingerprint. The method realizes comprehensive, accurate and efficient detection of the Web component information of the target site.

Description

Web fingerprint detection method and related equipment
Technical Field
The present application relates to Web security technologies, and in particular, to a Web fingerprint detection method and related device.
Background
With the rapid development of internet technology, the number of Web sites is rapidly increasing, application scenes of the Web sites are becoming diversified day by day, and Web application programs are widely applied to fields closely related to life of people, such as social networks, banking services, online shopping, Web mails, blogs and the like. While Web applications bring great convenience to our lives, the security problems of Web sites threaten our lives and property security all the time. A large number of open source components are widely used for Web site construction, but these open source components themselves may have various vulnerabilities and flaws that are easily exploited by attackers.
At present, most Web attacks utilize known vulnerabilities of Web components, and further obtain high-level authority and important data of a server through an attack means, and the safety of a Web site is in vital connection with whether vulnerabilities exist in components used by the site. Therefore, before the target site is subjected to safety detection, Web fingerprint identification is an important step in an information collection link, the Web component information of the target site is accurately identified, the safety testing efficiency can be improved, and great help is brought to the formulation of a penetration detection strategy. However, the related Web component type detection system has the technical problems of low accuracy, poor expansibility, incomplete detection, low detection efficiency and the like.
Disclosure of Invention
In view of the above, an object of the present application is to provide a Web fingerprint detection method and a related device.
In view of the above, a first aspect of the present application provides a Web fingerprint detection method, including:
the method comprises the steps that a web crawler is utilized to crawl webpage source codes of a plurality of webpages from a target website, and key information of static file paths is obtained based on the webpage source codes;
the web crawler sends a predefined HTTP request to a host server of the target site to acquire the header information of a response message of the host server;
identifying a Content Management System (CMS) type of the target site by matching the key information with a Web fingerprint library;
predicting the type of the Web server of the target site by utilizing a trained machine learning model based on the head information;
and scanning the open port of the host server and the service corresponding to the open port by using a network connection end scanning tool, and detecting the host port fingerprint of the target site.
A second aspect of the present application provides a Web fingerprint detection apparatus, including:
a crawling module configured to: crawling source codes of a plurality of webpages from a target website by using a web crawler, and acquiring key information of a static file path based on the source codes; the web crawler sends a predefined HTTP request to a host server of the target site to acquire the header information of a response message of the host server;
a Web fingerprint detection module configured to: identifying the CMS type of the target site by matching the key information with a Web fingerprint library; predicting the type of the Web server of the target site by utilizing a trained machine learning model based on the head information; and scanning the open port of the host server and the service corresponding to the open port by using a network connection end scanning tool, and detecting the host port fingerprint of the target site.
A third aspect of the present application proposes an electronic device comprising a memory, a processor and a computer program stored on the memory and executable by the processor, wherein the processor realizes the method provided by the first aspect of the present application when executing the computer program.
As can be seen from the above, the Web fingerprint detection method and the related device provided by the application crawl the Web page source codes of multiple Web pages from the target site by using the Web crawler, and obtain the key information of the static file path based on the Web page source codes, and the crawl of the multiple Web page source codes has higher identification accuracy than the crawl of a single Web page source code. The method comprises the steps that a web crawler sends a predefined HTTP request to a host Server of a target site to obtain the header information of a response message of the host Server, and the header information of the response message is used for identification, so that the conditions that a Server field of the response message is modified and deleted can be accurately identified. Identifying the CMS type of the content management system of the target site by matching the key information with the Web fingerprint database; predicting the type of the Web server of the target site by utilizing a trained machine learning model based on the head information; the method comprises the steps that a network connection end scanning tool is used for scanning an open port of a host server and services corresponding to the open port, and a host port fingerprint of a target site is detected; the functions of Web server identification, CMS system identification, host information detection and the like are integrated, the technical problems of low accuracy, poor expansibility, incomplete detection, low detection efficiency and the like in the related technology are solved, and the component information of the target site is comprehensively, accurately and efficiently detected. According to the detection result, attacks aiming at Web component and host service loopholes can be effectively prevented, and the safety of the Web site is maintained.
Drawings
In order to more clearly illustrate the technical solutions in the present application or the related art, the drawings needed to be used in the description of the embodiments or the related art will be briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a flowchart of a Web fingerprint detection method provided in an embodiment of the present application;
FIG. 2 is a flowchart illustrating steps 400 of a method for detecting a Web fingerprint according to an embodiment of the present application;
FIG. 3 is a flowchart illustrating a Web server type detection process according to an embodiment of the present disclosure;
FIG. 4 is a flowchart illustrating steps 500 of a method for detecting a Web fingerprint according to an embodiment of the present application;
FIG. 5 is a flow chart illustrating host port fingerprint detection according to an embodiment of the present disclosure;
FIG. 6 is an information crawling flow diagram for a crawling module of an embodiment of the present application;
FIG. 7 is a schematic diagram illustrating a web page relationship according to an embodiment of the present application;
FIG. 8 is a flowchart illustrating a step 100 of a method for detecting a Web fingerprint according to an embodiment of the present application;
FIG. 9 is a flow chart of CMS system type detection according to an embodiment of the present disclosure;
FIG. 10 is a flowchart of a step 200 of a Web fingerprint detection method according to an embodiment of the present application;
fig. 11 is a block diagram of a Web fingerprint detection apparatus according to an embodiment of the present application;
FIG. 12 is a flowchart illustrating an implementation process of a Web fingerprint detection module according to an embodiment of the present application;
FIG. 13 is a flow chart of a detection process according to an embodiment of the present application;
fig. 14 is a schematic diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings in combination with specific embodiments.
It should be noted that technical terms or scientific terms used in the embodiments of the present application should have a general meaning as understood by those having ordinary skill in the art to which the present application belongs, unless otherwise defined. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect.
In some embodiments, as shown in fig. 1, a Web fingerprint detection method includes:
step 100, crawling web page source codes of a plurality of web pages from a target site by using a web crawler, and acquiring key information of a static file path based on the web page source codes.
In this step, crawling multiple web page source codes has a higher recognition accuracy than crawling a single web page source code.
Step 200, the web crawler sends a predefined HTTP request to the host server of the target site, and obtains the header information of the response packet of the host server.
In this step, the header information of the response packet includes the relative position information of the important field and the content information of the important field, and the condition that the Server field is modified and deleted after the characteristics of the header information are normalized can still be accurately identified.
And step 300, identifying the CMS type of the content management system of the target site by matching the key information with the Web fingerprint database.
In this step, the identification for the CMS system is mainly based on the web page source key information and the path information of the static resource file. The related art often uses the wap sizer and whatpad detection tools to identify the CMS system, which mainly performs regular matching with the Web fingerprint database by capturing the source code of a single page, analyzing the annotations and keyword information in the source code, and if the developer hides or modifies the keyword information, the CMS system cannot be identified by using this method. The related technology can also select to acquire static file paths in a plurality of labels in a single response page source code, and match the static file paths with the Web fingerprint library to identify the CMS system in the Web application component, but the method has higher requirement on the completeness of the Web fingerprint library and lower universality. It can be seen that the existing CMS system mainly identifies a single webpage source code keyword and a static resource file full path, and the two types of parties have low identification accuracy and high requirement on the completeness of a Web fingerprint database. According to the method, the crawler is used for acquiring the source codes of the multiple webpages of the target site, the key static paths of the static files are extracted from the source codes, and finally the matching is carried out through the Web fingerprint library, so that the CMS system can be accurately identified.
And step 400, predicting the type of the Web server of the target site by using the trained machine learning model based on the head information.
In this step, the identification of the Web server component is mainly based on the response information of the server, including an abnormal page, a response code, a message header content, and the like. In the related technology, an abnormal HTTP request is constructed to access resources which do not exist in a Server, and the type of the Web Server is judged according to a response code, the head content of a response message and the body content of an abnormal page. By constructing various malformed HTTP requests, the type of the Web server is judged by using a naive Bayesian algorithm classification model with the response state code as a feature. This approach requires multiple HTTP requests to be sent to the server, resulting in inefficient recognition. By analyzing the sequence of a large number of response message header fields and summarizing the fixed rule for identifying the type of the Web server, the method has higher false alarm rate and only identifies three mainstream Web servers Apache, IIS and Nginx, so that the method is complex to realize and has poorer expansibility. The method provided by the application extracts information from the header of the response message by using the trained machine learning model, wherein the information comprises important field relative position information and important field content information, and obtains a prediction model by training a random forest algorithm after normalizing the characteristics of the important field relative position information and the important field content information, so that the type of the Web Server can be accurately identified, the Server field can be accurately identified even if the Server field is modified and deleted, and the method has higher accuracy and better expansibility compared with the traditional identification mode based on fixed rules.
Step 500, scanning the open port of the host server and the service corresponding to the open port by using the network connection end scanning tool, and detecting the host port fingerprint of the target site.
In this step, optionally, the scanning for the host service mainly depends on an Nmap scanning tool and a Zmap scanning tool, both of which acquire basic information of the target host through IP address detection, but the Zmap scanning tool has incomparable advantages in scanning rate, and the Nmap scanning tool has unique advantages in scanning accuracy and comprehensiveness.
The functions of Web server identification, CMS system identification, host information detection and the like are integrated, the technical problems of low accuracy, poor expansibility, incomplete detection, low detection efficiency and the like in the related technology are solved, and the component information of the target site is comprehensively, accurately and efficiently detected. According to the detection result, attacks aiming at Web component and host service loopholes can be effectively prevented, and the safety of the Web site is maintained.
In some embodiments, predicting the type of Web server of the target site based on the header information using the trained machine learning model as shown in fig. 2 and 3 includes:
at step 410, the header information is preprocessed.
In the step, the header information of the response message is subjected to feature normalization processing according to the response rule, text vectors in the header information are converted into digital vectors, and a basis is provided for predicting the type of the Web server by using a machine learning model through a random forest algorithm.
And step 420, predicting the type of the Web server by using a machine learning model through a random forest algorithm based on the preprocessed head information.
In the step, based on the digital vector in the head information obtained after preprocessing, a trained machine learning model is utilized to execute a random forest algorithm on the head information to obtain a classification result, and then the type of the Web server is predicted. The random forest algorithm has the following advantages:
1. for a variety of input variables, a high degree of accuracy of the classification results can be achieved. The method is favorable for improving the accuracy of the type identification of the Web server.
2. A large number of input variables can be handled. The requirement on the number of input variables is small, and the expansibility is higher.
3. The importance of the variables may be evaluated when deciding on the category. Can be used to improve the accuracy of the prediction result.
4. When a forest is built, unbiased estimates can be generated internally for the generalized errors. The method is favorable for improving the accuracy of the type identification of the Web server.
5. The missing data can be estimated and if a large portion of the data is missing, the accuracy can still be maintained. The problem of incomplete input variables caused by the fact that other steps are asked can be effectively solved, and the accuracy of Web server type identification is improved.
6. The learning process is very fast. The method is beneficial to improving the efficiency of identifying the type of the Web server.
In some embodiments, as shown in fig. 4 and 5, the detecting the host port fingerprint of the target site by scanning an open port of the host server and a service corresponding to the open port by using a network connection end scanning tool includes:
step 510, scanning the open port and the service corresponding to the open port by using a scanning tool Nmap, and generating a detection report.
In this step, the Nmap parameter is preset, and the Nmap is integrated into the detection system by means of inter-process call, and when the open port and the service corresponding to the open port need to be scanned, the execution command is generated to call the scanning tool Nmap to scan so as to obtain the probe report.
Optionally, to meet the requirements of the method of this embodiment on the automatic scanning of the open port, the running service, and the operating system of the host server, the Nmap parameter is set as: nmap-sS-sV-O-T5 ip, and is integrated into the detection system by means of inter-process calls.
Step 520, analyze the probe report to obtain the host port fingerprint.
In this step, the key information is filtered and formatted by parsing the probe report to obtain the host port fingerprint.
In some embodiments, as shown in FIG. 6, crawling web page source code for a plurality of web pages from a target site using a web crawler includes: and crawling the webpage source codes from the target website by using a web crawler and adopting a breadth-first strategy.
Setting parameters of the number of user threads, the depth of the crawler and the execution time according to the information of the tasks to be detected in the set crawler task queue; and generating a connection pool from the Agent pool and the User-Agent pool, and selecting a link from the connection pool, wherein the link can be a socket link. And then crawling the webpage source code from the target site by adopting a breadth first strategy. The breadth is first, also called the breadth is first, and means that in the grabbing process, after the current level of search is completed, the next level of search is performed. The algorithm is relatively simple to design and implement. The breadth-first searching method can cover as many webpages as possible, and is beneficial to acquiring a plurality of webpage source codes. In a crawler, it is generally considered that there is a high probability that a web page within a certain link distance from the original URL has topic relevance. The capturing process is to directly insert the link found by the newly downloaded web page into the end of the URL queue to be captured, that is, the web crawler will capture all the web pages in the initial page first, then select one of the connected web pages, and continue to capture all the web pages linked in the web page. Referring to fig. 7, taking the relationship of web pages in the figure as an example, the breadth-first crawling order is: A-B-C-D-E-F-G-H-I. Optionally, breadth-first search and a web page filtering technology may be combined for use, a breadth-first policy is first used to capture a web page, and then irrelevant web pages are filtered out.
Alternatively, a breadth-first traversal policy may be utilized to obtain a page with a response code of 2 xx.
In some embodiments, as shown in fig. 8 and 9, obtaining the key information of the static file path based on the web page source code includes:
step 110, parsing out the predetermined tag from the webpage source code.
In this step, optionally, the predetermined tag may also be an HTML tag.
Step 120, extracting static file path information from the predetermined label by using a regular expression.
In this step, static file path information is extracted from a predetermined tag by means of element screening using a regular expression.
And step 130, storing the static file path information as a target text, and denoising the target text.
In the step, the screened static file path information is stored as a target text, irrelevant character strings in the target text are removed through a filter network, and the target text is denoised.
And 140, performing text slicing processing on the target text subjected to denoising processing to extract key information.
In this step, the denoised target file is divided into a plurality of slices by performing file slicing processing on the denoised target file, key information of the static file path is found out and extracted from the plurality of slices, and a detection basis is provided for step 300.
In some embodiments, as shown in fig. 10 and fig. 6, acquiring header information of a response message of a host server of a target site by sending a predefined HTTP request to the host server by a web crawler includes:
step 210, an HTTP request is sent by the web crawler to the host server.
In the step, a user-defined HTTP request is constructed according to parameter information of the tasks to be tested in the crawler task queue, and then the web crawler sends the user-defined HTTP request to the host server.
Step 220, a response message of the host server to the HTTP request is obtained.
In this step, the host server receives the HTTP request and returns a response message.
In step 230, the relative position information of the first predetermined field and the content information of the second predetermined field are extracted from the header of the response message as header information.
The HTTP request comprises GET/404page html HTTP/1.1/r/n/r/n'.
The first predetermined field includes a "Date" field, a "Server" field, a "Content-Type" field, a "Content-Length" field, a "Connection" field, and an "Expires" field.
The second predetermined field includes a "Content-Length" field and an "X-Power-By" field.
In some embodiments, the Web fingerprint detection method provided in the embodiment of the present application further includes: and writing the CMS type, the Web server type and the host port fingerprint into a remote dictionary service (Redis) queue as the detection result of the Web fingerprint of the target site.
The List data structure of Redis is used as a message queue to complete data transmission and reduce the coupling in the system. Because Redis is a memory type database, the access to data is operated in a memory, and the efficiency of data transmission among modules is improved.
It should be noted that the method of the embodiment of the present application may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the multiple devices may only perform one or more steps of the method of the embodiment, and the multiple devices interact with each other to complete the method.
It should be noted that the above describes some embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Based on the same inventive concept, corresponding to the method of any embodiment, the application also provides a Web fingerprint detection device.
Referring to fig. 11, the Web fingerprint detection apparatus may include:
a crawling module 10 configured to: the method comprises the steps that a web crawler is utilized to crawl webpage source codes of a plurality of webpages from a target website, and key information of static file paths is obtained based on the webpage source codes; the method comprises the steps that a web crawler sends a predefined HTTP request to a host server of a target site, and header information of a response message of the host server is obtained.
Wherein, the crawling process of the web crawler is as follows: and (3) adopting a breadth-first strategy to collect static file path information as much as possible, downloading a page source code according to the seed URL, then extracting a static file path and a URL link, filtering the external station link in a regular matching mode, adding the URL of the local station into a seed URL queue, and continuously repeating the process until the number of the static file paths with a preset threshold value is obtained.
Furthermore, in order to improve the crawler efficiency of the web crawler, three parameter settings of thread number, crawler depth and maximum execution time are externally provided. In order to deal with the anti-crawling strategy of the target website, an access IP, an access frequency and an access browser type are set according to the response of the target host server, and the page which fails to be acquired is abandoned so as to improve the overall efficiency of the system.
A Web fingerprint detection module 20 configured to: identifying the CMS type of the target site by matching the key information with a Web fingerprint database; predicting the type of the Web server of the target site by utilizing a trained machine learning model based on the head information; the network connection end scanning tool is used for scanning the open port of the host server and the service corresponding to the open port, and the host port fingerprint of the target site is detected.
A system storage module 30 configured to: storing user data, storing crawl information, and storing scan result data.
The system storage module 30 uses two storage management methods, which are a relational database MySQL and a memory database Redis respectively. And the List data structure of Redis is used as a message queue to finish data transmission among the modules and reduce the coupling among the modules in the system. In addition, since Redis is a memory type database, data access is operated in a memory, the efficiency of data transmission between modules is improved, and data stored by Redis mainly comprises main task basic information, crawler intermediate data, detection subtask creation information and detection report information. And carrying out persistent storage on the data by using MySQL, wherein the stored data comprises user information, login information, Web fingerprint information, task information and detection report information.
A task scheduling module 40 configured to: the scheduling crawling module 10 executes, the scheduling Web fingerprint detection module 20 executes, and the scheduling storage module executes.
The task scheduling module 40 mainly manages and schedules each module in a unified manner, decouples the whole system, realizes the design concept of high cohesion and low coupling of the whole system, and improves the reusability and the expansibility of the program module. The core function of the system is completed by the cooperation of the crawling module 10, the Web fingerprint detection module 20 and the system data storage module 03, and the execution logic and data transmission among the modules need to be scheduled and managed. For the management of execution logic among modules, the system uses a Quartz task scheduling framework, and the framework can schedule tasks, reduce the complexity of service logic and improve the fault-tolerant rate of the system. In addition, Redis is mainly used as a message queue to decouple the whole system, a producer-consumer mode is used, modules are not directly called but communicate in a message queue mode, and intermediate data among the modules mainly comprise detection task basic information, intermediate data collected by a crawler and detection results.
A user interaction module 50 configured to: user management, detection task management and detection result display.
The user interaction module 50 is configured to provide various interaction functions for a user on the front-end interface, and mainly includes user management, detection task management, and detection result display functions.
The Web fingerprint detection module 20 is a core module of the system, and includes a CMS system type detection submodule, a host port information detection submodule, and a Web server type detection submodule. As shown in fig. 12, the Web fingerprint detection module 20 specifically executes the following flow:
(1) and the CMS system type detection submodule extracts the static path key information from the Redis intermediate data queue, and then matches the static path key information in a Web fingerprint library to identify the CMS system of the target site.
(2) And the host port information detection submodule extracts a seed URL and an IP address from the Redis intermediate data queue, calls Nmap for detection and identifies the port fingerprint of the host server.
(3) And the Web server type detection submodule extracts message response head information from the Redis intermediate data queue, performs data preprocessing and identifies the type of the Web server by using a random forest detection model.
(4) And (4) integrating the detection results of (1) to (3) and writing the detection results into a Redis result queue.
(5) And writing the data to be recycled in the Redis result queue into the database.
In some embodiments, based on the above modules, the Web fingerprint detection method concludes that the crawling module 10 is responsible for collecting information of a target site, including key information of a static file path and header information of a response packet, and pre-processes the acquired information as an input of the Web fingerprint detection module 20; the task scheduling module 40 is responsible for coordinating the execution sequence of other modules, and mainly comprises a crawler task execution task, a fingerprint detection task execution task and a result storage task; the system storage module 30 is responsible for storing user data and data or results generated during the execution of other modules; the user interaction module 50 provides various interaction functions for the user on the front-end interface, and mainly comprises functions of user management, detection task management, detection result display and the like; the Web fingerprint detection module 20 is a specific implementation of a Web component detection technology, and mainly includes Web server component type detection, CMS system type detection, and host port fingerprint detection.
In order to better implement the present invention, the crawling module 10 further aims to obtain key information of a static file path of a target website and text information of a response packet header, and provide analysis data for the Web fingerprint detection module 20. In order to collect as much static file path information as possible, the crawling module 10 employs a breadth first strategy. For the acquisition module of the static file path key information, in order to improve the efficiency of the crawler and acquire more static file path information as much as possible in a short time, three parameters of thread number, crawler depth and maximum execution time are provided, and a user can set the parameters according to the requirements of the user.
In order to better implement the present invention, further, the task scheduling module 40 is responsible for coordinating the execution of the three core modules, namely the crawling module 10, the Web fingerprint detection module 20 and the system storage module 30, and performing scheduling management on the execution logic and data transfer among the modules.
In order to better implement the present invention, the system storage module 30 is responsible for storing user data and data or results generated during the execution of other modules, including basic parameter information of tasks, website static file path information and text information of response message headers acquired by the crawling module 10, result information and user information executed by the Web fingerprint detection module 20, and the like.
In order to better implement the present invention, further, the user interaction module 50 provides various interaction functions for the user at the front-end interface, and implements a user management function for dividing the operation authority of the user according to the role of the user; the task management function is used for the operations of creation, deletion, suspension, start and the like of the task; the Web fingerprint function is realized for a system administrator user to add the Web fingerprint to the Web fingerprint database.
In some embodiments, as shown in fig. 13, a single detection process is illustrated as follows:
(1) the user scans relevant parameters including target site URL or IP address, crawler thread number, task limit, maximum scanning time, etc. at the front end of the Web fingerprint detection apparatus through configuration of the user interaction module 50. And creating a task according to a parameter scannable result set by a user, storing the parameter information of the task into the system storage module 30, and marking the task to be detected to obtain the task to be detected.
(2) The task scheduling module 40 monitors the tasks to be detected in the database, serializes the basic information of the tasks to be stored in the crawler task queue, and sends the task queue to be detected to the crawling module 10 for processing.
(3) The crawling module 10 takes out task information from the task to be detected in the crawler task queue, collects target site information, preprocesses the collected information, stores the processed data in an intermediate data queue, and stores the intermediate data queue in a redis queue of the system storage module 30.
(4) After monitoring that the system storage module 30 stores the intermediate data queue, the task scheduling module 40 stores the preprocessed data into the database of the system storage module 30 by the scheduling system storage module 30, and issues a detection task queue to the Web fingerprint detection module 20.
(5) The Web fingerprint detection module 20 monitors the detection task queue at regular time, and when there is a task to be detected, calls the CMS system type detection submodule, the host port information detection submodule, and the Web server type detection submodule to detect the CMS system type, the host port fingerprint, and the Web server type, respectively, and stores the detection result data in the detection result queue.
(6) The task scheduling module 40 stores the detection result in the system storage module 30 after detecting the task information of the detection result queue, and sets the status of the task in the system storage module 30 to be completed.
(7) The user queries the detection result from the database of the system storage module 30 through the front-end interface by using the user interaction module 50, and displays the detection result on the Web interface.
For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the various modules may be implemented in the same one or more software and/or hardware implementations as the present application.
The apparatus in the foregoing embodiment is used to implement the corresponding Web fingerprint detection method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Based on the same inventive concept, corresponding to the method of any embodiment described above, the present application further provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and running on the processor, and when the processor executes the program, the Web fingerprint detection method described in any embodiment above is implemented.
Fig. 14 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.
The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.
The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).
The bus 1050 includes a path to transfer information between various components of the device, such as the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.
It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only the components necessary to implement the embodiments of the present disclosure, and need not include all of the components shown in the figures.
The electronic device of the foregoing embodiment is used to implement the corresponding Web fingerprint detection method in any one of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Based on the same inventive concept, corresponding to any of the above-mentioned embodiment methods, the present application also provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the Web fingerprint detection method according to any of the above-mentioned embodiments.
Computer-readable media, including both permanent and non-permanent, removable and non-removable media, for storing information may be implemented in any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
The computer instructions stored in the storage medium of the foregoing embodiment are used to enable the computer to execute the Web fingerprint detection method according to any one of the foregoing embodiments, and have the beneficial effects of the corresponding method embodiment, which are not described herein again.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the context of the present application, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present application as described above, which are not provided in detail for the sake of brevity.
In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the application. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the application are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that the embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.
While the present application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.
The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present application are intended to be included within the scope of the present application.

Claims (10)

1. A Web fingerprint detection method is characterized by comprising the following steps:
crawling source codes of a plurality of webpages from a target website by using a web crawler, and acquiring key information of a static file path based on the source codes;
the web crawler sends a predefined HTTP request to a host server of the target site to acquire the header information of a response message of the host server;
identifying the CMS type of the content management system of the target site by matching the key information with a Web fingerprint library;
predicting the type of the Web server of the target site by utilizing a trained machine learning model based on the head information;
and scanning the open port of the host server and the service corresponding to the open port by using a network connection end scanning tool, and detecting the host port fingerprint of the target site.
2. The method of claim 1, wherein the predicting, based on the header information, a Web server type for the target site using a trained machine learning model comprises:
preprocessing the header information;
and predicting the type of the Web server by utilizing the machine learning model through a random forest algorithm based on the preprocessed head information.
3. The method of claim 1, wherein the detecting the host port fingerprint of the target site by scanning an open port of the host server and a service corresponding to the open port with a network connection end scanning tool comprises:
scanning the open port and the service corresponding to the open port by using a scanning tool Nmap to generate a detection report;
and analyzing the detection report to obtain the host port fingerprint.
4. The method of claim 1, wherein the crawling a plurality of web pages from a target site with a web crawler comprises:
and crawling the source code from the target site by using the web crawler and adopting an breadth-first strategy.
5. The method of claim 4, wherein the obtaining key information of a static file path based on the source code comprises:
analyzing a predetermined label from the source code;
extracting static file path information from the predetermined label by using a regular expression;
saving the static file path information as a target text, and denoising the target text;
and performing text slicing processing on the target text subjected to denoising processing to extract the key information.
6. The method of claim 1, wherein the obtaining header information of the response message of the host server by the web crawler sending a predefined HTTP request to the host server of the target site comprises:
sending, by the web crawler, the HTTP request to the host server;
acquiring a response message of the host server to the HTTP request;
and extracting the relative position information of the first preset field and the content information of the second preset field from the head part of the response message as the head information.
7. The method of claim 6, wherein,
the HTTP request comprises "GET/404 page. html HTTP/1.1/r/n/r/n";
the first preset field comprises a 'Date' field, a 'Server' field, a 'Content-Type' field, a 'Content-Length' field, a 'Connection' field and an 'Expires' field;
the second predetermined field includes a "Content-Length" field and an "X-Power-By" field.
8. The method of any of claims 1 to 7, further comprising:
and writing the CMS type, the Web server type and the host port fingerprint into a remote dictionary service Redis queue as a detection result of the Web fingerprint of the target site.
9. A Web fingerprint detection apparatus comprising:
a crawling module configured to: crawling source codes of a plurality of webpages from a target website by using a web crawler, and acquiring key information of a static file path based on the source codes; the web crawler sends a predefined HTTP request to a host server of the target site to acquire the header information of a response message of the host server;
a Web fingerprint detection module configured to: identifying the CMS type of the target site by matching the key information with a Web fingerprint library; predicting the type of the Web server of the target site by utilizing a trained machine learning model based on the head information; and scanning the open port of the host server and the service corresponding to the open port by using a network connection end scanning tool, and detecting the host port fingerprint of the target site.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable by the processor, wherein the processor implements the method of any one of claims 1 to 8 when executing the computer program.
CN202111681406.7A 2021-12-31 2021-12-31 Web fingerprint detection method and related equipment Pending CN114528457A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111681406.7A CN114528457A (en) 2021-12-31 2021-12-31 Web fingerprint detection method and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111681406.7A CN114528457A (en) 2021-12-31 2021-12-31 Web fingerprint detection method and related equipment

Publications (1)

Publication Number Publication Date
CN114528457A true CN114528457A (en) 2022-05-24

Family

ID=81621061

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111681406.7A Pending CN114528457A (en) 2021-12-31 2021-12-31 Web fingerprint detection method and related equipment

Country Status (1)

Country Link
CN (1) CN114528457A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115941280A (en) * 2022-11-10 2023-04-07 北京源堡科技有限公司 Penetration method, device, equipment and medium based on web fingerprint information
CN116127236A (en) * 2023-04-19 2023-05-16 远江盛邦(北京)网络安全科技股份有限公司 Webpage web component identification method and device based on parallel structure
CN116304901A (en) * 2023-02-01 2023-06-23 北京市燃气集团有限责任公司 Webpage server fingerprint identification method, device, equipment and storage medium
JP7344614B1 (en) 2023-05-08 2023-09-14 株式会社エーアイセキュリティラボ Systems, methods, and programs for testing website vulnerabilities

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115941280A (en) * 2022-11-10 2023-04-07 北京源堡科技有限公司 Penetration method, device, equipment and medium based on web fingerprint information
CN115941280B (en) * 2022-11-10 2024-01-26 北京源堡科技有限公司 Penetration method, device, equipment and medium based on web fingerprint information
CN116304901A (en) * 2023-02-01 2023-06-23 北京市燃气集团有限责任公司 Webpage server fingerprint identification method, device, equipment and storage medium
CN116304901B (en) * 2023-02-01 2024-01-30 北京市燃气集团有限责任公司 Webpage server fingerprint identification method, device, equipment and storage medium
CN116127236A (en) * 2023-04-19 2023-05-16 远江盛邦(北京)网络安全科技股份有限公司 Webpage web component identification method and device based on parallel structure
JP7344614B1 (en) 2023-05-08 2023-09-14 株式会社エーアイセキュリティラボ Systems, methods, and programs for testing website vulnerabilities
JP7440150B1 (en) 2023-05-08 2024-02-28 株式会社エーアイセキュリティラボ Systems, methods, and programs for testing website vulnerabilities

Similar Documents

Publication Publication Date Title
CN111522922B (en) Log information query method and device, storage medium and computer equipment
CN108471429B (en) Network attack warning method and system
US9614862B2 (en) System and method for webpage analysis
CN114528457A (en) Web fingerprint detection method and related equipment
EP2871574B1 (en) Analytics for application programming interfaces
CN103888490B (en) A kind of man-machine knowledge method for distinguishing of full automatic WEB client side
CN111585955B (en) HTTP request abnormity detection method and system
CN112491602B (en) Behavior data monitoring method and device, computer equipment and medium
CN105743730A (en) Method and system used for providing real-time monitoring for webpage service of mobile terminal
CN110830483B (en) Webpage log attack information detection method, system, equipment and readable storage medium
CN102984161A (en) Identification method and device for reliable website
CN114817968B (en) Method, device and equipment for tracing path of featureless data and storage medium
CN111447224A (en) Web vulnerability scanning method and vulnerability scanner
CA2781391A1 (en) Identifying equivalent links on a page
CN111460803B (en) Equipment identification method based on Web management page of industrial Internet of things equipment
KR20190058141A (en) Method for generating data extracted from document and apparatus thereof
US11055631B2 (en) Automated meta parameter search for invariant based anomaly detectors in log analytics
EP3789882B1 (en) Automatic configuration of logging infrastructure for software deployments using source code
CN115033876A (en) Log processing method, log processing device, computer device and storage medium
CN112688966A (en) Webshell detection method, device, medium and equipment
CN113965407A (en) IOC information file generation method and device, storage medium and electronic equipment
CN111597422A (en) Buried point mapping method and device, computer equipment and storage medium
CN114465741B (en) Abnormality detection method, abnormality detection device, computer equipment and storage medium
CN112347328A (en) Network platform identification method, device, equipment and readable storage medium
CN111931186A (en) Software risk identification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination