CN111092769A - Web fingerprint identification method based on machine learning - Google Patents
Web fingerprint identification method based on machine learning Download PDFInfo
- Publication number
- CN111092769A CN111092769A CN201911335939.2A CN201911335939A CN111092769A CN 111092769 A CN111092769 A CN 111092769A CN 201911335939 A CN201911335939 A CN 201911335939A CN 111092769 A CN111092769 A CN 111092769A
- Authority
- CN
- China
- Prior art keywords
- web
- prediction
- fingerprint
- training
- web fingerprint
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/14—Network analysis or design
- H04L41/147—Network analysis or design for predicting network behaviour
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/02—Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Computer Networks & Wireless Communication (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Medical Informatics (AREA)
- Evolutionary Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
- Collating Specific Patterns (AREA)
Abstract
The invention relates to the technical field of Web services, in particular to a Web fingerprint identification method based on machine learning, which uses a machine learning algorithm, generates a Web fingerprint identification model by training the content in a response message and a corresponding Web fingerprint, can quickly judge the corresponding Web fingerprint by inputting the response message into the model, and finally realizes high-precision and quick prediction of the type version of the Web service; s1 data collection: collecting or creating a plurality of sites, then sending designed requests to the sites, classifying the corresponding response messages respectively, classifying the response messages and the corresponding Web fingerprints, and dividing the response messages into a training set and a test set; s2, processing data; training an S3 model; s4, finishing training; s5 behavior prediction.
Description
Technical Field
The invention relates to the technical field of web services, in particular to a web fingerprint identification method based on machine learning.
Background
As is well known, Web fingerprints refer to that, in the process of designing and developing related components of a Web service, due to differences in development tools, related technologies, and implementation modes used in development, each Web service component forms their own unique information when a result is formed, and each Web service component can be marked according to the information, where we refer to these unique information as Web fingerprints. Generally, the Web fingerprint includes a Web container, a Web server language, a Web backend framework, a Web application, a Web front end framework, a Web front end language, a Web operator, third party content, a CDN operator, and the like. The Web fingerprint identification refers to identifying service component related information of a target Web object by comparing the existing Web fingerprint information with the related characteristics of the Web object. Common Web fingerprint identification methods include Header determination, body keyword matching, MD5 matching of url content, url status code determination, url relative path matching, and the like. The technology is generally used for collecting information of a target Web site and acquiring related content of a Web service component of the target site. The user can often find out and obtain what kind of bugs exist in the Web service component of the type version through the specific Web service component type and version, and then perform the next operation.
The existing Web fingerprint identification method generally adopts the steps of constructing a special HTTP request, sending the request to a target Web service, and then determining the type and the version of the target Web service through special content or special characters and character strings existing in a response message sent back by the target Web service. Some mainstream Web fingerprint scanning tools, including whatpye and wap sizer, directly determine the Web fingerprint by responding to the banner in the message and the keyword in the HTML. The above methods all need to collect and analyze the response message by a large amount of manual work, and finally obtain special contents and keywords which can be used as judgment bases; meanwhile, the banner information and other special contents in the response message and the contents in the HTML are very easy to modify or delete, when a target Web service with certain keywords or hidden key contents is met, the common methods are easy to feel weak, and the detection success rate of the Web fingerprint is greatly reduced.
Disclosure of Invention
In order to solve the technical problems, the invention provides a Web fingerprint identification method based on machine learning, which uses a machine learning algorithm to train the content in a response message and the corresponding Web fingerprint to generate a Web fingerprint identification model, and inputs the response message into the model to quickly judge the corresponding Web fingerprint, thereby finally realizing high-precision and quick prediction of the type version of the Web service.
The invention relates to a web fingerprint identification method based on machine learning, which comprises the following steps:
s1 data collection: collecting or creating a plurality of sites, then sending designed requests to the sites, classifying the corresponding response messages respectively, classifying the response messages and the corresponding Web fingerprints, and dividing the response messages into a training set and a test set;
and S2 data processing: replacing each character contained in the collected response message with a number for the response message with different Web fingerprints;
training an S3 model;
s4, finishing training;
s5 behavior prediction.
The invention relates to a web fingerprint identification method based on machine learning, which comprises the following specific steps of S2: it is assumed that the total number of character types contained in all the response messages is n, and each character a corresponds to a number xaThus, a response message can be replaced by a vector of length n, where the ith character m for each message miIf the corresponding number isThen it is at the second of its corresponding vectorAdding 1 to each element, and repeating the steps to obtain a digital vector with the length of n, wherein the vector is in a digital vector form of the original message;
the step S3 includes the following steps: after obtaining the character vectors corresponding to the response messages, firstly clustering the vectors, wherein the number of classes is the number of different types of Web fingerprints needing to be classified in the original response message set:
for two different response messages, their distance is:
wherein t isiIs the parameter of the ith element in the vector;
calculating the distances between the response messages in the prediction set and the response messages in all the test sets to obtain the first 10 response message sets with the closest distances, and returning the category with the highest Web fingerprint appearance frequency as the predicted Web fingerprint according to different Web fingerprint appearance frequencies in the 10 response message sets;
after each prediction is finished, if the prediction result is consistent with the actual result, the error is recorded as 0; if the prediction result is not consistent with the actual result, 5 messages which are nearest to the prediction message in the corresponding Web fingerprint are taken, the average vector of the messages is calculated, and the error is as follows:
when one round of training is finished, the sum of the errors of each prediction is the error of the prediction of the round;
the specific steps in step S4 are: if the current round error is within a specified threshold value or the current round error reaches the maximum round number, ending the training process, storing the parameters at the moment locally, and judging that the model reaches a stable state at the moment and can be used for predicting the Web fingerprint; when Web fingerprint prediction is carried out later, parameters stored in the local can be directly called for prediction, and extra model parameter training is not needed; if the current round error is still outside the specified threshold and the maximum round number is not reached, updating the model parameters according to the error:
where α is the learning rate and Δ is the error;
the step S5 includes the following steps: after the trained Web fingerprint prediction model is obtained, when the Web fingerprint of the target site needs to be predicted, a constructed Web request is only sent to the target site, the returned response message is converted into a digital vector form and then input into the Web fingerprint prediction model, the model outputs the predicted Web fingerprint to complete prediction, and a user can perform the next operation according to the predicted Web fingerprint.
In the Web fingerprint identification method based on machine learning of the present invention, the Web fingerprint in step S1 specifically includes: the Web container comprises Nginx, Apache and the like, the Web server language comprises PHP, asp, Java and the like, the Web back end framework comprises Laravel, ThinkPHP, Django and the like, the Web application comprises Dedecms, WordPress, Discuz and the like, the Web front end framework comprises Bootstrap, ExtJS, Flex and the like, and the Web front end language comprises JavaScript, PHP, C # and the like.
Compared with the prior art, the invention has the beneficial effects that: when processing the collected response messages of the Web service, in order to convert the response messages in the character form into the response messages in the digital vector form, each character is replaced by a number, then each character is respectively stored in each element of the vector according to the times of the character appearing in the message, the character of the times of each character appearing is reserved, although the appearance sequence of the character is ignored, finally, the unordered character string can reach the Web fingerprint prediction precision target from the result, the addition of the character sequence can cause the complexity of the model to rise and the training time to be prolonged, but the improvement of the prediction precision is not large, the common Euclidean distance is not used when calculating the distance between different messages, but a parameter is added in the Euclidean distance, which means that the importance of different elements in the response message vector can be different, the accurate distance between two messages may not be accurately shown by using a common Euclidean distance, so that parameters are added, parameters of different elements in a vector are influenced by a result of a training set, the distance between different messages is more accurate, when prediction is finished and evaluation is carried out during training, 5 messages which are closest to a predicted message in a Web fingerprint are selected as a target result, the event that prediction is successful is simulated after the hypothesis prediction is finished if the 5 messages exist in the first 10 nearest response message sets, then the 5 messages are added as a distance moving direction after the parameters are added to the predicted message, the precision of the prediction is improved after the parameters are updated, and the 5 messages which are closest are selected to reduce the influence on other prediction generating directions as much as possible, compared with a common fingerprint scanning tool or Web fingerprint detection based on banner and HTML characteristics, the Web fingerprint identification method based on machine learning has the advantages that the time for detection is short again after training is completed, the prediction accuracy is high, the prediction accuracy is reduced due to manual banner or HTML modification, the method can effectively identify the fingerprint information, the influence of information loss or tampering is reduced, and the identification of a large amount of fingerprint information corresponding to the Web service can be completed at relatively low cost.
Detailed Description
The following examples are given to further illustrate the embodiments of the present invention. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
The invention relates to a web fingerprint identification method based on machine learning, which comprises the following steps:
s1 data collection: collecting or creating a plurality of sites, wherein the sites comprise Web service components of various types and different versions, Web containers comprise Nginx, Apache and the like, Web service end languages comprise PHP, asp, Java and the like, a Web rear end framework comprises Laravel, ThinkPHP, Django and the like, Web applications comprise Dedecms, WordPress, Discuz and the like, a Web front end framework comprises Bootstrap, ExtJS, Flex and the like, a Web front end language comprises JavaScript, PHP, C # and the like, then, the sites are sent with designed requests, corresponding response messages are classified respectively, the response messages and corresponding Web fingerprints are classified, and the response messages are divided into a training set and a testing set;
and S2 data processing: for the collected response messages with different Web fingerprints, each character contained in the response messages is replaced by a number, wherein the total number of character types contained in all the response messages is assumed to be n, and each character a corresponds to a number xaThus, a response message can be replaced by a vector of length n, where the ith character m for each message miIf the corresponding number isThen it is at the second of its corresponding vectorAdding 1 to each element, and repeating the steps to obtain a digital vector with the length of n, wherein the vector is in a digital vector form of the original message;
training an S3 model; after obtaining the character vectors corresponding to the response messages, firstly clustering the vectors, wherein the number of classes is the number of different types of Web fingerprints needing to be classified in the original response message set:
for two different response messages, their distance is:
wherein t isiIs the parameter of the ith element in the vector;
calculating the distances between the response messages in the prediction set and the response messages in all the test sets to obtain the first 10 response message sets with the closest distances, and returning the category with the highest Web fingerprint appearance frequency as the predicted Web fingerprint according to different Web fingerprint appearance frequencies in the 10 response message sets;
after each prediction is finished, if the prediction result is consistent with the actual result, the error is recorded as 0; if the prediction result is not consistent with the actual result, 5 messages which are nearest to the prediction message in the corresponding Web fingerprint are taken, the average vector of the messages is calculated, and the error is as follows:
when one round of training is finished, the sum of the errors of each prediction is the error of the prediction of the round;
s4, finishing training; if the current round error is within a specified threshold value or the current round error reaches the maximum round number, ending the training process, storing the parameters at the moment locally, and judging that the model reaches a stable state at the moment and can be used for predicting the Web fingerprint; when Web fingerprint prediction is carried out later, parameters stored in the local can be directly called for prediction, and extra model parameter training is not needed; if the current round error is still outside the specified threshold and the maximum round number is not reached, updating the model parameters according to the error:
where α is the learning rate and Δ is the error;
s5 behavior prediction: after the trained Web fingerprint prediction model is obtained, when the Web fingerprint of the target site needs to be predicted, a constructed Web request is only sent to the target site, the returned response message is converted into a digital vector form and then input into the Web fingerprint prediction model, the model outputs the predicted Web fingerprint to complete prediction, and a user can perform the next operation according to the predicted Web fingerprint.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.
Claims (3)
1. A web fingerprint identification method based on machine learning is characterized by comprising the following steps:
s1 data collection: collecting or creating a plurality of sites, then sending designed requests to the sites, classifying the corresponding response messages respectively, classifying the response messages and the corresponding Web fingerprints, and dividing the response messages into a training set and a test set;
and S2 data processing: replacing each character contained in the collected response message with a number for the response message with different Web fingerprints;
training an S3 model;
s4, finishing training;
s5 behavior prediction.
2. The web fingerprint identification method based on machine learning of claim 1, wherein the step S2 includes the following steps: it is assumed that the total number of character types contained in all the response messages is n, and each character a corresponds to a number xaThus, a response message can be replaced by a vector of length n, where the ith character m for each message miIf the corresponding number isThen it is at the second of its corresponding vectorAdding 1 to each element, and repeating the steps to obtain a digital vector with the length of n, wherein the vector is in a digital vector form of the original message;
the step S3 includes the following steps: after obtaining the character vectors corresponding to the response messages, firstly clustering the vectors, wherein the number of classes is the number of different types of Web fingerprints needing to be classified in the original response message set:
for two different response messages, their distance is:
wherein t' is a parameter of the ith element in the vector;
calculating the distances between the response messages in the prediction set and the response messages in all the test sets to obtain the first 10 response message sets with the closest distances, and returning the category with the highest Web fingerprint appearance frequency as the predicted Web fingerprint according to different Web fingerprint appearance frequencies in the 10 response message sets;
after each prediction is finished, if the prediction result is consistent with the actual result, the error is recorded as 0; if the prediction result is not consistent with the actual result, 5 messages which are nearest to the prediction message in the corresponding Web fingerprint are taken, the average vector of the messages is calculated, and the error is as follows:
when one round of training is finished, the sum of the errors of each prediction is the error of the prediction of the round;
the specific steps in step S4 are: if the current round error is within a specified threshold value or the current round error reaches the maximum round number, ending the training process, storing the parameters at the moment locally, and judging that the model reaches a stable state at the moment and can be used for predicting the Web fingerprint; when Web fingerprint prediction is carried out later, parameters stored in the local can be directly called for prediction, and extra model parameter training is not needed; if the current round error is still outside the specified threshold and the maximum round number is not reached, updating the model parameters according to the error:
where α is the learning rate and Δ is the error;
the step S5 includes the following steps: after the trained Web fingerprint prediction model is obtained, when the Web fingerprint of the target site needs to be predicted, a constructed Web request is only sent to the target site, the returned response message is converted into a digital vector form and then input into the Web fingerprint prediction model, the model outputs the predicted Web fingerprint to complete prediction, and a user can perform the next operation according to the predicted Web fingerprint.
3. The method for Web fingerprint identification based on machine learning according to claim 2, wherein the Web fingerprint identification in step S1 specifically includes: the Web container comprises Nginx, Apache and the like, the Web server language comprises PHP, asp, Java and the like, the Web back end framework comprises Laravel, ThinkPHP, Django and the like, the Web application comprises Dedecms, WordPress, Discuz and the like, the Web front end framework comprises Bootstrap, ExtJS, Flex and the like, and the Web front end language comprises JavaScript, PHP, C # and the like.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911335939.2A CN111092769A (en) | 2019-12-24 | 2019-12-24 | Web fingerprint identification method based on machine learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911335939.2A CN111092769A (en) | 2019-12-24 | 2019-12-24 | Web fingerprint identification method based on machine learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111092769A true CN111092769A (en) | 2020-05-01 |
Family
ID=70396442
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911335939.2A Pending CN111092769A (en) | 2019-12-24 | 2019-12-24 | Web fingerprint identification method based on machine learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111092769A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112989315A (en) * | 2021-02-03 | 2021-06-18 | 杭州安恒信息安全技术有限公司 | Fingerprint generation method, device and equipment for terminal of Internet of things and readable storage medium |
CN112989256A (en) * | 2021-05-08 | 2021-06-18 | 北京华云安信息技术有限公司 | Method and device for identifying web fingerprint in response information |
CN115618291A (en) * | 2022-10-14 | 2023-01-17 | 吉林省吉林祥云信息技术有限公司 | Method, system, equipment and storage medium for identifying web fingerprint based on Transformer |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104102700A (en) * | 2014-07-04 | 2014-10-15 | 华南理工大学 | Categorizing method oriented to Internet unbalanced application flow |
US20160162779A1 (en) * | 2014-12-05 | 2016-06-09 | RealMatch, Inc. | Device, system and method for generating a predictive model by machine learning |
CN106548210A (en) * | 2016-10-31 | 2017-03-29 | 腾讯科技(深圳)有限公司 | Machine learning model training method and device |
CN107241352A (en) * | 2017-07-17 | 2017-10-10 | 浙江鹏信信息科技股份有限公司 | A kind of net security accident classificaiton and Forecasting Methodology and system |
US20180218303A1 (en) * | 2017-02-01 | 2018-08-02 | Weather Build, Inc. | Systems and methods for analyzing weather event impacts on schedule activities |
CN109948650A (en) * | 2019-02-13 | 2019-06-28 | 南京中一物联科技有限公司 | A kind of smart home device type decision method based on message characteristic |
-
2019
- 2019-12-24 CN CN201911335939.2A patent/CN111092769A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104102700A (en) * | 2014-07-04 | 2014-10-15 | 华南理工大学 | Categorizing method oriented to Internet unbalanced application flow |
US20160162779A1 (en) * | 2014-12-05 | 2016-06-09 | RealMatch, Inc. | Device, system and method for generating a predictive model by machine learning |
CN106548210A (en) * | 2016-10-31 | 2017-03-29 | 腾讯科技(深圳)有限公司 | Machine learning model training method and device |
US20180218303A1 (en) * | 2017-02-01 | 2018-08-02 | Weather Build, Inc. | Systems and methods for analyzing weather event impacts on schedule activities |
CN107241352A (en) * | 2017-07-17 | 2017-10-10 | 浙江鹏信信息科技股份有限公司 | A kind of net security accident classificaiton and Forecasting Methodology and system |
CN109948650A (en) * | 2019-02-13 | 2019-06-28 | 南京中一物联科技有限公司 | A kind of smart home device type decision method based on message characteristic |
Non-Patent Citations (4)
Title |
---|
YOUYUAN LI,YINGPING ZHUANG: "fmpRPMF:A Web Implementation for Protein Indentication by Robust Peptide Mass Fingerprinting", 《IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS》 * |
南世惠,魏伟: "基于KNN和GBDT的Web服务器指纹识别技术", 《计算机科学》 * |
尹长昕: "主动Web站点指纹攻击技术的研究与实现", 《中国优秀硕士学位论文数据库》 * |
李峰: "Web应用安全漏洞扫描技术研究", 《信息通信》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112989315A (en) * | 2021-02-03 | 2021-06-18 | 杭州安恒信息安全技术有限公司 | Fingerprint generation method, device and equipment for terminal of Internet of things and readable storage medium |
CN112989256A (en) * | 2021-05-08 | 2021-06-18 | 北京华云安信息技术有限公司 | Method and device for identifying web fingerprint in response information |
CN112989256B (en) * | 2021-05-08 | 2021-09-24 | 北京华云安信息技术有限公司 | Method and device for identifying web fingerprint in response information |
CN115618291A (en) * | 2022-10-14 | 2023-01-17 | 吉林省吉林祥云信息技术有限公司 | Method, system, equipment and storage medium for identifying web fingerprint based on Transformer |
CN115618291B (en) * | 2022-10-14 | 2023-09-29 | 吉林省吉林祥云信息技术有限公司 | Web fingerprint identification method, system, equipment and storage medium based on Transformer |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110032623B (en) | Method and device for matching question of user with title of knowledge point | |
CN111092769A (en) | Web fingerprint identification method based on machine learning | |
JP5370159B2 (en) | Information extraction apparatus and information extraction system | |
CN110674255B (en) | Text content auditing method and device | |
CN111182162B (en) | Telephone quality inspection method, device, equipment and storage medium based on artificial intelligence | |
CN109325201A (en) | Generation method, device, equipment and the storage medium of entity relationship data | |
WO2023115761A1 (en) | Event detection method and apparatus based on temporal knowledge graph | |
JP2010501096A (en) | Cooperative optimization of wrapper generation and template detection | |
CN110909868A (en) | Node representation method and device based on graph neural network model | |
CN110175851A (en) | A kind of cheating detection method and device | |
CN105069077A (en) | Search method and device | |
WO2022042297A1 (en) | Text clustering method, apparatus, electronic device, and storage medium | |
CN103631787A (en) | Webpage type recognition method and webpage type recognition device | |
CN112069069A (en) | Defect automatic positioning analysis method, device and readable storage medium | |
CN110413307B (en) | Code function association method and device and electronic equipment | |
CN110781673B (en) | Document acceptance method and device, computer equipment and storage medium | |
CN114492601A (en) | Resource classification model training method and device, electronic equipment and storage medium | |
CN110209780B (en) | Question template generation method and device, server and storage medium | |
CN111158973A (en) | Web application dynamic evolution monitoring method | |
CN116318974A (en) | Site risk identification method and device, computer readable medium and electronic equipment | |
CN114925750A (en) | Information recommendation method and device, computer readable storage medium and electronic equipment | |
CN114141236A (en) | Language model updating method and device, electronic equipment and storage medium | |
CN106506614A (en) | The generation method of mobile terminal identification code, generating means and mobile terminal | |
CN114676705A (en) | Dialogue relation processing method, computer and readable storage medium | |
CN112287005A (en) | Data processing method, device, server and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20200501 |