CN110245314A - A kind of web page fingerprint generation method - Google Patents
A kind of web page fingerprint generation method Download PDFInfo
- Publication number
- CN110245314A CN110245314A CN201910469738.5A CN201910469738A CN110245314A CN 110245314 A CN110245314 A CN 110245314A CN 201910469738 A CN201910469738 A CN 201910469738A CN 110245314 A CN110245314 A CN 110245314A
- Authority
- CN
- China
- Prior art keywords
- web page
- website
- dimensional vector
- generation method
- web
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 26
- 239000000203 mixture Substances 0.000 claims abstract description 10
- 239000000284 extract Substances 0.000 abstract description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000004069 differentiation Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 235000014510 cooky Nutrition 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Collating Specific Patterns (AREA)
Abstract
A kind of web page fingerprint generation method of the present invention, identifies web page contents, obtains website element;Obtain the feature in website element with weight, composition characteristic set, N bit binary value is generated to the corresponding each website element of feature institute with weight, each N bit binary value forms n-dimensional vector V, every dimensional vector is calculated respectively, n-dimensional vector is updated, generates new binary numeral as website fingerprint information.Can identify quick locating websites composition and variation by webpage, locating web-pages or affiliated web site whether latest threat or attack.The generic web page of identification extracts content intact, and recognition accuracy is higher, and versatility is preferable, and network security risk is effectively reduced.
Description
Technical field
This field is related to computer fingerprint technology more particularly to a kind of generation method of web page fingerprint.
Background technique
Web page fingerprint identification technology is concrete application of the technology under actual scene.In many network information interacting activities
In, website is important information carrier, and webpage plays very important role as the basic element of website, and people enter
Website browsing webpage, to obtain information or leave on webpage the information of oneself, therefore, for network behavior monitoring and
Speech determines the too busy to get away other differentiation of web page class of differentiation of target and categories of websites.
Can quickly be identified when passing through webpage Content Management System in webpage (CMS), blog platform, statistics/analysis bag,
The information such as the library JavaScript, Web server and embedded device.Component is cyberspace minimum unit, web application,
Database, middleware etc. belong to component.Fingerprint is one section of characteristic information that object type can be identified on component, is used in website
Information collects in link and quickly identifies destination service, and determines Web evolution situation by recognizer component.
Currently, with internet area society development and business activity importance increasingly increase severely, for network pacify
Full management requires higher and higher.Webpage carrying information is more and more, identifies quick locating websites composition and variation by webpage, fixed
Position webpage or affiliated web site whether latest threat or attack, pass through external one Xiang Fangfa of locating websites safety.Pass through biography
System identification general Websites occur web site contents extract it is imperfect, recognition accuracy is lower, and versatility is poor.
Summary of the invention
For disadvantages described above, the present invention provides the web page fingerprint generation method that a kind of recognition accuracy is high, versatility is good.
The technical solution of the present invention is as follows: a kind of web page fingerprint generation method, it is characterised in that: the following steps are included:
It identifies web page contents, obtains website element;By HTTP, HTTPS request website, perhaps head is obtained in web response body Web
The keyword match of information obtains the canonical matching of perhaps head information in web response body Web, based on TCP/IP request protocol identification clothes
Business information;
Obtain the feature in website element with weight, composition characteristic set;
N bit binary value is generated to the corresponding each website element of feature institute with weight, it is described every
The all corresponding respective different binary numeral of a website element;
Each N bit binary value composition n-dimensional vector V (V1, V2 ..., Vn), respectively to every dimensional vector into
Row calculates, and updates n-dimensional vector;
According to the updated n-dimensional vector, new binary numeral is generated as website fingerprint information.
Further, a kind of web page fingerprint generation method, it is described that every dimensional vector in N-dimensional vector V is calculated respectively
Method are as follows: if the binary numeral of the corresponding bit of website element is 1, to the feature weight of the website element into
Row add operation;If the binary numeral of bit is 0, subtraction is carried out to the feature weight of the website element.
Further, a kind of web page fingerprint generation method will be in N fingerprints if i-th dimension is positive number in vector V
I-th bit is set as 1;If i-th dimension is non-positive number in vector V, 0 is set by i-th bit in N fingerprints.
Further, a kind of web page fingerprint generation method, the binary numeral that each website element generates are multidigit,
Usually 64.
Further, a kind of web page fingerprint generation method, the website element include: html text, http response head,
URL characteristic value, host-feature.
The invention has the benefit that the present invention provides a kind of web page fingerprint generation method, the feature of website element is weighed
Value is updated calculating, generates web page fingerprint, can identify quick locating websites composition and variation by webpage, locating web-pages or
Affiliated web site whether latest threat or attack.The generic web page of identification extracts content intact, and recognition accuracy is higher, versatility
Preferably, network security risk is effectively reduced.
Detailed description of the invention
Fig. 1 is a kind of web page fingerprint generation method block diagram of the present invention;
Fig. 2 is a kind of implementation example figure of web page fingerprint generation method of the present invention.
Specific embodiment
Further describe technical solution of the present invention with reference to the accompanying drawing:
A kind of web page fingerprint generation method as shown in Figure 1, comprising the following steps:
S1: identification web page contents obtain website element;By HTTP, HTTPS request website, web response body Web content is obtained
Or the keyword match of head information, the canonical matching of perhaps head information in web response body Web is obtained, is known based on TCP/IP request protocol
Other information on services;
S2: the feature in website element with weight, composition characteristic set are obtained;
S3: generating N bit binary value to the corresponding each website element of feature institute with weight, described
The corresponding respective different binary numeral of each website element;The binary numeral that each website element generates is more
Position, usually 64.
S4: each N bit binary value composition n-dimensional vector V (V1, V2 ..., Vn), respectively to every dimensional vector
It is calculated, updates n-dimensional vector;
The method that every dimensional vector in N-dimensional vector V is calculated respectively are as follows: if the corresponding bit of website element
The binary numeral of position is 1, then carries out add operation to the feature weight of the website element;If the binary number of bit
Value is 0, then carries out subtraction to the feature weight of the website element.
S5: according to updated n-dimensional vector, new binary numeral is generated as website fingerprint information.It is wanted according to right
A kind of web page fingerprint generation method described in asking 1, it is characterised in that: if i-th dimension is positive number in vector V, by N fingerprints
Middle i-th bit is set as 1;If i-th dimension is non-positive number in vector V, 0 is set by i-th bit in N fingerprints.
As shown in Fig. 2, a kind of web page fingerprint generation method of the invention, versatility are preferable.Its webpage that can be applicable in contains
There is following website element: html text, http response head, URL characteristic value, host-feature.
The website element html text that can be identified includes: special text;In HTML title, keyword, description etc.
Label;There are certain div tags of specific class attribute in special class:HTML;Meta label characteristics: html source code
(special class title) and its annotation part (comment).
The website element http response head that can be identified includes: that HTTP request is soundd out, and observes head response, primary part observation response
The Server of head (Response Header), X-Powered-By, Set-Cookie field;It is soundd out, is observed using HTTP request
One of head response, such as the feature of Struts2 are exactly the cookie field that special JSESSIONID can be arranged;Server field
And WP-Super-Cache.
The website element URL characteristic value that can be identified includes: to indicate to some specific paths: readme.html,
wp-content/uploads/,wp-admin;The default-error page: the wrong page or http answer code in website, such as
Apache default 404;Robots.txt:CMS has the robots.txt of default, directly illustrates in the text of robots.txt
CMS title and version;The matching of the special files such as favicon.ico: downloading favicon.ico image carries out md5 abstract, comparison
The library icon whether there is the icon of identical md5, and calculation amount is bigger but retrieval effectiveness better method is exactly to do retrieving similar images
(based on SIFT or other characteristics of image), other than favicon.ico, there are also other logo file/img/buttom_
logo.gif。
The website element host-feature that can be identified includes: default port feature: to host opening end where website
Mouth feature judgement;OS fingerprint: operating system can be tentatively judged by web site features and TCP response;SSL certificate: SSL certificate letter
Breath.
Claims (5)
1. a kind of web page fingerprint generation method, it is characterised in that: the following steps are included:
It identifies web page contents, obtains website element;By HTTP, HTTPS request website, perhaps head information is obtained in web response body Web
Keyword match, obtain web response body Web in perhaps head information canonical matching, based on TCP/IP request protocol identification service letter
Breath;
Obtain the feature in website element with weight, composition characteristic set;
N bit binary value, each net are generated to the corresponding each website element of feature institute with weight
The all corresponding respective different binary numeral of element of standing;
Each N bit binary value composition n-dimensional vector V (V1, V2 ..., Vn), respectively counts every dimensional vector
It calculates, updates n-dimensional vector;
According to the updated n-dimensional vector, new binary numeral is generated as website fingerprint information.
2. a kind of web page fingerprint generation method according to claim 1, it is characterised in that: described respectively in N-dimensional vector V
The method that is calculated of every dimensional vector are as follows: if the binary numeral of the corresponding bit of website element is 1, to the net
Stand element feature weight carry out add operation;If the binary numeral of bit is 0, to the feature of the website element
Weight carries out subtraction.
3. a kind of web page fingerprint generation method according to claim 1, it is characterised in that: if i-th dimension is just in vector V
Number, then set 1 for i-th bit in N fingerprints;If i-th dimension is non-positive number in vector V, by i-th bit in N fingerprints
It is set as 0.
4. a kind of web page fingerprint generation method according to claim 1, it is characterised in that: each website element generates
Binary numeral be multidigit, usually 64.
5. a kind of web page fingerprint generation method according to claim 1, it is characterised in that: the website element includes:
Html text, http response head, URL characteristic value, host-feature.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910469738.5A CN110245314A (en) | 2019-05-31 | 2019-05-31 | A kind of web page fingerprint generation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910469738.5A CN110245314A (en) | 2019-05-31 | 2019-05-31 | A kind of web page fingerprint generation method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110245314A true CN110245314A (en) | 2019-09-17 |
Family
ID=67885689
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910469738.5A Pending CN110245314A (en) | 2019-05-31 | 2019-05-31 | A kind of web page fingerprint generation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110245314A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111061934A (en) * | 2019-11-27 | 2020-04-24 | 西安四叶草信息技术有限公司 | Fingerprint identification method, equipment and storage medium |
CN113949564A (en) * | 2021-10-15 | 2022-01-18 | 天津大学 | Website fingerprint identification method based on resource loading tree |
CN114221787A (en) * | 2021-11-17 | 2022-03-22 | 广东机电职业技术学院 | Network security processing method, system and storage medium based on time strategy |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104572787A (en) * | 2013-10-29 | 2015-04-29 | 腾讯科技(深圳)有限公司 | Method and device for recognizing pseudo original website |
CN107679073A (en) * | 2017-08-25 | 2018-02-09 | 中国科学院信息工程研究所 | A kind of quick similarity matching methods of compressed webpage fingerprint base construction method and compressed webpage |
CN109376291A (en) * | 2018-11-08 | 2019-02-22 | 杭州安恒信息技术股份有限公司 | A kind of method and device of the website fingerprint information scanning based on web crawlers |
-
2019
- 2019-05-31 CN CN201910469738.5A patent/CN110245314A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104572787A (en) * | 2013-10-29 | 2015-04-29 | 腾讯科技(深圳)有限公司 | Method and device for recognizing pseudo original website |
CN107679073A (en) * | 2017-08-25 | 2018-02-09 | 中国科学院信息工程研究所 | A kind of quick similarity matching methods of compressed webpage fingerprint base construction method and compressed webpage |
CN109376291A (en) * | 2018-11-08 | 2019-02-22 | 杭州安恒信息技术股份有限公司 | A kind of method and device of the website fingerprint information scanning based on web crawlers |
Non-Patent Citations (1)
Title |
---|
GURMEET SINGH MANKU: "Detecting NearDuplicates for Web Crawling", 《WWW "07: PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111061934A (en) * | 2019-11-27 | 2020-04-24 | 西安四叶草信息技术有限公司 | Fingerprint identification method, equipment and storage medium |
CN111061934B (en) * | 2019-11-27 | 2023-04-07 | 西安四叶草信息技术有限公司 | Fingerprint identification method, equipment and storage medium |
CN113949564A (en) * | 2021-10-15 | 2022-01-18 | 天津大学 | Website fingerprint identification method based on resource loading tree |
CN113949564B (en) * | 2021-10-15 | 2022-10-14 | 天津大学 | Website fingerprint identification method based on resource loading tree |
CN114221787A (en) * | 2021-11-17 | 2022-03-22 | 广东机电职业技术学院 | Network security processing method, system and storage medium based on time strategy |
CN114221787B (en) * | 2021-11-17 | 2023-12-19 | 广东机电职业技术学院 | Network security processing method, system and storage medium based on time strategy |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11196820B2 (en) | System and method for main page identification in web decoding | |
CN101971591B (en) | System and method of analyzing web addresses | |
Tyagi et al. | An algorithmic approach to data preprocessing in web usage mining | |
CN109905288B (en) | Application service classification method and device | |
US20100161785A1 (en) | Method and system for mining websites | |
US8930447B2 (en) | Method, apparatus, and program for usability analysis of web applications | |
CN110245314A (en) | A kind of web page fingerprint generation method | |
WO2017071179A1 (en) | Method and apparatus for recognizing user behaviour object based on flow analysis | |
US20130232424A1 (en) | User operation detection system and user operation detection method | |
CN104023046B (en) | Mobile terminal recognition method and device | |
CN114528457A (en) | Web fingerprint detection method and related equipment | |
CN115033876A (en) | Log processing method, log processing device, computer device and storage medium | |
Santhanakumar et al. | Web usage based analysis of web pages using rapidminer | |
CN111797297B (en) | Page data processing method and device, computer equipment and storage medium | |
CN111209325B (en) | Service system interface identification method, device and storage medium | |
US9749352B2 (en) | Apparatus and method for collecting harmful website information | |
CN108287831B (en) | URL classification method and system and data processing method and system | |
US9756064B2 (en) | Apparatus and method for collecting harmful website information | |
EP3144827A1 (en) | A device for identifying organizations and monitoring organization's website activity from visit logs | |
KR20120090131A (en) | Method, system and computer readable recording medium for providing search results | |
Sun et al. | Automatically identifying apps in mobile traffic | |
JP6763433B2 (en) | Information gathering system, information gathering method, and program | |
Bhuvaneswari et al. | A comparative study of different log analyzer tools to analyze user behaviors | |
KR102127635B1 (en) | Big data based web-accessibility improvement apparatus and method | |
JP5424781B2 (en) | Data flow analysis device, data flow analysis system, data flow analysis method, and data flow analysis program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190917 |