CN110245314A - A kind of web page fingerprint generation method - Google Patents

A kind of web page fingerprint generation method Download PDF

Info

Publication number
CN110245314A
CN110245314A CN201910469738.5A CN201910469738A CN110245314A CN 110245314 A CN110245314 A CN 110245314A CN 201910469738 A CN201910469738 A CN 201910469738A CN 110245314 A CN110245314 A CN 110245314A
Authority
CN
China
Prior art keywords
web page
website
dimensional vector
generation method
web
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910469738.5A
Other languages
Chinese (zh)
Inventor
张振伟
邬鹏程
陈可义
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Baida Wisdom Network Technology Co Ltd
Original Assignee
Jiangsu Baida Wisdom Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Baida Wisdom Network Technology Co Ltd filed Critical Jiangsu Baida Wisdom Network Technology Co Ltd
Priority to CN201910469738.5A priority Critical patent/CN110245314A/en
Publication of CN110245314A publication Critical patent/CN110245314A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Collating Specific Patterns (AREA)

Abstract

A kind of web page fingerprint generation method of the present invention, identifies web page contents, obtains website element;Obtain the feature in website element with weight, composition characteristic set, N bit binary value is generated to the corresponding each website element of feature institute with weight, each N bit binary value forms n-dimensional vector V, every dimensional vector is calculated respectively, n-dimensional vector is updated, generates new binary numeral as website fingerprint information.Can identify quick locating websites composition and variation by webpage, locating web-pages or affiliated web site whether latest threat or attack.The generic web page of identification extracts content intact, and recognition accuracy is higher, and versatility is preferable, and network security risk is effectively reduced.

Description

A kind of web page fingerprint generation method
Technical field
This field is related to computer fingerprint technology more particularly to a kind of generation method of web page fingerprint.
Background technique
Web page fingerprint identification technology is concrete application of the technology under actual scene.In many network information interacting activities In, website is important information carrier, and webpage plays very important role as the basic element of website, and people enter Website browsing webpage, to obtain information or leave on webpage the information of oneself, therefore, for network behavior monitoring and Speech determines the too busy to get away other differentiation of web page class of differentiation of target and categories of websites.
Can quickly be identified when passing through webpage Content Management System in webpage (CMS), blog platform, statistics/analysis bag, The information such as the library JavaScript, Web server and embedded device.Component is cyberspace minimum unit, web application, Database, middleware etc. belong to component.Fingerprint is one section of characteristic information that object type can be identified on component, is used in website Information collects in link and quickly identifies destination service, and determines Web evolution situation by recognizer component.
Currently, with internet area society development and business activity importance increasingly increase severely, for network pacify Full management requires higher and higher.Webpage carrying information is more and more, identifies quick locating websites composition and variation by webpage, fixed Position webpage or affiliated web site whether latest threat or attack, pass through external one Xiang Fangfa of locating websites safety.Pass through biography System identification general Websites occur web site contents extract it is imperfect, recognition accuracy is lower, and versatility is poor.
Summary of the invention
For disadvantages described above, the present invention provides the web page fingerprint generation method that a kind of recognition accuracy is high, versatility is good.
The technical solution of the present invention is as follows: a kind of web page fingerprint generation method, it is characterised in that: the following steps are included:
It identifies web page contents, obtains website element;By HTTP, HTTPS request website, perhaps head is obtained in web response body Web The keyword match of information obtains the canonical matching of perhaps head information in web response body Web, based on TCP/IP request protocol identification clothes Business information;
Obtain the feature in website element with weight, composition characteristic set;
N bit binary value is generated to the corresponding each website element of feature institute with weight, it is described every The all corresponding respective different binary numeral of a website element;
Each N bit binary value composition n-dimensional vector V (V1, V2 ..., Vn), respectively to every dimensional vector into Row calculates, and updates n-dimensional vector;
According to the updated n-dimensional vector, new binary numeral is generated as website fingerprint information.
Further, a kind of web page fingerprint generation method, it is described that every dimensional vector in N-dimensional vector V is calculated respectively Method are as follows: if the binary numeral of the corresponding bit of website element is 1, to the feature weight of the website element into Row add operation;If the binary numeral of bit is 0, subtraction is carried out to the feature weight of the website element.
Further, a kind of web page fingerprint generation method will be in N fingerprints if i-th dimension is positive number in vector V I-th bit is set as 1;If i-th dimension is non-positive number in vector V, 0 is set by i-th bit in N fingerprints.
Further, a kind of web page fingerprint generation method, the binary numeral that each website element generates are multidigit, Usually 64.
Further, a kind of web page fingerprint generation method, the website element include: html text, http response head, URL characteristic value, host-feature.
The invention has the benefit that the present invention provides a kind of web page fingerprint generation method, the feature of website element is weighed Value is updated calculating, generates web page fingerprint, can identify quick locating websites composition and variation by webpage, locating web-pages or Affiliated web site whether latest threat or attack.The generic web page of identification extracts content intact, and recognition accuracy is higher, versatility Preferably, network security risk is effectively reduced.
Detailed description of the invention
Fig. 1 is a kind of web page fingerprint generation method block diagram of the present invention;
Fig. 2 is a kind of implementation example figure of web page fingerprint generation method of the present invention.
Specific embodiment
Further describe technical solution of the present invention with reference to the accompanying drawing:
A kind of web page fingerprint generation method as shown in Figure 1, comprising the following steps:
S1: identification web page contents obtain website element;By HTTP, HTTPS request website, web response body Web content is obtained Or the keyword match of head information, the canonical matching of perhaps head information in web response body Web is obtained, is known based on TCP/IP request protocol Other information on services;
S2: the feature in website element with weight, composition characteristic set are obtained;
S3: generating N bit binary value to the corresponding each website element of feature institute with weight, described The corresponding respective different binary numeral of each website element;The binary numeral that each website element generates is more Position, usually 64.
S4: each N bit binary value composition n-dimensional vector V (V1, V2 ..., Vn), respectively to every dimensional vector It is calculated, updates n-dimensional vector;
The method that every dimensional vector in N-dimensional vector V is calculated respectively are as follows: if the corresponding bit of website element The binary numeral of position is 1, then carries out add operation to the feature weight of the website element;If the binary number of bit Value is 0, then carries out subtraction to the feature weight of the website element.
S5: according to updated n-dimensional vector, new binary numeral is generated as website fingerprint information.It is wanted according to right A kind of web page fingerprint generation method described in asking 1, it is characterised in that: if i-th dimension is positive number in vector V, by N fingerprints Middle i-th bit is set as 1;If i-th dimension is non-positive number in vector V, 0 is set by i-th bit in N fingerprints.
As shown in Fig. 2, a kind of web page fingerprint generation method of the invention, versatility are preferable.Its webpage that can be applicable in contains There is following website element: html text, http response head, URL characteristic value, host-feature.
The website element html text that can be identified includes: special text;In HTML title, keyword, description etc. Label;There are certain div tags of specific class attribute in special class:HTML;Meta label characteristics: html source code (special class title) and its annotation part (comment).
The website element http response head that can be identified includes: that HTTP request is soundd out, and observes head response, primary part observation response The Server of head (Response Header), X-Powered-By, Set-Cookie field;It is soundd out, is observed using HTTP request One of head response, such as the feature of Struts2 are exactly the cookie field that special JSESSIONID can be arranged;Server field And WP-Super-Cache.
The website element URL characteristic value that can be identified includes: to indicate to some specific paths: readme.html, wp-content/uploads/,wp-admin;The default-error page: the wrong page or http answer code in website, such as Apache default 404;Robots.txt:CMS has the robots.txt of default, directly illustrates in the text of robots.txt CMS title and version;The matching of the special files such as favicon.ico: downloading favicon.ico image carries out md5 abstract, comparison The library icon whether there is the icon of identical md5, and calculation amount is bigger but retrieval effectiveness better method is exactly to do retrieving similar images (based on SIFT or other characteristics of image), other than favicon.ico, there are also other logo file/img/buttom_ logo.gif。
The website element host-feature that can be identified includes: default port feature: to host opening end where website Mouth feature judgement;OS fingerprint: operating system can be tentatively judged by web site features and TCP response;SSL certificate: SSL certificate letter Breath.

Claims (5)

1. a kind of web page fingerprint generation method, it is characterised in that: the following steps are included:
It identifies web page contents, obtains website element;By HTTP, HTTPS request website, perhaps head information is obtained in web response body Web Keyword match, obtain web response body Web in perhaps head information canonical matching, based on TCP/IP request protocol identification service letter Breath;
Obtain the feature in website element with weight, composition characteristic set;
N bit binary value, each net are generated to the corresponding each website element of feature institute with weight The all corresponding respective different binary numeral of element of standing;
Each N bit binary value composition n-dimensional vector V (V1, V2 ..., Vn), respectively counts every dimensional vector It calculates, updates n-dimensional vector;
According to the updated n-dimensional vector, new binary numeral is generated as website fingerprint information.
2. a kind of web page fingerprint generation method according to claim 1, it is characterised in that: described respectively in N-dimensional vector V The method that is calculated of every dimensional vector are as follows: if the binary numeral of the corresponding bit of website element is 1, to the net Stand element feature weight carry out add operation;If the binary numeral of bit is 0, to the feature of the website element Weight carries out subtraction.
3. a kind of web page fingerprint generation method according to claim 1, it is characterised in that: if i-th dimension is just in vector V Number, then set 1 for i-th bit in N fingerprints;If i-th dimension is non-positive number in vector V, by i-th bit in N fingerprints It is set as 0.
4. a kind of web page fingerprint generation method according to claim 1, it is characterised in that: each website element generates Binary numeral be multidigit, usually 64.
5. a kind of web page fingerprint generation method according to claim 1, it is characterised in that: the website element includes: Html text, http response head, URL characteristic value, host-feature.
CN201910469738.5A 2019-05-31 2019-05-31 A kind of web page fingerprint generation method Pending CN110245314A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910469738.5A CN110245314A (en) 2019-05-31 2019-05-31 A kind of web page fingerprint generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910469738.5A CN110245314A (en) 2019-05-31 2019-05-31 A kind of web page fingerprint generation method

Publications (1)

Publication Number Publication Date
CN110245314A true CN110245314A (en) 2019-09-17

Family

ID=67885689

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910469738.5A Pending CN110245314A (en) 2019-05-31 2019-05-31 A kind of web page fingerprint generation method

Country Status (1)

Country Link
CN (1) CN110245314A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061934A (en) * 2019-11-27 2020-04-24 西安四叶草信息技术有限公司 Fingerprint identification method, equipment and storage medium
CN113949564A (en) * 2021-10-15 2022-01-18 天津大学 Website fingerprint identification method based on resource loading tree
CN114221787A (en) * 2021-11-17 2022-03-22 广东机电职业技术学院 Network security processing method, system and storage medium based on time strategy

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572787A (en) * 2013-10-29 2015-04-29 腾讯科技(深圳)有限公司 Method and device for recognizing pseudo original website
CN107679073A (en) * 2017-08-25 2018-02-09 中国科学院信息工程研究所 A kind of quick similarity matching methods of compressed webpage fingerprint base construction method and compressed webpage
CN109376291A (en) * 2018-11-08 2019-02-22 杭州安恒信息技术股份有限公司 A kind of method and device of the website fingerprint information scanning based on web crawlers

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572787A (en) * 2013-10-29 2015-04-29 腾讯科技(深圳)有限公司 Method and device for recognizing pseudo original website
CN107679073A (en) * 2017-08-25 2018-02-09 中国科学院信息工程研究所 A kind of quick similarity matching methods of compressed webpage fingerprint base construction method and compressed webpage
CN109376291A (en) * 2018-11-08 2019-02-22 杭州安恒信息技术股份有限公司 A kind of method and device of the website fingerprint information scanning based on web crawlers

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GURMEET SINGH MANKU: "Detecting NearDuplicates for Web Crawling", 《WWW "07: PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061934A (en) * 2019-11-27 2020-04-24 西安四叶草信息技术有限公司 Fingerprint identification method, equipment and storage medium
CN111061934B (en) * 2019-11-27 2023-04-07 西安四叶草信息技术有限公司 Fingerprint identification method, equipment and storage medium
CN113949564A (en) * 2021-10-15 2022-01-18 天津大学 Website fingerprint identification method based on resource loading tree
CN113949564B (en) * 2021-10-15 2022-10-14 天津大学 Website fingerprint identification method based on resource loading tree
CN114221787A (en) * 2021-11-17 2022-03-22 广东机电职业技术学院 Network security processing method, system and storage medium based on time strategy
CN114221787B (en) * 2021-11-17 2023-12-19 广东机电职业技术学院 Network security processing method, system and storage medium based on time strategy

Similar Documents

Publication Publication Date Title
US11196820B2 (en) System and method for main page identification in web decoding
CN101971591B (en) System and method of analyzing web addresses
Tyagi et al. An algorithmic approach to data preprocessing in web usage mining
CN109905288B (en) Application service classification method and device
US20100161785A1 (en) Method and system for mining websites
US8930447B2 (en) Method, apparatus, and program for usability analysis of web applications
CN110245314A (en) A kind of web page fingerprint generation method
WO2017071179A1 (en) Method and apparatus for recognizing user behaviour object based on flow analysis
US20130232424A1 (en) User operation detection system and user operation detection method
CN104023046B (en) Mobile terminal recognition method and device
CN114528457A (en) Web fingerprint detection method and related equipment
CN115033876A (en) Log processing method, log processing device, computer device and storage medium
Santhanakumar et al. Web usage based analysis of web pages using rapidminer
CN111797297B (en) Page data processing method and device, computer equipment and storage medium
CN111209325B (en) Service system interface identification method, device and storage medium
US9749352B2 (en) Apparatus and method for collecting harmful website information
CN108287831B (en) URL classification method and system and data processing method and system
US9756064B2 (en) Apparatus and method for collecting harmful website information
EP3144827A1 (en) A device for identifying organizations and monitoring organization's website activity from visit logs
KR20120090131A (en) Method, system and computer readable recording medium for providing search results
Sun et al. Automatically identifying apps in mobile traffic
JP6763433B2 (en) Information gathering system, information gathering method, and program
Bhuvaneswari et al. A comparative study of different log analyzer tools to analyze user behaviors
KR102127635B1 (en) Big data based web-accessibility improvement apparatus and method
JP5424781B2 (en) Data flow analysis device, data flow analysis system, data flow analysis method, and data flow analysis program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190917