CN110049052A - The malice domain name detection method of label and attribute similarity based on dom tree - Google Patents

The malice domain name detection method of label and attribute similarity based on dom tree Download PDF

Info

Publication number
CN110049052A
CN110049052A CN201910327562.XA CN201910327562A CN110049052A CN 110049052 A CN110049052 A CN 110049052A CN 201910327562 A CN201910327562 A CN 201910327562A CN 110049052 A CN110049052 A CN 110049052A
Authority
CN
China
Prior art keywords
domain name
dom tree
malice
binary string
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910327562.XA
Other languages
Chinese (zh)
Inventor
张兆心
刘晓燕
程亚楠
许海燕
闫健恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology Weihai
Original Assignee
Harbin Institute of Technology Weihai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology Weihai filed Critical Harbin Institute of Technology Weihai
Priority to CN201910327562.XA priority Critical patent/CN110049052A/en
Publication of CN110049052A publication Critical patent/CN110049052A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic

Abstract

The present invention provides the malice domain name detection method of a kind of label based on dom tree and attribute similarity, the technical problem which solve existing malice domain name detection method verification and measurement ratios is low, accuracy is poor.This method comprises: acquisition malice type domain name collection, is converted to binary string deposit database for malice domain name collection;It is binary string by the domain name translation of UNKNOWN TYPE;The binary string of malice type domain name collection in the corresponding binary string of UNKNOWN TYPE domain name and database is compared, the malicious of the UNKNOWN TYPE domain name is judged by the two similarity.It the composite can be widely applied in network safety system.

Description

The malice domain name detection method of label and attribute similarity based on dom tree
Technical field
The present invention relates to a kind of malice domain name detection methods, more particularly to a kind of label based on dom tree and attribute phase Like the malice domain name detection method of property.
Background technique
In recent years, individual privacy, property safety or even body of the sustainable growth of all kinds of malice domain name quantity to users Grave danger caused by heart health, the presence of malice domain name seriously hamper the sound development of internet.Although malice domain name number Measure it is huge, but in a practical situation malice domain registrar whom in order to reach low cost, quickly, a large amount of mesh for generating malice domain name , a large amount of different domain names can be registered, but the structure of web page of these domain names is same or similar.
Webpage is depended on from the malice domain name detection method of the corresponding webpage research angle of malice domain name at present Content, but the continuous variation of web page contents is so that carry out malice from this angle of the web page contents similitude of malice domain name The verification and measurement ratio of domain name is lower, greatly affected the discrimination of malice domain name webpage, and accuracy is poor.
Summary of the invention
The present invention provides a kind of standard for the technical problem that existing malice domain name detection method verification and measurement ratio is low, accuracy is poor The malice domain name detection method of true high, the high-efficient label and attribute similarity based on dom tree of property.
For this purpose, the technical scheme is that, a kind of malice domain name detection of label and attribute similarity based on dom tree Method, comprising:
Malice type domain name collection is acquired, malice domain name collection is converted to binary string deposit database;
It is binary string by the domain name translation of UNKNOWN TYPE;
The binary string of malice type domain name collection in the corresponding binary string of UNKNOWN TYPE domain name and database is compared, is led to The two similarity is crossed to judge the malicious of the UNKNOWN TYPE domain name.
Preferably, the step of malice domain name collection being converted to binary string are as follows:
(1) it obtains malice domain name and concentrates the html document after the completion of the corresponding web page loading process of each domain name;
(2) the corresponding dom tree of construction html document;
(3) it from the node label name and corresponding whole attribute-names extracted in every dom tree in certain number of plies, will extract Tag name and the text sequence of attribute-name be converted into binary string.
Preferably, the corresponding dom tree of construction html document method particularly includes: parse library for HTML using Python third party Document is parsed into dom tree.
Preferably, the tag name of dom tree and the method for attribute-name construction text sequence are extracted are as follows: for domain each in domain name The dom tree of name traverses each node in certain number of plies according to certain search spread method, extracts the tag name of respective nodes DOM tree structure is switched into text sequence with attribute-name.
Preferably, the method that the search spread method of dom tree uses breadth search traversal.
Preferably, dom tree is the dom tree comprising stratification relationship between node.
Preferably, the text sequence of tag name and attribute-name is converted into binary string using Simhash algorithm.
Preferably, process and malice domain name collection that the domain name translation of UNKNOWN TYPE is binary string are converted to binary string Process it is identical.
Preferably, UNKNOWN TYPE domain name judges malicious process using binary string are as follows: UNKNOWN TYPE domain name is corresponding Binary string binary string corresponding with malice type domain name collection compared with, when similitude be more than threshold value when cannot judge the domain name Type;When similitude is in threshold value, then assert that the two is similar, to detect that the UNKNOWN TYPE domain name is malicious.
Preferably, when carrying out binary string similarity system design, by the corresponding binary string of UNKNOWN TYPE domain name and database In each binary string compared two-by-two one by one, the Hamming distances between the two are calculated, are measured using Hamming distances The similitude of the two.
The present invention has the beneficial effect that:
(1) judge that domain name is malicious by using binary string comparison, improve the accuracy of judgement, the efficiency of judgement Also it is improved;
(2) the corresponding html document of domain name, each section in tree are represented by using having levels, orderly dom tree An element in the corresponding document of point, each edge correspond to the set membership between two nodes, as a result, malice domain name structure of web page It just can be used the corresponding dom tree constructed accurately to be indicated, be conducive to accurately be compared between domain name, obtain more accurately Judging result;
(3) library being parsed using Python third party and html document being parsed into dom tree, this method can call directly existing Some libraries, third party library provide powerful parsing function, more accurately guarantee the result for being same as browser resolves;
(4) breadth search traversal method is used, i.e., order traversal is carried out to every layer of node according to the sequence of level, both guaranteed Dom tree original hierarchy, in turn ensures the adjacency of every node layer;
(5) text sequence is converted into binary string using Simhash algorithm, and Simhash algorithm is a kind of local susceptibility Algorithm can produce same or similar hashcode to almost the same content of text, that is, hashcode's is similar Degree can directly reflect the similarity degree of input content, greatly improve detection precision;
It (6), can be with both accurate judgements by the similitude for calculating the Hamming distances between binary string to judge the two Difference, judging efficiency is higher, and accuracy is high, avoids the uncertainty of domain name type to a certain extent;
(7) from the corresponding dom tree of domain name html document, and DOM tree structure is switched into text sequence, and by text sequence Column are converted into binary string using local sensitivity Simhash algorithm, at the same by the comparison of structural similarity be eventually converted into two into Similitude compares two-by-two between system string, greatlies simplify model structure, improves similarity system design rate, has saved meter Resource is calculated, from the harm for largely reducing malice domain name and developing in a healthy way to internet.
Detailed description of the invention
Fig. 1 is the flow chart of the embodiment of the present invention.
Specific embodiment
The present invention is described further below with reference to embodiment.
Fig. 1 shows the flow chart of the malice domain name detection method of label and attribute similarity based on dom tree, reality in figure The process that line is constituted is the treatment process of known type domain name collection, and the process that dotted line is constituted in figure is to UNKNOWN TYPE domain name malice Property detection process.
By analysis dom tree, it is found that the shallow-layer node of tree is affected to structure of web page: if webpage is dissimilar, shallow-layer Nodal information can be very different;If webpage is similar, shallow-layer label information is close, but the deeper difference of level is bigger.For from maliciously The DOM tree structure of domain name sets out to measure the similitude of structure of web page, studies a kind of label and attribute similarity based on dom tree Malice domain name detection method.
As shown in Figure 1, a kind of malice domain name detection method of label and attribute similarity based on dom tree, comprising: utilize Web crawlers acquires the malice domain name collection of a large amount of known type, automatization simulation user from third party authority's domain name detection website The corresponding network address of the domain name is opened, malice domain name is obtained and concentrates the HTML after the completion of the corresponding web page loading process of each domain name literary Shelves parse library using Python third party and html document are parsed into the dom tree comprising stratification relationship between node, for domain name Collect the node before the method traversal dom tree that corresponding each dom tree is traversed according to breadth search in N layers, in combination with Python Third party parses whole attribute-names that the tag name comprising each node including n-th layer is extracted in library and the label has, will Extract comprising including n-th layer the corresponding tag name of all nodes and attribute-name construct the text that domain name concentrates each domain name This sequence, then binary string deposit database is converted using Simhash algorithm by text sequence.Because being searched according to certain Rope ergodic algorithm traverses node, therefore obtained text sequence remains to the DOM tree structure of reflection domain name.
Library being parsed using Python third party, html document being parsed into dom tree, this method can call directly existing Library, third party library provides powerful parsing function, more accurately guarantees the result for being same as browser resolves.Compared to common Deep search traversal, what is obtained is the subtree path of dom tree, only considered father and son's node relationships of DOM, without guarantee Adjacency between the hierarchy and the brotgher of node of dom tree, the application uses breadth search traversal method, i.e., according to the sequence of level Order traversal is carried out to every layer of node, the original hierarchy of dom tree had both been ensure that, and had in turn ensured the adjacency of every node layer.
Binary string is converted according to mode identical with known type domain name by the domain name of UNKNOWN TYPE, by UNKNOWN TYPE Compared with the corresponding binary string of domain name carries out two-by-two one by one with each binary string in database, the sea between the two is calculated Prescribed distance measures the similitude of the two using Hamming distances.The domain cannot be judged when Hamming distances are more than the threshold value of setting The type of name;When Hamming distances are in the threshold value of setting, then assert that the two is similar, to detect that the UNKNOWN TYPE domain name is It is malicious, it, can be with the difference of both accurate judgements by the similitude for calculating the Hamming distances between binary string to judge the two Not, judging efficiency is higher, and accuracy is high, avoids the uncertainty of domain name type to a certain extent.
The corresponding html document of domain name, each node pair in tree are represented by using having levels, orderly dom tree An element in document is answered, the set membership between corresponding two nodes of each edge, malice domain name structure of web page can as a result, It is accurately indicated with the corresponding dom tree constructed, is conducive to accurately be compared between domain name, more accurately be judged As a result.
Switch to text sequence from the corresponding dom tree of domain name html document, and by DOM tree structure, and by text sequence It is converted into binary string using local sensitivity Simhash algorithm, while the comparison of structural similarity is eventually converted into binary system Similitude compares two-by-two between string, greatlies simplify model structure, improves similarity system design rate, has saved calculating Resource, from the harm for largely reducing malice domain name and developing in a healthy way to internet.
Text sequence is converted into binary string using Simhash algorithm, and Simhash algorithm is a kind of local susceptibility calculation Method is proposed earliest and is used for by Google the removing duplicate webpages work of Google's browser, to almost the same content of text, can produce Same or similar hashcode, that is, the similarity degree of hashcode can directly reflect the similar journey of input content Degree, greatly improves detection precision.
Only as described above, only specific embodiments of the present invention, when the model that cannot be limited the present invention with this and implement It encloses, therefore the displacement of its equivalent assemblies, or according to equivalent changes and modifications made by the invention patent protection scope, should still belong to this hair The scope that bright claims are covered.

Claims (10)

1. a kind of malice domain name detection method of label and attribute similarity based on dom tree, which is characterized in that the method packet It includes:
Malice type domain name collection is acquired, malice domain name collection is converted to binary string deposit database;
It is binary string by the domain name translation of UNKNOWN TYPE;
The binary string of malice type domain name collection in the corresponding binary string of UNKNOWN TYPE domain name and database is compared, passes through two Person's similarity judges the malicious of the UNKNOWN TYPE domain name.
2. the malice domain name detection method of the label and attribute similarity according to claim 1 based on dom tree, feature The step of being, malice domain name collection be converted to binary string are as follows:
(1) it obtains malice domain name and concentrates the html document after the completion of the corresponding web page loading process of each domain name;
(2) the corresponding dom tree of construction html document;
(3) from the node label name and corresponding whole attribute-names extracted in every dom tree in certain number of plies, the mark that will be extracted The text sequence of signature and attribute-name is converted into binary string.
3. the malice domain name detection method of the label and attribute similarity according to claim 2 based on dom tree, feature It is, the corresponding dom tree of construction html document method particularly includes: parse library using Python third party and be parsed into html document Dom tree.
4. the malice domain name detection method of the label and attribute similarity according to claim 2 based on dom tree, feature It is, extracts the tag name of dom tree and the method for attribute-name construction text sequence are as follows: for the dom tree of domain name each in domain name Each node in certain number of plies is traversed according to certain search spread method, the tag name and attribute-name for extracting respective nodes will DOM tree structure switchs to text sequence.
5. the malice domain name detection method of the label and attribute similarity according to claim 4 based on dom tree, feature It is, the method that the search spread method of the dom tree uses breadth search traversal.
6. the malice domain name detection method of the label and attribute similarity according to claim 2 based on dom tree, feature It is, the dom tree is the dom tree comprising stratification relationship between node.
7. the malice domain name detection method of the label and attribute similarity according to claim 6 based on dom tree, feature It is, the text sequence of the tag name and attribute-name is converted into binary string using Simhash algorithm.
8. the malice domain name detection method of the label and attribute similarity according to claim 7 based on dom tree, feature It is, the process that the domain name translation of UNKNOWN TYPE is binary string is converted to malice domain name collection to the process phase of binary string Together.
9. the malice domain name detection method of the label and attribute similarity according to claim 8 based on dom tree, feature It is, UNKNOWN TYPE domain name judges malicious process using binary string are as follows: by the corresponding binary string of UNKNOWN TYPE domain name Compared with binary string corresponding with malice type domain name collection, the type of the domain name cannot be judged when similitude is more than threshold value;When When similitude is in threshold value, then assert that the two is similar, to detect that the UNKNOWN TYPE domain name is malicious.
10. the malice domain name detection method of the label and attribute similarity according to claim 9 based on dom tree, special Sign is, when carrying out binary string similarity system design, by each of the corresponding binary string of UNKNOWN TYPE domain name and database Binary string is compared two-by-two one by one, and the Hamming distances between the two are calculated, and the phase of the two is measured using Hamming distances Like property.
CN201910327562.XA 2019-04-23 2019-04-23 The malice domain name detection method of label and attribute similarity based on dom tree Pending CN110049052A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910327562.XA CN110049052A (en) 2019-04-23 2019-04-23 The malice domain name detection method of label and attribute similarity based on dom tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910327562.XA CN110049052A (en) 2019-04-23 2019-04-23 The malice domain name detection method of label and attribute similarity based on dom tree

Publications (1)

Publication Number Publication Date
CN110049052A true CN110049052A (en) 2019-07-23

Family

ID=67278561

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910327562.XA Pending CN110049052A (en) 2019-04-23 2019-04-23 The malice domain name detection method of label and attribute similarity based on dom tree

Country Status (1)

Country Link
CN (1) CN110049052A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111884813A (en) * 2020-08-05 2020-11-03 哈尔滨工业大学(威海) Malicious certificate detection method
CN112214737A (en) * 2020-11-10 2021-01-12 山东比特智能科技股份有限公司 Method, system, device and medium for identifying picture-based fraudulent webpage
CN117081865A (en) * 2023-10-17 2023-11-17 北京启天安信科技有限公司 Network security defense system based on malicious domain name detection method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7765236B2 (en) * 2007-08-31 2010-07-27 Microsoft Corporation Extracting data content items using template matching
CN102316081A (en) * 2010-06-30 2012-01-11 北京启明星辰信息技术股份有限公司 Method and device for identifying similar webpage
CN105528357A (en) * 2014-09-30 2016-04-27 中国银联股份有限公司 Webpage content extraction method based on similarity of URLs and similarity of webpage document structures
CN107204960A (en) * 2016-03-16 2017-09-26 阿里巴巴集团控股有限公司 Web page identification method and device, server

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7765236B2 (en) * 2007-08-31 2010-07-27 Microsoft Corporation Extracting data content items using template matching
CN102316081A (en) * 2010-06-30 2012-01-11 北京启明星辰信息技术股份有限公司 Method and device for identifying similar webpage
CN105528357A (en) * 2014-09-30 2016-04-27 中国银联股份有限公司 Webpage content extraction method based on similarity of URLs and similarity of webpage document structures
CN107204960A (en) * 2016-03-16 2017-09-26 阿里巴巴集团控股有限公司 Web page identification method and device, server

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111884813A (en) * 2020-08-05 2020-11-03 哈尔滨工业大学(威海) Malicious certificate detection method
CN111884813B (en) * 2020-08-05 2022-03-25 哈尔滨工业大学(威海) Malicious certificate detection method
CN112214737A (en) * 2020-11-10 2021-01-12 山东比特智能科技股份有限公司 Method, system, device and medium for identifying picture-based fraudulent webpage
CN112214737B (en) * 2020-11-10 2022-06-24 山东比特智能科技股份有限公司 Method, system, device and medium for identifying picture-based fraudulent webpage
CN117081865A (en) * 2023-10-17 2023-11-17 北京启天安信科技有限公司 Network security defense system based on malicious domain name detection method
CN117081865B (en) * 2023-10-17 2023-12-29 北京启天安信科技有限公司 Network security defense system based on malicious domain name detection method

Similar Documents

Publication Publication Date Title
Resch et al. Combining machine-learning topic models and spatiotemporal analysis of social media data for disaster footprint and damage assessment
CN102841920B (en) Method and device for extracting webpage frame information
CN103514234B (en) A kind of page info extracting method and device
CN103853738B (en) A kind of recognition methods of info web correlation region
Flatow et al. On the accuracy of hyper-local geotagging of social media content
CN103491205B (en) The method for pushing of a kind of correlated resources address based on video search and device
Kovbasistyi et al. Method for detection of non-relevant and wrong information based on content analysis of web resources
CN107967208A (en) A kind of Python resource sensitive defect code detection methods based on deep neural network
CN110049052A (en) The malice domain name detection method of label and attribute similarity based on dom tree
CN110472066A (en) A kind of construction method of urban geography semantic knowledge map
US9519718B2 (en) Webpage information detection method and system
CN106484764A (en) User's similarity calculating method based on crowd portrayal technology
CN104899243B (en) The method and device of detection point of interest POI data accuracy
CN106202041A (en) A kind of method and apparatus of the entity alignment problem solved in knowledge mapping
CN104699835A (en) Method and device used for determining webpages including POI (point of interest) data
CN104182420A (en) Ontology-based Chinese name disambiguation method
CN102915361B (en) Webpage text extracting method based on character distribution characteristic
CN110377747A (en) A kind of knowledge base fusion method towards encyclopaedia website
CN106547770A (en) A kind of user's classification based on address of theenduser information, user identification method and device
CN108111526A (en) A kind of illegal website method for digging based on abnormal WHOIS information
CN103023874B (en) A kind of detection method for phishing site
CN105279086A (en) Flow chart-based method for automatically detecting logic loopholes of electronic commerce websites
CN107102993A (en) A kind of user's demand analysis method and device
CN106096040A (en) Organization web ownership place method of discrimination based on search engine and device thereof
CN108536664A (en) The knowledge fusion method in commodity field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190723