CN110049052A

CN110049052A - The malice domain name detection method of label and attribute similarity based on dom tree

Info

Publication number: CN110049052A
Application number: CN201910327562.XA
Authority: CN
Inventors: 张兆心; 刘晓燕; 程亚楠; 许海燕; 闫健恩
Original assignee: Harbin Institute of Technology Weihai
Current assignee: Harbin Institute of Technology Weihai
Priority date: 2019-04-23
Filing date: 2019-04-23
Publication date: 2019-07-23

Abstract

The present invention provides the malice domain name detection method of a kind of label based on dom tree and attribute similarity, the technical problem which solve existing malice domain name detection method verification and measurement ratios is low, accuracy is poor.This method comprises: acquisition malice type domain name collection, is converted to binary string deposit database for malice domain name collection；It is binary string by the domain name translation of UNKNOWN TYPE；The binary string of malice type domain name collection in the corresponding binary string of UNKNOWN TYPE domain name and database is compared, the malicious of the UNKNOWN TYPE domain name is judged by the two similarity.It the composite can be widely applied in network safety system.

Description

The malice domain name detection method of label and attribute similarity based on dom tree

Technical field

The present invention relates to a kind of malice domain name detection methods, more particularly to a kind of label based on dom tree and attribute phase Like the malice domain name detection method of property.

Background technique

In recent years, individual privacy, property safety or even body of the sustainable growth of all kinds of malice domain name quantity to users Grave danger caused by heart health, the presence of malice domain name seriously hamper the sound development of internet.Although malice domain name number Measure it is huge, but in a practical situation malice domain registrar whom in order to reach low cost, quickly, a large amount of mesh for generating malice domain name , a large amount of different domain names can be registered, but the structure of web page of these domain names is same or similar.

Webpage is depended on from the malice domain name detection method of the corresponding webpage research angle of malice domain name at present Content, but the continuous variation of web page contents is so that carry out malice from this angle of the web page contents similitude of malice domain name The verification and measurement ratio of domain name is lower, greatly affected the discrimination of malice domain name webpage, and accuracy is poor.

Summary of the invention

The present invention provides a kind of standard for the technical problem that existing malice domain name detection method verification and measurement ratio is low, accuracy is poor The malice domain name detection method of true high, the high-efficient label and attribute similarity based on dom tree of property.

For this purpose, the technical scheme is that, a kind of malice domain name detection of label and attribute similarity based on dom tree Method, comprising:

Malice type domain name collection is acquired, malice domain name collection is converted to binary string deposit database；

It is binary string by the domain name translation of UNKNOWN TYPE；

The binary string of malice type domain name collection in the corresponding binary string of UNKNOWN TYPE domain name and database is compared, is led to The two similarity is crossed to judge the malicious of the UNKNOWN TYPE domain name.

Preferably, the step of malice domain name collection being converted to binary string are as follows:

(1) it obtains malice domain name and concentrates the html document after the completion of the corresponding web page loading process of each domain name；

(2) the corresponding dom tree of construction html document；

(3) it from the node label name and corresponding whole attribute-names extracted in every dom tree in certain number of plies, will extract Tag name and the text sequence of attribute-name be converted into binary string.

Preferably, the corresponding dom tree of construction html document method particularly includes: parse library for HTML using Python third party Document is parsed into dom tree.

Preferably, the tag name of dom tree and the method for attribute-name construction text sequence are extracted are as follows: for domain each in domain name The dom tree of name traverses each node in certain number of plies according to certain search spread method, extracts the tag name of respective nodes DOM tree structure is switched into text sequence with attribute-name.

Preferably, the method that the search spread method of dom tree uses breadth search traversal.

Preferably, dom tree is the dom tree comprising stratification relationship between node.

Preferably, the text sequence of tag name and attribute-name is converted into binary string using Simhash algorithm.

Preferably, process and malice domain name collection that the domain name translation of UNKNOWN TYPE is binary string are converted to binary string Process it is identical.

Preferably, UNKNOWN TYPE domain name judges malicious process using binary string are as follows: UNKNOWN TYPE domain name is corresponding Binary string binary string corresponding with malice type domain name collection compared with, when similitude be more than threshold value when cannot judge the domain name Type；When similitude is in threshold value, then assert that the two is similar, to detect that the UNKNOWN TYPE domain name is malicious.

Preferably, when carrying out binary string similarity system design, by the corresponding binary string of UNKNOWN TYPE domain name and database In each binary string compared two-by-two one by one, the Hamming distances between the two are calculated, are measured using Hamming distances The similitude of the two.

The present invention has the beneficial effect that:

(1) judge that domain name is malicious by using binary string comparison, improve the accuracy of judgement, the efficiency of judgement Also it is improved；

(2) the corresponding html document of domain name, each section in tree are represented by using having levels, orderly dom tree An element in the corresponding document of point, each edge correspond to the set membership between two nodes, as a result, malice domain name structure of web page It just can be used the corresponding dom tree constructed accurately to be indicated, be conducive to accurately be compared between domain name, obtain more accurately Judging result；

(3) library being parsed using Python third party and html document being parsed into dom tree, this method can call directly existing Some libraries, third party library provide powerful parsing function, more accurately guarantee the result for being same as browser resolves；

(4) breadth search traversal method is used, i.e., order traversal is carried out to every layer of node according to the sequence of level, both guaranteed Dom tree original hierarchy, in turn ensures the adjacency of every node layer；

(5) text sequence is converted into binary string using Simhash algorithm, and Simhash algorithm is a kind of local susceptibility Algorithm can produce same or similar hashcode to almost the same content of text, that is, hashcode's is similar Degree can directly reflect the similarity degree of input content, greatly improve detection precision；

It (6), can be with both accurate judgements by the similitude for calculating the Hamming distances between binary string to judge the two Difference, judging efficiency is higher, and accuracy is high, avoids the uncertainty of domain name type to a certain extent；

(7) from the corresponding dom tree of domain name html document, and DOM tree structure is switched into text sequence, and by text sequence Column are converted into binary string using local sensitivity Simhash algorithm, at the same by the comparison of structural similarity be eventually converted into two into Similitude compares two-by-two between system string, greatlies simplify model structure, improves similarity system design rate, has saved meter Resource is calculated, from the harm for largely reducing malice domain name and developing in a healthy way to internet.

Detailed description of the invention

Fig. 1 is the flow chart of the embodiment of the present invention.

Specific embodiment

The present invention is described further below with reference to embodiment.

Fig. 1 shows the flow chart of the malice domain name detection method of label and attribute similarity based on dom tree, reality in figure The process that line is constituted is the treatment process of known type domain name collection, and the process that dotted line is constituted in figure is to UNKNOWN TYPE domain name malice Property detection process.

By analysis dom tree, it is found that the shallow-layer node of tree is affected to structure of web page: if webpage is dissimilar, shallow-layer Nodal information can be very different；If webpage is similar, shallow-layer label information is close, but the deeper difference of level is bigger.For from maliciously The DOM tree structure of domain name sets out to measure the similitude of structure of web page, studies a kind of label and attribute similarity based on dom tree Malice domain name detection method.

As shown in Figure 1, a kind of malice domain name detection method of label and attribute similarity based on dom tree, comprising: utilize Web crawlers acquires the malice domain name collection of a large amount of known type, automatization simulation user from third party authority's domain name detection website The corresponding network address of the domain name is opened, malice domain name is obtained and concentrates the HTML after the completion of the corresponding web page loading process of each domain name literary Shelves parse library using Python third party and html document are parsed into the dom tree comprising stratification relationship between node, for domain name Collect the node before the method traversal dom tree that corresponding each dom tree is traversed according to breadth search in N layers, in combination with Python Third party parses whole attribute-names that the tag name comprising each node including n-th layer is extracted in library and the label has, will Extract comprising including n-th layer the corresponding tag name of all nodes and attribute-name construct the text that domain name concentrates each domain name This sequence, then binary string deposit database is converted using Simhash algorithm by text sequence.Because being searched according to certain Rope ergodic algorithm traverses node, therefore obtained text sequence remains to the DOM tree structure of reflection domain name.

Library being parsed using Python third party, html document being parsed into dom tree, this method can call directly existing Library, third party library provides powerful parsing function, more accurately guarantees the result for being same as browser resolves.Compared to common Deep search traversal, what is obtained is the subtree path of dom tree, only considered father and son's node relationships of DOM, without guarantee Adjacency between the hierarchy and the brotgher of node of dom tree, the application uses breadth search traversal method, i.e., according to the sequence of level Order traversal is carried out to every layer of node, the original hierarchy of dom tree had both been ensure that, and had in turn ensured the adjacency of every node layer.

Binary string is converted according to mode identical with known type domain name by the domain name of UNKNOWN TYPE, by UNKNOWN TYPE Compared with the corresponding binary string of domain name carries out two-by-two one by one with each binary string in database, the sea between the two is calculated Prescribed distance measures the similitude of the two using Hamming distances.The domain cannot be judged when Hamming distances are more than the threshold value of setting The type of name；When Hamming distances are in the threshold value of setting, then assert that the two is similar, to detect that the UNKNOWN TYPE domain name is It is malicious, it, can be with the difference of both accurate judgements by the similitude for calculating the Hamming distances between binary string to judge the two Not, judging efficiency is higher, and accuracy is high, avoids the uncertainty of domain name type to a certain extent.

The corresponding html document of domain name, each node pair in tree are represented by using having levels, orderly dom tree An element in document is answered, the set membership between corresponding two nodes of each edge, malice domain name structure of web page can as a result, It is accurately indicated with the corresponding dom tree constructed, is conducive to accurately be compared between domain name, more accurately be judged As a result.

Switch to text sequence from the corresponding dom tree of domain name html document, and by DOM tree structure, and by text sequence It is converted into binary string using local sensitivity Simhash algorithm, while the comparison of structural similarity is eventually converted into binary system Similitude compares two-by-two between string, greatlies simplify model structure, improves similarity system design rate, has saved calculating Resource, from the harm for largely reducing malice domain name and developing in a healthy way to internet.

Text sequence is converted into binary string using Simhash algorithm, and Simhash algorithm is a kind of local susceptibility calculation Method is proposed earliest and is used for by Google the removing duplicate webpages work of Google's browser, to almost the same content of text, can produce Same or similar hashcode, that is, the similarity degree of hashcode can directly reflect the similar journey of input content Degree, greatly improves detection precision.

Only as described above, only specific embodiments of the present invention, when the model that cannot be limited the present invention with this and implement It encloses, therefore the displacement of its equivalent assemblies, or according to equivalent changes and modifications made by the invention patent protection scope, should still belong to this hair The scope that bright claims are covered.

Claims

1. a kind of malice domain name detection method of label and attribute similarity based on dom tree, which is characterized in that the method packet It includes:

It is binary string by the domain name translation of UNKNOWN TYPE；

The binary string of malice type domain name collection in the corresponding binary string of UNKNOWN TYPE domain name and database is compared, passes through two Person's similarity judges the malicious of the UNKNOWN TYPE domain name.

2. the malice domain name detection method of the label and attribute similarity according to claim 1 based on dom tree, feature The step of being, malice domain name collection be converted to binary string are as follows:

(2) the corresponding dom tree of construction html document；

(3) from the node label name and corresponding whole attribute-names extracted in every dom tree in certain number of plies, the mark that will be extracted The text sequence of signature and attribute-name is converted into binary string.

3. the malice domain name detection method of the label and attribute similarity according to claim 2 based on dom tree, feature It is, the corresponding dom tree of construction html document method particularly includes: parse library using Python third party and be parsed into html document Dom tree.

4. the malice domain name detection method of the label and attribute similarity according to claim 2 based on dom tree, feature It is, extracts the tag name of dom tree and the method for attribute-name construction text sequence are as follows: for the dom tree of domain name each in domain name Each node in certain number of plies is traversed according to certain search spread method, the tag name and attribute-name for extracting respective nodes will DOM tree structure switchs to text sequence.

5. the malice domain name detection method of the label and attribute similarity according to claim 4 based on dom tree, feature It is, the method that the search spread method of the dom tree uses breadth search traversal.

6. the malice domain name detection method of the label and attribute similarity according to claim 2 based on dom tree, feature It is, the dom tree is the dom tree comprising stratification relationship between node.

7. the malice domain name detection method of the label and attribute similarity according to claim 6 based on dom tree, feature It is, the text sequence of the tag name and attribute-name is converted into binary string using Simhash algorithm.

8. the malice domain name detection method of the label and attribute similarity according to claim 7 based on dom tree, feature It is, the process that the domain name translation of UNKNOWN TYPE is binary string is converted to malice domain name collection to the process phase of binary string Together.

9. the malice domain name detection method of the label and attribute similarity according to claim 8 based on dom tree, feature It is, UNKNOWN TYPE domain name judges malicious process using binary string are as follows: by the corresponding binary string of UNKNOWN TYPE domain name Compared with binary string corresponding with malice type domain name collection, the type of the domain name cannot be judged when similitude is more than threshold value；When When similitude is in threshold value, then assert that the two is similar, to detect that the UNKNOWN TYPE domain name is malicious.

10. the malice domain name detection method of the label and attribute similarity according to claim 9 based on dom tree, special Sign is, when carrying out binary string similarity system design, by each of the corresponding binary string of UNKNOWN TYPE domain name and database Binary string is compared two-by-two one by one, and the Hamming distances between the two are calculated, and the phase of the two is measured using Hamming distances Like property.