CN110049052A - The malice domain name detection method of label and attribute similarity based on dom tree - Google Patents
The malice domain name detection method of label and attribute similarity based on dom tree Download PDFInfo
- Publication number
- CN110049052A CN110049052A CN201910327562.XA CN201910327562A CN110049052A CN 110049052 A CN110049052 A CN 110049052A CN 201910327562 A CN201910327562 A CN 201910327562A CN 110049052 A CN110049052 A CN 110049052A
- Authority
- CN
- China
- Prior art keywords
- domain name
- dom tree
- malice
- binary string
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L61/00—Network arrangements, protocols or services for addressing or naming
- H04L61/45—Network directories; Name-to-address mapping
- H04L61/4505—Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
- H04L61/4511—Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
Abstract
The present invention provides the malice domain name detection method of a kind of label based on dom tree and attribute similarity, the technical problem which solve existing malice domain name detection method verification and measurement ratios is low, accuracy is poor.This method comprises: acquisition malice type domain name collection, is converted to binary string deposit database for malice domain name collection;It is binary string by the domain name translation of UNKNOWN TYPE;The binary string of malice type domain name collection in the corresponding binary string of UNKNOWN TYPE domain name and database is compared, the malicious of the UNKNOWN TYPE domain name is judged by the two similarity.It the composite can be widely applied in network safety system.
Description
Technical field
The present invention relates to a kind of malice domain name detection methods, more particularly to a kind of label based on dom tree and attribute phase
Like the malice domain name detection method of property.
Background technique
In recent years, individual privacy, property safety or even body of the sustainable growth of all kinds of malice domain name quantity to users
Grave danger caused by heart health, the presence of malice domain name seriously hamper the sound development of internet.Although malice domain name number
Measure it is huge, but in a practical situation malice domain registrar whom in order to reach low cost, quickly, a large amount of mesh for generating malice domain name
, a large amount of different domain names can be registered, but the structure of web page of these domain names is same or similar.
Webpage is depended on from the malice domain name detection method of the corresponding webpage research angle of malice domain name at present
Content, but the continuous variation of web page contents is so that carry out malice from this angle of the web page contents similitude of malice domain name
The verification and measurement ratio of domain name is lower, greatly affected the discrimination of malice domain name webpage, and accuracy is poor.
Summary of the invention
The present invention provides a kind of standard for the technical problem that existing malice domain name detection method verification and measurement ratio is low, accuracy is poor
The malice domain name detection method of true high, the high-efficient label and attribute similarity based on dom tree of property.
For this purpose, the technical scheme is that, a kind of malice domain name detection of label and attribute similarity based on dom tree
Method, comprising:
Malice type domain name collection is acquired, malice domain name collection is converted to binary string deposit database;
It is binary string by the domain name translation of UNKNOWN TYPE;
The binary string of malice type domain name collection in the corresponding binary string of UNKNOWN TYPE domain name and database is compared, is led to
The two similarity is crossed to judge the malicious of the UNKNOWN TYPE domain name.
Preferably, the step of malice domain name collection being converted to binary string are as follows:
(1) it obtains malice domain name and concentrates the html document after the completion of the corresponding web page loading process of each domain name;
(2) the corresponding dom tree of construction html document;
(3) it from the node label name and corresponding whole attribute-names extracted in every dom tree in certain number of plies, will extract
Tag name and the text sequence of attribute-name be converted into binary string.
Preferably, the corresponding dom tree of construction html document method particularly includes: parse library for HTML using Python third party
Document is parsed into dom tree.
Preferably, the tag name of dom tree and the method for attribute-name construction text sequence are extracted are as follows: for domain each in domain name
The dom tree of name traverses each node in certain number of plies according to certain search spread method, extracts the tag name of respective nodes
DOM tree structure is switched into text sequence with attribute-name.
Preferably, the method that the search spread method of dom tree uses breadth search traversal.
Preferably, dom tree is the dom tree comprising stratification relationship between node.
Preferably, the text sequence of tag name and attribute-name is converted into binary string using Simhash algorithm.
Preferably, process and malice domain name collection that the domain name translation of UNKNOWN TYPE is binary string are converted to binary string
Process it is identical.
Preferably, UNKNOWN TYPE domain name judges malicious process using binary string are as follows: UNKNOWN TYPE domain name is corresponding
Binary string binary string corresponding with malice type domain name collection compared with, when similitude be more than threshold value when cannot judge the domain name
Type;When similitude is in threshold value, then assert that the two is similar, to detect that the UNKNOWN TYPE domain name is malicious.
Preferably, when carrying out binary string similarity system design, by the corresponding binary string of UNKNOWN TYPE domain name and database
In each binary string compared two-by-two one by one, the Hamming distances between the two are calculated, are measured using Hamming distances
The similitude of the two.
The present invention has the beneficial effect that:
(1) judge that domain name is malicious by using binary string comparison, improve the accuracy of judgement, the efficiency of judgement
Also it is improved;
(2) the corresponding html document of domain name, each section in tree are represented by using having levels, orderly dom tree
An element in the corresponding document of point, each edge correspond to the set membership between two nodes, as a result, malice domain name structure of web page
It just can be used the corresponding dom tree constructed accurately to be indicated, be conducive to accurately be compared between domain name, obtain more accurately
Judging result;
(3) library being parsed using Python third party and html document being parsed into dom tree, this method can call directly existing
Some libraries, third party library provide powerful parsing function, more accurately guarantee the result for being same as browser resolves;
(4) breadth search traversal method is used, i.e., order traversal is carried out to every layer of node according to the sequence of level, both guaranteed
Dom tree original hierarchy, in turn ensures the adjacency of every node layer;
(5) text sequence is converted into binary string using Simhash algorithm, and Simhash algorithm is a kind of local susceptibility
Algorithm can produce same or similar hashcode to almost the same content of text, that is, hashcode's is similar
Degree can directly reflect the similarity degree of input content, greatly improve detection precision;
It (6), can be with both accurate judgements by the similitude for calculating the Hamming distances between binary string to judge the two
Difference, judging efficiency is higher, and accuracy is high, avoids the uncertainty of domain name type to a certain extent;
(7) from the corresponding dom tree of domain name html document, and DOM tree structure is switched into text sequence, and by text sequence
Column are converted into binary string using local sensitivity Simhash algorithm, at the same by the comparison of structural similarity be eventually converted into two into
Similitude compares two-by-two between system string, greatlies simplify model structure, improves similarity system design rate, has saved meter
Resource is calculated, from the harm for largely reducing malice domain name and developing in a healthy way to internet.
Detailed description of the invention
Fig. 1 is the flow chart of the embodiment of the present invention.
Specific embodiment
The present invention is described further below with reference to embodiment.
Fig. 1 shows the flow chart of the malice domain name detection method of label and attribute similarity based on dom tree, reality in figure
The process that line is constituted is the treatment process of known type domain name collection, and the process that dotted line is constituted in figure is to UNKNOWN TYPE domain name malice
Property detection process.
By analysis dom tree, it is found that the shallow-layer node of tree is affected to structure of web page: if webpage is dissimilar, shallow-layer
Nodal information can be very different;If webpage is similar, shallow-layer label information is close, but the deeper difference of level is bigger.For from maliciously
The DOM tree structure of domain name sets out to measure the similitude of structure of web page, studies a kind of label and attribute similarity based on dom tree
Malice domain name detection method.
As shown in Figure 1, a kind of malice domain name detection method of label and attribute similarity based on dom tree, comprising: utilize
Web crawlers acquires the malice domain name collection of a large amount of known type, automatization simulation user from third party authority's domain name detection website
The corresponding network address of the domain name is opened, malice domain name is obtained and concentrates the HTML after the completion of the corresponding web page loading process of each domain name literary
Shelves parse library using Python third party and html document are parsed into the dom tree comprising stratification relationship between node, for domain name
Collect the node before the method traversal dom tree that corresponding each dom tree is traversed according to breadth search in N layers, in combination with Python
Third party parses whole attribute-names that the tag name comprising each node including n-th layer is extracted in library and the label has, will
Extract comprising including n-th layer the corresponding tag name of all nodes and attribute-name construct the text that domain name concentrates each domain name
This sequence, then binary string deposit database is converted using Simhash algorithm by text sequence.Because being searched according to certain
Rope ergodic algorithm traverses node, therefore obtained text sequence remains to the DOM tree structure of reflection domain name.
Library being parsed using Python third party, html document being parsed into dom tree, this method can call directly existing
Library, third party library provides powerful parsing function, more accurately guarantees the result for being same as browser resolves.Compared to common
Deep search traversal, what is obtained is the subtree path of dom tree, only considered father and son's node relationships of DOM, without guarantee
Adjacency between the hierarchy and the brotgher of node of dom tree, the application uses breadth search traversal method, i.e., according to the sequence of level
Order traversal is carried out to every layer of node, the original hierarchy of dom tree had both been ensure that, and had in turn ensured the adjacency of every node layer.
Binary string is converted according to mode identical with known type domain name by the domain name of UNKNOWN TYPE, by UNKNOWN TYPE
Compared with the corresponding binary string of domain name carries out two-by-two one by one with each binary string in database, the sea between the two is calculated
Prescribed distance measures the similitude of the two using Hamming distances.The domain cannot be judged when Hamming distances are more than the threshold value of setting
The type of name;When Hamming distances are in the threshold value of setting, then assert that the two is similar, to detect that the UNKNOWN TYPE domain name is
It is malicious, it, can be with the difference of both accurate judgements by the similitude for calculating the Hamming distances between binary string to judge the two
Not, judging efficiency is higher, and accuracy is high, avoids the uncertainty of domain name type to a certain extent.
The corresponding html document of domain name, each node pair in tree are represented by using having levels, orderly dom tree
An element in document is answered, the set membership between corresponding two nodes of each edge, malice domain name structure of web page can as a result,
It is accurately indicated with the corresponding dom tree constructed, is conducive to accurately be compared between domain name, more accurately be judged
As a result.
Switch to text sequence from the corresponding dom tree of domain name html document, and by DOM tree structure, and by text sequence
It is converted into binary string using local sensitivity Simhash algorithm, while the comparison of structural similarity is eventually converted into binary system
Similitude compares two-by-two between string, greatlies simplify model structure, improves similarity system design rate, has saved calculating
Resource, from the harm for largely reducing malice domain name and developing in a healthy way to internet.
Text sequence is converted into binary string using Simhash algorithm, and Simhash algorithm is a kind of local susceptibility calculation
Method is proposed earliest and is used for by Google the removing duplicate webpages work of Google's browser, to almost the same content of text, can produce
Same or similar hashcode, that is, the similarity degree of hashcode can directly reflect the similar journey of input content
Degree, greatly improves detection precision.
Only as described above, only specific embodiments of the present invention, when the model that cannot be limited the present invention with this and implement
It encloses, therefore the displacement of its equivalent assemblies, or according to equivalent changes and modifications made by the invention patent protection scope, should still belong to this hair
The scope that bright claims are covered.
Claims (10)
1. a kind of malice domain name detection method of label and attribute similarity based on dom tree, which is characterized in that the method packet
It includes:
Malice type domain name collection is acquired, malice domain name collection is converted to binary string deposit database;
It is binary string by the domain name translation of UNKNOWN TYPE;
The binary string of malice type domain name collection in the corresponding binary string of UNKNOWN TYPE domain name and database is compared, passes through two
Person's similarity judges the malicious of the UNKNOWN TYPE domain name.
2. the malice domain name detection method of the label and attribute similarity according to claim 1 based on dom tree, feature
The step of being, malice domain name collection be converted to binary string are as follows:
(1) it obtains malice domain name and concentrates the html document after the completion of the corresponding web page loading process of each domain name;
(2) the corresponding dom tree of construction html document;
(3) from the node label name and corresponding whole attribute-names extracted in every dom tree in certain number of plies, the mark that will be extracted
The text sequence of signature and attribute-name is converted into binary string.
3. the malice domain name detection method of the label and attribute similarity according to claim 2 based on dom tree, feature
It is, the corresponding dom tree of construction html document method particularly includes: parse library using Python third party and be parsed into html document
Dom tree.
4. the malice domain name detection method of the label and attribute similarity according to claim 2 based on dom tree, feature
It is, extracts the tag name of dom tree and the method for attribute-name construction text sequence are as follows: for the dom tree of domain name each in domain name
Each node in certain number of plies is traversed according to certain search spread method, the tag name and attribute-name for extracting respective nodes will
DOM tree structure switchs to text sequence.
5. the malice domain name detection method of the label and attribute similarity according to claim 4 based on dom tree, feature
It is, the method that the search spread method of the dom tree uses breadth search traversal.
6. the malice domain name detection method of the label and attribute similarity according to claim 2 based on dom tree, feature
It is, the dom tree is the dom tree comprising stratification relationship between node.
7. the malice domain name detection method of the label and attribute similarity according to claim 6 based on dom tree, feature
It is, the text sequence of the tag name and attribute-name is converted into binary string using Simhash algorithm.
8. the malice domain name detection method of the label and attribute similarity according to claim 7 based on dom tree, feature
It is, the process that the domain name translation of UNKNOWN TYPE is binary string is converted to malice domain name collection to the process phase of binary string
Together.
9. the malice domain name detection method of the label and attribute similarity according to claim 8 based on dom tree, feature
It is, UNKNOWN TYPE domain name judges malicious process using binary string are as follows: by the corresponding binary string of UNKNOWN TYPE domain name
Compared with binary string corresponding with malice type domain name collection, the type of the domain name cannot be judged when similitude is more than threshold value;When
When similitude is in threshold value, then assert that the two is similar, to detect that the UNKNOWN TYPE domain name is malicious.
10. the malice domain name detection method of the label and attribute similarity according to claim 9 based on dom tree, special
Sign is, when carrying out binary string similarity system design, by each of the corresponding binary string of UNKNOWN TYPE domain name and database
Binary string is compared two-by-two one by one, and the Hamming distances between the two are calculated, and the phase of the two is measured using Hamming distances
Like property.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910327562.XA CN110049052A (en) | 2019-04-23 | 2019-04-23 | The malice domain name detection method of label and attribute similarity based on dom tree |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910327562.XA CN110049052A (en) | 2019-04-23 | 2019-04-23 | The malice domain name detection method of label and attribute similarity based on dom tree |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110049052A true CN110049052A (en) | 2019-07-23 |
Family
ID=67278561
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910327562.XA Pending CN110049052A (en) | 2019-04-23 | 2019-04-23 | The malice domain name detection method of label and attribute similarity based on dom tree |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110049052A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111884813A (en) * | 2020-08-05 | 2020-11-03 | 哈尔滨工业大学(威海) | Malicious certificate detection method |
CN112214737A (en) * | 2020-11-10 | 2021-01-12 | 山东比特智能科技股份有限公司 | Method, system, device and medium for identifying picture-based fraudulent webpage |
CN117081865A (en) * | 2023-10-17 | 2023-11-17 | 北京启天安信科技有限公司 | Network security defense system based on malicious domain name detection method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7765236B2 (en) * | 2007-08-31 | 2010-07-27 | Microsoft Corporation | Extracting data content items using template matching |
CN102316081A (en) * | 2010-06-30 | 2012-01-11 | 北京启明星辰信息技术股份有限公司 | Method and device for identifying similar webpage |
CN105528357A (en) * | 2014-09-30 | 2016-04-27 | 中国银联股份有限公司 | Webpage content extraction method based on similarity of URLs and similarity of webpage document structures |
CN107204960A (en) * | 2016-03-16 | 2017-09-26 | 阿里巴巴集团控股有限公司 | Web page identification method and device, server |
-
2019
- 2019-04-23 CN CN201910327562.XA patent/CN110049052A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7765236B2 (en) * | 2007-08-31 | 2010-07-27 | Microsoft Corporation | Extracting data content items using template matching |
CN102316081A (en) * | 2010-06-30 | 2012-01-11 | 北京启明星辰信息技术股份有限公司 | Method and device for identifying similar webpage |
CN105528357A (en) * | 2014-09-30 | 2016-04-27 | 中国银联股份有限公司 | Webpage content extraction method based on similarity of URLs and similarity of webpage document structures |
CN107204960A (en) * | 2016-03-16 | 2017-09-26 | 阿里巴巴集团控股有限公司 | Web page identification method and device, server |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111884813A (en) * | 2020-08-05 | 2020-11-03 | 哈尔滨工业大学(威海) | Malicious certificate detection method |
CN111884813B (en) * | 2020-08-05 | 2022-03-25 | 哈尔滨工业大学(威海) | Malicious certificate detection method |
CN112214737A (en) * | 2020-11-10 | 2021-01-12 | 山东比特智能科技股份有限公司 | Method, system, device and medium for identifying picture-based fraudulent webpage |
CN112214737B (en) * | 2020-11-10 | 2022-06-24 | 山东比特智能科技股份有限公司 | Method, system, device and medium for identifying picture-based fraudulent webpage |
CN117081865A (en) * | 2023-10-17 | 2023-11-17 | 北京启天安信科技有限公司 | Network security defense system based on malicious domain name detection method |
CN117081865B (en) * | 2023-10-17 | 2023-12-29 | 北京启天安信科技有限公司 | Network security defense system based on malicious domain name detection method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Resch et al. | Combining machine-learning topic models and spatiotemporal analysis of social media data for disaster footprint and damage assessment | |
CN102841920B (en) | Method and device for extracting webpage frame information | |
CN103514234B (en) | A kind of page info extracting method and device | |
CN103853738B (en) | A kind of recognition methods of info web correlation region | |
Flatow et al. | On the accuracy of hyper-local geotagging of social media content | |
CN103491205B (en) | The method for pushing of a kind of correlated resources address based on video search and device | |
Kovbasistyi et al. | Method for detection of non-relevant and wrong information based on content analysis of web resources | |
CN107967208A (en) | A kind of Python resource sensitive defect code detection methods based on deep neural network | |
CN110049052A (en) | The malice domain name detection method of label and attribute similarity based on dom tree | |
CN110472066A (en) | A kind of construction method of urban geography semantic knowledge map | |
US9519718B2 (en) | Webpage information detection method and system | |
CN106484764A (en) | User's similarity calculating method based on crowd portrayal technology | |
CN104899243B (en) | The method and device of detection point of interest POI data accuracy | |
CN106202041A (en) | A kind of method and apparatus of the entity alignment problem solved in knowledge mapping | |
CN104699835A (en) | Method and device used for determining webpages including POI (point of interest) data | |
CN104182420A (en) | Ontology-based Chinese name disambiguation method | |
CN102915361B (en) | Webpage text extracting method based on character distribution characteristic | |
CN110377747A (en) | A kind of knowledge base fusion method towards encyclopaedia website | |
CN106547770A (en) | A kind of user's classification based on address of theenduser information, user identification method and device | |
CN108111526A (en) | A kind of illegal website method for digging based on abnormal WHOIS information | |
CN103023874B (en) | A kind of detection method for phishing site | |
CN105279086A (en) | Flow chart-based method for automatically detecting logic loopholes of electronic commerce websites | |
CN107102993A (en) | A kind of user's demand analysis method and device | |
CN106096040A (en) | Organization web ownership place method of discrimination based on search engine and device thereof | |
CN108536664A (en) | The knowledge fusion method in commodity field |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190723 |