CN113807087A - Website domain name similarity detection method and device - Google Patents

Website domain name similarity detection method and device Download PDF

Info

Publication number
CN113807087A
CN113807087A CN202010548219.0A CN202010548219A CN113807087A CN 113807087 A CN113807087 A CN 113807087A CN 202010548219 A CN202010548219 A CN 202010548219A CN 113807087 A CN113807087 A CN 113807087A
Authority
CN
China
Prior art keywords
domain name
character string
website
local maximum
matrix obtained
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010548219.0A
Other languages
Chinese (zh)
Other versions
CN113807087B (en
Inventor
蔡鑫
施丽佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN202010548219.0A priority Critical patent/CN113807087B/en
Publication of CN113807087A publication Critical patent/CN113807087A/en
Application granted granted Critical
Publication of CN113807087B publication Critical patent/CN113807087B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Abstract

The invention discloses a method and a device for detecting similarity of website domain names, and relates to the field of internet security. The method comprises the following steps: matching the domain name character string of the website to be detected with the domain name character string of the target website; acquiring a plurality of common substrings between the website domain name character string to be detected and the target website domain name character string, wherein each common substring comprises one or more characters, at least one character is separated between every two adjacent common substrings, and the at least one character does not belong to any common substring; and determining the similarity between the domain name of the website to be detected and the domain name of the target website according to the sum of the number of the characters in the public sub-strings and the ratio of the number of the characters in the domain name character string of the target website. The method and the device optimize the calculation result of the similarity of the website domain name, so that the calculation result of the similarity is more accurate.

Description

Website domain name similarity detection method and device
Technical Field
The disclosure relates to the field of internet security, and in particular to a method and a device for detecting similarity of website domain names.
Background
Some malicious web pages exist on the internet, and the purpose of phishing fraud is achieved by embedding character strings which are the same as or similar to domain names of other people in a Uniform Resource Locator (URL) to confuse and counterfeit the domain names of other people. The similarity between the mock and counterfeited domain names can be calculated by an edit distance algorithm such as Jaro, and the attempted counterfeiter can be found. However, the calculation amount of the algorithm is large, and when the similarity comparison of a large batch of domain names to be detected is faced, the efficiency is low.
In the related art, the similarity between domain names is calculated by the longest common substring method. However, the longest common substring method can only strictly match substring parts which are completely the same, and the longest matching in the practical sense cannot be realized for similar 'editing' situations of local modification, replacement, character addition, character deletion and the like.
Disclosure of Invention
The technical problem to be solved by the present disclosure is to provide a method and an apparatus for detecting website domain name similarity, which can improve the accuracy of calculating the website domain name similarity.
According to one aspect of the disclosure, a method for detecting similarity of domain names of websites is provided, which includes: matching the domain name character string of the website to be detected with the domain name character string of the target website; acquiring a plurality of common substrings between a website domain name character string to be detected and a target website domain name character string, wherein each common substring comprises one or more characters, at least one character is separated between two adjacent common substrings, and at least one character does not belong to any common substring; and determining the similarity between the domain name of the website to be detected and the domain name of the target website according to the sum of the number of the characters in the public sub-strings and the ratio of the number of the characters in the domain name character string of the target website.
In some embodiments, the obtaining a plurality of common substrings between the website domain name character string to be detected and the target website domain name character string includes: constructing a matrix between the domain name character string of the website to be detected and the domain name character string of the target website based on a recursion matrix corresponding to the longest common substring algorithm; determining a local maximum matrix of each iteration through repeated iteration searching according to the sequence of the number of the characters of the character string from large to small; and taking the character string corresponding to each local maximum matrix as a common substring.
In some embodiments, the coordinates of the local maximum matrix obtained from the nth iteration and the local maximum matrix obtained from any previous k-th iteration satisfy (x)n-xk)(xn+Ln-xk-Lk) > 0, and (y)n-yk)(yn+Ln-yk-Lk) Is more than 0, wherein n and k are natural numbers, and n is more than k and xnThe position, x, of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the domain name character string of the target websitekThe position of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the target website domain name character string, ynThe position y of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the website domain name character string to be detectedkThe position L of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the website domain name character string to be detectednThe number of characters, L, of the character string corresponding to the local maximum matrix obtained for the nth iterationkAnd the number of the characters of the character string corresponding to the local maximum matrix obtained by the kth iteration.
In some embodiments, the coordinates of the local maximum matrix obtained from the nth iteration and the local maximum matrix obtained from any previous k-th iteration satisfy [ x [ ]n-(xk+Lk-1)][yn-(yk+Lk-1)]Is more than 0, wherein n and k are natural numbers, and n is more than k and xnThe position, x, of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the domain name character string of the target websitekThe position of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the target website domain name character string, ynThe position y of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the website domain name character string to be detectedkThe position L of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the website domain name character string to be detectedkAnd the number of the characters of the character string corresponding to the local maximum matrix obtained by the kth iteration.
In some embodiments, it is determined whether the similarity between the domain name of the website to be detected and the domain name of the target website is greater than a threshold value and less than 1, and if so, it is determined that the domain name of the website to be detected is a counterfeit domain name.
According to another aspect of the present disclosure, a website domain name similarity detecting apparatus is further provided, including: the character string matching unit is configured to match the website domain name character string to be detected with the target website domain name character string; the common substring acquisition unit is configured to acquire a plurality of common substrings between the website domain name character string to be detected and the target website domain name character string, wherein each common substring comprises one or more characters, at least one character is separated between two adjacent common substrings, and at least one character does not belong to any common substring; and the similarity comparison unit is configured to determine the similarity between the domain name of the website to be detected and the domain name of the target website according to the ratio of the sum of the number of the characters in the public sub-strings to the number of the characters in the domain name character string of the target website.
In some embodiments, the common substring obtaining unit is configured to construct a matrix between the domain name character string of the website to be detected and the domain name character string of the target website based on a recursion matrix corresponding to a longest common substring algorithm; determining a local maximum matrix of each iteration through repeated iteration searching according to the sequence of the number of the characters of the character string from large to small; and taking the character string corresponding to each local maximum matrix as a common substring.
In some embodiments, the coordinates of the local maximum matrix obtained from the nth iteration and the local maximum matrix obtained from any previous k-th iteration satisfy (x)n-xk)(xn-xk+Ln-Lk) > 0, and (y)n-yk)(yn-yk+Ln-Lk) Is more than 0, wherein n and k are natural numbers, and n is more than k and xnThe position, x, of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the domain name character string of the target websitekThe position of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the target website domain name character string, ynThe position y of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the website domain name character string to be detectedkFor the k-th iterationThe position L of the first character in the character string corresponding to the local maximum matrix in the website domain name character string to be detectednThe number of characters, L, of the character string corresponding to the local maximum matrix obtained for the nth iterationkAnd the number of the characters of the character string corresponding to the local maximum matrix obtained by the kth iteration.
In some embodiments, the coordinates of the local maximum matrix obtained from the nth iteration and the local maximum matrix obtained from any previous k-th iteration satisfy [ x [ ]n-(xk+Lk-1)][yn-(yk+Lk-1)]Is more than 0, wherein n and k are natural numbers, and n is more than k and xnThe position, x, of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the domain name character string of the target websitekThe position of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the target website domain name character string, ynThe position y of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the website domain name character string to be detectedkThe position L of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the website domain name character string to be detectedkAnd the number of the characters of the character string corresponding to the local maximum matrix obtained by the kth iteration.
According to another aspect of the present disclosure, a website domain name similarity detecting apparatus is further provided, including: a memory; and a processor coupled to the memory, the processor configured to perform the website domain name similarity detection method as described above based on instructions stored in the memory.
According to another aspect of the present disclosure, a non-transitory computer-readable storage medium is also proposed, on which computer program instructions are stored, which when executed by a processor implement the above-mentioned website domain name similarity detection method.
In the embodiment of the disclosure, the similarity between the domain name of the website to be detected and the domain name of the target website is determined by acquiring the plurality of common substrings between the domain name character string of the website to be detected and the domain name character string of the target website according to the ratio of the sum of the number of characters in the plurality of common substrings to the number of characters in the domain name character string of the target website, so that the calculation result of the similarity of the domain name of the website is optimized, and the calculation result of the similarity is more accurate.
Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:
fig. 1 is a flowchart illustrating a website domain name similarity detection method according to some embodiments of the present disclosure.
Fig. 2 is a flowchart illustrating another embodiment of a website domain name similarity detection method according to the present disclosure.
Fig. 3 is a schematic structural diagram of some embodiments of the website domain name similarity detection apparatus according to the present disclosure.
Fig. 4 is a schematic structural diagram of another embodiment of a website domain name similarity detection apparatus according to the present disclosure.
Fig. 5 is a schematic structural diagram of another embodiment of a website domain name similarity detection apparatus according to the present disclosure.
Detailed Description
Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.
Fig. 1 is a flowchart illustrating a website domain name similarity detection method according to some embodiments of the present disclosure.
In step 110, the website domain name string to be detected is matched with the target website domain name string.
For example, the domain name of the target website is www.roses-garden.com, and the domain name of the website to be detected is www.r0ses-gaden.com. The string roses-garden is matched to the string r0 ses-garen.
In step 120, a plurality of common substrings between the website domain name character string to be detected and the target website domain name character string are obtained, wherein each common substring comprises one or more characters, at least one character is separated between two adjacent common substrings, and at least one character does not belong to any common substring.
For example, the resulting common substrings include r, ses-ga, den.
In step 130, the similarity between the domain name of the website to be detected and the domain name of the target website is determined according to the ratio of the sum of the number of characters in the plurality of public sub-strings to the number of characters in the domain name character string of the target website.
For example, the common substring r includes one character, the common substring ses-ga includes six characters, the common substring den includes three characters, the target website domain name character string includes 12 characters, and the similarity between the website domain name to be detected and the target website domain name is 10/12.
In some embodiments, after determining the similarity between the domain name of the website to be detected and the domain name of the target website, determining whether the similarity between the domain name of the website to be detected and the domain name of the target website is greater than a threshold value and less than 1, and if so, determining that the domain name of the website to be detected is a counterfeit website domain name.
For example, if the threshold is set to 0.8, and the similarity between the domain name of the website to be detected and the domain name of the target website obtained by the method of this embodiment is 10/12, it may be determined that the domain name of the website to be detected is a counterfeit website domain name. And when the similarity between the domain names is calculated by the related longest common substring method, the similarity between the domain name of the website to be detected and the domain name of the target website is 6/12. At this time, it can only be determined that the domain name of the website to be detected is not the same as the domain name of the target website, and the domain name of the website to be detected is considered to be a domain name unrelated to the domain name of the target website.
In the embodiment, the similarity between the domain name of the website to be detected and the domain name of the target website is determined by acquiring the plurality of common substrings between the domain name character string of the website to be detected and the domain name character string of the target website according to the ratio of the sum of the number of characters in the plurality of common substrings to the number of characters in the domain name character string of the target website, so that the calculation result of the similarity between the domain name of the website to be detected and the domain name of the target website is optimized, and the calculation result of the similarity is more accurate.
Fig. 2 is a flowchart illustrating another embodiment of a website domain name similarity detection method according to the present disclosure.
In step 210, a matrix between the domain name character string of the website to be detected and the domain name character string of the target website is constructed based on the recursion matrix corresponding to the longest common substring algorithm.
In some embodiments, a recurrence matrix obtained by a standard longest common substring algorithm is used, when the ith character in the domain name character string of the target website is the same as the jth character in the domain name character string of the website to be detected, i. j is a positive integer greater than or equal to 1, if the previous character of the same character in the domain name character string of the target website is different from the previous character of the same character in the domain name character string of the website to be detected, the coordinates of the same character in the recursion matrix are (0, 0), if the former character of the same character in the domain name character string of the target website is the same as the former character of the same character in the domain name character string of the website to be detected, the abscissa of the same character in the recurrence matrix is the abscissa of the previous character plus 1, and the ordinate of the same character in the recurrence matrix is the ordinate of the previous character plus 1. For example, the data of the recurrence matrix is denoted as res [ i ] [ j ]. When a [ i ] ═ B [ j ], res [ i ] [ j ] ═ res [ i-1] [ j-1] + 1. When a [ i ] ≠ B [ j ], res [ i ] [ j ] ═ 0. A [ i ] is the ith character in the target website domain name character string, and B [ j ] is the jth character in the website domain name character string to be detected. When i is 0 or j is 0, a [ i ] and B [ j ] have no corresponding characters, and therefore i and j are positive integers equal to or greater than 1.
Take the target website domain name of www.roses-garden.com and the website domain name to be detected of www.r0ses-garden.com as an example. A matrix between the domain name character string of the website to be detected and the domain name character string of the target website is shown in table 1.
Figure BDA0002541530980000071
Figure BDA0002541530980000081
TABLE 1
In step 220, according to the sequence of the number of the characters of the character string from large to small, the local maximum matrix of each iteration is determined through repeated iteration search, and the character string corresponding to each local maximum matrix is used as a common sub-string.
In some embodiments, the number of iterations may be limited or the iterative search may be stopped when it is determined that the local maximum matrix no longer appears. In some embodiments, if the local maximum matrix found in the nth iteration does not satisfy the following constraint requirement, the local maximum matrix is discarded, and the next iteration is turned to.
In some embodiments of the present invention, the,the coordinates of the local maximum matrix obtained by the nth iteration and the local maximum matrix obtained by any previous k-th iteration satisfy (x)n-xk)(xn+Ln-xk-Lk) > 0, and (y)n-yk)(yn+Ln-yk-Lk) Is more than 0, wherein n and k are natural numbers, and n is more than k and xnThe position, x, of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the domain name character string of the target websitekThe position of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the target website domain name character string, ynThe position y of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the website domain name character string to be detectedkThe position L of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the website domain name character string to be detectednThe number of characters, L, of the character string corresponding to the local maximum matrix obtained for the nth iterationkAnd the number of the characters of the character string corresponding to the local maximum matrix obtained by the kth iteration.
If the position of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration is larger than the position of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the domain name character string of the target website, the position of the last character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the domain name character string of the target website is larger than the position of the last character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the domain name character string of the target website; the position of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the website domain name character string to be detected is larger than the position of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the website domain name character string to be detected, the position of the last character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the website domain name character string to be detected is larger than the position of the last character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the website domain name character string to be detected.
If the position of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration is smaller than the position of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the domain name character string of the target website, and the position of the last character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the domain name character string of the target website is smaller than the position of the last character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the domain name character string of the target website; the position of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the website domain name character string to be detected is smaller than the position of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the website domain name character string to be detected, the position of the last character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the website domain name character string to be detected is smaller than the position of the last character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the website domain name character string to be detected.
For example, on the first iteration, the recurrence formula res [ i ] is used][j]Find res [ i ]][j]The local maximum matrix in (1) is Rank1, wherein the number of characters L of the character string corresponding to the local maximum matrix Rank1 of the first iteration can be determined16, and the coordinate [ x ] of the corresponding string ses-ga in the domain name string of the destination web site1,x1+1,...x1+L1-1],x1Is 3. The coordinate of the corresponding character string ses-ga in the domain name character string of the website to be detected is y1,y1+1,...y1+L1-1,y1Is 3.
In the second iteration, finding the next local maximum matrix Rank2, wherein the number L of the characters of the character string corresponding to the local maximum matrix Rank2 in the second iteration is23, and the coordinates of the corresponding string den in the domain name string of the target web site[x2,x2+1,...x2+L2-1],x2Is 10. The coordinate of the corresponding character string den in the domain name character string of the website to be detected is y2,y2+1,...y2+L2-1,y2Is 9.
(x2-x1)(x2-x1+L2-L1) (10-3) (10-3+3-6) > 0, and (y)2-y1)(y2-y1+L2-L1) (9-3) (9-3+3-6) > 0. Namely, the projections of the longitudinal axis and the transverse axis of the Rank2 are on the same side of the projection of the Rank1, and no intersection exists.
In the third iteration, the next local maximum matrix Rank3 is found, and the number L of the characters of the character string corresponding to the local maximum matrix Rank3 in the second iteration is31, and the coordinate [ x ] of the corresponding character string r in the domain name character string of the target website3],x3Is 1. The coordinate of the corresponding character string r in the domain name character string of the website to be detected is y3,y3Is 1.
(x3-x2)(x3-x2+L3-L2) (1-10) (1-10+1-3) > 0, and (y)3-y2)(y3-y2+L3-L2)=(1-9)(1-9+1-3)>0。(x3-x1)(x3-x1+L3-L1) (1-3) (1-3+1-6) > 0, and (y)3-y1)(y3-y1+L3-L1) (1-3) (1-3+1-3) > 0. Namely, the projections of the longitudinal axis and the transverse axis of the Rank3 are on the same side of the projection of the Rank2, namely, the projections of the longitudinal axis and the transverse axis of the Rank3 are also on the same side of the projection of the Rank1, and no intersection exists.
In some embodiments, the coordinates of the local maximum matrix obtained from the nth iteration and the local maximum matrix obtained from any previous k-th iteration satisfy [ x [ ]n-(xk+Lk-1)][yn-(yk+Lk-1)]Is more than 0, wherein n and k are natural numbers, and n is more than k and xnFor the position of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the target website domain name character string,xkthe position of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the target website domain name character string, ynThe position y of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the website domain name character string to be detectedkThe position L of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the website domain name character string to be detectedkAnd the number of the characters of the character string corresponding to the local maximum matrix obtained by the kth iteration.
That is, if the position of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration is greater than the position of the last character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the domain name character string of the target website, the position of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the domain name character string of the website to be detected is greater than the position of the last character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the domain name character string of the website to be detected.
If the position of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the domain name character string of the target website is smaller than the position of the last character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the domain name character string of the target website, the position of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the domain name character string of the website to be detected is smaller than the position of the last character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the domain name character string of the website to be detected.
For example, during the first iteration, the position of the last character a in the common substring ses-ga in the target website domain name character string is 8, and the position in the website domain name character string to be detected is 8. During the second iteration, the position of the first character of the common substring den in the target website domain name character string is 10, and the position of the first character in the website domain name character string to be detected is 9. Then 10>8 and 9> 8.
In step 230, the similarity between the domain name of the website to be detected and the domain name of the target website is determined according to the ratio of the sum of the number of characters in the plurality of public sub-strings to the number of characters in the domain name character string of the target website.
In step 240, it is determined whether the similarity between the domain name of the website to be detected and the domain name of the target website is greater than a threshold value and less than 1, if yes, step 250 is executed, otherwise, step 260 is executed.
In step 250, the domain name of the website to be detected is determined to be a counterfeit website domain name.
In step 260, it is determined that the domain name of the website to be detected is not a counterfeit website domain name. For example, when the similarity between the domain name of the website to be detected and the domain name of the target website is 1, it indicates that the domain name of the website to be detected and the domain name of the target website are the same, and when the similarity between the domain name of the website to be detected and the domain name of the target website is smaller than a threshold value, it indicates that the domain name of the website to be detected and the domain name of the target website are unrelated.
In the embodiment, the public substrings are obtained through improved multiple iterations, the calculation effect similar to the editing distance is realized, and the counterfeit website domain names which are partially modified, replaced, added with characters, deleted with characters and the like and are similar to the editing situation can be detected. In addition, the scheme avoids the algorithms of Jaro and other edit distances with large calculation amount, and has high realization efficiency.
Fig. 3 is a schematic structural diagram of some embodiments of the website domain name similarity detection apparatus according to the present disclosure. The device comprises a character string matching unit 310, a common substring obtaining unit 320 and a similarity comparison unit 330.
The character string matching unit 310 is configured to match the website domain name character string to be detected with the target website domain name character string.
The common substring acquiring unit 320 is configured to acquire a plurality of common substrings between the website domain name character string to be detected and the target website domain name character string, wherein each common substring comprises one or more characters, at least one character is separated between two adjacent common substrings, and at least one character does not belong to any common substring; and
in some embodiments, a matrix between the domain name character string of the website to be detected and the domain name character string of the target website is constructed based on a recurrence matrix corresponding to the longest common substring algorithm. And determining the local maximum matrix of each iteration through repeated iterative search according to the sequence of the number of the characters of the character string from large to small, and taking the character string corresponding to each local maximum matrix as a common substring.
In some embodiments, the coordinates of the local maximum matrix obtained from the nth iteration and the local maximum matrix obtained from any previous k-th iteration satisfy (x)n-xk)(xn-xk+Ln-Lk) > 0, and (y)n-yk)(yn-yk+Ln-Lk) Is more than 0, wherein n and k are natural numbers, and n is more than k and xnThe position, x, of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the domain name character string of the target websitekThe position of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the target website domain name character string, ynThe position y of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the website domain name character string to be detectedkThe position L of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the website domain name character string to be detectednThe number of characters, L, of the character string corresponding to the local maximum matrix obtained for the nth iterationkAnd the number of the characters of the character string corresponding to the local maximum matrix obtained by the kth iteration.
In other embodiments, the coordinates of the local maximum matrix obtained from the nth iteration and the local maximum matrix obtained from any previous k-th iteration satisfy [ x [ ]n-(xk+Lk-1)][yn-(yk+Lk-1)]Is more than 0, wherein n and k are natural numbers, and n is more than k and xnThe position, x, of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the domain name character string of the target websitekThe first character in the character string corresponding to the local maximum matrix obtained by the kth iteration is in the domain name word of the target websitePosition in string, ynThe position y of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the website domain name character string to be detectedkThe position L of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the website domain name character string to be detectedkAnd the number of the characters of the character string corresponding to the local maximum matrix obtained by the kth iteration.
The similarity comparison unit 330 is configured to determine the similarity between the domain name of the website to be detected and the domain name of the target website according to the ratio of the sum of the number of characters in the plurality of public sub-strings to the number of characters in the domain name character string of the target website.
In the embodiment, the similarity between the domain name of the website to be detected and the domain name of the target website is determined by acquiring the plurality of common substrings between the domain name character string of the website to be detected and the domain name character string of the target website according to the ratio of the sum of the number of characters in the plurality of common substrings to the number of characters in the domain name character string of the target website, so that the calculation result of the similarity between the domain name of the website to be detected and the domain name of the target website is optimized, and the calculation result of the similarity is more accurate.
In other embodiments of the present disclosure, as shown in fig. 4, the apparatus further includes a counterfeit website domain name determining unit 410, configured to determine whether the similarity between the website domain name to be detected and the target website domain name is greater than a threshold and smaller than 1, and if so, determine that the website domain name to be detected is a counterfeit website domain name.
In this embodiment, it is possible to detect counterfeit website domain names that are partially subjected to "editing" like replacement by modification, addition of characters, deletion of characters, and the like.
Fig. 5 is a schematic structural diagram of another embodiment of a website domain name similarity detection apparatus according to the present disclosure. The apparatus 500 includes a memory 510 and a processor 520. Wherein: the memory 510 may be a magnetic disk, flash memory, or any other non-volatile storage medium. The memory is used to store instructions in the embodiments corresponding to fig. 1-2. Processor 520 is coupled to memory 510 and may be implemented as one or more integrated circuits, such as a microprocessor or microcontroller. The processor 520 is configured to execute instructions stored in memory.
In some embodiments, processor 520 is coupled to memory 510 by a BUS BUS 530. The apparatus 500 may also be connected to an external storage system 550 through a storage interface 540 for calling external data, and may also be connected to a network or another computer system (not shown) through a network interface 560. And will not be described in detail herein.
In the embodiment, the data instruction is stored in the memory, and the processor processes the instruction, so that the calculation result of the similarity of the domain name of the website is optimized, and the calculation result of the similarity is more accurate.
In other embodiments, a computer-readable storage medium has stored thereon computer program instructions which, when executed by a processor, implement the steps of the method in the embodiments corresponding to fig. 1-2. As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, apparatus, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Thus far, the present disclosure has been described in detail. Some details that are well known in the art have not been described in order to avoid obscuring the concepts of the present disclosure. It will be fully apparent to those skilled in the art from the foregoing description how to practice the presently disclosed embodiments.
Although some specific embodiments of the present disclosure have been described in detail by way of example, it should be understood by those skilled in the art that the foregoing examples are for purposes of illustration only and are not intended to limit the scope of the present disclosure. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the present disclosure. The scope of the present disclosure is defined by the appended claims.

Claims (11)

1. A website domain name similarity detection method comprises the following steps:
matching the domain name character string of the website to be detected with the domain name character string of the target website;
acquiring a plurality of common substrings between the website domain name character string to be detected and the target website domain name character string, wherein each common substring comprises one or more characters, at least one character is separated between every two adjacent common substrings, and the at least one character does not belong to any common substring; and
and determining the similarity between the domain name of the website to be detected and the domain name of the target website according to the sum of the number of the characters in the public sub-strings and the ratio of the number of the characters in the domain name character string of the target website.
2. The website domain name similarity detection method according to claim 1, wherein the obtaining of the plurality of common substrings between the website domain name character string to be detected and the target website domain name character string comprises:
constructing a matrix between the domain name character string of the website to be detected and the domain name character string of the target website based on a recursion matrix corresponding to a longest common substring algorithm;
determining a local maximum matrix of each iteration through repeated iteration searching according to the sequence of the number of the characters of the character string from large to small; and
and taking the character string corresponding to each local maximum matrix as a common substring.
3. The website domain name similarity detection method according to claim 2,
the coordinates of the local maximum matrix obtained by the nth iteration and the local maximum matrix obtained by any previous k-th iteration satisfy (x)n-xk)(xn+Ln-xk-Lk) > 0, and (y)n-yk)(yn+Ln-yk-Lk) Is more than 0, wherein n and k are natural numbers, and n is more than k and xnThe position, x, of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the domain name character string of the target websitekThe position y of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the domain name character string of the target websitenThe position y of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the website domain name character string to be detectedkThe first character in the character string corresponding to the local maximum matrix obtained by the kth iteration is locatedStating the position, L, in the domain name string of the website to be detectednThe number of characters, L, of the character string corresponding to the local maximum matrix obtained for the nth iterationkAnd the number of the characters of the character string corresponding to the local maximum matrix obtained by the kth iteration.
4. The website domain name similarity detection method according to claim 2,
the coordinates of the local maximum matrix obtained by the nth iteration and the local maximum matrix obtained by any previous k iterations satisfy [ x [ ]n-(xk+Lk-1)][yn-(yk+Lk-1)]Is more than 0, wherein n and k are natural numbers, and n is more than k and xnThe position, x, of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the domain name character string of the target websitekThe position y of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the domain name character string of the target websitenThe position y of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the website domain name character string to be detectedkThe position L of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the website domain name character string to be detectedkAnd the number of the characters of the character string corresponding to the local maximum matrix obtained by the kth iteration.
5. The website domain name similarity detection method according to any one of claims 1 to 4, further comprising:
and judging whether the similarity between the domain name of the website to be detected and the domain name of the target website is greater than a threshold value and less than 1, if so, determining that the domain name of the website to be detected is a counterfeit domain name.
6. A website domain name similarity detection device comprises:
the character string matching unit is configured to match the website domain name character string to be detected with the target website domain name character string;
the public substring acquisition unit is configured to acquire a plurality of public substrings between the website domain name character string to be detected and the target website domain name character string, wherein each public substring comprises one or more characters, at least one character is separated between every two adjacent public substrings, and the at least one character does not belong to any public substring; and
and the similarity comparison unit is configured to determine the similarity between the domain name of the website to be detected and the domain name of the target website according to the ratio of the sum of the number of the characters in the public sub-strings to the number of the characters in the domain name character string of the target website.
7. The website domain name similarity detection apparatus according to claim 6,
the public substring acquisition unit is configured to construct a matrix between the domain name character string of the website to be detected and the domain name character string of the target website based on a recursion matrix corresponding to a longest public substring algorithm; determining a local maximum matrix of each iteration through repeated iteration searching according to the sequence of the number of the characters of the character string from large to small; and taking the character string corresponding to each local maximum matrix as a common substring.
8. The website domain name similarity detection apparatus according to claim 7,
the coordinates of the local maximum matrix obtained by the nth iteration and the local maximum matrix obtained by any previous k-th iteration satisfy (x)n-xk)(xn-xk+Ln-Lk) > 0, and (y)n-yk)(yn-yk+Ln-Lk) Is more than 0, wherein n and k are natural numbers, and n is more than k and xnThe position, x, of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the domain name character string of the target websitekThe position y of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the domain name character string of the target websitenThe position y of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the website domain name character string to be detectedkThe position L of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the website domain name character string to be detectednThe number of characters, L, of the character string corresponding to the local maximum matrix obtained for the nth iterationkAnd the number of the characters of the character string corresponding to the local maximum matrix obtained by the kth iteration.
9. The website domain name similarity detection apparatus according to claim 7,
the coordinates of the local maximum matrix obtained by the nth iteration and the local maximum matrix obtained by any previous k iterations satisfy [ x [ ]n-(xk+Lk-1)][yn-(yk+Lk-1)]Is more than 0, wherein n and k are natural numbers, and n is more than k and xnThe position, x, of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the domain name character string of the target websitekThe position y of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the domain name character string of the target websitenThe position y of the first character in the character string corresponding to the local maximum matrix obtained by the nth iteration in the website domain name character string to be detectedkThe position L of the first character in the character string corresponding to the local maximum matrix obtained by the kth iteration in the website domain name character string to be detectedkAnd the number of the characters of the character string corresponding to the local maximum matrix obtained by the kth iteration.
10. A website domain name similarity detection device comprises:
a memory; and
a processor coupled to the memory, the processor configured to perform the website domain name similarity detection method of any one of claims 1 to 5 based on instructions stored in the memory.
11. A non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the website domain name similarity detection method of any one of claims 1 to 5.
CN202010548219.0A 2020-06-16 2020-06-16 Method and device for detecting similarity of website domain names Active CN113807087B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010548219.0A CN113807087B (en) 2020-06-16 2020-06-16 Method and device for detecting similarity of website domain names

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010548219.0A CN113807087B (en) 2020-06-16 2020-06-16 Method and device for detecting similarity of website domain names

Publications (2)

Publication Number Publication Date
CN113807087A true CN113807087A (en) 2021-12-17
CN113807087B CN113807087B (en) 2023-11-28

Family

ID=78944259

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010548219.0A Active CN113807087B (en) 2020-06-16 2020-06-16 Method and device for detecting similarity of website domain names

Country Status (1)

Country Link
CN (1) CN113807087B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114710468A (en) * 2022-03-31 2022-07-05 绿盟科技集团股份有限公司 Domain name generation and identification method, device, equipment and medium
CN114710468B (en) * 2022-03-31 2024-05-14 绿盟科技集团股份有限公司 Domain name generation and identification method, device, equipment and medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222187A (en) * 2011-06-02 2011-10-19 国家计算机病毒应急处理中心 Domain name structural feature-based hang horse web page detection method
CN103428307A (en) * 2013-08-09 2013-12-04 中国科学院计算机网络信息中心 Method and equipment for detecting counterfeit domain names
CN106127222A (en) * 2016-06-13 2016-11-16 中国科学院信息工程研究所 The similarity of character string computational methods of a kind of view-based access control model and similarity determination methods
CN106202543A (en) * 2016-07-27 2016-12-07 苏州家佳宝妇幼医疗科技有限公司 Ontology Matching method and system based on machine learning
CN106776768A (en) * 2016-11-23 2017-05-31 福建六壬网安股份有限公司 A kind of URL grasping means of distributed reptile engine and system
CN107609059A (en) * 2017-08-28 2018-01-19 昆明理工大学 A kind of Chinese domain name Similarity Measures based on J W distances
CN108628953A (en) * 2018-04-08 2018-10-09 中山大学 A kind of parallel by character string matching algorithm based on FPGA

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222187A (en) * 2011-06-02 2011-10-19 国家计算机病毒应急处理中心 Domain name structural feature-based hang horse web page detection method
CN103428307A (en) * 2013-08-09 2013-12-04 中国科学院计算机网络信息中心 Method and equipment for detecting counterfeit domain names
CN106127222A (en) * 2016-06-13 2016-11-16 中国科学院信息工程研究所 The similarity of character string computational methods of a kind of view-based access control model and similarity determination methods
CN106202543A (en) * 2016-07-27 2016-12-07 苏州家佳宝妇幼医疗科技有限公司 Ontology Matching method and system based on machine learning
CN106776768A (en) * 2016-11-23 2017-05-31 福建六壬网安股份有限公司 A kind of URL grasping means of distributed reptile engine and system
CN107609059A (en) * 2017-08-28 2018-01-19 昆明理工大学 A kind of Chinese domain name Similarity Measures based on J W distances
CN108628953A (en) * 2018-04-08 2018-10-09 中山大学 A kind of parallel by character string matching algorithm based on FPGA

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114710468A (en) * 2022-03-31 2022-07-05 绿盟科技集团股份有限公司 Domain name generation and identification method, device, equipment and medium
CN114710468B (en) * 2022-03-31 2024-05-14 绿盟科技集团股份有限公司 Domain name generation and identification method, device, equipment and medium

Also Published As

Publication number Publication date
CN113807087B (en) 2023-11-28

Similar Documents

Publication Publication Date Title
US7110540B2 (en) Multi-pass hierarchical pattern matching
CN101398820B (en) Large scale key word matching method
CN111290784B (en) Program source code similarity detection method suitable for large-scale samples
US8490203B1 (en) Fingerprinting based entity extraction
US9171153B2 (en) Bloom filter with memory element
JP5240475B2 (en) Approximate pattern matching method and apparatus
US8812547B2 (en) Fast identification of complex strings in a data stream
CN106096024A (en) The appraisal procedure of address similarity and apparatus for evaluating
WO2015165037A1 (en) Cascaded binary coding based image matching method
CN105589894B (en) Document index establishing method and device and document retrieval method and device
WO2017143907A1 (en) Character string distance calculation method and device
CN113901474B (en) Vulnerability detection method based on function-level code similarity
CN106815179B (en) Text similarity determination method and device
CN114168954A (en) Intrusion detection method and device based on regular matching
CN113807087B (en) Method and device for detecting similarity of website domain names
KR20210082390A (en) Systems and methods for grouping and collapsing sequencing reads
Lee et al. Similar pair identification using locality-sensitive hashing technique
CN108415889A (en) A kind of text similarity detection method for once replacing hash algorithm based on cum rights
CN112861891B (en) User behavior abnormality detection method and device
CN113992625A (en) Domain name source station detection method, system, computer and readable storage medium
CN112883372B (en) Cross-site scripting attack detection method and device
CN110046180B (en) Method and device for locating similar examples and electronic equipment
JP6096084B2 (en) Traffic scanning apparatus and method
CN107248929B (en) Strong correlation data generation method of multi-dimensional correlation data
CN112182319B (en) Webpage similarity determination method, webpage clustering device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant