CN107609059B

CN107609059B - Chinese domain name similarity measurement method based on J-W distance

Info

Publication number: CN107609059B
Application number: CN201710749659.0A
Authority: CN
Inventors: 龙华; 祁俊辉; 邵玉斌; 杜庆治
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2017-08-28
Filing date: 2017-08-28
Publication date: 2020-10-20
Anticipated expiration: 2037-08-28
Also published as: CN107609059A

Abstract

The invention relates to a Chinese domain name similarity measurement method based on J-W distance, belonging to the technical field of network security. The method maps the Chinese character after being coded into a string of digital character strings through a Unicode Chinese character stroke sequence table, and simultaneously creatively introduces a Jaro-Winner Distance algorithm in the field of machine learning to be combined with the longest public substring so as to measure the similarity of the Chinese domain name. Firstly, acquiring a domain name to be detected and a target domain name, and initializing the domain name to be detected and the target domain name to generate a domain name main body; secondly, coding the domain name main body according to a Unicode Chinese character stroke sequence table to generate a digital character string which is used as an input of a Jaro-Winner Distance algorithm to generate a detection matrix; then, the similarity of the digital character strings is calculated according to the relevant rules by combining with the longest public substring of the digital character strings, and the similarity of the digital character strings can effectively represent the similarity between Chinese characters.

Description

Chinese domain name similarity measurement method based on J-W distance

Technical Field

The invention relates to a Chinese domain name similarity measurement method based on J-W distance, belonging to the technical field of network security.

Background

With the development and popularization of the internet, the chinese domain name gradually becomes an important component of the internationalized domain name, and meanwhile, domain name counterfeiting attacks against the chinese domain name are increasing, and the counterfeiting forms of the domain name are increasingly complex. Because Chinese characters have many shapes and are close to characters, and the habit of fast reading of people is added, visual misjudgment is inevitably caused to a certain degree.

The traditional domain name similarity measurement method can only be applied to similarity measurement of English domain names, but the effect is not obvious for Chinese domain names. Moreover, at present, domestic related research on Chinese domain name similarity measurement is relatively deficient, and research results are relatively few.

At present, most Chinese domain name similarity measurement methods calculate the similarity of Chinese characters according to single characters and overall similarity, so that the methods have certain defects in time complexity or accuracy, and no specific implementation algorithm exists for calculating the single character similarity or the overall similarity.

Disclosure of Invention

The invention aims to solve the technical problem of limitation and deficiency of the prior art and provides a Chinese domain similarity measurement method based on J-W Distance. Compared with the Chinese domain name similarity measurement method in the prior art, the method mainly solves the problems of insufficient accuracy, poor efficiency and the like in the prior art, and aims to improve the accuracy and the timeliness of the Chinese domain name similarity measurement in the prior art.

The technical scheme of the invention is as follows: a Chinese domain name similarity measurement method based on J-W distance comprises the following specific steps:

step 1: acquiring a domain name X to be detected and a target domain name Y;

step 2: the domain name X to be detected and the target domain name Y are given a dot sign ". or a period". ' splitting, ignoring network name and domain suffix, reserving domain subject and generating Chinese character set x: { x₁,x₂…x_p{ y: } and y: { y₁,y₂…y_q}；

Step 3: traversing the domain name main body Chinese character set x obtained in Step2 according to the Unicode Chinese character stroke sequence table₁,x₂…x_p{ y: } and y: { y₁,y₂…y_qFor each Chinese character x according to the aggregate character order_i,i∈[1,p]Or y_i,i∈[1,q]Searching the stroke sequence of the corresponding Chinese character, converting according to the corresponding coding rule, and generating the coded character string str of the main body of the domain name X to be detected_xAnd the code string str of the main body of the target domain name Y domain name_yAnd obtaining the code string str_xAnd str_yLength of (len)_xAnd len_y；

Step4.1: subjecting the main domain name coding character string str of the domain name X to be detected and the target domain name Y_xAnd str_yAs input to the J-W algorithm, and generates a detection matrix

Step4.2: the matching window value MW is calculated according to equation (1):

step4.3: by a detection matrix

And a matching window value MW, calculating the number m of matched characters and the number n of replaced matched characters according to the relevant rules;

step4.4: the number m of the matched characters and the number n of the conversion digits of the matched characters are calculated by Step4.3, and the domain name main body coding character string str of the domain name X to be detected and the target domain name Y is calculated according to a formula (2)_xAnd str_yJaro Distance of (1):

step4.5: acquiring a domain name main body coding character string str of a domain name X to be detected and a target domain name Y_xAnd str_yThe longest common substring str_xyAnd obtain the length len thereof_xy；

Step4.6: further calculating the main domain name coding character string str of the domain name X to be detected and the target domain name Y according to the formula (3)_xAnd str_yJaro-Winkler Distance of (1):

wherein, b_tTo determine whether further computation of the threshold is required, p is a scaling factor.

The domain name X to be detected and the target domain name Y in Step1 may be a primary domain name or a secondary domain name.

In Step2, if the domain name X to be detected and the target domain name Y are primary domain names, only the domain name suffix needs to be ignored. In addition, since the chinese domain name is not yet popular, some domain names may not need the "www" network name during registration, and at this time, the initialization process of Step2 can be adjusted accordingly, and in short, the next Step can be performed only by extracting the domain name body.

In the Step1, the domain name X to be detected and the target domain name Y need to be consistent with the conventional domain name, namely, after the Step2 is initialized, a Chinese character set X of the domain name main body is generated₁,x₂…x_p{ y: } and y: { y₁,y₂…y_qThe requirements are satisfied:

p,q∈N₊

similarly, after the encoding process of Step3, an encoded string str is generated_xAnd str_yLength of (len)_xAnd len_yThe following requirements should be satisfied:

len_x,len_y∈N₊。

the Unicode Chinese character stroke sequence table in Step3 is 1,2, 3,4, 5 for coding the Chinese character stroke sequence of horizontal, vertical, left falling, right falling and turning into numbers, and all Chinese characters are processed according to the coding

According to the encoding rule in Step3, a null character string is generated first, and then a main Chinese character set X of the domain name X to be compared is formed₁,x₂…x_pGet over each Chinese character x according to the sequence_i,i∈[1,p]Searching the stroke sequence of the corresponding Chinese character according to the stroke sequence table of the Unicode Chinese character, adding the stroke sequence to the tail part of the character string, and processing all the elements in the set to obtain the character string which is the coded character string str of the main body of the domain name X to be detected_xSimilarly, the target domain name Y is processed by the method, so that the coded character string str of the main body of the target domain name Y can be generated_y。

Calculating the number m of matched characters in the step Step4.1, if the character string str is coded_xAnd str_yIf the difference distance between the same characters is smaller than the matching window value MW, the characters are considered to be matched; it should be noted, however, that in the matching process, it is excludedIf the matched character is found, skipping the matching and matching the next character;

for the calculation of the number n of the conversion bits of the matched character, the code character string str is needed to be looked at_xAnd str_yIf the sequence of the matched character set is consistent, half of the transposition number is the transposition number n of the matched character; in addition, the number m of matched characters and the number n of transposed matched characters should satisfy the following requirements:

in said step Step4.6 a further threshold value b is calculated_tThe value is 0.7, and small-amplitude adjustment can be performed according to the actual detection result, mainly for improving the detection accuracy; the value of the scaling factor p is 0.1, and small-amplitude adjustment can be performed according to an actual detection result, mainly to avoid the situation that the final calculation result is greater than 1, but the method adds a new code string str_xAnd str_yReciprocal of the longest distance in

Improving the calculation formula here

The value of the scaling factor p has little influence on the final calculation result.

Dis calculated in Steps Step4.4 and Step4.6_jAnd Dis_jwThe following requirements should be met:

if not, indicating that the calculation is wrong and needing to be recalculated; if the domain name is satisfied, the closer the value is to 1, the more similar the domain name X to be detected and the target domain name Y are.

Usually, the domain name X to be detected needs to be a set of target domain names Y₁,Y₂…Y_kCarry out similarity calculation for extractionHigh detection rate, the target domain name Y_i,i∈[1,k]Its code character string can be calculated in advance

And storing the data into a database, and directly calling the database when the data is needed to be used.

The invention has the beneficial effects that: the method comprises the steps of mapping a coded Chinese character into a string of digital character strings through a Unicode Chinese character stroke sequence table, innovatively introducing a Jaro-Window Distance algorithm in the field of machine learning, combining the Jaro-Window Distance algorithm with a longest public substring, and further performing similarity measurement on the Chinese domain name. Firstly, acquiring a domain name to be detected and a target domain name, and initializing the domain name to be detected and the target domain name to generate a domain name main body; secondly, coding the domain name main body according to a Unicode Chinese character stroke sequence table to generate a digital character string which is used as an input of a Jaro-Winner Distance algorithm to generate a detection matrix; then, the similarity of the digital character strings is calculated according to the relevant rules by combining with the longest public substring of the digital character strings, and the similarity of the digital character strings can effectively represent the similarity between Chinese characters. Compared with the prior art, the method mainly solves the problems of insufficient accuracy, poor efficiency and the like in the prior art, and aims to improve the accuracy and the timeliness of the similarity measurement of the Chinese domain name at present.

Drawings

FIG. 1 is a schematic flow diagram of the present invention.

Detailed Description

The invention is further described with reference to the following drawings and detailed description.

Example 1: as shown in fig. 1, a method for measuring similarity of a chinese domain name based on J-W distance includes the following steps:

step 1: acquiring a domain name X to be detected and a target domain name Y;

Step4.2: the matching window value MW is calculated according to equation (1):

step4.3: by a detection matrix

p,q∈N₊

len_x,len_y∈N₊。

The encoding rule in Step3 is to generate a null character string and then treat the comparisonDomain name X Domain name Main Chinese character set X: { X₁,x₂…x_pGet over each Chinese character x according to the sequence_i,i∈[1,p]Searching the stroke sequence of the corresponding Chinese character according to the stroke sequence table of the Unicode Chinese character, adding the stroke sequence to the tail part of the character string, and processing all the elements in the set to obtain the character string which is the coded character string str of the main body of the domain name X to be detected_xSimilarly, the target domain name Y is processed by the method, so that the coded character string str of the main body of the target domain name Y can be generated_y。

Calculating the number m of matched characters in the step Step4.1, if the character string str is coded_xAnd str_yIf the difference distance between the same characters is smaller than the matching window value MW, the characters are considered to be matched; however, it should be noted that in the matching process, the matched character needs to be excluded, and if the matched character is found, the matching needs to be skipped for the next character matching;

Improving the calculation formula here

Example 2: the calculation of the number m of matched characters and the number n of transposed characters will be further explained on the basis of embodiment 1. Assuming that the domain main bodies of the domain name X to be detected and the target domain name Y are respectively ' treatment ' and ' treatment ', searching corresponding Chinese character codes ' treatment ' 44154251 ', ' treatment ' 4134112534 ' and ' treatment ' 4154251 ' through a Unicode Chinese character stroke sequence table, and generating a code character string str_x、str_yRespectively "441542514134112534" and "41542514134112534".

Calculating a matching window value MW:

binding detection matrix I (X, Y)_18×17Calculating the number m of matched characters and the number n of transposed matched characters:

as shown in the above table (matrix): the value of "/" in the table (matrix) indicates that the matching window value MW is exceeded, and whether the matching window value MW is matched or not is not considered; "1" indicates that the corresponding column value matches the row value; a "0" indicates that the corresponding column value does not match the row value.

To sum up, the number of matching characters m is 17, the matching character set is {4,4,1,5,4,2,5,1,4,1,3,4,1,1,2,5,3}, and the code string str is a string of code characters_yThe number of transpositions is 15 for "41542514134112534", so that the number of transpositions n of the resulting matched character is 7.

Example 3: the practice of the present invention is further illustrated on the basis of example 1. Assuming that the domain name X to be detected and the target domain name Y are 'today's science and technology, China 'command's science and technology, respectively, the initialized domain name main bodies are 'today's science and technology 'command' technology, searching corresponding Chinese character codes through a Unicode Chinese character stroke sequence table, and generating a code character string str according to rules_x、str_yRespectively "344525113123444121211254" and "3445425113123444121211254".

Calculating a matching window value MW:

binding detection matrix I (X, Y)_24×25The number m of the matched characters obtained by calculation is 24, and the number n of the replaced matched characters is 8.

Calculating the code string str_x、str_yJaro Distance of (1):

longest common substring str_xyLength of (len)_xyThe code string str is further calculated 20_x、str_yJaro-Winkler Distance of (1):

the result shows that human eyes of the domain name X to be detected and the target domain name Y look similar, the result obtained by calculation of the invention also accords with the human eye detection effect, the defects that a counterfeited website maker replaces one Chinese character with an approximate Chinese character, the judgment accuracy rate is low, the efficiency is low and the like in the prior art are effectively prevented, and the method is more humanized in practical application.

Example 4: on the basis of embodiment 3, suppose that the domain name X to be detected and the target domain name Y are "science and technology of this day" and "science and technology of china" of this day ", respectively, and the final calculation result Jaro-WinklerDistance is calculated by the steps described in embodiment 4:

by combining the present example and the embodiment 3, it is comprehensively shown that the method for judging the similarity of the Chinese domain names has good implementation effect and almost the same result as the result judged by human eyes, effectively prevents the counterfeiter from replacing one Chinese character with an approximate Chinese character, has the defects of low accuracy, low efficiency and the like of the judgment of the prior art, and is more humanized in practical application.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims

1. A Chinese domain name similarity measurement method based on J-W distance is characterized by comprising the following steps:

step 1: acquiring a domain name X to be detected and a target domain name Y;

Step 3: traversing the domain name main body Chinese character set x obtained in Step2 according to the Unicode Chinese character stroke sequence table₁,x₂…x_p{ y: } and y: { y₁,y₂…y_qFor each Chinese character x according to the aggregate character order_i,i∈[1,p]Or y_i,i∈[1,q]Finding out the stroke order of corresponding Chinese characters according to the corresponding codesThe code rules are converted to generate a code character string str of the main body of the domain name X to be detected_xAnd the code string str of the main body of the target domain name Y domain name_yAnd obtaining the code string str_xAnd str_yLength of (len)_xAnd len_y；

Step4.2: the matching window value MW is calculated according to equation (1):

step4.3: by a detection matrix

And a matching window value MW, calculating the number m of matched characters and the number n of replaced matched characters according to the relevant rules; for the calculation of the number m of matched characters, if the character string str is coded_xAnd str_yIf the difference distance between the same characters is smaller than the matching window value MW, the characters are considered to be matched; in the matching process, the matched characters need to be excluded, if the matched characters are found, the matching needs to be skipped out, and the matching of the next character is carried out;

for the calculation of the number n of the conversion bits of the matched character, the code character string str is needed to be looked at_xAnd str_yIf the sequence of the matched character set is consistent, half of the transposition number is the transposition number n of the matched character; the number m of matched characters and the number n of transposed matched characters should satisfy the following requirements:

step4.4: the number m of the matched characters and the transposition number of the matched characters are calculated by Step4.3n, calculating the main domain name coding character string str of the domain name X to be detected and the target domain name Y according to the formula (2)_xAnd str_yJaro Distance of (1):

2. The J-W distance-based chinese domain name similarity measurement method according to claim 1, wherein: the domain name X to be detected and the target domain name Y in Step1 may be a primary domain name or a secondary domain name.

3. The J-W distance-based chinese domain name similarity measurement method according to claim 1, wherein: in Step2, if the domain name X to be detected and the target domain name Y are primary domain names, only the domain name suffix needs to be ignored.

4. The J-W distance-based chinese domain name similarity measurement method according to claim 1, wherein: in the Step1, the domain name X to be detected and the target domain name Y need to be consistent with the conventional domain name, namely, after the Step2 is initialized, a Chinese character set X of the domain name main body is generated₁,x₂…x_p{ y: } and y: { y₁,y₂…y_qThe requirements are satisfied:

p,q∈N₊

len_x,len_y∈N₊。

5. the J-W distance-based chinese domain name similarity measurement method according to claim 1, wherein: the Unicode Chinese character stroke sequence table in Step3 is 1,2, 3,4 and 5 for coding the Chinese character stroke sequence of horizontal, vertical, left falling, right falling and turning into numbers, and all Chinese characters are coded according to the coding.

6. The J-W distance-based chinese domain name similarity measurement method according to claim 1, wherein: according to the encoding rule in Step3, a null character string is generated first, and then a main Chinese character set X of the domain name X to be compared is formed₁,x₂…x_pGet over each Chinese character x according to the sequence_i,i∈[1,p]Searching the stroke sequence of the corresponding Chinese character according to the stroke sequence table of the Unicode Chinese character, adding the stroke sequence to the tail part of the character string, and processing all the elements in the set to obtain the character string which is the coded character string str of the main body of the domain name X to be detected_xSimilarly, the target domain name Y is processed by the method, so that the coded character string str of the main body of the target domain name Y can be generated_y。

7. The J-W distance-based chinese domain name similarity measurement method according to claim 1, wherein: in said step Step4.6 a further threshold value b is calculated_tThe value is 0.7, and the value of the scaling factor p is 0.1.

8. The J-W distance-based chinese domain name similarity measurement method according to claim 1, wherein: dis calculated in Steps Step4.4 and Step4.6_jAnd Dis_jwThe following requirements should be met: