A kind of fishing website identifying system and method
Present patent application be the applying date on 06 28th, 2012, it is Application No. 201210224485.3, entitled
The divisional application of the Chinese invention patent application of " a kind of fishing website identifying system and method ".
Technical field
The present invention relates to technical field of network security, more particularly to a kind of fishing website identifying system and method.
Background technology
With the development of internet, netizen's quantity increases year by year.In online, except traditional wooden horse, the threat of virus,
The quantity of nearly 2 years fishing websites is significantly increased.
Current main fishing website identification technology is by collecting common fishing website, being fabricated to knowledge base, then count
The similarity of newfound webpage and the fishing website in knowledge base is calculated, so that it is fishing website to judge whether.
The method that fishing website is recognized above by fishing website knowledge base, is typically only capable to recognize the Fishing net of known class
Stand, for the fishing website then None- identified of new type, such as the only related fishing of Bank of China in fishing website knowledge base
During website, for counterfeit industrial and commercial bank fishing website with regard to None- identified.
The content of the invention
The technical problem to be solved in the present invention is:How a kind of fishing website identifying system and method are provided, effectively to know
The fishing website of other new type.
In order to solve the above technical problems, the present invention provides a kind of fishing website identifying system, it includes:Domain Name acquisition list
Member, domain name statistic unit and web site recognizing unit;
Domain name acquiring unit, suitable for collecting the all-links occurred in website to be identified, obtains the link correspondence
Domain name;
Domain name statistic unit, the number of times occurred suitable for statistics domain name in the website to be identified, finds out
The most domain name of occurrence number, is denoted as target domain name;
The web site recognizing unit, suitable for judging institute according to itself domain name of the target domain name and the website to be identified
Whether state website to be identified is fishing website.
Wherein, the web site recognizing unit includes:Comparing subunit and identification subelement;
The comparing subunit, institute is shown suitable for relatively more described target domain name and itself domain name, and in comparative result
State target domain name and the own domain famous prime minister simultaneously, judge that the website to be identified is not fishing website;
The identification subelement, suitable for when the target domain name is different from itself domain name, calculating the aiming field
Ratio between the occurrence number of name and the occurrence number of itself domain name, and calculate the target domain name with it is described itself
Similarity between domain name, and then judge whether the website to be identified is Fishing net according to the ratio and the similarity
Stand.
Wherein, the identification subelement includes:Ratio computing module, similarity calculation module and judge module;
The ratio computing module, occurrence is gone out suitable for calculate the occurrence number of the target domain name and itself domain name
Ratio between number;
The similarity calculation module, suitable for calculating the similarity between the target domain name and itself domain name;
The judge module, suitable for judging whether the ratio and the similarity meet condition:The ratio is more than pre-
Certainty ratio, and the similarity is more than predetermined threshold;If it is satisfied, judging that the website to be identified is fishing website;Otherwise,
It is not fishing website to judge the website to be identified.
Wherein, the similarity calculation module includes:Character string contrast submodule, initial value calculating sub module and final value are calculated
Submodule;
The character string contrasts submodule, suitable for building the character string of the target domain name and the character of itself domain name
The contrast array of string, is arranged on the first row of the contrast array by the character string of the target domain name and holding position is fixed,
The character string of itself domain name is arranged on the second row of the contrast array and moved from left to right, in two line character strings
Overlapping character is contrasted;
The initial value calculating sub module, suitable for the initial character when the target domain name and the trailing character pair of itself domain name
Qi Shi, calculates the first similarity value calculation Q between the target domain name and itself domain name1;When the target domain name
When second character aligns with the trailing character of itself domain name, second between the target domain name and itself domain name is calculated
Similarity value calculation Q2;The like, when the trailing character of the target domain name aligns with the initial character of itself domain name, meter
Calculate the m similarity value calculations Q between the target domain name and itself domain namem;Wherein, m=n1+n2- 1, n1Represent described
The string length of target domain name, n2Represent the string length of itself domain name;
The final value calculating sub module, the target domain name and itself domain name are obtained suitable for being calculated according to following formula
Between similarity Qmax:
Qmax=max { Q1, Q2, Q3... ... Qm}。
Wherein, in the initial value calculating sub module, the i-th similarity value calculation Q is calculated using equation belowi:
Qi=Mi 2×Li;
Wherein, i is natural number, also, 1≤i≤m;Also,
Mi=si/nmax;
Li=ri/nmax;
Wherein, riRepresent when ith is contrasted, the character string of the character string of itself domain name and the target domain name
In, overlapping character number;nmaxRepresent the character string of itself domain name and longer character in the character string of the target domain name
The character number of string;LiRepresent when ith is contrasted, the character string of itself domain name and the character string of the target domain name
Duplication;siRepresent when ith is contrasted, it is overlapping in the character string of the character string of itself domain name and the target domain name
And identical character number;MiRepresent when ith is contrasted, the word of the character string of itself domain name and the target domain name
Accord with the matching rate of string.
Wherein, in the initial value calculating sub module, the i-th similarity value calculation Q is calculated using following manneri:
When ith is contrasted, the character string for calculating the target domain name is overlapping with the character string of itself domain name simultaneously
And identical character number, it regard described overlapping and identical character number as the i-th similarity value calculation Qi。
Wherein, the system also includes:Supplement recognition unit;
The supplement recognition unit, suspicious net is denoted as suitable for will determine that result is shown as the website to be identified of fishing website
Stand, and supplement identification is carried out to described the suspected site, show that described the suspected site is still the situation of fishing website in recognition result
Under, described the suspected site is sent into fishing website storehouse.
Wherein, the corresponding domain name of the link is the absolute address of the link.
Wherein, the system also includes:Website acquiring unit;
The website acquiring unit, suitable for searching newly-built website to be used as website to be identified.
The present invention also provides a kind of fishing website recognition methods, and it includes step:
The all-links occurred in website to be identified are collected, the corresponding domain name of the link is obtained;
The number of times that statistics domain name occurs in the website to be identified, finds the most domain name of occurrence number, is denoted as
Target domain name;
Judge whether the website to be identified is to fish according to itself domain name of the target domain name and the website to be identified
Fishnet station.
Wherein, itself domain name according to the target domain name and the website to be identified judges the website to be identified
Whether it is fishing website, further comprises step:
Judge whether the target domain name is identical with itself domain name, if it is, judging that the website to be identified is not
Fishing website, terminates flow;Otherwise, next step is performed;
The ratio between the occurrence number of the target domain name and the occurrence number of itself domain name is calculated, and it is described
Similarity between target domain name and itself domain name, the website to be identified is judged according to the ratio and the similarity
Whether it is fishing website.
Wherein, the ratio between the occurrence number of the occurrence number for calculating the target domain name and itself domain name
Example, and the similarity between the target domain name and itself domain name, institute is judged according to the ratio and the similarity
Whether be fishing website, further comprise step if stating website to be identified:
Calculate the ratio between the occurrence number of the target domain name and the occurrence number of itself domain name;
Calculate the similarity between the target domain name and itself domain name;
Judge whether to meet following condition:The ratio is more than predetermined ratio, and the similarity is more than predetermined threshold;
If it is, judging that the website to be identified is fishing website;Otherwise, it is determined that the website to be identified is not fishing website.
Wherein, the similarity calculated between the target domain name and itself domain name, further comprises step:
The contrast array of the character string of the target domain name and the character string of itself domain name is built, by the aiming field
The character string of name is arranged on the first row of the contrast array and holding position is fixed, and the character string of itself domain name is set
Move in second row for contrasting array and from left to right, character overlapping in two line character strings is contrasted;
When the initial character of the target domain name aligns with the trailing character of itself domain name, calculate the target domain name with
The first similarity value calculation Q between itself domain name1;When the second character and itself domain name of the target domain name
When trailing character aligns, the second similarity value calculation Q between the target domain name and itself domain name is calculated2;The like,
When the trailing character of the target domain name aligns with the initial character of itself domain name, calculate the target domain name with it is described itself
M similarity value calculations Q between domain namem;Wherein, m=n1+n2- 1, n1Represent the string length of the target domain name, n2Table
Show the string length of itself domain name;
The similarity Q obtained between the target domain name and itself domain name is calculated according to following formulamax:
Qmax=max { Q1, Q2, Q3... ... Qm}。
Wherein, it is described when the initial character of the target domain name aligns with the trailing character of itself domain name, calculate described
The first similarity value calculation Q between target domain name and itself domain name1;When the target domain name the second character with it is described
During the trailing character alignment of itself domain name, the second similarity value calculation between the target domain name and itself domain name is calculated
Q2;The like, when the trailing character of the target domain name aligns with the initial character of itself domain name, calculate the aiming field
M similarity value calculations Q between name and itself domain namemIn, the i-th similarity value calculation QiCalculation formula it is as follows:
Qi=Mi 2×Li;
Wherein, i is natural number, also, 1≤i≤m;Also,
Mi=si/nmax;
Li=ri/nmax;
Wherein, riRepresent when ith is contrasted, the character string of the character string of itself domain name and the target domain name
In, overlapping character number;nmaxRepresent the character string of itself domain name and longer character in the character string of the target domain name
The character number of string;LiRepresent when ith is contrasted, the character string of itself domain name and the character string of the target domain name
Duplication;siRepresent when ith is contrasted, it is overlapping in the character string of the character string of itself domain name and the target domain name
And identical character number;MiRepresent when ith is contrasted, the word of the character string of itself domain name and the target domain name
Accord with the matching rate of string.
Wherein, it is described when the initial character of the target domain name aligns with the trailing character of itself domain name, calculate described
The first similarity value calculation Q between target domain name and itself domain name1;When the target domain name the second character with it is described
During the trailing character alignment of itself domain name, the second similarity value calculation between the target domain name and itself domain name is calculated
Q2;The like, when the trailing character of the target domain name aligns with the initial character of itself domain name, calculate the aiming field
M similarity value calculations Q between name and itself domain namemIn, calculate the i-th similarity value calculation Q using following manneri:
When ith is contrasted, the character string for calculating the target domain name is overlapping with the character string of itself domain name simultaneously
And identical character number, it regard described overlapping and identical character number as the i-th similarity value calculation Qi。
Wherein, the net to be identified is judged in itself domain name according to the target domain name and the website to be identified
Whether stand is also to include step after fishing website:It will determine that result is shown as the website to be identified of fishing website and is denoted as suspicious net
Stand, and supplement identification is carried out to described the suspected site, show that described the suspected site is still the situation of fishing website in recognition result
Under, described the suspected site is sent into fishing website storehouse.
Wherein, the corresponding domain name of the link is the absolute address of the link.
Wherein, in the all-links collected and occurred in website to be identified, before obtaining the corresponding domain name of the link
Also include step:Newly-built website is searched to be used as website to be identified.
The fishing website identifying system and method for the present invention, fishing website is carried out based on the linking relationship in website
Identification, can effectively recognize the fishing website of new type;Meanwhile, be conducive in abundant fishing website storehouse the quantity of fishing website and
Type, is easy to further fishing website to recognize and search, is with a wide range of applications in network safety filed.
Brief description of the drawings
Fig. 1 is the modular structure schematic diagram of fishing website identifying system described in the embodiment of the present invention one;
Fig. 2 is the modular structure schematic diagram of the web site recognizing unit;
Fig. 3 is the modular structure schematic diagram of the identification subelement;
Fig. 4 is the modular structure schematic diagram of the similarity calculation module;
Fig. 5 is the modular structure schematic diagram of fishing website identifying system described in the embodiment of the present invention two;
Fig. 6 is the flow chart of fishing website recognition methods described in the embodiment of the present invention three;
Fig. 7 is the flow chart of fishing website recognition methods described in the embodiment of the present invention four.
Embodiment
With reference to the accompanying drawings and examples, the embodiment to the present invention is described in further detail.Implement below
Example is used to illustrate the present invention, but is not limited to the scope of the present invention.
Fig. 1 is the modular structure schematic diagram of fishing website identifying system described in the embodiment of the present invention one, as shown in figure 1, institute
The system of stating includes:Domain Name acquisition unit 100, domain name statistic unit 200 and web site recognizing unit 300.
Domain name acquiring unit 100, suitable for collecting the all-links occurred in website to be identified, obtains the link pair
The domain name answered.It is described here to link the absolute address that corresponding domain name is the link, if occurred in the website to be identified
Link using relative address, it is necessary to be converted into absolute address.
Domain name statistic unit 200, the number of times occurred suitable for statistics domain name in the website to be identified, finds
The most domain name of occurrence number, is denoted as target domain name.Domain name statistic unit 200 can using domain name as key, using occurrence number as
Value, generates a key-value form, then according to the numerical value of value in form, domain name is ranked up, occurred
The most domain name of number of times.
The web site recognizing unit 300, suitable for being sentenced according to the target domain name and itself domain name of the website to be identified
Whether the disconnected website to be identified is fishing website.
Fig. 2 is the modular structure schematic diagram of the web site recognizing unit, as shown in Fig. 2 the web site recognizing unit 300 enters
One step includes:Comparing subunit 310 and identification subelement 320.
The comparing subunit 310, suitable for relatively more described target domain name and itself domain name, and shows in comparative result
The target domain name simultaneously, judges that the website to be identified is not fishing website with the own domain famous prime minister.
The identification subelement 320, suitable for when the target domain name is different from itself domain name, calculating the target
Ratio between the occurrence number of the occurrence number of domain name and itself domain name, and calculate the target domain name with it is described from
Similarity between body domain name, and then judge whether the website to be identified is Fishing net according to the ratio and the similarity
Stand.
Fig. 3 is the modular structure schematic diagram of the identification subelement, as shown in figure 3, the identification subelement 320 is further
Including:Ratio computing module 321, similarity calculation module 322 and judge module 323.
The ratio computing module 321, suitable for calculating the occurrence number of the target domain name and going out for itself domain name
Ratio between occurrence number.
The similarity calculation module 322, suitable for calculating the similarity between the target domain name and itself domain name.
Fig. 4 is the modular structure schematic diagram of the similarity calculation module, as shown in figure 4, the similarity calculation module
322 further comprise:Character string contrast submodule 322a, initial value calculating sub module 322b and final value calculating sub module 322c.
The character string contrasts submodule 322a, character string and itself domain name suitable for building the target domain name
The contrast array of character string, is arranged on the first row of the contrast array by the character string of the target domain name and holding position is consolidated
It is fixed, the character string of itself domain name is arranged on the second row of the contrast array and moved from left to right, to two line characters
Overlapping character is contrasted in string.
The initial value calculating sub module 322b, suitable for the initial character when the target domain name and the tail word of itself domain name
During symbol alignment, the first similarity value calculation Q between the target domain name and itself domain name is calculated1;When the aiming field
When second character of name aligns with the trailing character of itself domain name, calculate between the target domain name and itself domain name
Second similarity value calculation Q2;The like, when the trailing character of the target domain name aligns with the initial character of itself domain name
When, calculate the m similarity value calculations Q between the target domain name and itself domain namem;Wherein, m=n1+n2- 1, n1Table
Show the string length of the target domain name, n2Represent the string length of itself domain name.
Wherein, in the initial value calculating sub module 322b, the i-th similarity value calculation Q is calculated using equation belowi:
Qi=Mi 2×Li;
Wherein, i is natural number, also, 1≤i≤m;Also,
Mi=si/nmax;
Li=ri/nmax;
Wherein, riRepresent when ith is contrasted, the character string of the character string of itself domain name and the target domain name
In, overlapping character number;nmaxRepresent the character string of itself domain name and longer character in the character string of the target domain name
The character number of string;LiRepresent when ith is contrasted, the character string of itself domain name and the character string of the target domain name
Duplication;siRepresent when ith is contrasted, it is overlapping in the character string of the character string of itself domain name and the target domain name
And identical character number;MiRepresent when ith is contrasted, the word of the character string of itself domain name and the target domain name
Accord with the matching rate of string.
For example, it is assumed that the entitled boc.cn of own domain is moved from left to right, the entitled cocc.cn holding positions of aiming field are consolidated
It is fixed.When contrasting for the 1st time, only character n is overlapping with character c, correspondingly r1=1, s1=0;The 2nd time contrast when, character n with
Character o is overlapping, and character c is overlapping with character c, correspondingly r2=2, s2=1.
In addition, in the initial value calculating sub module, the i-th similarity value calculation Q can also be calculated using following manneri:
When ith is contrasted, the character string for calculating the target domain name is overlapping with the character string of itself domain name simultaneously
And identical character number, it regard described overlapping and identical character number as the i-th similarity value calculation Qi。
For the i-th similarity value calculation QiCalculation, can also use some known existing methods, due to its non-
Invention emphasis, will not be repeated here.
The final value calculating sub module 322c, suitable for according to following formula calculate obtain the target domain name with it is described itself
Similarity Q between domain namemax:
Qmax=max { Q1, Q2, Q3... ... Qm}。
The judge module 323, suitable for judging whether the ratio and the similarity meet condition:The ratio is more than
Predetermined ratio, and the similarity is more than predetermined threshold;If it is satisfied, judging that the website to be identified is fishing website;It is no
Then, judge that the website to be identified is not fishing website.The predetermined ratio and the predetermined threshold can be according to actual uses
Situation is configured and adjusted, the present embodiment, and the predetermined ratio is preferably 1.0, and the predetermined threshold is preferably 80%.
Fig. 5 is the modular structure schematic diagram of fishing website identifying system described in the embodiment of the present invention two, as shown in figure 5, this
System described in embodiment and system described in embodiment one are essentially identical, and it the difference is that only, system described in the present embodiment is also
Including:Website acquiring unit 000 and supplement recognition unit 400.
The website acquiring unit 000, suitable for searching newly-built website to be used as website to be identified.Generally, go fishing
Website is mostly newly-built website, therefore, by setting the website acquiring unit 000, only using newly-built website as website to be identified,
The identification range of fishing website can be reduced, the degree of accuracy and the speed of identification is improved.Lookup for newly-built website can be used
Following method:Search-engine results page is monitored by particular keywords;Or, few by client discovery netizen visit capacity
Website.
It is described supplement recognition unit 400, suitable for will determine that result be shown as fishing website website to be identified be denoted as it is suspicious
Website, and supplement identification is carried out to described the suspected site, show that described the suspected site is still the feelings of fishing website in recognition result
Under condition, described the suspected site is sent into fishing website storehouse.The supplement identification can be by the way of manual review.By setting
The supplement recognition unit 400, can further improve the degree of accuracy of fishing website identification.
Fig. 6 is the flow chart of fishing website recognition methods described in the embodiment of the present invention three, as shown in fig. 6, methods described bag
Include step:
A:The all-links occurred in website to be identified are collected, the corresponding domain name of the link is obtained.The link correspondence
Domain name be the link absolute address.
B:The number of times that statistics domain name occurs in the website to be identified, finds the most domain name of occurrence number, remembers
Make target domain name.
C:According to itself domain name of the target domain name and the website to be identified judge the website to be identified whether be
Fishing website.
The step C further comprises step:
C1:Judge whether the target domain name is identical with itself domain name, if it is, judging the website to be identified not
It is fishing website, terminates flow;Otherwise, step C2 is performed;
C2:The ratio between the occurrence number of the target domain name and the occurrence number of itself domain name is calculated, and
Similarity between the target domain name and itself domain name, judges described to be identified according to the ratio and the similarity
Whether website is fishing website.
The step C2 further comprises step:
C21:Calculate the ratio between the occurrence number of the target domain name and the occurrence number of itself domain name.
C22:Calculate the similarity between the target domain name and itself domain name.
The step C22 further comprises step:
C221:The contrast array of the character string of the target domain name and the character string of itself domain name is built, will be described
The character string of target domain name is arranged on the first row of the contrast array and holding position is fixed, by the character of itself domain name
String is arranged on the second row of the contrast array and moved from left to right, and character overlapping in two line character strings is contrasted.
C222:When the initial character of the target domain name aligns with the trailing character of itself domain name, the target is calculated
The first similarity value calculation Q between domain name and itself domain name1;When the target domain name the second character with it is described itself
During the trailing character alignment of domain name, the second similarity value calculation Q between the target domain name and itself domain name is calculated2;According to
It is secondary to analogize, when the trailing character of the target domain name aligns with the initial character of itself domain name, calculate the target domain name with
M similarity value calculations Q between itself domain namem;Wherein, m=n1+n2- 1, n1Represent the character string of the target domain name
Length, n2Represent the string length of itself domain name.
In the step C222, the i-th similarity value calculation QiCalculation formula it is as follows:
Qi=Mi 2×Li;
Wherein, i is natural number, also, 1≤i≤m;Also,
Mi=si/nmax;
Li=ri/nmax;
Wherein, riRepresent when ith is contrasted, the character string of the character string of itself domain name and the target domain name
In, overlapping character number;nmaxRepresent the character string of itself domain name and longer character in the character string of the target domain name
The character number of string;LiRepresent when ith is contrasted, the character string of itself domain name and the character string of the target domain name
Duplication;siRepresent when ith is contrasted, it is overlapping in the character string of the character string of itself domain name and the target domain name
And identical character number;MiRepresent when ith is contrasted, the word of the character string of itself domain name and the target domain name
Accord with the matching rate of string.
In addition, in the step C222, the i-th similarity value calculation Q can also be calculated using following manneri:
When ith is contrasted, the character string for calculating the target domain name is overlapping with the character string of itself domain name simultaneously
And identical character number, it regard described overlapping and identical character number as the i-th similarity value calculation Qi。
C223:The similarity Q obtained between the target domain name and itself domain name is calculated according to following formulamax:
Qmax=max { Q1, Q2, Q3... ... Qm}。
C23:Judge whether to meet following condition:The ratio is more than predetermined ratio, and the similarity is more than predetermined
Threshold value;If it is, judging that the website to be identified is fishing website;Otherwise, it is determined that the website to be identified is not fishing website.
Fig. 7 is the flow chart of fishing website recognition methods described in the embodiment of the present invention four, as shown in fig. 7, the present embodiment institute
State method and the methods described of embodiment three is essentially identical, it the difference is that only:
Also include step A ' before the step A:Newly-built website is searched to be used as website to be identified.For newly-built website
Lookup can adopt with the following method:Search-engine results page is monitored by particular keywords;Or, net is found by client
The few website of people's visit capacity.
Also include step D after the step C:It will determine that and result be shown as the website to be identified of fishing website be denoted as can
Website is doubted, and supplement identification is carried out to described the suspected site, it is still fishing website to show described the suspected site in recognition result
In the case of, described the suspected site is sent into fishing website storehouse.The supplement identification can be by the way of manual review.
Fishing website identifying system and method described in the embodiment of the present invention, Fishing net is carried out based on the linking relationship in website
The identification stood, can effectively recognize the fishing website of new type;Meanwhile, be conducive to the number of fishing website in abundant fishing website storehouse
Amount and type, are easy to further fishing website to recognize and search, are with a wide range of applications in network safety filed.
Embodiment of above is merely to illustrate the present invention, and not limitation of the present invention, about the common of technical field
Technical staff, without departing from the spirit and scope of the present invention, can also make a variety of changes and modification, therefore all
Equivalent technical scheme falls within scope of the invention, and scope of patent protection of the invention should be defined by the claims.