CN103581347A - Inundation sub-domain identification method and system - Google Patents

Inundation sub-domain identification method and system Download PDF

Info

Publication number
CN103581347A
CN103581347A CN201210256109.2A CN201210256109A CN103581347A CN 103581347 A CN103581347 A CN 103581347A CN 201210256109 A CN201210256109 A CN 201210256109A CN 103581347 A CN103581347 A CN 103581347A
Authority
CN
China
Prior art keywords
slice groups
subdomain
fragment
effective slice
fragments
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201210256109.2A
Other languages
Chinese (zh)
Other versions
CN103581347B (en
Inventor
李学凯
张锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shiji Guangsu Information Technology Co Ltd
Original Assignee
Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Shiji Guangsu Information Technology Co Ltd filed Critical Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority to CN201210256109.2A priority Critical patent/CN103581347B/en
Publication of CN103581347A publication Critical patent/CN103581347A/en
Application granted granted Critical
Publication of CN103581347B publication Critical patent/CN103581347B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Complex Calculations (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention discloses an inundation sub-domain identification method and system, and relates to the technical field of computers. The inundation sub-domain identification method and system are applied to search engines. Inundation sub-domains are identified according to the dispersion or the concentration ratio of the fragment length of any valid fragment set of sub-domain names, and the identification degree of the inundation sub-domains can be effectively improved. The inundation sub-domain identification method in the embodiment includes the steps of collecting sub-domain names with the same main domain name, and if it is judged that the fragment length of any valid fragment set of the sub-domain names with the same main domain name is in dispersed distribution or concentrated distribution, identifying that the sub-domain names corresponding to any valid fragment set are the inundation sub-domains, wherein the valid fragment set is the same fragment set of a left-side domain name part and a right-side domain name part of the same-stage fragment in a same-stage fragment of the sub-domain names with the same main domain name.

Description

Spread unchecked recognition methods and the system of subdomain
Technical field
The present invention relates to field of computer technology, relate in particular to recognition methods and the system of spreading unchecked subdomain.
Background technology
The development of computer networking technology has improved the convenience of people's obtaining informations greatly, in computer network, stored the information of magnanimity, in order to make people find own required information, search engine is widely used, search engine to the including of website, quality control according to subdomain unit by name.Subdomain name refers to the difference according to business, the head of a station on the basis of Main Domain, a plurality of domain names that expand, for example bbs.163.com provides forum service, blog.163.com is the subdomain name of Netease's blog.Subdomain name can be named arbitrarily, can be even multistage subdomain name, for example twocold.blog.sina.com.cn.Subdomain name adds Main Domain by left side domain name part and combines, and excludes Main Domain part, and the remaining part of subdomain name can be divided into multistage fragment according to symbol ". ".Such as www.163.com can be cut apart " www " one-level fragment, twocold.blog.sina.com.cn can be divided into " twocold ", " blog " two-stage fragment.
But some head of a station can deliberately generate quantity huge and content, quality be very approaching subdomain name, to search engine, cause the very huge illusion of business of this Main Domain, the subdomain name that this batch is produced is called and spreads unchecked subdomain, owing to spreading unchecked content, the quality of subdomain, be very approaching, if as to common subdomain to carrying out conventional renewal, quality evaluation, greatly increased the burden of search engine, so, identification is spread unchecked subdomain and is taked the means of dispatching accordingly can make resource distribution more reasonable, and greatly reduces the burden of search engine.
The method that subdomain is spread unchecked in the conventional identification of prior art is by adding up the quantity of the subdomain name that identical Main Domain comprises, when quantity surpasses certain threshold value, thinking to spread unchecked subdomain.
Inventor finds that prior art at least exists following shortcoming: prior art determines whether to spread unchecked subdomain according to quantity merely, can only solve the most serious subdomain problem of spreading unchecked, low to spreading unchecked the identification degree of subdomain.
Summary of the invention
Embodiments of the invention provide a kind of recognition methods and system of spreading unchecked subdomain, spread unchecked the identification of subdomain according to the dispersion of the fragment length of arbitrary effective slice groups of subdomain name or concentration degree, can effectively improve the subdomain identification degree of spreading unchecked.
For achieving the above object, the technical scheme that the embodiment of the present invention adopts is,
On the one hand, the embodiment of the present invention provides a kind of recognition methods of spreading unchecked subdomain, comprising:
Obtain the subdomain name with identical Main Domain;
If there is the fragment length of arbitrary effective slice groups of the subdomain name of identical Main Domain described in judgement and be discrete distribution or concentrate, distribute, subdomain name corresponding to described arbitrary effective slice groups is identified as and spreads unchecked subdomain, wherein, described effective slice groups is: in the same one-level fragment of the described subdomain name with identical Main Domain, the left side domain name part of described same one-level fragment and/or right side domain name part be identical set of segments respectively.
Preferably, the fragment length of arbitrary effective slice groups described in described judgement with the subdomain name of identical Main Domain is discrete distribution, comprising:
Obtain average number of fragments, wherein, the different fragment length sum that the fragment total quantity that described average number of fragments is comprised by described arbitrary effective slice groups comprises divided by described arbitrary effective slice groups obtains;
If described average number of fragments is less than the first dispersion threshold value, the fragment length that judges described arbitrary effective slice groups is discrete distribution.
Preferably, the fragment length of arbitrary effective slice groups described in described judgement with the subdomain name of identical Main Domain is discrete distribution, also comprises:
If described average number of fragments is not less than described the first dispersion threshold value, add up number of fragments that described arbitrary effective slice groups comprises separator or the naming scheme of described arbitrary effective slice groups;
If the number of fragments that described arbitrary effective slice groups comprises separator is greater than the ratio of any one naming scheme of default separator threshold value or described arbitrary effective slice groups and is greater than preset ratio threshold value, and described average number of fragments is less than the second dispersion threshold value, the fragment length that judges described arbitrary effective slice groups is discrete distribution.
Preferably, the fragment length of arbitrary effective slice groups described in described judgement with the subdomain name of identical Main Domain distributes for concentrating, and comprising:
Obtain average number of fragments, wherein, the different fragment length sum that the fragment total quantity that described average number of fragments is comprised by described arbitrary effective slice groups comprises divided by described arbitrary effective slice groups obtains;
Obtain the effective length quantity that described arbitrary effective slice groups comprises, wherein, it is described effective length with the fragment length of adjusting because of subproduct that the number of fragments comprising is greater than described average number of fragments;
If the ratio of described effective length quantity and described fragment length sum is less than the first concentration degree threshold value, judge that the fragment length of described arbitrary effective slice groups distributes for concentrating.
Preferably, the fragment length of arbitrary effective slice groups described in described judgement with the subdomain name of identical Main Domain distributes for concentrating, and also comprises:
If the ratio of described effective length quantity and described fragment length sum is not less than described the first concentration degree threshold value, add up number of fragments that described arbitrary effective slice groups comprises separator or the naming scheme of described arbitrary effective slice groups;
If the number of fragments that described arbitrary effective slice groups comprises separator is greater than the ratio of any one naming scheme of default separator threshold value or described arbitrary effective slice groups and is greater than preset ratio threshold value, and the ratio of described effective length quantity and described fragment length sum is less than the second concentration degree threshold value, judge that the fragment length of described arbitrary effective slice groups distributes for concentrating.
Preferably, described method also comprises:
If there is the fragment length of each effective slice groups of subdomain name of identical Main Domain described in judgement all for discrete distribution or concentrate to distribute, and described in there is identical Main Domain subdomain name while thering is at least two-stage fragment, described at least two adjacent effective slice groups with the subdomain name of identical Main Domain are merged into one-level fragment;
According to the one-level slice groups after described merging, obtain new effective slice groups, if judge, the fragment length of described new effective slice groups is discrete distribution or concentrates distribution, is identified as described new subdomain name corresponding to effective slice groups to spread unchecked subdomain.
Preferably, if be that discrete distribution or concentrate distributes at the fragment length of arbitrary effective slice groups described in judgement with the subdomain name of identical Main Domain, before subdomain name corresponding to described arbitrary effective slice groups being identified as and spreading unchecked subdomain, described method also comprises:
According to predefined exemption rule, by meeting the regular fragment of described exemption or subdomain name, filter, do not spread unchecked the identification of subdomain.
Preferably, described method also comprises, the update cycle is set, corresponding,
Described collection has the subdomain name of identical Main Domain, comprising: according to the update cycle of described setting, collect the subdomain name with identical Main Domain within each update cycle;
If there is the fragment length of arbitrary effective slice groups of the subdomain name of identical Main Domain described in described judgement and be discrete distribution or concentrate, distribute, subdomain name corresponding to described arbitrary effective slice groups is identified as and spreads unchecked subdomain, comprise: according to the update cycle of described setting, if there is the fragment length of arbitrary effective slice groups of the subdomain name of identical Main Domain described in judgement and be discrete distribution within each update cycle or concentrate, distribute, subdomain name corresponding to described arbitrary effective slice groups is identified as and spreads unchecked subdomain.
On the one hand, the embodiment of the present invention provides a kind of recognition system of spreading unchecked subdomain, it is characterized in that, comprising:
Acquiring unit, for obtaining the subdomain name with identical Main Domain;
Judging unit, for thering is the fragment length of arbitrary effective slice groups of the subdomain name of identical Main Domain described in judging that described acquiring unit obtains, be whether that discrete distribution or concentrate distributes, wherein, described effective slice groups is: in the same one-level fragment of the described subdomain name with identical Main Domain, the left side domain name part of described same one-level fragment and/or right side domain name part be identical set of segments respectively;
Recognition unit, is discrete distribution or concentrates after distribution for the fragment length in the described arbitrary effective slice groups of described judging unit judgement, and subdomain name corresponding to described arbitrary effective slice groups is identified as and spreads unchecked subdomain.
Preferably, described judging unit comprises, comprising:
Acquisition module, for obtaining average number of fragments, wherein, the different fragment length sum that the fragment total quantity that described average number of fragments is comprised by described arbitrary effective slice groups comprises divided by described arbitrary effective slice groups obtains;
The first judge module, for determining that the described average number of fragments that described acquisition module obtains is less than after the first dispersion threshold value, the fragment length that judges described arbitrary effective slice groups is discrete distribution.
Preferably, described judging unit, also comprises:
Statistical module, for being not less than after described the first dispersion threshold value in the described average number of fragments of described the first judge module judgement, adds up number of fragments that described arbitrary effective slice groups comprises separator or the naming scheme of described arbitrary effective slice groups;
The second judge module, the ratio that the number of fragments that comprises separator for the described arbitrary effective slice groups determining described statistical module counts is greater than any one naming scheme of default separator threshold value or described arbitrary effective slice groups is greater than preset ratio threshold value, and described the first judge module determines that described average number of fragments is less than after the second dispersion threshold value, and the fragment length that judges described arbitrary effective slice groups is discrete distribution.
Preferably, described judging unit, comprising:
The first acquisition module, for obtaining average number of fragments, wherein, the different fragment length sum that the fragment total quantity that described average number of fragments is comprised by described arbitrary effective slice groups comprises divided by described arbitrary effective slice groups obtains;
The second acquisition module, the effective length quantity comprising for obtaining described arbitrary effective slice groups, wherein, it is described effective length with the fragment length of adjusting because of subproduct that the number of fragments comprising is greater than described average number of fragments;
The first judge module, for determining that the described effective length quantity that described the second acquisition module obtains is less than after the first concentration degree threshold value with the ratio of described fragment length sum, judges that the fragment length of described arbitrary effective slice groups is for concentrating distribution.
Preferably, described judging unit, also comprises:
Statistical module, for determining that at described the first judge module the ratio of described effective length quantity and described fragment length sum is not less than after described the first concentration degree threshold value, add up number of fragments that described arbitrary effective slice groups comprises separator or the naming scheme of described arbitrary effective slice groups;
The second judge module, the ratio that the number of fragments that comprises separator for the described arbitrary effective slice groups determining described statistical module counts is greater than any one naming scheme of default separator threshold value or described arbitrary effective slice groups is greater than preset ratio threshold value, and described the first judge module determines that the ratio of described effective length quantity and described fragment length sum is less than after the second concentration degree threshold value, judge that the fragment length of described arbitrary effective slice groups distributes for concentrating.
Preferably, described system also comprises:
Merge cells, for the fragment length of each effective slice groups of subdomain name thering is identical Main Domain described in the judgement of described judging unit all for discrete distribution or after concentrating and distributing, and described in there is identical Main Domain subdomain name have at least after two-stage fragment, the described adjacent at least two-stage segment with the subdomain name of identical Main Domain is merged into one-level fragment;
Described judging unit also for, according to the one-level fragment after described merging, obtain new effective slice groups, whether the fragment length that judges described new effective slice groups is that discrete distribution or concentrate distributes;
Described recognition unit also for, at the fragment length of the described new effective slice groups of described judging unit judgement, be, after discrete distribution or concentrate distributes, described new subdomain name corresponding to effective slice groups to be identified as and to spread unchecked subdomain.
Preferably, described system also comprises:
Filter element, for regular according to predefined exemption, will meet the regular fragment of described exemption or subdomain name filter so that described judging unit and recognition unit be not for the identification that meets the regular fragment of described exemption or subdomain name and spread unchecked subdomain.
Preferably, described system also comprises, update cycle setting unit is for the update cycle is set, corresponding,
Described acquiring unit also for: the update cycle arranging according to described update cycle setting unit, within each update cycle, obtain the subdomain name with identical Main Domain;
Whether described judging unit is also for the update cycle arranging according to described update cycle setting unit, if having the fragment length of arbitrary effective slice groups of the subdomain name of identical Main Domain described in judgement within each update cycle, be that discrete distribution or concentrate distributes;
Described recognition unit is also for update cycle of arranging according to described update cycle setting unit, within each update cycle, fragment length in the described arbitrary effective slice groups of described judging unit judgement is discrete distribution or concentrates after distribution, and subdomain name corresponding to described arbitrary effective slice groups is identified as and spreads unchecked subdomain.
The recognition methods of spreading unchecked subdomain and system that the embodiment of the present invention provides, according to thering is the dispersion of fragment length of arbitrary effective slice groups of subdomain name of identical Main Domain or concentration degree, identify and spread unchecked subdomain, if the fragment length of arbitrary effective slice groups is discrete distribution or concentrates distribution, subdomain name corresponding to this arbitrary effective slice groups is identified as and spreads unchecked subdomain.Improved the identification degree of spreading unchecked subdomain, solved prior art and according to subdomain quantity, determined whether to spread unchecked subdomain merely, can only solve the most serious subdomain that spreads unchecked, to spreading unchecked the low problem of identification degree of subdomain.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, to the accompanying drawing of required use in embodiment or description of the Prior Art be briefly described below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skills, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.
A kind of recognition methods flow chart that spreads unchecked subdomain that Fig. 1 provides for the embodiment of the present invention;
A kind of recognition system figure that spreads unchecked subdomain that Fig. 2 provides for the embodiment of the present invention;
A kind of structure chart of the judging unit in the recognition system figure that spreads unchecked subdomain that Fig. 3 provides for the embodiment of the present invention;
The another kind of structure chart of the judging unit in the recognition system figure that spreads unchecked subdomain that Fig. 4 provides for the embodiment of the present invention;
The another kind that Fig. 5 provides for the embodiment of the present invention spreads unchecked the recognition system figure of subdomain.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, rather than whole embodiment.Embodiment based in the present invention, those of ordinary skills, not making the every other embodiment obtaining under creative work prerequisite, belong to the scope of protection of the invention.
The embodiment of the present invention provides a kind of recognition methods of spreading unchecked subdomain, referring to Fig. 1, comprises,
S101: obtain the subdomain name with identical Main Domain;
Exemplary, can collect all subdomain names that search engine is included on network, subdomain name can represent with tabular form, also can represent with other form, all subdomain names of collecting are classified according to Main Domain, and each group has the subdomain name of identical Main Domain respectively as the data source of spreading unchecked domain name identification.The present embodiment spreads unchecked the explanation of domain name identification with the subdomain example by name comprising for a Main Domain, so the implication of the Main Domain below occurring refers to same specific Main Domain, the subdomain name that other Main Domains comprise to spread unchecked domain name identifying identical with principle.
Preferably, in step S101, can set a update cycle, in each update cycle, obtain and there is the subdomain name of identical Main Domain and upgrade.
S102: be that discrete distribution or concentrate distributes if there is the fragment length of arbitrary effective slice groups of the subdomain name of identical Main Domain described in judgement, subdomain name corresponding to described arbitrary effective slice groups is identified as and spreads unchecked subdomain, wherein, described effective slice groups is: in the same one-level fragment of the described subdomain name with identical Main Domain, the left side domain name part of described same one-level fragment and/or right side domain name part be identical set of segments respectively.
Exemplary, the present embodiment claims that the fragment adjacent with Main Domain is first order fragment, adjacent with first order fragment is second level fragment, the like, for example, domain name twocold.blog.sina.com.cn Main Domain is " sina.com.cn ", and first order fragment is " blog ", second level fragment is that " twocold ", " twocold " are the left side domain name part of first order fragment, and " sina.com.cn " is the right side domain name part of first order fragment.Effectively slice groups is: in the same one-level fragment of the described subdomain name with identical Main Domain, the left side domain name part of described same one-level fragment and/or right side domain name part be identical set of segments respectively.
Wherein, when having the fragment at the highest level of subdomain name of identical Main Domain described in described same one-level fragment is not, the left side domain name part of described same one-level fragment and/or right side domain name part be identical comprising respectively: the left side domain name part of described same one-level fragment is identical respectively with right side domain name part;
The second level fragment of following subdomain name of take describes as example,
www.cid-3c148c1cd8599f5e.profile.live.com
www.cid-fc56648fc658c405.profile.live.com
www.cid-f4bd27e168f86267.profile.live.com
www.51senv.space.live.com
The second level fragment of above-mentioned domain name comprises " cid-3c148c1cd8599f5e ", " cid-fc56648fc658c405 ", " cid-f4bd27e168f86267 " and " 51senv ", wherein " cid-3c148c1cd8599f5e ", " cid-fc56648fc658c405 ", left side domain name part and the right side domain name part of " cid-f4bd27e168f86267 " are all identical, and belong to same one-level fragment, so form effective slice groups, and fragment " 51senv " and fragment " cid-3c148c1cd8599f5e " though etc. belong to same one-level fragment, but right side domain name part is not identical, so can not belong to same effective slice groups with fragments such as " cid-3c148c1cd8599f5e ".
When having the fragment at the highest level of subdomain name of identical Main Domain described in described same one-level fragment is, the left side domain name part of described same one-level fragment and/or right side domain name part be identical comprising respectively: described same one-level fragment right side domain name part is identical respectively.
For example, when spreading unchecked identification according to the fragment at the highest level of subdomain name, fragment at the highest level only has right side domain name part, so, as long as right side domain name part is identical can be thought and form effective slice groups, the third level fragment of following subdomain name of take describes as example
ihazo.qh.gzszyl.go.cn
fidoo.qh.gzszyl.go.cn
npvny.qh.gzszyl.go.cn
tmtmk.ne.gzszyl.go.cn
The third level fragment of above-mentioned domain name is fragment at the highest level, comprise " ihazo ", " fidoo ", " npvny " and " tmtmk ", wherein " ihazo ", " fidoo ", the right side domain name part of " npvny " is all identical, thus form effective slice groups, and fragment " tmtmk " and fragment " ihazo " though etc. belong to same one-level fragment, but right side domain name part is not identical, so can not belong to same effective slice groups with fragments such as " ihazo ".
The subdomain name with identical Main Domain can comprise a plurality of effective slice groups, the present embodiment is identified as example and describes to spread unchecked according to the dispersion of the fragment length of first effective slice groups or concentration degree, spreads unchecked identification amount principle identical with process according to the dispersion of the fragment length of other effective slice groups or concentration degree.Wherein, the concentration degree of the fragment length of effective slice groups: refer to the fragment length that effective slice groups comprises, concentrate on the degree in a few length on distributing; The effective dispersion of the fragment length of slice groups: the quantity that refers to the fragment length that effective slice groups comprises is many, and the very low degree of domain name quantitative proportion of each distribution of lengths.
Exemplary, the fragment total quantity comprising according to first effective slice groups, and the fragment length of each fragment can be added up the different fragment length sum that first effective slice groups comprises.
In addition, in step S102, can set a update cycle, whether whether the fragment length that judges first effective slice groups in each update cycle be discrete distribution or be to concentrate to distribute.
Simple introduction judges whether the fragment length of first effective slice groups is discrete distribution and whether is to concentrate the method distributing respectively below.
Whether the fragment length that, judges first effective slice groups is discrete distribution, can comprise:
A, obtain average number of fragments, wherein, the different fragment length sum that the fragment total quantity that described average number of fragments is comprised by described effective slice groups comprises divided by described effective slice groups obtains;
Exemplary, the present embodiment be take fragment total quantity that first effective slice groups comprises and is described as example as 1000;
The fragment length of each fragment refers to the number of characters comprising of each fragment, for example,
The fragment length of the second level fragment of cid-3c148c1cd8599f5e.profile.live.com is 20;
The fragment length of the first order fragment of www.thhhhshhh.live.com is 9;
The fragment length of the first order fragment of www.live.com is 3;
The different fragment length sum that first effective slice groups comprises refers to that first effective slice groups comprises how many kinds of fragment length; Average number of fragments refers to the ratio of the different fragment length sum that the total quantity of first effective slice groups comprises from first effective slice groups.
The present embodiment be take the first fragment and is comprised 4 kinds of fragment lengths and describe as example, supposes that four kinds of fragment lengths are respectively:
Fragment length (len)=2, the number of fragments that this fragment length comprises is 500,
Len=3, the number of fragments that this fragment length comprises is 200;
Len=8, the number of fragments that this fragment length comprises is 250;
Len=11, the number of fragments that this fragment length comprises is 50.
So, on average number of fragments is 1000/4=250.
If the described average number of fragments of b is less than the first dispersion threshold value, the fragment length that judges described arbitrary effective slice groups is discrete distribution.
Exemplary, can set in advance the first dispersion threshold value, the span of the first dispersion threshold value can be determined by spread unchecked the dispersion degree feature of subdomain according to present stage, if new trend appears in the dispersion degree of spreading unchecked subdomain that present stage often occurs, can adjust the first dispersion threshold value.Wherein present stage spread unchecked subdomain dispersion degree feature can by statistics mode obtain, do not limit herein.
For example, the first dispersion threshold value can be got 12-40, preferably can get 12.
When average number of fragments is less than the first dispersion threshold value, can think that first segment length is discrete distribution.
For example, the first dispersion threshold value gets 40, when average number of fragments is 250, can not judge that first segment length is discrete distribution, when average number of fragments is 25, can judge that first segment length is discrete distribution.
Preferably, when being only not discrete distribution according to the fragment length of average judgement quantity judgement first effective slice groups of first effective slice groups, in order to improve the identification degree of spreading unchecked domain name, can also be discrete distribution in conjunction with the fragment length of name information auxiliary judgment first effective slice groups of first effective slice groups, so, can also comprise:
If the average number of fragments of c is not less than the first dispersion threshold value, add up number of fragments that first effective slice groups comprises separator or the naming scheme of first effective slice groups;
Exemplary, suppose that the first dispersion threshold value gets 40, average number of fragments is 250, can not judge that first segment length is discrete distribution, further adds up number of fragments that first effective slice groups comprises separator or the naming scheme of first effective slice groups;
To adding up first effective slice groups number of fragments that comprises separator and the naming scheme of adding up first effective slice groups, describe respectively below.
1) add up the number of fragments that first effective slice groups comprises separator.
Exemplary, separator can preset according to the symbol that allows in domain name to occur, for example, and line "-" in occurring if allow in domain name, separator can be set in advance as "-", and the number of fragments that first effective slice groups comprises separator is the quantity of the fragment that comprises separator "-"; If can there is the symbols such as "-", " _ " in domain name, separator can be set in advance as the symbols such as "-", " _ ", the number of fragments that first effective slice groups comprises separator be symbols such as comprising separator "-", " _ " fragment quantity and.
In addition, can preset separator threshold value, to analyze whether ubiquity of fragment that first effective slice groups comprises separator, for example, default separator threshold value can be set to 60%, preferred, can be set to 80%.
For example, default separator threshold value setting is 60%, the number of fragments that comprises separator when first effective slice groups accounts for 60% when above of first effective slice groups total amount, can think that the fragment that first effective slice groups comprises separator is ubiquitous, the number of fragments that comprises separator when first effective slice groups account for first effective slice groups total amount not higher than 60% time, can think that the fragment that first effective slice groups comprises separator is not ubiquitous.
2) add up the naming scheme of first effective slice groups.
Exemplary, can preset naming scheme, for example can comprise 4 kinds of naming schemes (be all numeral, be all letter, be all that numeral adds letter, is all that subdomain adds numeral), certainly according to the change of domain name naming rule, default naming scheme also can upgrade, and does not limit herein.
Exemplary, can preset ratio threshold value, for example, to judge whether the naming scheme of the first fragment is unified, and, preset ratio threshold value can be set to 60%, preferred, can be set to 80%.
For example, preset ratio threshold value setting is 60%, the number of fragments of any pattern comprising when first effective slice groups accounts for 60% when above of first effective slice groups total amount, can think that first effective slice groups naming scheme is unified, when the number of fragments of each pattern comprising when first effective slice groups accounts for first effective slice groups total amount and is all less than 60%, can think that the naming scheme of first effective slice groups is skimble-scamble.
If the number of fragments that the effective slice groups of d first comprises separator is greater than the ratio of any one naming scheme of default separator threshold value or first effective slice groups and is greater than preset ratio threshold value, and average number of fragments is less than the second dispersion threshold value, and the fragment length that judges first effective slice groups is discrete distribution.
Exemplary, when average number of fragments is less than the second dispersion threshold value, and the first fragment at least meets, and the fragment comprise separator is ubiquitous, naming scheme is while being a kind of situation in unified, and the fragment length that can judge first effective slice groups is discrete distribution.Wherein, the first fragment whether meet the fragment that comprises separator be ubiquitous, naming scheme be unified judgement with described in above-mentioned c, repeat no more herein.
The second dispersion threshold value can set in advance, and the span of the second dispersion threshold value can be determined by spread unchecked the dispersion degree feature of subdomain according to present stage, for example, can be 15-50.But the second discrete threshold values should be greater than the first discrete threshold values.For example, the first dispersion threshold value is got 40 o'clock, and the second dispersion threshold value can get 50, and preferred, the first dispersion threshold value is got 12 o'clock, and the second dispersion threshold value can get 15.
Whether the fragment length that two, judges first effective slice groups is to concentrate to distribute, and can comprise:
A, obtain average number of fragments, wherein, the different fragment length sum that the fragment total quantity that described average number of fragments is comprised by described effective slice groups comprises divided by described effective slice groups obtains;
Exemplary, the total quantity of first effective slice groups of herein still take describes as example as 1000; Suppose that the first fragment comprises 4 kinds of fragment lengths, for example four kinds of fragment lengths are respectively:
Fragment length (len)=2, the number of fragments that this fragment length comprises is 500,
Len=3, the number of fragments that this fragment length comprises is 200;
Len=8, the number of fragments that this fragment length comprises is 250;
Len=11, the number of fragments that this fragment length comprises is 50.
So, on average number of fragments is 1000/4=250.
B, obtain the effective length quantity that first effective slice groups comprises, wherein, it is described effective length with the fragment length of adjusting because of subproduct that the number of fragments comprising is greater than described average number of fragments;
Exemplary, the span of adjusting the factor can be 0.9-1.5, preferred value is 0.9.
For example, when adjusting the factor and be 0.9, average number of fragments with adjust because subproduct is 250*0.9=225, the number of fragments comprising is greater than 225 fragment length len=2 (comprising 500 fragments) and len=8 (comprising 300 fragments).So effective length quantity is 2.
If the ratio of c effective length quantity and fragment length sum is less than the first concentration degree threshold value, judge that the fragment length of described first effective slice groups distributes for concentrating.
Exemplary, can set in advance the first concentration degree threshold value, the span of the first concentration degree threshold value can be determined by spread unchecked the intensity feature of subdomain according to present stage, if new trend appears in the intensity of spreading unchecked subdomain that present stage often occurs, can adjust the first concentration degree threshold value.Wherein present stage spread unchecked subdomain intensity feature can by statistics mode obtain, do not limit herein.
For example, the first concentration degree threshold value can be got 0.45-0.6, preferably can get 0.45.
When the ratio of effective length quantity and fragment length sum is less than the first concentration degree threshold value, can think that first segment length distributes for concentrating.
For example, when the first concentration degree threshold value, get 0.45, effective length quantity is 2, fragment length adds up to 4, and 2/4=0.5, is greater than 0.45, can not judge that first segment length distributes for concentrating, when the first concentration degree threshold value gets 0.6, can judge that first segment length distributes for concentrating.
Preferably, when not being concentrated distribution according to the fragment length of ratio in judgement first effective slice groups of effective length quantity and fragment length sum, in order to improve the identification degree of spreading unchecked domain name, can also for concentrating, distribute in conjunction with the fragment length of name information auxiliary judgment first effective slice groups of first effective slice groups, so, can also comprise:
If the ratio of the described effective length quantity of c and described fragment length sum is not less than described the first concentration degree threshold value, add up number of fragments that described first effective slice groups comprises separator or the naming scheme of first effective slice groups;
Exemplary, the number of fragments of separator and the statistics of naming scheme and apply same as abovely repeat no more herein.
If the number of fragments that the effective slice groups of d first comprises separator is greater than the ratio of any one naming scheme of default separator threshold value or first effective slice groups and is greater than preset ratio threshold value, and the ratio of described effective length quantity and described fragment length sum is less than the second concentration degree threshold value, judge that the fragment length of first effective slice groups distributes for concentrating.
Exemplary, when the ratio of effective length quantity and described fragment length sum is less than the second concentration degree threshold value, and the first fragment at least meets, and the fragment comprise separator is ubiquitous, naming scheme is while being a kind of situation in unified, can judge that the fragment length of first effective slice groups is for concentrating distribution.
The second concentration degree threshold value can set in advance, the span of the second concentration degree threshold value can be determined by spread unchecked the intensity feature of subdomain according to present stage, for example, the span of the second concentration degree threshold value can be 0.6-0.7, but the second threshold concentration should be greater than the first threshold concentration.For example, the first concentration degree threshold value is got 0.6 o'clock, and the second concentration degree threshold value can get 0.7, and preferred, the first concentration degree threshold value is got 0.45 o'clock, and the second concentration degree threshold value can get 0.6.
Preferably, in order to improve efficiency and the reliability of identification, can first according to effective slice groups of first order fragment, identify, after then the subdomain name that is identified as spreading unchecked subdomain being deleted, according to effective slice groups of second level fragment, identify again, the like.
Preferably, when subdomain name comprises multistage fragment, utilizing said method all can not identify spread unchecked subdomain in the situation that according to each effective slice groups, in order to improve identification degree, the method can also comprise,
If judge, the fragment length of each effective slice groups of the subdomain name with identical Main Domain is not all discrete distribution or concentrates distribution, merges into the described adjacent at least two-stage segment with the subdomain name of identical Main Domain one-level fragment;
Exemplary, the step of merging can be removed intersegmental ". " of sheet.The step of removing ". " can progressively strengthen, and the first step is removed one ". ", then attempts identification, if or can not identify, add greatly two ". ", carry out so successively.
For example,, for following domain name:
www.ihazo.qh.gzszyl.go.cn
www.fidoo.edu.gzszyl.go.cn
www.npvny.hb.gzszyl.go.cn
www.tmtmk.ne.gzszyl.go.cn
By after second level fragment and the merging of third level fragment, become:
www.ihazoqh.gzszyl.go.cn
www.fidooedu.gzszyl.go.cn
www.npvnyhb.gzszyl.go.cn
www.tmtmkne.gzszyl.go.cn
www.tlekafj.gzszyl.go.cn
The new one-level fragment obtaining is the second level fragment of domain name after above-mentioned merging.
According to the one-level slice groups after described merging, obtain new effective slice groups, if judge, the fragment length of described new effective slice groups is discrete distribution or concentrates distribution, is identified as described new subdomain name corresponding to effective slice groups to spread unchecked subdomain.
Exemplary, can obtain new effective slice groups according to the second level fragment of domain name after above-mentioned merging, according to the fragment length distribution of new effective slice groups, spread unchecked principle and the same said method of process of the identification of subdomain, repeat no more herein.
Preferably, before step S102, described method also comprises:
Whether the quantity that judges described arbitrary effective slice groups is greater than predetermined threshold value, if be greater than predetermined threshold value, spreads unchecked the identification of subdomain.
Exemplary, because spread unchecked subdomain, be generally generation in batches, so number is larger, so in order to simplify the process of identification, one predetermined threshold value can be set, when the quantity of a certain effective slice groups is greater than this predetermined threshold value, adopts and utilize said method to spread unchecked the identification of subdomain.The span of predetermined threshold value can arrange according to the quantity feature that present stage is spread unchecked domain name, the quantity feature that present stage is spread unchecked domain name can utilize the method for statistics to obtain, for example, the span of predetermined threshold value can be for being more than or equal to 50, effective slice groups for first order fragment, predetermined threshold value can be preferably 500, and for effective slice groups of second level fragment, predetermined threshold value can be preferably 100.
Preferably, step S102, according to predefined exemption rule, filters meeting the regular fragment of described exemption or subdomain name, does not spread unchecked the identification of subdomain.
Exemplary, as counterweight, wanting the protection of subdomain name, can set according to actual needs according to exempting rule, meet and exempt regular subdomain name, will can not be identified as spreading unchecked domain name.
For example, according to the subdomain fragment between the different main territories of statistics, some are significant, or ubiquitous fragment saves as exemption fragment in advance.Such as " bbs ", " blog ", " www " etc.
Again for example, by analyzing the quality of subdomain, user's visit capacity, the subdomain name of statistics particular importance is as exempting subdomain.Such as qzone.163.com, bbs.163.com etc.
Preferably, after step S102, can also comprise: the subdomain that spreads unchecked being identified that each effective slice groups is corresponding is dispatched as a subdomain.
Exemplary, because it is similar to spread unchecked quality and the content of subdomain, so can dispatch with a virtual subdomain spreading unchecked subdomain, save broadband resource.
For example, can retain left side domain name part and the right side domain name part of effective slice groups, effective fragment is represented with " * ".
As, cid-3c148c1cd8599f5e.profile.live.com
cid-fc56648fc658c405.profile.live.com
cid-f4bd27e168f86267.profile.live.com
Deng the subdomain name that meets * .profile.live.com rule description, can virtually be a subdomain name: * .prifile.live.com.
The recognition methods of spreading unchecked subdomain that the present embodiment provides, according to thering is the dispersion of fragment length of arbitrary effective slice groups of subdomain name of identical Main Domain or concentration degree, identify and spread unchecked subdomain, if the fragment length of arbitrary effective slice groups is discrete distribution or concentrates distribution, subdomain name corresponding to this arbitrary effective slice groups is identified as and spreads unchecked subdomain.Improved the identification degree of spreading unchecked subdomain, solved prior art and according to subdomain quantity, determined whether to spread unchecked subdomain merely, can only solve the most serious subdomain that spreads unchecked, to spreading unchecked the low problem of identification degree of subdomain.
Another embodiment of the present invention provides a kind of recognition system of spreading unchecked subdomain, is applied to the method shown in Fig. 1, and referring to Fig. 2, this system comprises:
Acquiring unit 201, for obtaining the subdomain name with identical Main Domain;
Exemplary, acquiring unit 201 can be collected all subdomain names that search engine is included on network, subdomain name can represent with tabular form, also can represent with other form, all subdomain names of collecting are classified according to Main Domain, can add up respectively the subdomain name that different Main Domains comprise, the subdomain name that each Main Domain comprises is respectively as the data source of spreading unchecked domain name identification.The present embodiment spreads unchecked the explanation of domain name identification with the subdomain that comprises for Main Domain example by name, the subdomain name that other Main Domains comprise to spread unchecked domain name identifying identical with principle.
Judging unit 202, for thering is the fragment length of arbitrary effective slice groups of the subdomain name of identical Main Domain described in judging that described acquiring unit 201 obtains, be whether that discrete distribution or concentrate distributes, wherein, described effective slice groups is: in the same one-level fragment of the described subdomain name with identical Main Domain, the left side domain name part of described same one-level fragment and/or right side domain name part be identical set of segments respectively;
Exemplary, in the present embodiment, the same embodiment of the method for implication of effective slice groups repeats no more herein.The subdomain name with identical Main Domain can comprise a plurality of effective slice groups, the present embodiment is identified as example and describes to spread unchecked according to the dispersion of the fragment length of first effective slice groups or concentration degree, spreads unchecked identification amount principle identical with process according to the dispersion of the fragment length of other effective slice groups or concentration degree.
Describe in two kinds of situation below.
The first situation,
Described judging unit can comprise:
Acquisition module 301, for obtaining average number of fragments, wherein, the different fragment length sum that the fragment total quantity that described average number of fragments is comprised by described arbitrary effective slice groups comprises divided by described arbitrary effective slice groups obtains;
Exemplary, the present embodiment be take first effective slice groups and is described as example, supposes that the fragment total quantity that first effective slice groups comprises is 1000;
The fragment length of each fragment refers to the number of characters comprising of each fragment, for example,
The fragment length of the second level fragment of cid-3c148c1cd8599f5e.profile.live.com is 20;
The fragment length of the first order fragment of www.thhhhshhh.live.com is 9;
The fragment length of the first order fragment of www.live.com is 3;
The different fragment length sum that first effective slice groups comprises refers to that first effective slice groups comprises how many kinds of fragment length; Average number of fragments refers to the ratio of the different fragment length sum that the total quantity of first effective slice groups comprises from first effective slice groups.
The present embodiment be take the first fragment and is comprised 4 kinds of fragment lengths and describe as example, supposes that four kinds of fragment lengths are respectively:
Fragment length (len)=2, the number of fragments that this fragment length comprises is 500,
Len=3, the number of fragments that this fragment length comprises is 200;
Len=8, the number of fragments that this fragment length comprises is 250;
Len=11, the number of fragments that this fragment length comprises is 50.
So, on average number of fragments is 1000/4=250.
The first judge module 302, for determining that the described average number of fragments that described acquisition module obtains is less than after the first dispersion threshold value, the fragment length that judges described arbitrary effective slice groups is discrete distribution.
Exemplary, system can set in advance the first dispersion threshold value, the span of the first dispersion threshold value can be determined by spread unchecked the dispersion degree feature of subdomain according to present stage, if new trend appears in the dispersion degree of spreading unchecked subdomain that present stage often occurs, can adjust the first dispersion threshold value.Wherein present stage spread unchecked subdomain dispersion degree feature can by statistics mode obtain, do not limit herein.
For example, the first dispersion threshold value can be got 12-40, preferably can get 12.
When average number of fragments is less than the first dispersion threshold value, can think that first segment length is discrete distribution.
For example, the first dispersion threshold value gets 40, when average number of fragments is 250, can not judge that first segment length is discrete distribution, when average number of fragments is 25, can judge that first segment length is discrete distribution.
Statistical module 303, for being not less than after described the first dispersion threshold value in the described average number of fragments of described the first judge module 302 judgement, add up number of fragments that described arbitrary effective slice groups comprises separator or the naming scheme of described arbitrary effective slice groups;
Exemplary, when being only not discrete distribution according to the fragment length of average number of fragments judgement first effective slice groups of first effective slice groups, in order to improve the identification degree of spreading unchecked domain name, can also be discrete distribution in conjunction with the fragment length of name information auxiliary judgment first effective slice groups of first effective slice groups.
To adding up first effective slice groups number of fragments that comprises separator and the naming scheme of adding up first effective slice groups, describe respectively below.
1) add up the number of fragments that first effective slice groups comprises separator.
Exemplary, separator can preset according to the symbol that allows in domain name to occur, for example, and line "-" in occurring if allow in domain name, separator can be set in advance as "-", and the number of fragments that first effective slice groups comprises separator is the quantity of the fragment that comprises separator "-"; If can there is the symbols such as "-", " _ " in domain name, separator can be set in advance as the symbols such as "-", " _ ", the number of fragments that first effective slice groups comprises separator be symbols such as comprising separator "-", " _ " fragment quantity and.
In addition, can preset separator threshold value, to analyze whether ubiquity of fragment that first effective slice groups comprises separator, for example, default separator threshold value can be set to 60%, preferred, can be set to 80%.
For example, default separator threshold value setting is 60%, the number of fragments that comprises separator when first effective slice groups accounts for 60% when above of first effective slice groups total amount, can think that the fragment that first effective slice groups comprises separator is ubiquitous, the number of fragments that comprises separator when first effective slice groups account for first effective slice groups total amount not higher than 60% time, can think that the fragment that first effective slice groups comprises separator is not ubiquitous.
2) add up the naming scheme of first effective slice groups.
Exemplary, can preset naming scheme, for example can comprise 4 kinds of naming schemes (be all numeral, be all letter, be all that numeral adds letter, is all that subdomain adds numeral), certainly according to the change of domain name naming rule, default naming scheme also can upgrade, and does not limit herein.
Exemplary, can preset ratio threshold value, for example, to judge whether the naming scheme of the first fragment is unified, and, preset ratio threshold value can be set to 60%, preferred, can be set to 80%.
For example, preset ratio threshold value setting is 60%, the number of fragments of any pattern comprising when first effective slice groups accounts for 60% when above of first effective slice groups total amount, can think that first effective slice groups naming scheme is unified, when the number of fragments of each pattern comprising when first effective slice groups accounts for first effective slice groups total amount and is all less than 60%, can think that the naming scheme of first effective slice groups is skimble-scamble.
The second judge module 304, the ratio that the number of fragments that comprises separator for the described arbitrary effective slice groups determining described statistical module counts is greater than any one naming scheme of default separator threshold value or described arbitrary effective slice groups is greater than preset ratio threshold value, and described the first judge module determines that described average number of fragments is less than after the second dispersion threshold value, and the fragment length that judges described arbitrary effective slice groups is discrete distribution.
Exemplary, system can set in advance the second dispersion threshold value, and the span of the second dispersion threshold value can be determined by spread unchecked the dispersion degree feature of subdomain according to present stage, for example, can be 15-50.But the second discrete threshold values should be greater than the first discrete threshold values.For example, the first dispersion threshold value is got 40 o'clock, and the second dispersion threshold value can get 50, and preferred, the first dispersion threshold value is got 12 o'clock, and the second dispersion threshold value can get 15.
The second situation, referring to Fig. 4,
Described judging unit, comprising:
The first acquisition module 401, for obtaining average number of fragments, wherein, the different fragment length sum that the fragment total quantity that described average number of fragments is comprised by described arbitrary effective slice groups comprises divided by described arbitrary effective slice groups obtains;
Exemplary, the first effective slice groups of herein still take describes as example, and the total quantity of supposing first effective slice groups is 1000; Suppose that the first fragment comprises 4 kinds of fragment lengths, for example four kinds of fragment lengths are respectively:
Fragment length (len)=2, the number of fragments that this fragment length comprises is 500,
Len=3, the number of fragments that this fragment length comprises is 200;
Len=8, the number of fragments that this fragment length comprises is 250;
Len=11, the number of fragments that this fragment length comprises is 50.
So, on average number of fragments is 1000/4=250.
The second acquisition module 402, the effective length quantity comprising for obtaining described arbitrary effective slice groups, wherein, it is described effective length with the fragment length of adjusting because of subproduct that the number of fragments comprising is greater than described average number of fragments;
Exemplary, the span of adjusting the factor can be 0.9-1.5, preferred value is 0.9.
For example, when adjusting the factor and be 0.9, average number of fragments with adjust because subproduct is 250*0.9=225, the number of fragments comprising is greater than 225 fragment length len=2 (comprising 500 fragments) and len=8 (comprising 300 fragments).So effective length quantity is 2.
The first judge module 403, for determining that the described effective length quantity that described the second acquisition module obtains is less than after the first concentration degree threshold value with the ratio of described fragment length sum, judges that the fragment length of described arbitrary effective slice groups is for concentrating distribution.
Exemplary, can set in advance the first concentration degree threshold value, the span of the first concentration degree threshold value can be determined by spread unchecked the intensity feature of subdomain according to present stage, if new trend appears in the intensity of spreading unchecked subdomain that present stage often occurs, can adjust the first concentration degree threshold value.Wherein present stage spread unchecked subdomain intensity feature can by statistics mode obtain, do not limit herein.
For example, the first concentration degree threshold value can be got 0.45-0.6, preferably can get 0.45.
When the ratio of effective length quantity and fragment length sum is less than the first concentration degree threshold value, can think that first segment length distributes for concentrating.
For example, when the first concentration degree threshold value, get 0.45, effective length quantity is 2, fragment length adds up to 4, and 2/4=0.5, is greater than 0.45, can not judge that first segment length distributes for concentrating, when the first concentration degree threshold value gets 0.6, can judge that first segment length distributes for concentrating.
Statistical module 404, for determining that at described the first judge module 403 ratio of described effective length quantity and described fragment length sum is not less than after described the first concentration degree threshold value, add up number of fragments that described arbitrary effective slice groups comprises separator or the naming scheme of described arbitrary effective slice groups;
Exemplary, the number of fragments of separator and the statistics of naming scheme and application, with described in the first situation, repeat no more herein.
The second judge module 405, the ratio that the number of fragments that comprises separator for the described arbitrary effective slice groups determining described statistical module 404 statistics is greater than any one naming scheme of default separator threshold value or described arbitrary effective slice groups is greater than preset ratio threshold value, and described the first judge module 403 determines that the ratio of described effective length quantity and described fragment length sum is less than after the second concentration degree threshold value, judge that the fragment length of described arbitrary effective slice groups distributes for concentrating.
The second concentration degree threshold value can set in advance, the span of the second concentration degree threshold value can be determined by spread unchecked the intensity feature of subdomain according to present stage, for example, the span of the second concentration degree threshold value can be 0.6-0.7, but the second threshold concentration should be greater than the first threshold concentration.For example, the first concentration degree threshold value is got 0.6 o'clock, and the second concentration degree threshold value can get 0.7, and preferred, the first concentration degree threshold value is got 0.45 o'clock, and the second concentration degree threshold value can get 0.6.
Recognition unit 203, is discrete distribution or concentrates after distribution for the fragment length in the described arbitrary effective slice groups of described judging unit 202 judgement, and subdomain name corresponding to described arbitrary effective slice groups is identified as and spreads unchecked subdomain.
Preferably, in order to improve efficiency and the reliability of identification, judging unit 202 can first be identified according to effective slice groups of first order fragment with recognition unit 203, then after the subdomain name that is identified as spreading unchecked subdomain being deleted, according to effective slice groups of second level fragment, identify again, the like.
Further, referring to Fig. 5,
Described system also comprises:
Merge cells 204, for the fragment length of each effective slice groups of subdomain name thering is identical Main Domain described in 202 judgements of described judging unit, all for discrete distribution or after concentrating and distributing, the described adjacent at least two-stage segment with the subdomain name of identical Main Domain is merged into one-level fragment;
Exemplary, the step of merging can be removed intersegmental ". " of sheet.The step of removing ". " can progressively strengthen, and the first step is removed one ". ", then attempts identification, if or can not identify, add greatly two ". ", carry out so successively.
For example,, for following domain name:
www.ihazo.qh.gzszyl.go.cn
www.fidoo.edu.gzszyl.go.cn
www.npvny.hb.gzszyl.go.cn
www.tmtmk.ne.gzszyl.go.cn
By after second level fragment and the merging of third level fragment, become:
www.ihazoqh.gzszyl.go.cn
www.fidooedu.gzszyl.go.cn
www.npvnyhb.gzszyl.go.cn
www.tmtmkne.gzszyl.go.cn
The new one-level fragment obtaining is the second level fragment of domain name after above-mentioned merging.
Accordingly, described judging unit 202 also for, according to the one-level fragment after described merging, obtain new effective slice groups, whether the fragment length that judges described new effective slice groups is that discrete distribution or concentrate distributes;
Described recognition unit 203 also for, at the fragment length of the described new effective slice groups of described judging unit judgement, be, after discrete distribution or concentrate distributes, described new subdomain name corresponding to effective slice groups to be identified as and to spread unchecked subdomain.
Comparing unit 205, whether the quantity for more described arbitrary effective slice groups is greater than predetermined threshold value, so that described judging unit 202 and recognition unit 203 determine that at described comparing unit 205 quantity of described arbitrary effective slice groups is greater than after predetermined threshold value, spreads unchecked the identification of subdomain.
Exemplary, because spread unchecked subdomain, be generally generation in batches, so number is larger, so in order to simplify the process of identification, system can arrange a predetermined threshold value, when the quantity of a certain effective slice groups is greater than this predetermined threshold value, then spreads unchecked the identification of subdomain.The span of predetermined threshold value can arrange according to the quantity feature that present stage is spread unchecked domain name, the quantity feature that present stage is spread unchecked domain name can utilize the method for statistics to obtain, for example, the span of predetermined threshold value can be for being more than or equal to 50, effective slice groups for first order fragment, predetermined threshold value can be preferably 500, and for effective slice groups of second level fragment, predetermined threshold value can be preferably 100.
Filter element 206, for regular according to predefined exemption, will meet the regular fragment of described exemption or subdomain name filter so that described judging unit 202 and recognition unit 203 be not for the identification that meets the regular fragment of described exemption or subdomain name and spread unchecked subdomain.
Exemplary, as counterweight, wanting the protection of subdomain name, can set according to actual needs according to exempting rule, meet and exempt regular subdomain name, will can not be identified as spreading unchecked domain name.
For example, according to the subdomain fragment between the different main territories of statistics, some are significant, or ubiquitous fragment saves as exemption fragment in advance.Such as " bbs ", " blog ", " www " etc.
Again for example, by analyzing the quality of subdomain, user's visit capacity, the subdomain name of statistics particular importance is as exempting subdomain.Such as qzone.163.com, bbs.163.com etc.
Update cycle setting unit 207, for the update cycle is set, corresponding,
Described acquiring unit 201 also for: the update cycle arranging according to described update cycle setting unit 207, within each update cycle, obtain the subdomain name with identical Main Domain;
Whether described judging unit 202 is also for update cycle of arranging according to described update cycle setting unit 207, if having the fragment length of arbitrary effective slice groups of the subdomain name of identical Main Domain described in judgement within each update cycle, be that discrete distribution or concentrate distributes;
Described recognition unit 203 is also for update cycle of arranging according to described update cycle setting unit 207, within each update cycle, fragment length in the described arbitrary effective slice groups of described judging unit 202 judgement is discrete distribution or concentrates after distribution, and subdomain name corresponding to described arbitrary effective slice groups is identified as and spreads unchecked subdomain.
Scheduling unit 208, for dispatching the subdomain that spreads unchecked corresponding to each effective slice groups of described recognition unit 203 identifications as a subdomain.
Exemplary, because it is similar to spread unchecked quality and the content of subdomain, so can dispatch with a virtual subdomain spreading unchecked subdomain, save broadband resource.
For example, can retain left side domain name part and the right side domain name part of effective slice groups, effective fragment is represented with " * ".
As, cid-3c148c1cd8599f5e.profile.live.com,
cid-fc56648fc658c405.profile.live.com,
cid-f4bd27e168f86267.profile.live.com,
Deng the subdomain name that meets * .profile.live.com rule description, can virtually be a subdomain name: * .prifile.live.com.
The recognition system of spreading unchecked subdomain that the present embodiment provides, by judgement, having the dispersion of fragment length of arbitrary effective slice groups of subdomain name of identical Main Domain or concentration degree identifies and spreads unchecked subdomain, if the fragment length of arbitrary effective slice groups is discrete distribution or concentrates distribution, subdomain name corresponding to this arbitrary effective slice groups is identified as and spreads unchecked subdomain.Improved the identification degree of spreading unchecked subdomain, solved prior art and according to subdomain quantity, determined whether to spread unchecked subdomain merely, can only solve the most serious subdomain that spreads unchecked, to spreading unchecked the low problem of identification degree of subdomain.
One of ordinary skill in the art will appreciate that all or part of step that realizes said method embodiment can complete by the relevant hardware of program command, aforesaid program can be stored in a computer read/write memory medium, this program, when carrying out, is carried out the step that comprises said method embodiment; And aforesaid storage medium comprises: various media that can be program code stored such as ROM, RAM, magnetic disc or CDs.
The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited to this, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; can expect easily changing or replacing, within all should being encompassed in protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection range of described claim.

Claims (18)

1. a recognition methods of spreading unchecked subdomain, is characterized in that, comprising:
Obtain the subdomain name with identical Main Domain;
If there is the fragment length of arbitrary effective slice groups of the subdomain name of identical Main Domain described in judgement and be discrete distribution or concentrate, distribute, subdomain name corresponding to described arbitrary effective slice groups is identified as and spreads unchecked subdomain, wherein, described effective slice groups is: in the same one-level fragment of the described subdomain name with identical Main Domain, the left side domain name part of described same one-level fragment and/or right side domain name part be identical set of segments respectively.
2. recognition methods of spreading unchecked subdomain according to claim 1, is characterized in that,
When having the fragment at the highest level of subdomain name of identical Main Domain described in described same one-level fragment is not, the left side domain name part of described same one-level fragment and/or right side domain name part be identical comprising respectively: the left side domain name part of described same one-level fragment is identical respectively with right side domain name part;
When having the fragment at the highest level of subdomain name of identical Main Domain described in described same one-level fragment is, the left side domain name part of described same one-level fragment and/or right side domain name part be identical comprising respectively: described same one-level fragment right side domain name part is identical respectively.
3. recognition methods of spreading unchecked subdomain according to claim 1 and 2, is characterized in that, the fragment length of arbitrary effective slice groups described in described judgement with the subdomain name of identical Main Domain is discrete distribution, comprising:
Obtain average number of fragments, wherein, the different fragment length sum that the fragment total quantity that described average number of fragments is comprised by described arbitrary effective slice groups comprises divided by described arbitrary effective slice groups obtains;
If described average number of fragments is less than the first dispersion threshold value, the fragment length that judges described arbitrary effective slice groups is discrete distribution.
4. recognition methods of spreading unchecked subdomain according to claim 3, is characterized in that, the fragment length of arbitrary effective slice groups described in described judgement with the subdomain name of identical Main Domain is discrete distribution, also comprises:
If described average number of fragments is not less than described the first dispersion threshold value, add up number of fragments that described arbitrary effective slice groups comprises separator or the naming scheme of described arbitrary effective slice groups;
If the number of fragments that described arbitrary effective slice groups comprises separator is greater than the ratio of any one naming scheme of default separator threshold value or described arbitrary effective slice groups and is greater than preset ratio threshold value, and described average number of fragments is less than the second dispersion threshold value, the fragment length that judges described arbitrary effective slice groups is discrete distribution.
5. recognition methods of spreading unchecked subdomain according to claim 1 and 2, is characterized in that, the fragment length of arbitrary effective slice groups described in described judgement with the subdomain name of identical Main Domain distributes for concentrating, and comprising:
Obtain average number of fragments, wherein, the different fragment length sum that the fragment total quantity that described average number of fragments is comprised by described arbitrary effective slice groups comprises divided by described arbitrary effective slice groups obtains;
Obtain the effective length quantity that described arbitrary effective slice groups comprises, wherein, it is described effective length with the fragment length of adjusting because of subproduct that the number of fragments comprising is greater than described average number of fragments;
If the ratio of described effective length quantity and described fragment length sum is less than the first concentration degree threshold value, judge that the fragment length of described arbitrary effective slice groups distributes for concentrating.
6. recognition methods of spreading unchecked subdomain according to claim 5, is characterized in that, the fragment length of arbitrary effective slice groups described in described judgement with the subdomain name of identical Main Domain distributes for concentrating, and also comprises:
If the ratio of described effective length quantity and described fragment length sum is not less than described the first concentration degree threshold value, add up number of fragments that described arbitrary effective slice groups comprises separator or the naming scheme of described arbitrary effective slice groups;
If the number of fragments that described arbitrary effective slice groups comprises separator is greater than the ratio of any one naming scheme of default separator threshold value or described arbitrary effective slice groups and is greater than preset ratio threshold value, and the ratio of described effective length quantity and described fragment length sum is less than the second concentration degree threshold value, judge that the fragment length of described arbitrary effective slice groups distributes for concentrating.
7. according to the recognition methods of spreading unchecked subdomain described in claim 1-6 any one, it is characterized in that, described method also comprises:
If there is the fragment length of each effective slice groups of subdomain name of identical Main Domain described in judgement all for discrete distribution or concentrate to distribute, and described in there is identical Main Domain subdomain name while thering is at least two-stage fragment, the described adjacent at least two-stage segment with the subdomain name of identical Main Domain is merged into one-level fragment;
According to the one-level slice groups after described merging, obtain new effective slice groups, if judge, the fragment length of described new effective slice groups is discrete distribution or concentrates distribution, is identified as described new subdomain name corresponding to effective slice groups to spread unchecked subdomain.
8. recognition methods of spreading unchecked subdomain according to claim 7, it is characterized in that, if be that discrete distribution or concentrate distributes at the fragment length of arbitrary effective slice groups described in judgement with the subdomain name of identical Main Domain, before subdomain name corresponding to described arbitrary effective slice groups being identified as and spreading unchecked subdomain, described method also comprises:
According to predefined exemption rule, by meeting the regular fragment of described exemption or subdomain name, filter, do not spread unchecked the identification of subdomain.
9. recognition methods of spreading unchecked subdomain according to claim 8, is characterized in that, described method also comprises, the update cycle is set;
The described subdomain name with identical Main Domain that obtains, comprising: according to the update cycle of described setting, obtain the subdomain name with identical Main Domain within each update cycle;
If there is the fragment length of arbitrary effective slice groups of the subdomain name of identical Main Domain described in described judgement and be discrete distribution or concentrate, distribute, subdomain name corresponding to described arbitrary effective slice groups is identified as and spreads unchecked subdomain, comprise: according to the update cycle of described setting, if there is the fragment length of arbitrary effective slice groups of the subdomain name of identical Main Domain described in judgement and be discrete distribution within each update cycle or concentrate, distribute, subdomain name corresponding to described arbitrary effective slice groups is identified as and spreads unchecked subdomain.
10. a recognition system of spreading unchecked subdomain, is characterized in that, comprising:
Acquiring unit, for obtaining the subdomain name with identical Main Domain;
Judging unit, for thering is the fragment length of arbitrary effective slice groups of the subdomain name of identical Main Domain described in judging that described acquiring unit obtains, be whether that discrete distribution or concentrate distributes, wherein, described effective slice groups is: in the same one-level fragment of the described subdomain name with identical Main Domain, the left side domain name part of described same one-level fragment and/or right side domain name part be identical set of segments respectively;
Recognition unit, is discrete distribution or concentrates after distribution for the fragment length in the described arbitrary effective slice groups of described judging unit judgement, and subdomain name corresponding to described arbitrary effective slice groups is identified as and spreads unchecked subdomain.
11. recognition systems of spreading unchecked subdomain according to claim 10, is characterized in that,
When having the fragment at the highest level of subdomain name of identical Main Domain described in described same one-level fragment is not, the left side domain name part of described same one-level fragment and/or right side domain name part be identical comprising respectively: the left side domain name part of described same one-level fragment is identical respectively with right side domain name part;
When having the fragment at the highest level of subdomain name of identical Main Domain described in described same one-level fragment is, the left side domain name part of described same one-level fragment and/or right side domain name part be identical comprising respectively: described same one-level fragment right side domain name part is identical respectively.
12. according to the recognition system of spreading unchecked subdomain described in claim 10 or 11, it is characterized in that, described judging unit, comprising:
Acquisition module, for obtaining average number of fragments, wherein, the different fragment length sum that the fragment total quantity that described average number of fragments is comprised by described arbitrary effective slice groups comprises divided by described arbitrary effective slice groups obtains;
The first judge module, for determining that the described average number of fragments that described acquisition module obtains is less than after the first dispersion threshold value, the fragment length that judges described arbitrary effective slice groups is discrete distribution.
13. recognition systems of spreading unchecked subdomain according to claim 12, is characterized in that, described judging unit, also comprises:
Statistical module, for being not less than after described the first dispersion threshold value in the described average number of fragments of described the first judge module judgement, adds up number of fragments that described arbitrary effective slice groups comprises separator or the naming scheme of described arbitrary effective slice groups;
The second judge module, the ratio that the number of fragments that comprises separator for the described arbitrary effective slice groups determining described statistical module counts is greater than any one naming scheme of default separator threshold value or described arbitrary effective slice groups is greater than preset ratio threshold value, and described the first judge module determines that described average number of fragments is less than after the second dispersion threshold value, and the fragment length that judges described arbitrary effective slice groups is discrete distribution.
14. according to the recognition system of spreading unchecked subdomain described in claim 10 or 11, it is characterized in that, described judging unit, comprising:
The first acquisition module, for obtaining average number of fragments, wherein, the different fragment length sum that the fragment total quantity that described average number of fragments is comprised by described arbitrary effective slice groups comprises divided by described arbitrary effective slice groups obtains;
The second acquisition module, the effective length quantity comprising for obtaining described arbitrary effective slice groups, wherein, it is described effective length with the fragment length of adjusting because of subproduct that the number of fragments comprising is greater than described average number of fragments;
The first judge module, for determining that the described effective length quantity that described the second acquisition module obtains is less than after the first concentration degree threshold value with the ratio of described fragment length sum, judges that the fragment length of described arbitrary effective slice groups is for concentrating distribution.
15. recognition systems of spreading unchecked subdomain according to claim 14, is characterized in that, described judging unit, also comprises:
Statistical module, for determining that at described the first judge module the ratio of described effective length quantity and described fragment length sum is not less than after described the first concentration degree threshold value, add up number of fragments that described arbitrary effective slice groups comprises separator or the naming scheme of described arbitrary effective slice groups;
The second judge module, the ratio that the number of fragments that comprises separator for the described arbitrary effective slice groups determining described statistical module counts is greater than any one naming scheme of default separator threshold value or described arbitrary effective slice groups is greater than preset ratio threshold value, and described the first judge module determines that the ratio of described effective length quantity and described fragment length sum is less than after the second concentration degree threshold value, judge that the fragment length of described arbitrary effective slice groups distributes for concentrating.
16. recognition systems of spreading unchecked subdomain according to claim 15, is characterized in that, described system also comprises:
Merge cells, for the fragment length of each effective slice groups of subdomain name thering is identical Main Domain described in the judgement of described judging unit all for discrete distribution or concentrate and distribute, and described in there is identical Main Domain subdomain name have at least after two-stage fragment, the described adjacent at least two-stage segment with the subdomain name of identical Main Domain is merged into one-level fragment;
Described judging unit also for, according to the one-level fragment after described merging, obtain new effective slice groups, whether the fragment length that judges described new effective slice groups is that discrete distribution or concentrate distributes;
Described recognition unit also for, at the fragment length of the described new effective slice groups of described judging unit judgement, be, after discrete distribution or concentrate distributes, described new subdomain name corresponding to effective slice groups to be identified as and to spread unchecked subdomain.
17. recognition systems of spreading unchecked subdomain according to claim 16, is characterized in that, described system also comprises:
Filter element, for regular according to predefined exemption, will meet the regular fragment of described exemption or subdomain name filter so that described judging unit and recognition unit be not for the identification that meets the regular fragment of described exemption or subdomain name and spread unchecked subdomain.
18. recognition systems of spreading unchecked subdomain according to claim 17, is characterized in that, described system also comprises, update cycle setting unit, for arranging the update cycle;
Described acquiring unit also for: the update cycle arranging according to described update cycle setting unit, within each update cycle, obtain the subdomain name with identical Main Domain;
Whether described judging unit is also for the update cycle arranging according to described update cycle setting unit, if having the fragment length of arbitrary effective slice groups of the subdomain name of identical Main Domain described in judgement within each update cycle, be that discrete distribution or concentrate distributes;
Described recognition unit is also for update cycle of arranging according to described update cycle setting unit, within each update cycle, fragment length in the described arbitrary effective slice groups of described judging unit judgement is discrete distribution or concentrates after distribution, and subdomain name corresponding to described arbitrary effective slice groups is identified as and spreads unchecked subdomain.
CN201210256109.2A 2012-07-23 2012-07-23 The recognition methods and system of inundation sub-domain Active CN103581347B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210256109.2A CN103581347B (en) 2012-07-23 2012-07-23 The recognition methods and system of inundation sub-domain

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210256109.2A CN103581347B (en) 2012-07-23 2012-07-23 The recognition methods and system of inundation sub-domain

Publications (2)

Publication Number Publication Date
CN103581347A true CN103581347A (en) 2014-02-12
CN103581347B CN103581347B (en) 2019-03-26

Family

ID=50052255

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210256109.2A Active CN103581347B (en) 2012-07-23 2012-07-23 The recognition methods and system of inundation sub-domain

Country Status (1)

Country Link
CN (1) CN103581347B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108933846A (en) * 2018-06-21 2018-12-04 北京谷安天下科技有限公司 A kind of recognition methods, device and the electronic equipment of general parsing domain name

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101650715A (en) * 2008-08-12 2010-02-17 厦门市美亚柏科信息股份有限公司 Method and device for screening links on web pages
CN101702660A (en) * 2009-11-12 2010-05-05 中国科学院计算技术研究所 Abnormal domain name detection method and system
CN102158427A (en) * 2011-03-23 2011-08-17 陈伟强 Email address structure and mail sending and receiving system
CN102523311A (en) * 2011-11-25 2012-06-27 中国科学院计算机网络信息中心 Illegal domain name recognition method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101650715A (en) * 2008-08-12 2010-02-17 厦门市美亚柏科信息股份有限公司 Method and device for screening links on web pages
CN101702660A (en) * 2009-11-12 2010-05-05 中国科学院计算技术研究所 Abnormal domain name detection method and system
CN102158427A (en) * 2011-03-23 2011-08-17 陈伟强 Email address structure and mail sending and receiving system
CN102523311A (en) * 2011-11-25 2012-06-27 中国科学院计算机网络信息中心 Illegal domain name recognition method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘资茂等: "基于代理控制力的Fast-Flux僵尸网络检测方法", 《广西大学学报:自然科学版》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108933846A (en) * 2018-06-21 2018-12-04 北京谷安天下科技有限公司 A kind of recognition methods, device and the electronic equipment of general parsing domain name
CN108933846B (en) * 2018-06-21 2021-08-27 北京谷安天下科技有限公司 Method and device for identifying domain name by pan-resolution and electronic equipment

Also Published As

Publication number Publication date
CN103581347B (en) 2019-03-26

Similar Documents

Publication Publication Date Title
CN105407494A (en) Network capacity expansion method and apparatus
CN104700289A (en) Advertising method and device
CN107870981A (en) Electronic installation, the method and storage medium of tables of data filing processing
CN104881477B (en) A kind of application data space uses evaluation method
CN105550175A (en) Malicious account identification method and apparatus
CN104468737A (en) Storage hierarchical scheduling method and system based on service type characteristics
CN104615765A (en) Data processing method and data processing device for browsing internet records of mobile subscribers
CN102890860A (en) Method and device for classifying traffic zone
CN104320353A (en) User address managing method and device
CN102740368B (en) Bandwidth adjusting method and business intelligent system
CN116467266A (en) Batch file intelligent online processing method and device and storable medium
CN103747042A (en) Information acquisition method and device
CN103746851A (en) Method and device for realizing counting of independent user number
CN111385815B (en) Cell network resource optimization method, device, equipment and medium
CN108134746A (en) The processing method and processing device of rail traffic data
CN103581347A (en) Inundation sub-domain identification method and system
US8527565B2 (en) Selecting and reassigning a blade for a logical partition for service scheduling of a blade server
CN101848149B (en) Method and device for scheduling graded queues in packet network
CN102546652B (en) System and method for server load balancing
CN104202407A (en) Video file synchronization method and video file synchronization device
CN102137494B (en) Method and device for allocating communication resources
CN113194107B (en) Internet-based regional characteristic addressing method and device
CN111885159B (en) Data acquisition method and device, electronic equipment and storage medium
CN107438262B (en) abnormal user identification method and device
CN110895588B (en) Data processing method and equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant