CN107566389A

CN107566389A - A kind of imitation URL link fishing domain name recognition methods based on C4.5 decision trees

Info

Publication number: CN107566389A
Application number: CN201710843991.3A
Authority: CN
Inventors: 张永斌; 姚强
Original assignee: Ji'nan Mutual Trust Software Co Ltd
Current assignee: Ji'nan Mutual Trust Software Co Ltd
Priority date: 2017-09-19
Filing date: 2017-09-19
Publication date: 2018-01-09

Abstract

The invention provides a kind of imitation URL link fishing domain name recognition methods based on C4.5 decision trees, comprise the following steps：S1, extraction imitate the domain name and feature of URL link；S2, based on C4.5 algorithms to imitate URL link domain name classify, build classification tree；S3, intercepted for meeting the domain name of type in classification tree.The present invention can extract high-risk domain name therein, detect the security of such domain name in real time.

Description

A kind of imitation URL link fishing domain name recognition methods based on C4.5 decision trees

Technical field

The present invention relates to Internet technical field, more particularly to a kind of imitation URL link fishing based on C4.5 decision trees Domain name recognition methods.

Background technology

Phishing is a kind of electronic theft behavior, by disguised oneself as in ecommerce a trustworthy entity come Sensitive information is obtained from the user there that there is not a shadow of doubt.With the popularization of internet, phishing is endangered to caused by Internet user Evil is more and more common, a large amount of fishing websites in network be present.Anti-phishing working group (Anti-Phishing Working Group, APWG 1,220,523 phishing attacks [1]) are found the fourth quarter in 2016.Chinese anti-phishing alliance (Anti-Phishing Alliance of China, APAC) find 4,958 fishing websites [2] altogether the first quarter in 2017.Situation of going fishing is quite tight It is high, serious influence is formed to network environment.Research is found：In fishing domain name there is obvious characteristic in a large amount of domain names, such as： Www.paypal.com.signin.country.en.locale.en.diamondzapper .com, lack the use of network knowledge Family is easy to regard such domain name as URL link.Referred to herein as this kind of domain name is imitation URL link domain name.Due to such domain name pair User's is fascinating stronger, thus the security of the such domain name of rapid evaluation, and to improving user's online experience, purification network has Important meaning.

The content of the invention

The invention provides a kind of imitation URL link fishing domain name recognition methods based on C4.5 decision trees, extraction is wherein High-risk domain name, detect the security of such domain name in real time.

In order to solve the above technical problems, the embodiment of the present application provides a kind of imitation URL link based on C4.5 decision trees Fishing domain name recognition methods, comprises the following steps：

S1, extraction imitate the domain name and feature of URL link；

S2, based on C4.5 algorithms to imitate URL link domain name classify, build classification tree；

S3, intercepted for meeting the domain name of type in classification tree.

As the preferable technical scheme of the present invention, imitate the domain name of URL link and be characterized as：

1) domain name series is higher, length is longer；

2) domain name character conversion frequency is high, and contiguous alphabet maximum length is shorter or continuous number maximum length is shorter；

3) the hyphen number of domain name is higher；

4) domain name includes brand name, and the position of brand name is more apparent；

5) most long subdomain name series is higher.

It is as follows as the preferable technical scheme of the present invention, the construction method of described classification tree：

Step1：Sample data is pre-processed, authority data form is to form the training set of decision tree；

Step2：Calculate the information gain-ratio of each attribute；

Assuming that training sample set is combined into S, training sample is divided into k classes, as C={ C₁,C₂,...,C_k, p (S_i) represent Sample belongs to C_iRatio, now shown in set S comentropy such as formula (1),

Assuming that property set is A, and A={ A₁,A₂,...,A_m, select A_jSample is divided for testing attribute, and is set Values(A_j) it is A_jCodomain, then attribute A_jInformation gain such as formula (2) shown in,

In formula：| S | the number of elements of sample set is represented, | Sv | it is attribute A in sample set S_jIt is worth first prime number for v Amount, now, attribute A division sample sets S range and uniformity can be obtained, as shown in formula (3),

Thus, attribute A can be obtained by information gain and division information_jInformation gain-ratio, as shown in Equation 4,

Step3：Build decision-tree model

Select root node of the attribute (such as maximum subdomain name series) as decision tree with highest information gain-ratio. Attribute of the selection with highest information gain-ratio forms decision-tree model as node of divergence, recurrence in remaining candidate attribute.

The one or more technical schemes provided in the embodiment of the present application, have at least the following technical effects or advantages：

High-risk domain name therein is can extract, detects the security of such domain name in real time.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this hairs Some bright embodiments, for those of ordinary skill in the art, without having to pay creative labor, can be with Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is the flow circuit theory schematic diagram of the embodiment of the present application；

Fig. 2 is the brand name position significant degree distribution map of the embodiment of the present application domain name；

Fig. 3 is the embodiment of the present application domain name contiguous alphabet maximum length distribution map；

Fig. 4 is the maximum length figure of the embodiment of the present application figure continuous number；

Fig. 5 is the most long subdomain name significant degree figure of the embodiment of the present application figure.

Embodiment

In order to be better understood from above-mentioned technical proposal, below in conjunction with Figure of description and specific embodiment to upper Technical scheme is stated to be described in detail.

A kind of imitation URL link fishing domain name recognition methods based on C4.5 decision trees described in the present embodiment, including with Lower step：

S1, extraction imitate the domain name and feature of URL link；

S3, intercepted for meeting the domain name of type in classification tree.

Wherein, in the present embodiment, imitate the domain name of URL link and be characterized as：

1) domain name series is higher, length is longer.And legitimate domain name is remembered for ease of user, its usual length is shorter, and level Number is relatively low.

2) domain name character conversion frequency is high, and contiguous alphabet maximum length is shorter or continuous number maximum length is shorter.And lead to Normal legitimate domain name names domain name by artificial mode, and for ease of memory legitimate domain name frequently with contiguous alphabet or numeral, word Female conversion frequency is smaller.

3) the hyphen number of domain name is higher.And the character framing of legitimate domain name is simpler, the quantity of hyphen is smaller.

4) domain name includes brand name, and the position of brand name is more apparent.Non- well-known legitimate domain name includes the situation of brand name It is then less.Brand name is placed in obvious position by the probability accessed for increase user, fisherman using brand name as subdomain name Put.In addition, well-known domain name is nested in domain name by some fishermans, the fascination of domain name is increased.

5) most long subdomain name series is higher.Real Main Domain is found to cause user to be not easy, generally goes fishing domain name most The series of long subdomain name is relatively low, and legitimate domain name does not possess this feature then.

Wherein, in the present embodiment, the construction method of described classification tree is as follows：

Step2：Calculate the information gain-ratio of each attribute；

Step3：Build decision-tree model

During decision tree is created, data noise and isolated point can cause the branch of training set abnormal.It is at this time, it may be necessary to logical The situation of the method processing data over-fitting of beta pruning is crossed, i.e., cuts off insecure branch by statistical measures so that after beta pruning Decision tree is more rapid and better classified data to be tested.

Test result and analysis

Data source

A large amount of known fishing domain names are collected from the website such as Phishtank, Openphish [15], Watcherlab, are used URL link domain name collection, and then therefrom domain name totally 2,008 of the extraction with obvious characteristic are imitated in the extraction of this paper domain names screening conditions It is individual, as negative sample.

Most of domain name is legal in internet, and domain name of going fishing is relatively seldom, and domain name data amount is very big, can not Manually marked, the access data of education network are collected in this experiment, filter the fishing domain name in data set, and therefrom extraction is imitated URL link domain name totally 171,834, as positive sample.

Classification performance is evaluated

Test feature is analyzed

2,008 imitation URL links marked fishing domain names are carried out with statistical analyses, partial analysis result such as Fig. 2,3, Shown in 4.Analysis finds brand name significant degree, contiguous alphabet maximum length, continuous number maximum length, most long subdomain name significant degree Feature, there is preferable discrimination for detection fishing domain name.

As shown in Figure 2, imitate in the fishing domain name of URL link containing brand name it is more, constitute about 36%, and brand name exists Position in domain name is more obvious；And in legitimate domain name, about 93% domain name is not present brand name, and its position significant degree compared with It is low.

From the figure 3, it may be seen that the maximum length of about 56% secure domain name contiguous alphabet is less than 20, and about 94% imitation The maximum length of URL link fishing domain name contiguous alphabet is less than 20.Imitate the contiguous alphabet maximum length of URL link fishing domain name It is relatively low, and the contiguous alphabet maximum length of secure domain name is higher.

As shown in Figure 4, continuous number is not present in about 65% imitation URL link fishing domain name, and in legitimate domain name Continuous number is not present in only about 13% domain name, and the continuous number maximum length of imitation URL link fishing domain name is relatively low, and The continuous number maximum length of secure domain name is higher.

As shown in Figure 5, when most long subdomain name significant degree is less than 0.67, it is total that imitation URL link fishing domain name constitutes about its The 21% of body, and have 45% legitimate domain name within the range；It is more apparent to imitate the most long subdomain name of URL link fishing domain name, And the most long subdomain name average of legitimate domain name is less than fishing domain name.

Classifier Performance Evaluation

Each 1,041 domain name of extraction at random from fishing set of domains, secure domain name set, respectively as the negative of training set Sample and positive sample.Classification checking is carried out to the data set using C4.5 decision tree classifiers and using ten folding cross validations, tied Fruit is as shown in table 1.

Table 1 imitates URL link domain name training set classifying quality

As shown in Table 1, the grader 91.80% is respectively reached to the recognition accuracy of fishing domain name, secure domain name, 96.80%, it can thus be concluded that the grader can effectively extract the high-risk domain name imitated in URL link domain name.Domain name in experiment is missed Report situation is analyzed, and has a small amount of fishing domain name to be reported by mistake for secure domain name, such as：Wp-secured-accout.com, it is Because domain name fishing feature does not cause significantly to report by mistake；Some secure domain names reported by mistake for go fishing domain name, such as：Certain domain of Kingsoft cloud Name bd7316f02e7e46499eda436584d213dc.trace-ldns.ksyun.com, the level Four domain name of the domain name use Random string, similitude be present with some imitation URL link fishing domain names, cause domain name to be reported by mistake.

Grader classifying quality

Due to the ratio very little for domain name of being gone fishing in real network, for the true grader for reflecting this chapter models in live network Effect, the secure domain name number of use is tested far above fishing domain name number, to simulate live network detection scene.Experiment uses 30,000 secure domain name, 967 fishing domain names.Classification results are as shown in table 5.

Table 2 imitates URL link domain name classification results

As shown in Table 2,1.00%, 2.70% secure domain name and fishing domain name are reported by mistake.Analyze the mistake in domain name detection Report situation, the secure domain name reported by mistake be mainly content distributing network (Content Delivery Network, CDN) domain name, Agent software domain name.Such as：1445516683-state-connected.D4EE071C9C86.1445535542.cc.hiwif I.com, the domain name are that pole route domain name, site information to be connected are converted into random character when asking to connect, due to its word It is similar to imitating URL link fishing domain name to accord with construction feature, is thus reported by mistake；128a5743c1148cd503b9ced8e54948 0b.google.com.dnsbl7.mailshell.net is Network Security Service company Mailshell detection data message Secure domain name, because subdomain name includes brand name Google, brand name position is more apparent, and domain name alphanumeric conversion frequency Higher, thus the domain name is mistaken for domain name of going fishing.A small amount of fishing domain name is reported by mistake is the discovery that these domains for secure domain name, analysis Caused by the feature unobvious of name.

Analysis of experimental results is with discussing

In summary, the imitation URL link fishing domain name identification model based on C4.5 decision trees can effective detection fishing domain Name.But the experiment is there is also certain rate of failing to report, a large amount of agent softwares, well-known website CDN domain names are reported by mistake as domain name of going fishing, The experiment later stage can arrange these domain name lists, and the list is added into white list with filtering safe domain name；To what is easily failed to judge Feature unobvious domain name, the experiment later stage will be furtherd investigate, and excavate more validity feature information, to improve fishing domain name Verification and measurement ratio.

The above described is only a preferred embodiment of the present invention, any formal limitation not is made to the present invention, though So the present invention is disclosed above with preferred embodiment, but is not limited to the present invention, any to be familiar with this professional technology people Member, without departing from the scope of the present invention, when the technology contents using the disclosure above make a little change or modification For the equivalent embodiment of equivalent variations, as long as being the content without departing from technical solution of the present invention, the technical spirit according to the present invention Any simple modification, equivalent change and modification made to above example, in the range of still falling within technical solution of the present invention.

Claims

1. a kind of imitation URL link fishing domain name recognition methods based on C4.5 decision trees, it is characterised in that including following step Suddenly：

S1, extraction imitate the domain name and feature of URL link；

S3, intercepted for meeting the domain name of type in classification tree.

2. a kind of imitation URL link fishing domain name recognition methods based on C4.5 decision trees according to claim 1, it is special Sign is：Imitate the domain name of URL link and be characterized as：

1) domain name series is higher, length is longer；

3) the hyphen number of domain name is higher；

5) most long subdomain name series is higher.

3. a kind of imitation URL link fishing domain name recognition methods based on C4.5 decision trees according to claim 1, it is special Sign is：The construction method of described classification tree is as follows：

Step2：Calculate the information gain-ratio of each attribute；

Assuming that training sample set is combined into S, training sample is divided into k classes, as C={ C₁,C₂,...,C_k, p (S_i) represent sample category In C_iRatio, now shown in set S comentropy such as formula (1),

<mrow> <mi>I</mi> <mrow> <mo>(</mo> <mi>S</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo>-</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <msub> <mi>log</mi> <mn>2</mn> </msub> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

Assuming that property set is A, and A={ A₁,A₂,...,A_m, select A_jSample is divided for testing attribute, and sets Values (A_j) For A_jCodomain, then attribute A_jInformation gain such as formula (2) shown in,

<mrow> <mi>G</mi> <mi>a</mi> <mi>i</mi> <mi>n</mi> <mrow> <mo>(</mo> <mi>S</mi> <mo>,</mo> <msub> <mi>A</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mi>I</mi> <mrow> <mo>(</mo> <mi>S</mi> <mo>)</mo> </mrow> <mo>-</mo> <munder> <mo>&Sigma;</mo> <mrow> <mi>v</mi> <mo>&Element;</mo> <mi>V</mi> <mi>a</mi> <mi>l</mi> <mi>u</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <msub> <mi>A</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <mfrac> <mrow> <mo>|</mo> <msub> <mi>S</mi> <mi>v</mi> </msub> <mo>|</mo> </mrow> <mrow> <mo>|</mo> <mi>S</mi> <mo>|</mo> </mrow> </mfrac> <mi>I</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>v</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>

In formula：| S | the number of elements of sample set is represented, | Sv | it is attribute A in sample set S_jIt is worth the number of elements for v, this When, range and uniformity that attribute A divides sample set S can be obtained, as shown in formula (3),

<mrow> <mi>S</mi> <mi>p</mi> <mi>l</mi> <mi>i</mi> <mi>t</mi> <mi>I</mi> <mi>n</mi> <mi>f</mi> <mi>o</mi> <mrow> <mo>(</mo> <mi>S</mi> <mo>,</mo> <msub> <mi>A</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mo>-</mo> <munder> <mo>&Sigma;</mo> <mrow> <mi>v</mi> <mo>&Element;</mo> <mi>V</mi> <mi>a</mi> <mi>l</mi> <mi>u</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <msub> <mi>A</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <mfrac> <mrow> <mo>|</mo> <msub> <mi>S</mi> <mi>v</mi> </msub> <mo>|</mo> </mrow> <mrow> <mo>|</mo> <mi>S</mi> <mo>|</mo> </mrow> </mfrac> <msub> <mi>log</mi> <mn>2</mn> </msub> <mfrac> <mrow> <mo>|</mo> <msub> <mi>S</mi> <mi>v</mi> </msub> <mo>|</mo> </mrow> <mrow> <mo>|</mo> <mi>S</mi> <mo>|</mo> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>

Step3：Build decision-tree model

Root node of the attribute (such as maximum subdomain name series) as decision tree with highest information gain-ratio is selected, remaining Candidate attribute in selection with highest information gain-ratio attribute be used as node of divergence, recurrence formation decision-tree model.